chriki.de ::: Natural Language Processing Library

Overview

Introduction

Here you find a Java library containing a collection of NLP tools developed by Christian Spurk. For now, the library provides a minimally supervised generalized name (GN, cf. Yangarber et al., 2002) recognizer only as well as some utility classes that were built around it. Other NLP tools may or may not follow someday. The primary goal for setting up this site was to release the GN recognizer I developed for my master's thesis (actually a German “Diplomarbeit”) to the public in the hope that it will be useful for somebody else.

The GN recognizer

The algorithm behind the GN recognizer is basically language and domain independent, but is currently only implemented for the English language. It builds upon seed lists for each GN type that is to be learned. The recognizer has the great advantage that a corpus, which is to be annotated, does not need any linguistic preparation (such as part-of-speech tagging, lemmatization, chunking, parsing etc.). Furthermore a target corpus has to be read at most twice: in the first run a GN classifier is built using the raw text information from the corpus, and in a second run this classifier can be used to annotate the corpus with GNs. If you already have a well trained classifier, the first run can be skipped. Thus, the system is useable in a wide range of application scenarios, which was the main design goal.

To achieve the independence of preparation, the system utilizes an own noun group (NG) chunker, which is one of the two core parts of the system. This NG chunker is based on lists of closed class words for the target language. As this NG chunker is a crucial part of the system it can also be used separately, if necessary. The second core part of the system is a GN classifier which is based on the algorithm by Cucerzan and Yarowsky (1999). However, the algorithm has seen a lot of improvements, e.g. the MD-trie idea of Whitelaw and Patrick (2002) has been incorporated in an even further optimized version.

Documentation

The best way to get started using the library is to read the “Getting Started” document. There you get to know the prerequisites of the library and an overview of the basic classes for common recognition tasks.

You can also have a look at the code itself which is very well documented with standard comments as well as Javadoc comments. The latter were used to generate a clear API documentation which is available online.

For your convenience all documentation is also available for download. If all this doesn't help, then please feel free to contact me by e-mail – maybe I can give you some hints or short usage examples then.

License

The library is licensed under a “triple license”, just like the sources of the well-known Mozilla webbrowser Firefox: you can choose between the Mozilla Public Licence (MPL) in version 1.1, the GNU General Public License (GPL) in version 2.0 and the GNU Lesser General Public License (LGPL) in version 2.1. Questions about this licensing answers the license policy of the Mozilla Foundation.

Download

There are several packages with or without the Java files themselves, that you can download under the terms of the described license:

Package description and contents	Package file	Download size
Software binaries only – this package contains the compiled Java classes of the library.	chriki.de-nlp-bin-1.0.jar	76 KB
Software sources only – this package contains the source files of the Java classes in the library.	chriki.de-nlp-src-1.0.zip	137 KB
Resources only – this package contains a bundle of resources for English Named Entity recognition with the library. You'll find lists of closed class words, domain models, a NG transformer example file as well as adverb and verb models.	chriki.de-nlp-res-1.0.zip	3,17 MB
API only – this package contains the API reference in Javadoc format, which is also available online.	chriki.de-nlp-api-1.0.zip	317 KB
“Getting Started” document only – this package contains the “Getting Started” document only, which you can also read online.	chriki.de-nlp-gettingstarted-1.0.zip	28,7 KB
Complete SDK – this package contains all parts of the other packages, i.e. the compiled binaries, the source files, the API reference, the “Getting Started” document and the resources package.	chriki.de-nlp-sdk-1.0.zip	3,67 MB

References

Cucerzan, S. and D. Yarowsky (1999): Language Independent Named Entity Recognition Combining Morphological and Contextual Evidence. In: Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora 1999, p. 90–99.
Yangarber, R., W. Lin and R. Grishman (2002): Unsupervised Learning of Generalized Names. In: Proceedings of the 19th International Conference on Computational Linguistics: COLING-2002, Taipei (Taiwan).
Whitelaw, C. and J. Patrick (2002): Orthographic tries in language independent named entity recognition. In: Proceedings of ANLP02, p. 1–8, Centre for Language Technology, Macquarie University.

Natural Language Processing Library

Contents