Linas' collection of NLP data

Here is a collection of linguistic data, including a collection of parsed texts from Voice of America, Project Gutenberg, the simple English Wikipedia, and a portion of the full English Wikipedia. This data is the result of many CPU-years worth of number-crunching, and is meant to provide pre-digested input for higher order linguistic processing. Two types of data are provided: parsed and tagged texts, and large SQL tables of statistical correlations.

The texts were dependency parsed with a combination of RelEx and Link Grammar, and are marked with both dependencies (subject, object, prepositional relations, etc.), with features (part-of-speech tags, verb-tense and noun-number tags, etc., with Link Grammar linkage relations, and with phrasal constituency structure. The data is in the RelEx compact output format. This format captures all of the parser output in an easy-to-handle format, meant to be easy-to-treat with basic perl scripts. For example, these texts can be quickly and easily input into OpenCog using a perl script from the RelEx package, the src/perl/cff-to-opencog.pl script.

The Lexical Attraction package was used to compile tables of statistical correlations, including mutual information between word pairs, and conditional probabilities of observing specific link-grammar linkages. In particular, the Mihalcea word-sense disambiguation algorithm was used to tag text with likely word-senses taken from WordNet 3.0, and correlations between these and link-grammar linkages were compiled. The lexat directory contains database dumps of these tables.

The full set of data files are in the data directory. Some highlights of what can be found:

Link-grammar word-sense-disambiguation dictionaries are here.
Individual parsed simple-English Wikipedia pages are here. This directory contains *all* of the simple-English Wikipedia articles.
A tarball of all of the parsed simple-English Wikipedia pages is here
A small selection of parsed books from Project Gutenberg is here
English Wikipedia, letters A through M.
Parses of the sentences from MIT ConceptNet, as well as OpenCog database dumps of the extracted semantic triples.

Please let me know if you are using or planning to use any of this data -- I would like to share ideas and updates with you.

Related Work

The Wacky project provides pre-parsed text for the English-language Wikipedia. POS-tagging and lemmatization was done with the TreeTagger, parsing done with the MaltParser.

Created June 2008, last updated January 2010. Contact Linas Vepstas at linasvepstas at gmail dot com for more details. See also the affiliated OpenCog Project for more info.