Parsed Wikipedia ---------------- This directory contains a parsed copy of an October 2010 dump of Wikipedia. (Another directory contains a complete parsed version of a May 2008 dump). This data was parsed using the Link-Grammar parser, and post-processed with the RelEx dependency relationship extractor. The output is stored in an easy-to-handle format, the so-called "compact file format". Parsing the full set of 4.5 million articles in the dump will require approximately 20 cpu-years. See http://wiki.opencog.org/w/RelEx_compact_output for more information on the file format. See http://www.abisource.com/projects/link-grammar/ for more information about Link Grammar. See http://wiki.opencog.org/w/RelEx for more information about Relex. They are easily converted to OpenCog format using a perl script in the RelEx package, src/perl/cff-to-opencog.pl These files are mirrored at: http://gnucash.org/linas/nlp ------------------------------------------------------------- Any given sentence may have one unique parse or it may have dozens, hundreds or more different parses. The files here contain at most 20 alternative parses for each sentence, ranked by confidence. ------------------------------------------------------------- The enwiki-20101011-pages-articles.xml.bz2 file contains the original wikipedia page dump. Size: 6652983189 crc32: 4d008924 md5sum: 7a4805475bba1599933b3acd5150bd4d The enwiki-20101011-alpha.tar.bz2 file contains articles from wikipedia, stripped of all html and wiki markup; they should be just plain text. There are 3461265 (about 3.5 million) articles. These are split out into directories A-Z holding articles starting with letters A-Z. size: 2683431354 crc32: 48a5fad4 md5sum: ae60bcccf451deb08b5d3c272c8262cc ------------------------------------------------------------- The enwiki-20101011-parsed-1.tar.bz2 contains 40000 parsed articles. size: 680915739 md5sum: 4f4abbe2f303dca8dc280e2dfc7d8b93 The enwiki-20101011-parsed-2.tar.bz2 contains 40000 parsed articles. size: 686932867 md5sum: 8b68fb6369fceadcc1922b1d3bdf87e3 The enwiki-20101011-parsed-3.tar.bz2 contains 40000 parsed articles. size: 701181473 md5sum: 228f7491826d7ae170ebc6efabdd9db6 The enwiki-20101011-parsed-4.tar.bz2 contains 40000 parsed articles. size: 692138370 md5sum: b6c67f2076f107432909bd585adf9043 The enwiki-20101011-parsed-5.tar.bz2 contains 40000 parsed articles. size: 646406025 md5sum: 810b93a50631810e0061a3cbe16a570f