Parsed Wikipedia ---------------- This directory contains a parsed copy of a May 2008 dump of Wikipedia. Other directories contain more recent Wikipedia versions. This data was parsed using the Link-Grammar parser, and post-processed with the RelEx dependency relationship extractor. The output is stored in an easy-to-handle format, the so-called "compact file format". The files in this directory represent approximately 10 cpu-years of processing. See http://wiki.opencog.org/w/RelEx_compact_output for more information on the file format. See http://www.abisource.com/projects/link-grammar/ for more information about Link Grammar. See http://wiki.opencog.org/w/RelEx for more information about Relex. They are easily converted to OpenCog format using a perl script in the RelEx package, src/perl/cff-to-opencog.pl These files are mirrored at: http://gnucash.org/linas/nlp ------------------------------------------------------------- There are, in total, about 455703+97740+87945 = 641388 articles starting with the letters A-E. Circa 2008, the parse rate seems to be about 100K articles/month on 5 cpus. Assuming 2.5M articles, that'll be 2 years wall-clock time, or 10 cpu-years. A variety of improvements were made to the parser during the course of parsing, and later files provide higher-quality data than the earlier ones. If you need just a limited amount of data, use the latest files first. For a description of the specific changes, see the ChangeLog files for the Link-Grammar and the RelEx projects. Any given sentence may have one unique parse or it may have dozens, hundreds or more different parses. The enwiki-20080524-parsed-1.tar.bz2 thorugh enwiki-20080524-parsed-24.tar.bz2 files contain at most four alternative parses for each sentence, ranked by confidence. The enwiki-20080524-parsed-25.tar.bz2 and later files have at most 20 alternatives for each parse. ------------------------------------------------------------- The enwiki-20080524-pages-articles.xml.bz2 file contains the original wikipedia page dump. Size: 3978796938 crc32: 28337b7d md5sum: 353e3c5ac1a6d16b0352cdfd4497b19d The enwiki-20080524-stripped.tar.bz2 file contains articles from wikipedia, stripped of all html and wiki markup; they should be just plain text. There are approximately 2.4 million articles. All articles are in one directory. size: 1769166261 crc32: 05bc7f7b md5sum: e0e406bbd9d623978c071d8197463dab The enwiki-20080524-alpha.tar.bz2 contains the stripped articles, but split out into directories A-Z holding articles starting with letters A-Z, etc. Image and template articles removed, leaving a total of 2292429 (about 2.3 million) articles. A little more managable, that way. size: 1755656634 crc32: 5deee8a0 md5sum: 01df0064036f9f470817cff3fe9577e4 ------------------------------------------------------------- The enwiki-20080524-parsed-1.tar.bz2 contains parsed versions of about 15% of the A-E entries (97740 entries total, to be precise) size: 866668838 crc32: 6cde583c md5sum: 9828756c92050f7fbc5045c688bb5ba1 The enwiki-20080524-parsed-2.tar.bz2 contains parsed versions of another 87945 entries from the A-E letters. size: 795440272 crc32: a028009d md5sum: 33d7b2a3cfcfa8f46f8ca21c1a2f5f9b The enwiki-20080524-parsed-3.tar.bz2 contains parsed versions of another 70846 entries from the A-E letters. This is the firt batch of articles that will contain the inflected word tags. (inflection was added 9 sept 2008, and deployed weeks(?) later). size: 633046871 crc32: 47333516 md5sum: fd97a4102545c6a4ea280bb635b64c75 The enwiki-20080524-parsed-4.tar.bz2 contains parsed versions of another 117209 entries from the A-E letters. size: 1084348830 crc32: md5sum: f1563925a501ee7d33da27d4471699bb The enwiki-20080524-parsed-5.tar.bz2 contains parsed versions of another 86609 entries from the A-F letters. size: 814706752 md5sum: 690b443e802262f486249125518b43d8 The enwiki-20080524-parsed-6.tar.bz2 contains parsed versions of another 69822 entries from the A-D, F,G letters. size: 387433172 md5sum: 6f46840c0ff2c6967f35dc996def21ab The enwiki-20080524-parsed-7.tar.bz2 contains parsed versions of another 73266 entries from the A-C, F,G letters. size: 456500482 md5sum: be5685875c738da96f8852a8995c445e The enwiki-20080524-parsed-8.tar.bz2 contains parsed versions of another 68877 entries from the A,C,F,G letters. size: 677671415 md5sum: 4cd9000d9b38fb39f7085505186cea11 The enwiki-20080524-parsed-9.tar.bz2 contains parsed versions of another 77487 entries from the C,F,G letters. size: 738404406 md5sum: 4bc7b79926ca9754dbe21e87b05192da The enwiki-20080524-parsed-10.tar.bz2 contains parsed versions of another 65405 entries from the C,G,H,I letters. size: 601187223 md5sum: 5b81b319a093e1381973f4a3bc6c11e3 The enwiki-20080524-parsed-11.tar.bz2 contains parsed versions of another 74488 entries from the G,H,I letters. size: 699696904 md5sum: 240a1ab0193bb09840050c5714a55ded The enwiki-20080524-parsed-12.tar.bz2 contains parsed versions of another 63794 entries from the H,I,J letters. size: 688075980 md5sum: c0545fc14767f23dda5e3a5543e3c3f3 The enwiki-20080524-parsed-13.tar.bz2 contains parsed versions of another 70158 entries from the H,J,K letters. size: 688939878 md5sum: e4d0cc8286b45c3c2070f46c9de5a688 The enwiki-20080524-parsed-14.tar.bz2 contains parsed versions of another 62610 entries from the H,J,K letters. size: 569345483 md5sum: e6613156935faac3c34d8411e797ca65 The enwiki-20080524-parsed-15.tar.bz2 contains parsed versions of another 64364 entries from the J,K,L,M letters. size: 543937714 md5sum: 782300c8d361d37b07ad638039990d25 The enwiki-20080524-parsed-16.tar.bz2 contains parsed versions of another 65722 entries from the J,L,M,N letters. size: 598485148 md5sum: 1af4fa340db57abb1645f2ab4cd9724d The enwiki-20080524-parsed-17.tar.bz2 contains parsed versions of another 65959 entries from the L,M,N letters. size: 612890922 md5sum: 851911d8209a627c2414a0bf41c54114 The enwiki-20080524-parsed-18.tar.bz2 contains parsed versions of another 61003 entries from the L,M,N letters. size: 561417680 md5sum: 4b1f06ab948e05d0c5d14ce8b75e1d71 The enwiki-20080524-parsed-19.tar.bz2 contains parsed versions of another 63263 entries from the L,M,N,O,P letters. size: 573597608 md5sum: b9eb7ed9dd40007a497b9efd85e2ec75 The enwiki-20080524-parsed-20.tar.bz2 contains parsed versions of another 64212 entries from the U-Z letters. size: 554890618 md5sum: 1f1cba0660fa6cd4177f8daac8716910 The enwiki-20080524-parsed-21.tar.bz2 contains parsed versions of another 59687 entries from the U,V,W,Y,Z letters. size: 527634834 md5sum: d30440786b90e7cf845c27309364c998 The enwiki-20080524-parsed-22.tar.bz2 contains parsed versions of another 63347 entries from the M,O,P letters. size: 593194103 md5sum: c099e34bc6768ea88580ac5114765f88 The enwiki-20080524-parsed-23.tar.bz2 contains parsed versions of another 58927 entries from the U,V,W,Y letters. size: 613797128 md5sum: 4356abee20bf8a8406fbb750e3e8461f The enwiki-20080524-parsed-24.tar.bz2 contains parsed versions of another 68045 entries from the M,O,P,Q,R letters. size: 629246754 md5sum: 0b5d1a00f5ed0e0c332232239a88c4ee The enwiki-20080524-parsed-25.tar.bz2 contains parsed versions of another 23725 entries from the letter S. size: 440510200 md5sum: 50fc4755ce8dcd7335a8f0d7d67511ab The enwiki-20080524-parsed-26.tar.bz2 contains parsed versions of another 37024 entries from the letters S, T. size: 631949366 md5sum: 7d64a0f268019faf42419d53cffadc2a The enwiki-20080524-parsed-27.tar.bz2 contains parsed versions of another 35546 entries from the letters S, T. size: 581751501 md5sum: 24f42060b7e7e45b083a05755ac2a8d5 The enwiki-20080524-parsed-28.tar.bz2 contains parsed versions of another 35499 entries from the letters S, T. size: 603995657 md5sum: 4bf3d9c5e51383c3dd4247e3a0303180 The enwiki-20080524-parsed-29.tar.bz2 contains parsed versions of another 37019 entries from the letters R, S, T. size: 666389986 md5sum: dd1f01d2df5faec6abcc608db41db480 The enwiki-20080524-parsed-30.tar.bz2 contains parsed versions of another 36506 entries from the letters R, S, T. size: 651100873 md5sum: 130107a2ab8125cd40368e7c4c80398e The enwiki-20080524-parsed-31.tar.bz2 contains parsed versions of another 36359 entries from the letters R, S, T. size: 657862642 md5sum: 96cee371fe4ca78b2bad87c68765fbce The enwiki-20080524-parsed-32.tar.bz2 contains parsed versions of another 37184 entries from the letters R, S, T. size: 750206066 md5sum: 2e1c42b31751322537bae43c382ebf5c The enwiki-20080524-parsed-33.tar.bz2 contains parsed versions of another 36145 entries from the letters R, S, T. size: 672615829 md5sum: afb384619dcafa69051d3d8183aa20f4 The enwiki-20080524-parsed-34.tar.bz2 contains parsed versions of another 36648 entries from the letters R, S, T. size: 690005839 md5sum: 4a887af50f53b8ab207fe917c7a93f1a The enwiki-20080524-parsed-35.tar.bz2 contains parsed versions of another 35849 entries from the letters R, S, T. size: 638985616 md5sum: 1777dac6bdfc9e0a0337eed8f60efac9 The enwiki-20080524-parsed-36.tar.bz2 contains parsed versions of another 36564 entries from the letters R, S, T. size: 699489332 md5sum: ef40441a149657a992eb3dc4098b4ca9 The enwiki-20080524-parsed-37.tar.bz2 contains parsed versions of another 35932 entries from the letters R, S, T. size: 692721848 md5sum: 7af22090434f016a1b97a9b644dd68d1