Skip to content

clojure utilities to build training corpora for machine learning / NLP out of public wikimedia dumps: status - partially stalled - will probably be reworked as cascalog scripts -- this project is in stalled mode right now: the pignlproc project is likely to replace it due to licensing constraints for future integration in Apache projects

License

Notifications You must be signed in to change notification settings

ogrisel/corpusmaker

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

85 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

corpusmaker

Clojure utilities to build training corpora for machine learning / NLP out of public Wikipedia and DBpedia dumps.

Project status

Stalled: this project is no longer active for licensing issues (cascading being a GPL dependency it will never be possible to contribute it to the ASF) and is being replaced by https://github.com/ogrisel/pignlproc/ instead.

This project is experimental code. Features are implemented when needed. Expects bugs and not implemented exceptions.

Building from source

Get the latest version of leiningen to build from the sources:

  1. Download the script.
  2. Place it on your path and chmod it to be executable.
  3. Run: lein self-install

Then, at the root of the corpusmaker source tree:

$ lein deps # install dependencies in lib/

$ lein compile-java # compile a custom helper class for the Wikipedia parser

$ lein uberjar # build a standalone jar with all depedencies

Note: when executing the resulting standalone jar you might get a security exceptions with java complaining that the signed jar is invalid:

java.lang.SecurityException: Invalid signature file digest for Manifest main attributes

This can be fixed by removing the DUMMY.SF file that is mistakenly included:

% zip corpusmaker-standalone.jar -d META-INF/DUMMY.SF

Hackers can also use the following leiningen commands for development / deployment purpose:

$ JAVA_OPTS="-Xmx256m" lein test [TESTS] # run the tests in the TESTS namespaces, or all tests

$ lein repl # launch a REPL with the project classpath configured

$ lein pom # generate a pom.xml file suitable for maven deployment

Fetching the data

You can get the latest wikipedia dumps for the english articles here (around 5.4GB compressed, 23 GB uncompressed):

enwiki-latest-pages-articles.xml.bz2

The DBPedia links and entities types datasets are available here:

wikipage_en.nt.bz2

instancetype_en.nt.bz2

longabstract_en.nt.bz2

pagelinks_en.nt.bz2

redirect_en.nt.bz2

All of those datasets are also available from the Amazon cloud as public EBS volumes:

Wikipedia XML dataset EBS Volume: snap-8041f2e9 (all languages - 500GB)

DBPedia Triples dataset EBS Volume: snap-63cf3a0a (all languages - 67GB)

It is planned to have crane based utility function to load them to HDFS directly from the EBS volume.

Usage

Evaluate popularity of entities by counting incoming links

$ java -Xmx1g -server -jar corpusmaker-standalone.jar count-incoming \
  --pagelinks-file pagelinks_en.nt \
  --redirect-file redirects_en.nt \
  --output-folder incoming-counts-out/

Next step: flow the popularity through the links graph using TunkRank or PageRank style iterative algoritm.

Build a lucene index of DBpedia resources

Build a fulltext (Lucene-based) index of the abstracts of DBpedia resources:

$ java -Xmx1g -server -jar corpusmaker-standalone.jar build-index \
  --input-folder ~/data/dbpedia \
  --index-dir  ~/lucene/dbpedia-index

Building a NER training / evaluation corpus

TODO: Explain howto extract a BIO-formatted corpus suitable for the training of sequence labeling algorithms such as CRFs with Mallet or crfsuite.

Building a document classification corpus

TODO: Explain howto extract bag of words / document frequency features suitable for document classification

About

clojure utilities to build training corpora for machine learning / NLP out of public wikimedia dumps: status - partially stalled - will probably be reworked as cascalog scripts -- this project is in stalled mode right now: the pignlproc project is likely to replace it due to licensing constraints for future integration in Apache projects

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published