Clojure utilities to build training corpora for machine learning / NLP out of public Wikipedia and DBpedia dumps.
Stalled: this project is no longer active for licensing issues (cascading being a GPL dependency it will never be possible to contribute it to the ASF) and is being replaced by https://github.com/ogrisel/pignlproc/ instead.
This project is experimental code. Features are implemented when needed. Expects bugs and not implemented exceptions.
Get the latest version of leiningen to build from the sources:
- Download the script.
- Place it on your path and chmod it to be executable.
- Run: lein self-install
Then, at the root of the corpusmaker source tree:
$ lein deps # install dependencies in lib/
$ lein compile-java # compile a custom helper class for the Wikipedia parser
$ lein uberjar # build a standalone jar with all depedencies
Note: when executing the resulting standalone jar you might get a security exceptions with java complaining that the signed jar is invalid:
java.lang.SecurityException: Invalid signature file digest for Manifest main attributes
This can be fixed by removing the DUMMY.SF file that is mistakenly included:
% zip corpusmaker-standalone.jar -d META-INF/DUMMY.SF
Hackers can also use the following leiningen commands for development / deployment purpose:
$ JAVA_OPTS="-Xmx256m" lein test [TESTS] # run the tests in the TESTS namespaces, or all tests
$ lein repl # launch a REPL with the project classpath configured
$ lein pom # generate a pom.xml file suitable for maven deployment
You can get the latest wikipedia dumps for the english articles here (around 5.4GB compressed, 23 GB uncompressed):
enwiki-latest-pages-articles.xml.bz2
The DBPedia links and entities types datasets are available here:
All of those datasets are also available from the Amazon cloud as public EBS volumes:
Wikipedia XML dataset EBS Volume: snap-8041f2e9 (all languages - 500GB)
DBPedia Triples dataset EBS Volume: snap-63cf3a0a (all languages - 67GB)
It is planned to have crane based utility function to load them to HDFS directly from the EBS volume.
$ java -Xmx1g -server -jar corpusmaker-standalone.jar count-incoming \
--pagelinks-file pagelinks_en.nt \
--redirect-file redirects_en.nt \
--output-folder incoming-counts-out/
Next step: flow the popularity through the links graph using TunkRank or PageRank style iterative algoritm.
Build a fulltext (Lucene-based) index of the abstracts of DBpedia resources:
$ java -Xmx1g -server -jar corpusmaker-standalone.jar build-index \
--input-folder ~/data/dbpedia \
--index-dir ~/lucene/dbpedia-index
TODO: Explain howto extract a BIO-formatted corpus suitable for the training of sequence labeling algorithms such as CRFs with Mallet or crfsuite.
TODO: Explain howto extract bag of words / document frequency features suitable for document classification