Clojure utilities to build training corpora for machine learning / NLP out of public Wikipedia and DBpedia dumps.
Stalled: this project is no longer active for licensing issues (cascading being a GPL dependency it will never be possible to contribute it to the ASF) and is being replaced by instead.
This project is experimental code. Features are implemented when needed. Expects bugs and not implemented exceptions.
Get the latest version of leiningen to build from the sources:
- Download the script.
- Place it on your path and chmod it to be executable.
- Run: lein self-install
Then, at the root of the corpusmaker source tree:
$ lein deps # install dependencies in lib/
$ lein compile-java # compile a custom helper class for the Wikipedia parser
$ lein uberjar # build a standalone jar with all depedencies
Note: when executing the resulting standalone jar you might get a security exceptions with java complaining that the signed jar is invalid:
java.lang.SecurityException: Invalid signature file digest for Manifest main attributes
This can be fixed by removing the DUMMY.SF file that is mistakenly included:
% zip corpusmaker-standalone.jar -d META-INF/DUMMY.SF
Hackers can also use the following leiningen commands for development / deployment purpose:
$ JAVA_OPTS="-Xmx256m" lein test [TESTS] # run the tests in the TESTS namespaces, or all tests
$ lein repl # launch a REPL with the project classpath configured
$ lein pom # generate a pom.xml file suitable for maven deployment
You can get the latest wikipedia dumps for the english articles here (around 5.4GB compressed, 23 GB uncompressed):
The DBPedia links and entities types datasets are available here:
All of those datasets are also available from the Amazon cloud as public EBS volumes:
Wikipedia XML dataset EBS Volume: snap-8041f2e9 (all languages - 500GB)
DBPedia Triples dataset EBS Volume: snap-63cf3a0a (all languages - 67GB)
It is planned to have crane based utility function to load them to HDFS directly from the EBS volume.
$ java -Xmx1g -server -jar corpusmaker-standalone.jar count-incoming \
--pagelinks-file pagelinks_en.nt \
--redirect-file redirects_en.nt \
--output-folder incoming-counts-out/
Next step: flow the popularity through the links graph using TunkRank or PageRank style iterative algoritm.
Build a fulltext (Lucene-based) index of the abstracts of DBpedia resources:
$ java -Xmx1g -server -jar corpusmaker-standalone.jar build-index \
--input-folder ~/data/dbpedia \
--index-dir ~/lucene/dbpedia-index
TODO: Explain howto extract a BIO-formatted corpus suitable for the training of sequence labeling algorithms such as CRFs with Mallet or crfsuite.
TODO: Explain howto extract bag of words / document frequency features suitable for document classification