Skip to content
dnmilne edited this page Aug 22, 2013 · 1 revision
Summarizes Wikipedia's link structure, category structure, page types, etc.
A sequence of Hadoop jobs are provided to extract statistics and summaries from Wikipedia's static xml dumps. These scripts scale in roughly linear time, depending on the size of Wikipedia and the number of machines available to the Hadoop cluster.
Models Wikipedia as easy to understand Java classes
such as Article, Category, Anchor, etc. See the Java Doc for details.
Indexes data for efficient access
The summarized data is stored persistently in a Java BerlekeyDb database environment. You can access it immediately, without waiting for anything to load.
Caches summaries to memory if required
Sometimes you will rather spend time pre-loading the summaries to memory, so you can avoid the overhead of constantly querying the database. The toolkit allows you to flexibly cache databases to memory, depending on the needs of your application
Provides flexible searching, via link anchors, titles and redirects
as they occur or via stemming, case-folding, etc. You can also add your own search methods, and prepare the data so that they can be used efficiently.
Measures how Wikipedia's concepts relate to each other
The toolkit includes proven semantic relatedness measures that efficiently and accurately measure how topics relate to each other.
Detects Wikipedia topics when they are mentioned in documents
This includes machine-learned approaches for disambiguating ambiguous terms, and identifying the topics that are most likely to be of interest to the reader.