-
Notifications
You must be signed in to change notification settings - Fork 62
What this does
dnmilne edited this page Aug 22, 2013
·
1 revision
- Summarizes Wikipedia's link structure, category structure, page types, etc.
- A sequence of Hadoop jobs are provided to extract statistics and summaries from Wikipedia's static xml dumps. These scripts scale in roughly linear time, depending on the size of Wikipedia and the number of machines available to the Hadoop cluster.
- Models Wikipedia as easy to understand Java classes
- such as Article, Category, Anchor, etc. See the Java Doc for details.
- Indexes data for efficient access
- The summarized data is stored persistently in a Java BerlekeyDb database environment. You can access it immediately, without waiting for anything to load.
- Caches summaries to memory if required
- Sometimes you will rather spend time pre-loading the summaries to memory, so you can avoid the overhead of constantly querying the database. The toolkit allows you to flexibly cache databases to memory, depending on the needs of your application
- Provides flexible searching, via link anchors, titles and redirects
- as they occur or via stemming, case-folding, etc. You can also add your own search methods, and prepare the data so that they can be used efficiently.
- Measures how Wikipedia's concepts relate to each other
- The toolkit includes proven semantic relatedness measures that efficiently and accurately measure how topics relate to each other.
- Detects Wikipedia topics when they are mentioned in documents
- This includes machine-learned approaches for disambiguating ambiguous terms, and identifying the topics that are most likely to be of interest to the reader.