es-categorizer

Extracting categories from an arbitrary block of text using Elasticsearch and Wikipedia.

TODO:

Get better wiki article results.
- Adapt according to text size?

How?

Given some text, term vectors and the OpenNLP Ingest Processor plugin can be used to extract keywords that exist only within that text, which is not suitable for category extraction. What is needed is a dataset that includes indexable content paired with tags, like an encyclopedia:

Index Wikipedia. See this guide and Setup below.
Use a More Like This Query to find documents (wiki articles) that match a given text.
Gather the wiki categories of the highest scoring documents.
Return a general category based on those categories.

Prerequisite Dependencies

python 3
elasticsearch 5.5.2
ICU Analysis Plugin
Wikimedia's Extra Queries and Filters API Extention Plugin
- Also referred to as Elasticsearch Trigram Accelerated Regular Expression Filter

If you're using OS X and homebrew:

brew install python3
brew install elasticsearch
brew services start elasticsearch  # or just "elasticsearch" for foreground execution
elasticsearch-plugin install analysis-icu
elasticsearch-plugin install org.wikimedia.search:extra:5.5.2

Setup

Download the latest dump of the search index.

curl -O "https://dumps.wikimedia.org/other/cirrussearch/current/enwiki-20170925-cirrussearch-content.json.gz"
# If the URL is invalid, go to https://dumps.wikimedia.org/other/cirrussearch/current/ to find an alternative

Install this program's dependencies: pip install -r requirements.txt.
- You should probably do this in a virtual environment.
Run ./main.py.
- You will be prompted for the path of the file you downloaded in step 1.
- If you've got limited storage, consider setting delete to True for load_chunks() beforehand.
- This command is equivalent to:

./main.py delete_index
./main.py create_index
./main.py chunk /path/to/dump
./main.py load

Test wiki category extraction with ./main.py extract "Some text".

Sample output for ./main.py extract -s 3 "The Cubs are destroying the Mets right now.":

1) 21.56842 "1969 Chicago Cubs season"
   - Use mdy dates from November 2013
   - Articles with hCards
   - Chicago Cubs seasons
   - 1969 Major League Baseball season
2) 21.366503 "2015 Chicago Cubs season"
   - Use mdy dates from August 2015
   - Articles with hCards
   - 2015 Major League Baseball season
   - 2015 in sports in Illinois
   - Chicago Cubs seasons
3) 21.28783 "2015 National League Championship Series"
   - Pages using deprecated image syntax
   - 2015 Major League Baseball season
   - National League Championship Series
   - Chicago Cubs postseason
   - 2015 in sports in Illinois
   - 21st century in Chicago
   - 2015 in sports in New York City
   - New York Mets postseason
   - October 2015 sports events

NOTE: Results may differ between search index versions.

Test general category extraction with categorize.

./main.py categorize "The Cubs are destroying the Mets right now." should yield sports.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
enwiki		enwiki
.gitignore		.gitignore
README.md		README.md
category.json		category.json
fixtures.txt		fixtures.txt
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

es-categorizer

TODO:

How?

Prerequisite Dependencies

Setup

About

Releases

Packages

Contributors 2

Languages

Palisand/es-categorizer

Folders and files

Latest commit

History

Repository files navigation

es-categorizer

TODO:

How?

Prerequisite Dependencies

Setup

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages