This repository provides a tool for scraping wikipedia for any topic and generating a knowledge graph from the scraped articles.
The new Neuralcoref from explosion.ai
uses the state-of-the-art clustering
algorithm
MentionRank to cluster mentions in a document. This algorithm is much more accurate
than the previous one, but it is also much slower.
The new version of Neuralcoref is not compatible with the old version, so the code in this repository has been adapted
to the new version.
However, the end results are not the same as before. The new version of Neuralcoref delivers low-quality clusters, so the results are not as good as before. This is a known issue, and the developers are working on it.
- Python 3.7
- Wikipedia-API
- Spacy
- Neuralcoref
- Networkx
- spaCy en_core_web_lg
We recommend using conda. Create a new environment from the environment.yml
file
in the root of this repository:
conda env create -f environment.yml
Then, activate the environment:
conda activate spacy_pos_kg
Alternatively, you can use virtualenv. Create a new environment from the
requirements.txt
file in the root of this repository:
virtualenv -p python3.7 venv
source venv/bin/activate
pip install -r requirements.txt
Below is an example of how to run the code. The code will scrape wikipedia for the text query 2008 recession
, generate
a knowledge graph and plot it.
python demo.py --target "2008 recession" --sub-graph-target "The federal reserve"
Output:
The graph was generated using the
en_core_web_lg
model from spaCy and plotted with networkx and matplotlib.
The sub-graph was generated for the entity "The federal reserve"