- Built a NLP pipeline to link web text data into knowledge based on Spark by Python
- Extracted raw text by Beautifulsoups and pre-processed data by Pandas
- Tokenized text and recognized named entities by NLTK and Standford NER
- Linked each mention to candidate entities in Freebase knowledge base by Elasticsearch
- Queried each candidate's abstract in DBpedia database using SPARQL
- Computed entities similarity by Scikit-learn and got 3.4% precision, 1.2% recall, 5.4% F1
This project is to perform Entity Linking on a collection of web pages. the method consists of the following five steps:
- extract text from HTML pages in WARC files using beautifulsoups
- tokenize each text and recognize named entities in the content using nltk
- link each entity mention to a set of candidate entities in Freebase using ELASTICSEARCH
In this step, we will get a list of candidate Freebase IDs for each entity - query each candidate's abstract in DBPedia using SPARQL
- consider two similarities by computing cosine similarity using sklearn:
- the abstract and the text (where the entity mention retrieved from)
- the entity mention and the abstract's object (noun phrase before the first verb)
(sum of these two scores is candidate entity's similarity score) link the entity mention to the candidate entity with highest similarity score
return the result in the format: document IDs + '\t' + entity surface form + '\t' + Freebase entity ID
We also perform our method with Spark at cluster mode.
python-package: beautifulsoup4, nltk, sklearn, requests
pip install -U beautifulsoup4 nltk scikit-learn requests
Stanford NER
wget https://nlp.stanford.edu/software/stanford-ner-2018-10-16.zip
path in DAS-4
cd /home/wdps1811/scratch/wdps-group11
run without Spark
# run
# SCRIPT: starter-code.py
# INPUT: hdfs:///user/bbkruit/sample.warc.gz
# OUTPUT: sample
bash run.sh
run with Spark
setup environment
# setup virtualenv
python3 -m venv venv
# download python-packages
source venv/bin/activate
export PYTHONPATH=""
pip install -U beautifulsoup4 nltk scikit-learn requests
# download nltk_data and zip it
python -m nltk.downloader -d ./ all
zip -r nltk_data.zip ./nltk_data
# download stanford-ner
wget https://nlp.stanford.edu/software/stanford-ner-2018-10-16.zip
run
# run
# SCRIPT: starter-code-spark.py
# INPUT: hdfs:///user/bbkruit/sample.warc.gz
# OUTPUT: sample
bash run_venv.sh <SCRIPT> <INPUT> <OUTPUT>
compute F1-score
# if run with spark, the output is in hdfs
hdfs dfs -cat /user/wdps1811/sample/* > output.tsv
# compute F1-score
python score.py data/sample.annotations.tsv output.tsv
-
run.sh: run at local
run_venv.sh: run at cluster -
starter-code.py: main pipeline of Entity Linking and code for ranking candidate entities
starter-code-spark.py: use rdd operations to perform Entity Linking -
html2text.py: warc -> html -> text
'
remove html tags and useless text (script, comment, code, style...)
get text in tag ' -
nlp_preproc.py:text -> tokens -> clean_tokens(remove stopwords) -> ner_tagged_tokens
nlp_preproc_spark.py: use nltk ner to tag tokens -
elasticsearch.py: candidate entities generation
search for candidate Freebase entities IDs by elasticsearch -
sparql.py: query candidate entities' abstract
main idea:
- the most salient entities are relevant to the title of HTML pages
- compute Jaccard similarity of entity’s abstract and title of the page
extension prerequisties: unidecode, numpy
pip install -U unidecode numpy
run
# SCRIPT: extension-starter-code.py
# INPUT: hdfs:///user/bbkruit/sample.warc.gz
bash entension_run.sh > <OUTPUT>
- entension_run.sh: run extension
- extension-starter-code.py: detect the most salient entities by computing jaccard_index
- extensionhtml2text.py: get title