Web Data Process System (NLP/Entity Linking)

Built a NLP pipeline to link web text data into knowledge based on Spark by Python
Extracted raw text by Beautifulsoups and pre-processed data by Pandas
Tokenized text and recognized named entities by NLTK and Standford NER
Linked each mention to candidate entities in Freebase knowledge base by Elasticsearch
Queried each candidate's abstract in DBpedia database using SPARQL
Computed entities similarity by Scikit-learn and got 3.4% precision, 1.2% recall, 5.4% F1

How to understand the web data?

Description

This project is to perform Entity Linking on a collection of web pages. the method consists of the following five steps:

extract text from HTML pages in WARC files using beautifulsoups
tokenize each text and recognize named entities in the content using nltk
link each entity mention to a set of candidate entities in Freebase using ELASTICSEARCH
In this step, we will get a list of candidate Freebase IDs for each entity
query each candidate's abstract in DBPedia using SPARQL
consider two similarities by computing cosine similarity using sklearn:
- the abstract and the text (where the entity mention retrieved from)
- the entity mention and the abstract's object (noun phrase before the first verb)
  (sum of these two scores is candidate entity's similarity score) link the entity mention to the candidate entity with highest similarity score
  return the result in the format: document IDs + '\t' + entity surface form + '\t' + Freebase entity ID

We also perform our method with Spark at cluster mode.

Jobs: /m/0k8z

Prerequisites

python-package: beautifulsoup4, nltk, sklearn, requests

pip install -U beautifulsoup4 nltk scikit-learn requests

Stanford NER

wget https://nlp.stanford.edu/software/stanford-ner-2018-10-16.zip

path in DAS-4

cd /home/wdps1811/scratch/wdps-group11

How to run

run without Spark

# run
# SCRIPT: starter-code.py
# INPUT: hdfs:///user/bbkruit/sample.warc.gz
# OUTPUT: sample
bash run.sh

run with Spark

setup environment

# setup virtualenv
python3 -m venv venv
# download python-packages
source venv/bin/activate
export PYTHONPATH=""
pip install -U beautifulsoup4 nltk scikit-learn requests

# download nltk_data and zip it
python -m nltk.downloader -d ./ all
zip -r nltk_data.zip ./nltk_data
# download stanford-ner
wget https://nlp.stanford.edu/software/stanford-ner-2018-10-16.zip

run

# run
# SCRIPT: starter-code-spark.py
# INPUT: hdfs:///user/bbkruit/sample.warc.gz
# OUTPUT: sample
bash run_venv.sh <SCRIPT> <INPUT> <OUTPUT>

compute F1-score

# if run with spark, the output is in hdfs
hdfs dfs -cat /user/wdps1811/sample/* > output.tsv
# compute F1-score
python score.py data/sample.annotations.tsv output.tsv

Notes

run.sh: run at local
run_venv.sh: run at cluster
starter-code.py: main pipeline of Entity Linking and code for ranking candidate entities
starter-code-spark.py: use rdd operations to perform Entity Linking
html2text.py: warc -> html -> text
remove html tags and useless text (script, comment, code, style...)
get text in tag '
'
nlp_preproc.py:text -> tokens -> clean_tokens(remove stopwords) -> ner_tagged_tokens
nlp_preproc_spark.py: use nltk ner to tag tokens
elasticsearch.py: candidate entities generation
search for candidate Freebase entities IDs by elasticsearch
sparql.py: query candidate entities' abstract

Extension - Detect the most salient entities

main idea:

the most salient entities are relevant to the title of HTML pages
compute Jaccard similarity of entity’s abstract and title of the page

extension prerequisties: unidecode, numpy

pip install -U unidecode numpy

run

# SCRIPT: extension-starter-code.py
# INPUT: hdfs:///user/bbkruit/sample.warc.gz
bash entension_run.sh > <OUTPUT>

entension_run.sh: run extension
extension-starter-code.py: detect the most salient entities by computing jaccard_index
extensionhtml2text.py: get title

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Data Process System (NLP/Entity Linking)

How to understand the web data?

Description

Prerequisites

How to run

Notes

Extension - Detect the most salient entities

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
backupold		backupold
data		data
intermediateResult		intermediateResult
smallDemo		smallDemo
Frame.png		Frame.png
README.md		README.md
elasticsearch.py		elasticsearch.py
example1.png		example1.png
example2.png		example2.png
extension-starter-code.py		extension-starter-code.py
extension_run.sh		extension_run.sh
extensionhtml2text.py		extensionhtml2text.py
html2text.py		html2text.py
nlp_preproc.py		nlp_preproc.py
nlp_preproc_spark.py		nlp_preproc_spark.py
report.pdf		report.pdf
run.sh		run.sh
run_venv.sh		run_venv.sh
score.py		score.py
sparql.py		sparql.py
starter-code-spark.py		starter-code-spark.py
starter-code.py		starter-code.py

ljm0/WebDataProcess-EntityLink

Folders and files

Latest commit

History

Repository files navigation

Web Data Process System (NLP/Entity Linking)

How to understand the web data?

Description

Prerequisites

How to run

Notes

Extension - Detect the most salient entities

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages