Skip to content

[Web Data/NLP] Entity Linking, NLP, Elasticsearch, NLTK, Spark

Notifications You must be signed in to change notification settings

ljm0/WebDataProcess-EntityLink

Repository files navigation

Web Data Process System (NLP/Entity Linking)

  • Built a NLP pipeline to link web text data into knowledge based on Spark by Python
  • Extracted raw text by Beautifulsoups and pre-processed data by Pandas
  • Tokenized text and recognized named entities by NLTK and Standford NER
  • Linked each mention to candidate entities in Freebase knowledge base by Elasticsearch
  • Queried each candidate's abstract in DBpedia database using SPARQL
  • Computed entities similarity by Scikit-learn and got 3.4% precision, 1.2% recall, 5.4% F1

How to understand the web data?

avatar

Description

avatar This project is to perform Entity Linking on a collection of web pages. the method consists of the following five steps:

  1. extract text from HTML pages in WARC files using beautifulsoups
  2. tokenize each text and recognize named entities in the content using nltk
  3. link each entity mention to a set of candidate entities in Freebase using ELASTICSEARCH
    In this step, we will get a list of candidate Freebase IDs for each entity
  4. query each candidate's abstract in DBPedia using SPARQL
  5. consider two similarities by computing cosine similarity using sklearn:
    • the abstract and the text (where the entity mention retrieved from)
    • the entity mention and the abstract's object (noun phrase before the first verb)
      (sum of these two scores is candidate entity's similarity score) link the entity mention to the candidate entity with highest similarity score
      return the result in the format: document IDs + '\t' + entity surface form + '\t' + Freebase entity ID

We also perform our method with Spark at cluster mode.

  • Jobs: /m/0k8z avatar

Prerequisites

python-package: beautifulsoup4, nltk, sklearn, requests

pip install -U beautifulsoup4 nltk scikit-learn requests

Stanford NER

wget https://nlp.stanford.edu/software/stanford-ner-2018-10-16.zip  

path in DAS-4

cd /home/wdps1811/scratch/wdps-group11

How to run

run without Spark

# run
# SCRIPT: starter-code.py
# INPUT: hdfs:///user/bbkruit/sample.warc.gz
# OUTPUT: sample
bash run.sh

run with Spark

setup environment

# setup virtualenv
python3 -m venv venv
# download python-packages
source venv/bin/activate
export PYTHONPATH=""
pip install -U beautifulsoup4 nltk scikit-learn requests

# download nltk_data and zip it
python -m nltk.downloader -d ./ all
zip -r nltk_data.zip ./nltk_data
# download stanford-ner
wget https://nlp.stanford.edu/software/stanford-ner-2018-10-16.zip 

run

# run
# SCRIPT: starter-code-spark.py
# INPUT: hdfs:///user/bbkruit/sample.warc.gz
# OUTPUT: sample
bash run_venv.sh <SCRIPT> <INPUT> <OUTPUT>

compute F1-score

# if run with spark, the output is in hdfs
hdfs dfs -cat /user/wdps1811/sample/* > output.tsv
# compute F1-score
python score.py data/sample.annotations.tsv output.tsv

Notes

  1. run.sh: run at local
    run_venv.sh: run at cluster

  2. starter-code.py: main pipeline of Entity Linking and code for ranking candidate entities
    starter-code-spark.py: use rdd operations to perform Entity Linking

  3. html2text.py: warc -> html -> text
    remove html tags and useless text (script, comment, code, style...)
    get text in tag '

    '

  4. nlp_preproc.py:text -> tokens -> clean_tokens(remove stopwords) -> ner_tagged_tokens
    nlp_preproc_spark.py: use nltk ner to tag tokens

  5. elasticsearch.py: candidate entities generation
    search for candidate Freebase entities IDs by elasticsearch

  6. sparql.py: query candidate entities' abstract

Extension - Detect the most salient entities

main idea:

  • the most salient entities are relevant to the title of HTML pages
  • compute Jaccard similarity of entity’s abstract and title of the page

extension prerequisties: unidecode, numpy

pip install -U unidecode numpy

run

# SCRIPT: extension-starter-code.py
# INPUT: hdfs:///user/bbkruit/sample.warc.gz
bash entension_run.sh > <OUTPUT>
  1. entension_run.sh: run extension
  2. extension-starter-code.py: detect the most salient entities by computing jaccard_index
  3. extensionhtml2text.py: get title

About

[Web Data/NLP] Entity Linking, NLP, Elasticsearch, NLTK, Spark

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published