Installation

pip3 install -r requirements.txt
RUN python3 -m spacy download en_core_web_trf

Run

Run main.py which will run on data/warcs/sample.warc.gz, performing the pre-processing, entity linking and relation extraction.

Docker quick guide

Get our implementation from docker hub:

docker pull tawvk/wdp-docker

Run docker hub image:

docker run -t tawvk/wdp-docker

Build using:

docker build --tag IMAGE_NAME

Run using:

docker run --name CONTAINER_NAME IMAGE_NAME

Copy file out using

docker cp CONTAINER_NAME:PATH/TO/SRC PATH/TO/DEST

spaCy model

Download spaCy model using:

python -m spacy download en_core_web_trf

Demo

The demo is performed on a subset of the data due to time limitations. We will also show the output when running on all the data in addition to running the demo on this subset of the data.

Preprocessing is done over all warc files.
Entity linking and relation extraction is done for the first 50 warc files with HTML content.

Pipeline

The pipeline consists of 4 stages:

Preprocessing
Entity recognition
Entity linking
Relation extraction

Preprocessing

Get iterator over warc files.
Perform map operation over all individual warc files.
If file contains no HTML then skip file.
Normalize HTML to unescape unicode characters.
Pass HTML to BeautifulSoup.
Process title, headers and p tags.
1. Remove empty words, non-alphanumeric words with length of 1, and words containing a character that is not alphanumeric, '$', '€', ':', '.', ',', or '-'.
2. Sanitize words to remove unnecessary punctuation or non-alphanumeric characters at the end of the word.
3. Split the words into sentences based on present punctuation or separation via tags.
4. Return the text as a single string split into sentences.
Return mapped warc files as key-title-headers-text tuples.

Entity recognition

Load spaCy model en_core_web_trf to extract named entities and sentences.
Turn key-title-headers-text tuples into text-key pairs, the text will be processed, and the key is passed as context.
Give text-key pairs to nlp.pipe.

Entity linking

Take named entity.
If named entity is known, return mapping immediately. Otherwise continue
Construct SPARQL query.
Execute query on http://dbpedia.org/sparql endpoint.
If error occured, try again in 15 seconds.
Return result if there is any.
Store entity mention to Wikipedia link mapping.

Relation extraction

Use ReVerb using the spaCy model vocab.
Pass the text into the ReVerb.
Loop over all sentences.
Per sentence extract possible relations.
Return all relations that have a linked entity on both sides of the relation.

Scalability

There are two methods with which the processed was parallelized.

nlp.pipe()
pool.map()

nlp.pipe

nlp.pipe is used for the named entity recognition which is necessary for the entity linking and relation extraction.

All text is combined with the key in a list of text_context pairs and passed to nlp.pipe using as_tuples=True. spaCy will process the text and pass the key with it. This processing is done in parallel, where spaCy chooses how many processes to use. We leave this to spaCy to not run out of memory as too many model instantiations could be overly expensive.

import spacy
import spacy_transformers

nlp = spacy.load("en_core_web_trf", disable=[
    "textcat",
    "tok2vec",
    "parser",
    "lemmatizer"
])
nlp.add_pipe("sentencizer")
    
text_context = [(pre_proc_file[3], pre_proc_file[0]) for pre_proc_file in pre_proc_files]

doc_tuples = nlp.pipe(text_context, as_tuples=True)

pool.map

pool.map is used to parallelize the preprocessing, entity linking, and relation extraction.

This is done using the multiprocessing library where the used pool size is taken as the available CPU count.

import multiprocessing as mp
pool_size = mp.cpu_count()

The preprocessing map is a map over individual warc files, split by the split_records iterator.

with mp.Pool(processes=pool_size) as pool:
    processed_files = pool.map(process_payload, split_records(fo))

The entity linking and relation extraction are parallelized together over individual rows, where a row is a warc file that contained HTML. Here each process also receives the Extraction class that is given the available vocabulary. This class is used as a cache throughout the processing of the row.

with mp.Pool(processes=pool_size) as pool:
    extraction = Extraction(vocab)
    results = pool.map(extraction.process_row, doc_tuples)

Performance

All stages combined over all the warc contents take ~80 minutes. The result is 8602 linked entities and 1862 linked relations.

Name		Name	Last commit message	Last commit date
Latest commit History 83 Commits
data		data
pre-proc		pre-proc
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
dbpedia_utils.py		dbpedia_utils.py
dbpedia_with_EL.py		dbpedia_with_EL.py
main.py		main.py
relation_extraction.py		relation_extraction.py
requirements.txt		requirements.txt
score.py		score.py
warc.py		warc.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Installation

Run

Docker quick guide

spaCy model

Demo

Pipeline

Preprocessing

Entity recognition

Entity linking

Relation extraction

Scalability

nlp.pipe

pool.map

Performance

About

Releases

Packages

Contributors 3

Languages

anusha3ali/WDPS

Folders and files

Latest commit

History

Repository files navigation

Installation

Run

Docker quick guide

spaCy model

Demo

Pipeline

Preprocessing

Entity recognition

Entity linking

Relation extraction

Scalability

nlp.pipe

pool.map

Performance

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages