pip3 install -r requirements.txt
RUN python3 -m spacy download en_core_web_trf
Run main.py which will run on data/warcs/sample.warc.gz, performing the pre-processing, entity linking and relation extraction.
Get our implementation from docker hub:
docker pull tawvk/wdp-docker
Run docker hub image:
docker run -t tawvk/wdp-docker
Build using:
docker build --tag IMAGE_NAME
Run using:
docker run --name CONTAINER_NAME IMAGE_NAME
Copy file out using
docker cp CONTAINER_NAME:PATH/TO/SRC PATH/TO/DEST
Download spaCy model using:
python -m spacy download en_core_web_trf
The demo is performed on a subset of the data due to time limitations. We will also show the output when running on all the data in addition to running the demo on this subset of the data.
- Preprocessing is done over all warc files.
- Entity linking and relation extraction is done for the first 50 warc files with HTML content.
The pipeline consists of 4 stages:
- Preprocessing
- Entity recognition
- Entity linking
- Relation extraction
- Get iterator over warc files.
- Perform map operation over all individual warc files.
- If file contains no HTML then skip file.
- Normalize HTML to unescape unicode characters.
- Pass HTML to BeautifulSoup.
- Process title, headers and p tags.
- Remove empty words, non-alphanumeric words with length of 1, and words containing a character that is not alphanumeric, '$', '€', ':', '.', ',', or '-'.
- Sanitize words to remove unnecessary punctuation or non-alphanumeric characters at the end of the word.
- Split the words into sentences based on present punctuation or separation via tags.
- Return the text as a single string split into sentences.
- Return mapped warc files as key-title-headers-text tuples.
- Load spaCy model en_core_web_trf to extract named entities and sentences.
- Turn key-title-headers-text tuples into text-key pairs, the text will be processed, and the key is passed as context.
- Give text-key pairs to nlp.pipe.
- Take named entity.
- If named entity is known, return mapping immediately. Otherwise continue
- Construct SPARQL query.
- Execute query on http://dbpedia.org/sparql endpoint.
- If error occured, try again in 15 seconds.
- Return result if there is any.
- Store entity mention to Wikipedia link mapping.
- Use ReVerb using the spaCy model vocab.
- Pass the text into the ReVerb.
- Loop over all sentences.
- Per sentence extract possible relations.
- Return all relations that have a linked entity on both sides of the relation.
There are two methods with which the processed was parallelized.
- nlp.pipe()
- pool.map()
nlp.pipe is used for the named entity recognition which is necessary for the entity linking and relation extraction.
All text is combined with the key in a list of text_context pairs and passed to nlp.pipe using as_tuples=True
.
spaCy will process the text and pass the key with it. This processing is done in parallel, where spaCy chooses how many processes to use.
We leave this to spaCy to not run out of memory as too many model instantiations could be overly expensive.
import spacy
import spacy_transformers
nlp = spacy.load("en_core_web_trf", disable=[
"textcat",
"tok2vec",
"parser",
"lemmatizer"
])
nlp.add_pipe("sentencizer")
text_context = [(pre_proc_file[3], pre_proc_file[0]) for pre_proc_file in pre_proc_files]
doc_tuples = nlp.pipe(text_context, as_tuples=True)
pool.map is used to parallelize the preprocessing, entity linking, and relation extraction.
This is done using the multiprocessing library where the used pool size is taken as the available CPU count.
import multiprocessing as mp
pool_size = mp.cpu_count()
The preprocessing map is a map over individual warc files, split by the split_records iterator.
with mp.Pool(processes=pool_size) as pool:
processed_files = pool.map(process_payload, split_records(fo))
The entity linking and relation extraction are parallelized together over individual rows, where a row is a warc file that contained HTML. Here each process also receives the Extraction class that is given the available vocabulary. This class is used as a cache throughout the processing of the row.
with mp.Pool(processes=pool_size) as pool:
extraction = Extraction(vocab)
results = pool.map(extraction.process_row, doc_tuples)
All stages combined over all the warc contents take ~80 minutes. The result is 8602 linked entities and 1862 linked relations.