Tools for Processing Wikipedia Data for QA and NER Tasks

Aim

Extraction of the content of a Wikipedia dump and saving as a csv file compatible with annotation software Label Studio.

by default, it samples the firest 100 articles from the dump
it extracts the title, id, url, and text of each article

Quick Start

Modify the two variables in the shell script wikiDumpToCSV.sh to your needs. These are marked with # TODO in the script. Then, run the script, e.g.:

./wikiDumpToCSV.sh

the script will download the dump file, run Wiki Extractor on it, and then run the script wikiExtract2csv.py on the output generated by Wiki Extractor. Details are described in the section below. The two output .csv files (articles_sampled_QA.csv and articles_sampled_NER.csv) will be saved in the directory specified by OUT_FOLDER in the script.

Details

Download the Wikipedia dump corresponding to the desired languge from here https://dumps.wikimedia.org/<ISO CODE OF THE LANGUAGE>wiki/ e.g. for ALS. Then, select the following variant of the dump file:

For ALS, the link is https://dumps.wikimedia.org/alswiki/20230701/alswiki-20230701-pages-articles.xml.bz2.

Run Wiki Extractor on the downloaded dump file, e.g.: python -m wikiextractor.WikiExtractor alswiki-20230701-pages-articles.xml.bz2 --output ../../Documents/MRL_ST_2023/enwiki-20230420_extracted.. There is no need to unzip the file before running Wiki Extractor.
To convert the data to the required format, run the script wikiExtract2csv.py on the output generated in step 2. Make sure the output file's suffix is .csv.
To generate the question-answer-pairs, run the script questions2QA.py on the output generated in step 2. Again, make sure the output file's suffix is .csv.

Dependencies

Python 3.9.12
other dependencies are listed in requirements.txt

`wikiExtract2csv.py`

Usage

python wikiExtract2csv.py --input INPUT_DIR [--output OUTPUT_DIR] [--sample_size SAMPLE_SIZE] [--min MIN_LENGTH] --split

INPUT_DIR: path to the directory containing the output of Wiki Extractor, e.g. first folder in the output folder generated by the Wiki Extractor. E.g., ../../Documents/enwiki/AA/.
OUTPUT_DIR: path to the directory where the output csv file will be saved, default is ./articles.csv
sample_size: number of articles to sample from the input, default is 100
min: minimum number of characters an article needs to have to be included in the output, default is 1000.
split_by: Choose whether to split the text of each article in sentences or paragraphs. Default is both, creates two seperate cvs output files. If you want to split by sentences, use --split_by sentence. If you want to split by paragraphs, use --split_by paragraph.

Example usage

to generate sentence samples and paragraphs samples:

python wikiExtract2csv.py --input "../../Documents/MRL_ST_2023/enwiki-20230420_extracted/AA/" --output "../../Documents/MRL_ST_2023/enwiki/" --sample_size 100 --min 1000

`questions2QA.py`

This is a simple python script to create question-answer-pairs from a Label Studio question project snapshot. It extracts the questions and texts from the snapshot and saves them in a csv file, creating one task per question.

Usage

python questions2QA.py --input INPUT_FILE --output OUTPUT_FILE --labels --n_tasks N_TASKS

INPUT_FILE: json file containing LabelStudio snapshot
OUTPUT_FILE: path to the output csv file containing , default is ./article_question_pairs.csv
labels: add this flag to include the labels (answers) in the output csv file.
N_TASKS: first n of tasks to keep from the input, default is 100

Example usage

 python questions2QA.py --input /Users/dug/Py/wikiExtract2csv/Question_Exports/ID_Questions.json --output /Users/dug/Py/wikiExtract2csv/Answer_Tasks/answer_tasks_ID.csv

`answers2csv.py`

Prerequisits: Label Studio answer project snapshot, exported as csv file. This script creates clean csv files containing only text, question and optionally, the answer. It eliminates annotator information.

Usage

python answers2csv.py --input INPUT_FILE_PATH  --output OUTPUT_FOLDER --labels SPLIT_BY

`get_text_from_url.py`

Script to scrape the relevant wikipedia articles from the urls in the language biography csv files. The script extracts the text from the urls and saves it in a csv file containing the id, url, and text of each article. The script also splits the text into sentences and paragraphs and saves the output in seperate csv files.

Usage

First create language biography by running: wikiExtract2csv/NER/get_wikipedia_url_from_wikidata.py on the id_list.csv file. Make sure to change the language codes in the script.

python get_text_from_url.py --input INPUT_FILE_PATH  --output OUTPUT_FOLDER --split_by SPLIT_BY

INPUT_FILE_PATH: path to the language biography csv file containing the urls and ids of the articles, e.g. NER/lang_biography/en.csv
OUTPUT_FOLDER: path to the output folder where the processed text files will be saved.
SPLIT_BY: Choose whether to split the text of each article in sentences or paragraphs. Default is both, creates two seperate cvs output files. If you want to split by sentences, use --split_by sentence. If you want to split by paragraphs, use --split_by paragraph

Example Usage

python get_text_from_url.py --input NER/lang_biography/en.csv --output Test_Outputs_NER/ --sample_size 10 --split_by sentence

`NER_postprocessing.py`

Script to postprocess the conll file exported from a LabelStudio NER project. Main purpose is to remove date tags.

Usage

python NER_postprocessing.py --input INPUT_FILE_PATH  --output OUTPUT_FILE_PATH

INPUT_FILE_PATH: path to the conll file exported from LabelStudio, e.g. NER/lang_biography/ALS_NER.conll
OUTPUT_FOLDER: path to the file where containing the postprocessed file.

Evaluation

NER Evaluation

using conlleval-python from sighsmile/conlleval.

Clone the sighsmile/conlleval repository. And move the conlleval.py file to the current directory.

Append the predictions to the conll file such that each line has the following format:

token true_label predicted_label using the following command:

python combine_conll_file_tags.py --predictions NER_ALS_Test_PREDICTIONS.conll --labels NER_ALS_Test_GOLD.conll --output NER_ALS_Test_combined.conll

Score the conll file using:

python conlleval.py < NER_ALS_Test_combined.conll > NER_ALS_Test_Result.txt

QA Evaluation

Metrics

All metrics are calculated using implementations from Huggingface Evaluate Metric. The following metrics are used:

ChrF ('char_order': 6, 'word_order': 0, 'beta': 2)
ChrF+ ('char_order': 6, 'word_order': 1, 'beta': 2)
ChrF++ ('char_order': 6, 'word_order': 2, 'beta': 2)
RougeL (Longest common subsequence based scoring)
BERTScore F1 using embeddings from RobertaBase ('hashcode': 'roberta-base_L10_no-idf_version=0.3.12(hug_trans=4.34.0)

Usage

evaluate_QA.py can be used as follows, needs to be run for every language and system separately:

python evaluate_QA.py --predictions PRED_FILE --labels GOLD_FILE --results RESULTS_FILE

GOLD_FILE is the file containing the gold answers
PRED_FILE is the file containing a system's predicted answer
OUTPUT_FILE is the file to which the average scores per language will be written.

Detailed scores (the scores for each paragraph) will be appended as columns to the PRED_FILE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Tools for Processing Wikipedia Data for QA and NER Tasks

Aim

Quick Start

Details

Dependencies

`wikiExtract2csv.py`

Usage

Example usage

`questions2QA.py`

Usage

Example usage

`answers2csv.py`

Usage

`get_text_from_url.py`

Usage

Example Usage

`NER_postprocessing.py`

Usage

Evaluation

NER Evaluation

QA Evaluation

Metrics

Usage

Files

README.md

Latest commit

History

README.md

File metadata and controls

Tools for Processing Wikipedia Data for QA and NER Tasks

Aim

Quick Start

Details

Dependencies

wikiExtract2csv.py

Usage

Example usage

questions2QA.py

Usage

Example usage

answers2csv.py

Usage

get_text_from_url.py

Usage

Example Usage

NER_postprocessing.py

Usage

Evaluation

NER Evaluation

QA Evaluation

Metrics

Usage

`wikiExtract2csv.py`

`questions2QA.py`

`answers2csv.py`

`get_text_from_url.py`

`NER_postprocessing.py`