Skip to content

Latest commit

 

History

History
170 lines (109 loc) · 8.11 KB

README.md

File metadata and controls

170 lines (109 loc) · 8.11 KB

Tools for Processing Wikipedia Data for QA and NER Tasks

Aim

Extraction of the content of a Wikipedia dump and saving as a csv file compatible with annotation software Label Studio.

  • by default, it samples the firest 100 articles from the dump
  • it extracts the title, id, url, and text of each article

Quick Start

Modify the two variables in the shell script wikiDumpToCSV.sh to your needs. These are marked with # TODO in the script. Then, run the script, e.g.:

./wikiDumpToCSV.sh

the script will download the dump file, run Wiki Extractor on it, and then run the script wikiExtract2csv.py on the output generated by Wiki Extractor. Details are described in the section below. The two output .csv files (articles_sampled_QA.csv and articles_sampled_NER.csv) will be saved in the directory specified by OUT_FOLDER in the script.

Details

  1. Download the Wikipedia dump corresponding to the desired languge from here https://dumps.wikimedia.org/<ISO CODE OF THE LANGUAGE>wiki/ e.g. for ALS. Then, select the following variant of the dump file:
image

For ALS, the link is https://dumps.wikimedia.org/alswiki/20230701/alswiki-20230701-pages-articles.xml.bz2.

  1. Run Wiki Extractor on the downloaded dump file, e.g.: python -m wikiextractor.WikiExtractor alswiki-20230701-pages-articles.xml.bz2 --output ../../Documents/MRL_ST_2023/enwiki-20230420_extracted.. There is no need to unzip the file before running Wiki Extractor.
  2. To convert the data to the required format, run the script wikiExtract2csv.py on the output generated in step 2. Make sure the output file's suffix is .csv.
  3. To generate the question-answer-pairs, run the script questions2QA.py on the output generated in step 2. Again, make sure the output file's suffix is .csv.

Dependencies

wikiExtract2csv.py

Usage

python wikiExtract2csv.py --input INPUT_DIR [--output OUTPUT_DIR] [--sample_size SAMPLE_SIZE] [--min MIN_LENGTH] --split
  • INPUT_DIR: path to the directory containing the output of Wiki Extractor, e.g. first folder in the output folder generated by the Wiki Extractor. E.g., ../../Documents/enwiki/AA/.
  • OUTPUT_DIR: path to the directory where the output csv file will be saved, default is ./articles.csv
  • sample_size: number of articles to sample from the input, default is 100
  • min: minimum number of characters an article needs to have to be included in the output, default is 1000.
  • split_by: Choose whether to split the text of each article in sentences or paragraphs. Default is both, creates two seperate cvs output files. If you want to split by sentences, use --split_by sentence. If you want to split by paragraphs, use --split_by paragraph.

Example usage

to generate sentence samples and paragraphs samples:

python wikiExtract2csv.py --input "../../Documents/MRL_ST_2023/enwiki-20230420_extracted/AA/" --output "../../Documents/MRL_ST_2023/enwiki/" --sample_size 100 --min 1000

questions2QA.py

This is a simple python script to create question-answer-pairs from a Label Studio question project snapshot. It extracts the questions and texts from the snapshot and saves them in a csv file, creating one task per question.

Usage

python questions2QA.py --input INPUT_FILE --output OUTPUT_FILE --labels --n_tasks N_TASKS
  • INPUT_FILE: json file containing LabelStudio snapshot
  • OUTPUT_FILE: path to the output csv file containing , default is ./article_question_pairs.csv
  • labels: add this flag to include the labels (answers) in the output csv file.
  • N_TASKS: first n of tasks to keep from the input, default is 100

Example usage

 python questions2QA.py --input /Users/dug/Py/wikiExtract2csv/Question_Exports/ID_Questions.json --output /Users/dug/Py/wikiExtract2csv/Answer_Tasks/answer_tasks_ID.csv

answers2csv.py

Prerequisits: Label Studio answer project snapshot, exported as csv file. This script creates clean csv files containing only text, question and optionally, the answer. It eliminates annotator information.

Usage

python answers2csv.py --input INPUT_FILE_PATH  --output OUTPUT_FOLDER --labels SPLIT_BY         

get_text_from_url.py

Script to scrape the relevant wikipedia articles from the urls in the language biography csv files. The script extracts the text from the urls and saves it in a csv file containing the id, url, and text of each article. The script also splits the text into sentences and paragraphs and saves the output in seperate csv files.

Usage

First create language biography by running: wikiExtract2csv/NER/get_wikipedia_url_from_wikidata.py on the id_list.csv file. Make sure to change the language codes in the script.

python get_text_from_url.py --input INPUT_FILE_PATH  --output OUTPUT_FOLDER --split_by SPLIT_BY         
  • INPUT_FILE_PATH: path to the language biography csv file containing the urls and ids of the articles, e.g. NER/lang_biography/en.csv
  • OUTPUT_FOLDER: path to the output folder where the processed text files will be saved.
  • SPLIT_BY: Choose whether to split the text of each article in sentences or paragraphs. Default is both, creates two seperate cvs output files. If you want to split by sentences, use --split_by sentence. If you want to split by paragraphs, use --split_by paragraph

Example Usage

python get_text_from_url.py --input NER/lang_biography/en.csv --output Test_Outputs_NER/ --sample_size 10 --split_by sentence         

NER_postprocessing.py

Script to postprocess the conll file exported from a LabelStudio NER project. Main purpose is to remove date tags.

Usage

python NER_postprocessing.py --input INPUT_FILE_PATH  --output OUTPUT_FILE_PATH     
  • INPUT_FILE_PATH: path to the conll file exported from LabelStudio, e.g. NER/lang_biography/ALS_NER.conll
  • OUTPUT_FOLDER: path to the file where containing the postprocessed file.

Evaluation

NER Evaluation

using conlleval-python from sighsmile/conlleval.

  1. Clone the sighsmile/conlleval repository. And move the conlleval.py file to the current directory.

  2. Append the predictions to the conll file such that each line has the following format:

    token true_label predicted_label using the following command:

    python combine_conll_file_tags.py --predictions NER_ALS_Test_PREDICTIONS.conll --labels NER_ALS_Test_GOLD.conll --output NER_ALS_Test_combined.conll
  3. Score the conll file using:

    python conlleval.py < NER_ALS_Test_combined.conll > NER_ALS_Test_Result.txt     

QA Evaluation

Metrics

All metrics are calculated using implementations from Huggingface Evaluate Metric. The following metrics are used:

  • ChrF ('char_order': 6, 'word_order': 0, 'beta': 2)
  • ChrF+ ('char_order': 6, 'word_order': 1, 'beta': 2)
  • ChrF++ ('char_order': 6, 'word_order': 2, 'beta': 2)
  • RougeL (Longest common subsequence based scoring)
  • BERTScore F1 using embeddings from RobertaBase ('hashcode': 'roberta-base_L10_no-idf_version=0.3.12(hug_trans=4.34.0)

Usage

evaluate_QA.py can be used as follows, needs to be run for every language and system separately:

python evaluate_QA.py --predictions PRED_FILE --labels GOLD_FILE --results RESULTS_FILE
  • GOLD_FILE is the file containing the gold answers
  • PRED_FILE is the file containing a system's predicted answer
  • OUTPUT_FILE is the file to which the average scores per language will be written.

Detailed scores (the scores for each paragraph) will be appended as columns to the PRED_FILE.