Extraction of the content of a Wikipedia dump and saving as a csv file compatible with annotation software Label Studio.
- by default, it samples the firest 100 articles from the dump
- it extracts the title, id, url, and text of each article
Modify the two variables in the shell script wikiDumpToCSV.sh to your needs. These are marked with # TODO
in the script. Then, run the script, e.g.:
./wikiDumpToCSV.sh
the script will download the dump file, run Wiki Extractor on it, and then run the script wikiExtract2csv.py on the output generated by Wiki Extractor. Details are described in the section below. The two output .csv
files (articles_sampled_QA.csv
and articles_sampled_NER.csv
) will be saved in the directory specified by OUT_FOLDER
in the script.
- Download the Wikipedia dump corresponding to the desired languge from here
https://dumps.wikimedia.org/<ISO CODE OF THE LANGUAGE>wiki/
e.g. for ALS. Then, select the following variant of the dump file:
For ALS, the link is https://dumps.wikimedia.org/alswiki/20230701/alswiki-20230701-pages-articles.xml.bz2.
- Run Wiki Extractor on the downloaded dump file, e.g.:
python -m wikiextractor.WikiExtractor alswiki-20230701-pages-articles.xml.bz2 --output ../../Documents/MRL_ST_2023/enwiki-20230420_extracted.
. There is no need to unzip the file before running Wiki Extractor. - To convert the data to the required format, run the script wikiExtract2csv.py on the output generated in step 2. Make sure the output file's suffix is
.csv
. - To generate the question-answer-pairs, run the script questions2QA.py on the output generated in step 2. Again, make sure the output file's suffix is
.csv
.
- Python 3.9.12
- other dependencies are listed in requirements.txt
python wikiExtract2csv.py --input INPUT_DIR [--output OUTPUT_DIR] [--sample_size SAMPLE_SIZE] [--min MIN_LENGTH] --split
INPUT_DIR
: path to the directory containing the output of Wiki Extractor, e.g. first folder in the output folder generated by the Wiki Extractor. E.g.,../../Documents/enwiki/AA/
.OUTPUT_DIR
: path to the directory where the output csv file will be saved, default is./articles.csv
sample_size
: number of articles to sample from the input, default is100
min
: minimum number of characters an article needs to have to be included in the output, default is1000
.split_by
: Choose whether to split the text of each article in sentences or paragraphs. Default is both, creates two seperate cvs output files. If you want to split by sentences, use--split_by sentence
. If you want to split by paragraphs, use--split_by paragraph
.
to generate sentence samples and paragraphs samples:
python wikiExtract2csv.py --input "../../Documents/MRL_ST_2023/enwiki-20230420_extracted/AA/" --output "../../Documents/MRL_ST_2023/enwiki/" --sample_size 100 --min 1000
This is a simple python script to create question-answer-pairs from a Label Studio question project snapshot. It extracts the questions and texts from the snapshot and saves them in a csv file, creating one task per question.
python questions2QA.py --input INPUT_FILE --output OUTPUT_FILE --labels --n_tasks N_TASKS
INPUT_FILE
: json file containing LabelStudio snapshotOUTPUT_FILE
: path to the output csv file containing , default is./article_question_pairs.csv
labels
: add this flag to include the labels (answers) in the output csv file.N_TASKS
: first n of tasks to keep from the input, default is100
python questions2QA.py --input /Users/dug/Py/wikiExtract2csv/Question_Exports/ID_Questions.json --output /Users/dug/Py/wikiExtract2csv/Answer_Tasks/answer_tasks_ID.csv
Prerequisits: Label Studio answer project snapshot, exported as csv
file.
This script creates clean csv files containing only text, question and optionally, the answer. It eliminates annotator information.
python answers2csv.py --input INPUT_FILE_PATH --output OUTPUT_FOLDER --labels SPLIT_BY
Script to scrape the relevant wikipedia articles from the urls in the language biography csv files. The script extracts the text from the urls and saves it in a csv file containing the id, url, and text of each article. The script also splits the text into sentences and paragraphs and saves the output in seperate csv files.
First create language biography by running: wikiExtract2csv/NER/get_wikipedia_url_from_wikidata.py on the id_list.csv
file. Make sure to change the language codes in the script.
python get_text_from_url.py --input INPUT_FILE_PATH --output OUTPUT_FOLDER --split_by SPLIT_BY
INPUT_FILE_PATH
: path to the language biography csv file containing the urls and ids of the articles, e.g.NER/lang_biography/en.csv
OUTPUT_FOLDER
: path to the output folder where the processed text files will be saved.SPLIT_BY
: Choose whether to split the text of each article in sentences or paragraphs. Default is both, creates two seperate cvs output files. If you want to split by sentences, use--split_by sentence
. If you want to split by paragraphs, use--split_by paragraph
python get_text_from_url.py --input NER/lang_biography/en.csv --output Test_Outputs_NER/ --sample_size 10 --split_by sentence
Script to postprocess the conll
file exported from a LabelStudio NER project. Main purpose is to remove date tags.
python NER_postprocessing.py --input INPUT_FILE_PATH --output OUTPUT_FILE_PATH
INPUT_FILE_PATH
: path to theconll
file exported from LabelStudio, e.g.NER/lang_biography/ALS_NER.conll
OUTPUT_FOLDER
: path to the file where containing the postprocessed file.
using conlleval-python from sighsmile/conlleval.
-
Clone the sighsmile/conlleval repository. And move the
conlleval.py
file to the current directory. -
Append the predictions to the conll file such that each line has the following format:
token true_label predicted_label
using the following command:python combine_conll_file_tags.py --predictions NER_ALS_Test_PREDICTIONS.conll --labels NER_ALS_Test_GOLD.conll --output NER_ALS_Test_combined.conll
-
Score the
conll
file using:python conlleval.py < NER_ALS_Test_combined.conll > NER_ALS_Test_Result.txt
All metrics are calculated using implementations from Huggingface Evaluate Metric. The following metrics are used:
- ChrF ('char_order': 6, 'word_order': 0, 'beta': 2)
- ChrF+ ('char_order': 6, 'word_order': 1, 'beta': 2)
- ChrF++ ('char_order': 6, 'word_order': 2, 'beta': 2)
- RougeL (Longest common subsequence based scoring)
- BERTScore F1 using embeddings from RobertaBase ('hashcode': 'roberta-base_L10_no-idf_version=0.3.12(hug_trans=4.34.0)
evaluate_QA.py
can be used as follows, needs to be run for every language and system separately:
python evaluate_QA.py --predictions PRED_FILE --labels GOLD_FILE --results RESULTS_FILE
GOLD_FILE
is the file containing the gold answersPRED_FILE
is the file containing a system's predicted answerOUTPUT_FILE
is the file to which the average scores per language will be written.
Detailed scores (the scores for each paragraph) will be appended as columns to the PRED_FILE
.