Get plain text from Wikipedia pages, as clean as possible.
Based on the latest versions of the Wikimedia dumps, the principle is to parse the HTML pages and get the cleanest version possible of a text, with markdown format for headers, lists, and tables.
Examples of output can be found in the folder tests/examples_markdown:
- Wikipedia pages : with math, with chemical formulas, with several kinds of tables
- Wikisource page
- Wiktionary page
This code was used to generate the HuggingFace datasets:
Those datasets are supposed to be cleaner and more complete than French subsets of Wikimedia datasets:
- wikimedia/wikipedia (20231201.fr) is missing the information behind the template. See discussion here.
- wikimedia/wikisource (20231201.fr) is an incomplete dump (only contains 13 millions of words) and sometimes includes raw HTML code.
git clone [email protected]:OpenLLM-France/wikiplaintext.git
cd wikiplaintext
pip install -r requirements.txt
All the scripts in the following are in the subfolder wikiplaintext
.
The following command will
- Download the latest version of Wikipedia dump from Wikimedia Enterprise HTML dump
- Extract then ndjson files from the dump
- Extract one HTML file per Wikipedia page
- Parse each HTML file to get a clean plain text, and save it in a file
python dump_wiki_html.py \
--output_dir /path/to/Wikipedia \
--language fr \
--source wiki
This will generate plain text files in subfolder
/path/to/Wikipedia/{YYYYMMDD}/frwiki_txt/frwiki_namespace_0_*
where {YYYYMMDD}
is the latest version available.
One file per Wikipedia page, with the page id and title as a filename.
python dump_wiki_html.py \
--output_dir /path/to/Wikipedia \
--language fr \
--source wiki \
--date 20231201
This will generate plain text files in subfolder
/path/to/Wikipedia/20231201/frwiki_txt/frwiki_namespace_0_*
.
The process can be parallelized by launching several time the same command using option --subset {i}/{n}
.
For example, 5 processes can be launched with the following commands:
python dump_wiki_html.py ... --subset 1/5 &
python dump_wiki_html.py ... --subset 2/5 &
python dump_wiki_html.py ... --subset 3/5 &
python dump_wiki_html.py ... --subset 4/5 &
python dump_wiki_html.py ... --subset 5/5 &
We recommend to run that in several windows of a tmux
session (or screen
session).
The process is very similar to Wikipedia (see above).
python dump_wiki_html.py \
--output_dir /path/to/Wikipedia \
--language fr \
--source wiktionary
This will generate plain text files in subfolder
/path/to/Wikipedia/{YYYYMMDD}/frwiktionary_txt/frwiktionary_namespace_0_*
where {YYYYMMDD}
is the latest version available.
One file per Wikipedia page, with the page id and title as a filename.
python dump_wiki_html.py \
--output_dir /path/to/Wikipedia \
--language fr \
--source wiktionary \
--date 20231201
This will generate plain text files in subfolder
/path/to/Wikipedia/20231201/frwiktionary_txt/frwiktionary_namespace_0_*
.
For Wikisource, it is a bit different because the Wikimedia dump is quite incomplete.
So the process consists in the following:
- get all the page titles from the latest HuggingFace dataset from Wikimedia
- download the HTML pages from the Wikimedia API
- parse the HTML pages and get the plain text
It can be run with the following command:
python dump_wikisource_api.py \
--output_dir /path/to/Wikipedia \
--language fr \
--version 20231201 \
--dump_html
This will generate plain text files in the folder
/path/to/Wikipedia/20231201/frwikisource_txt/frwikisource_namespace_0_0
.
Also, with option --dump_html
it will dump all HTML pages in the folder
/path/to/Wikipedia/20231201/frwikisource_html/frwikisource_namespace_0_0
.
This is useful to restart the process later, if the cleaning code evolves, using:
python dump_wikisource_api2.py \
--output_dir /path/to/Wikipedia \
--language fr \
--version 20231201
- Wikimedia
- HuggingFace for hosting the datasets