wikiplaintext

Get plain text from Wikipedia pages, as clean as possible.

Based on the latest versions of the Wikimedia dumps, the principle is to parse the HTML pages and get the cleanest version possible of a text, with markdown format for headers, lists, and tables.

Examples of output can be found in the folder tests/examples_markdown:

Wikipedia pages : with math, with chemical formulas, with several kinds of tables
Wikisource page
Wiktionary page

This code was used to generate the HuggingFace datasets:

Those datasets are supposed to be cleaner and more complete than French subsets of Wikimedia datasets:

wikimedia/wikipedia (20231201.fr) is missing the information behind the template. See discussion here.
wikimedia/wikisource (20231201.fr) is an incomplete dump (only contains 13 millions of words) and sometimes includes raw HTML code.

Documentation

Installation
Dump Wikipedia
Dump Wiktionary
- Download the latest version available
- Download a given version
Dump Wikisource
Acknowledgements

Installation

git clone [email protected]:OpenLLM-France/wikiplaintext.git
cd wikiplaintext
pip install -r requirements.txt

All the scripts in the following are in the subfolder wikiplaintext.

Dump Wikipedia

Download the latest version available

The following command will

Download the latest version of Wikipedia dump from Wikimedia Enterprise HTML dump
Extract then ndjson files from the dump
Extract one HTML file per Wikipedia page
Parse each HTML file to get a clean plain text, and save it in a file

python dump_wiki_html.py \
    --output_dir /path/to/Wikipedia \
    --language fr \
    --source wiki

This will generate plain text files in subfolder /path/to/Wikipedia/{YYYYMMDD}/frwiki_txt/frwiki_namespace_0_* where {YYYYMMDD} is the latest version available.

One file per Wikipedia page, with the page id and title as a filename.

Download a given version

python dump_wiki_html.py \
    --output_dir /path/to/Wikipedia \
    --language fr \
    --source wiki \
    --date 20231201

This will generate plain text files in subfolder /path/to/Wikipedia/20231201/frwiki_txt/frwiki_namespace_0_*.

How to parallelize

The process can be parallelized by launching several time the same command using option --subset {i}/{n}. For example, 5 processes can be launched with the following commands:

python dump_wiki_html.py ... --subset 1/5 &
python dump_wiki_html.py ... --subset 2/5 &
python dump_wiki_html.py ... --subset 3/5 &
python dump_wiki_html.py ... --subset 4/5 &
python dump_wiki_html.py ... --subset 5/5 &

We recommend to run that in several windows of a tmux session (or screen session).

Dump Wiktionary

Download the latest version available

The process is very similar to Wikipedia (see above).

python dump_wiki_html.py \
    --output_dir /path/to/Wikipedia \
    --language fr \
    --source wiktionary

This will generate plain text files in subfolder /path/to/Wikipedia/{YYYYMMDD}/frwiktionary_txt/frwiktionary_namespace_0_* where {YYYYMMDD} is the latest version available.

One file per Wikipedia page, with the page id and title as a filename.

Download a given version

python dump_wiki_html.py \
    --output_dir /path/to/Wikipedia \
    --language fr \
    --source wiktionary \
    --date 20231201

This will generate plain text files in subfolder /path/to/Wikipedia/20231201/frwiktionary_txt/frwiktionary_namespace_0_*.

Dump Wikisource

For Wikisource, it is a bit different because the Wikimedia dump is quite incomplete.

So the process consists in the following:

get all the page titles from the latest HuggingFace dataset from Wikimedia
download the HTML pages from the Wikimedia API
parse the HTML pages and get the plain text

It can be run with the following command:

python dump_wikisource_api.py \
    --output_dir /path/to/Wikipedia \
    --language fr \
    --version 20231201 \
    --dump_html

This will generate plain text files in the folder /path/to/Wikipedia/20231201/frwikisource_txt/frwikisource_namespace_0_0.

Also, with option --dump_html it will dump all HTML pages in the folder /path/to/Wikipedia/20231201/frwikisource_html/frwikisource_namespace_0_0. This is useful to restart the process later, if the cleaning code evolves, using:

python dump_wikisource_api2.py \
    --output_dir /path/to/Wikipedia \
    --language fr \
    --version 20231201

Acknowledgements

Wikimedia
HuggingFace for hosting the datasets

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
tests		tests
wikiplaintext		wikiplaintext
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

wikiplaintext

Documentation

Installation

Dump Wikipedia

Download the latest version available

Download a given version

How to parallelize

Dump Wiktionary

Download the latest version available

Download a given version

Dump Wikisource

Acknowledgements

About

Releases

Packages

Languages

License

OpenLLM-France/wikiplaintext

Folders and files

Latest commit

History

Repository files navigation

wikiplaintext

Documentation

Installation

Dump Wikipedia

Download the latest version available

Download a given version

How to parallelize

Dump Wiktionary

Download the latest version available

Download a given version

Dump Wikisource

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages