🌟 ArXiv Preprint
Make largest monolingual datasets for low-resource languages from all of Common Crawl! Data Collection, Bechmarking, and Fine-tuning.
- UnifiedCrawl: Aggregated Common Crawl for Affordable Adaptation of LLMs on Low-Resource Languages
- Download mono-lingual datasets from common crawl. See the section How to download the data from CC
- Deduplicate dataset using exact substring match. See Deduplicate Data
- Benchmark pre-trained original models, and finetune using the crawled data
mkdir -p ~/miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm -rf ~/miniconda3/miniconda.sh
~/miniconda3/bin/conda init bash
source ~/.bashrc
conda create -y --name data_env python==3.10.12
conda activate data_env
conda install -c "nvidia/label/cuda-11.8.0" cuda-toolkit
# install gcc compiler
sudo apt update && sudo apt install -y gcc unzip
pip install -r setup/requirements_data.txt -r setup/requirements.txt
# Alternatively, to exactly mirror our dependency environment, use below -
# pip install --extra-index-url https://download.pytorch.org/whl/cu118 -r setup/requirments_full.txt
# install duckdb
mkdir ~/opt
cd ~/opt
wget https://github.com/duckdb/duckdb/releases/download/v0.8.1/duckdb_cli-linux-amd64.zip
unzip ./duckdb_cli-linux-amd64.zip
~/opt/duckdb -c "INSTALL httpfs"
Run the following in the terminal-
Set the environment variables CC_CRAWL_VERSION
, S3_ACCESS_KEY
, and S3_SECRET_ACCESS_KEY
before running!
CC_CRAWL_VERSION="CC-MAIN-2023-06" ./download_data/download_and_filter_warc_index.sh 2>&1 | tee datasets/errors.txt
The above command sometimes has some errors, so just re-run the following script many many times -
CC_CRAWL_VERSION="CC-MAIN-2023-06" ./download_data/download_and_filter_warc_index_retry.sh 2>&1 | tee datasets/errors.txt
CC_CRAWL_VERSION="CC-MAIN-2023-06" ./download_data/download_and_filter_warc_index_retry.sh 2>&1 | tee datasets/errors.txt
If your language has many URLs repeatedly crawled, you may benefit for removing duplicate URLs in the warc files across all dumps.
This can be done via simple use of the pandas
library for example. We skip this code here
python ./download_data/download_and_extract_text_from_warc.py --cc-crawl-version=CC-MAIN-2023-06
curl --proto '=https' --tlsv1.3 https://sh.rustup.rs -sSf | sh
source $HOME/.cargo/env
# check if rust is installed successfully
rustc --version
Note - Increase/Decrease the variable jobs_at_once
in the file make_suffix_array.py to increase/decrease the number of parallel jobs based on your CPU cores. Decreasing the number of parallel jobs may also help reducing RAM usage.
cd ./deduplicate_data/deduplicate-text-datasets
cargo build
First, combine the files in a single crawl into one file. The following command will do it for all crawls.
python deduplicate_data/combine_single_dump.py
cd deduplicate_data/text-dedup
# remove any previous deduplicated files
rm -rf ./output && rm -rf ../deduplicate-text-datasets/.cache/ ../deduplicate-text-datasets/output/ ../deduplicate-text-datasets/tmp
# "(When running on larger files, if you get an error that you have too many open files, that's because this script opens lots of files. You should run ulimit -Sn 1000000 to "fix" the error. You might want to do this preemptively before hitting this crash after hour ten of the job.)"
ulimit -Sn 1000000
# To de-duplicate a single-crawl
CC_CRAWL="CC-MAIN-2023-06"
LANGUAGE="amh"
python -m text_dedup.suffix_array \
--path "json" \
--data_files "../../datasets/$CC_CRAWL/amh_txt_collections_hf_combined.jsonl" \
--output "../../datasets/$CC_CRAWL/${LANGUAGE}_dedup" \
--split 'train' \
--column 'text' \
--google_repo_path "../deduplicate-text-datasets" \
--local \
--batch_size 10000 \
--k 50
We can use a simple bash for loop to run the above for all the crawls using -
LANGUAGE="amh"
for i in ../../datasets/CC-MAIN* ; do
echo $i
CC_CRAWL=`basename $i`
echo $CC_CRAWL
rm -rf ./output && rm -rf ../deduplicate-text-datasets/.cache/ ../deduplicate-text-datasets/output/ ../deduplicate-text-datasets/tmp
python -m text_dedup.suffix_array \
--path "json" \
--data_files "../../datasets/$CC_CRAWL/${LANGUAGE}_txt_collections_hf_combined.jsonl" \
--output "../../datasets/$CC_CRAWL/amh_dedup" \
--split 'train'\
--column 'text' \
--google_repo_path "../deduplicate-text-datasets" \
--local \
--batch_size 10000 \
--k 50 ;
done
We deduplicated all the crawls separately first, because the original code requires much more time/memory if there are a very large number of duplicates. (see issue discussion here).
We can run the command below to deduplicate across crawls our already-deduplicated single crawls -
rm -rf ./output && rm -rf ../deduplicate-text-datasets/.cache/ ../deduplicate-text-datasets/output/ ../deduplicate-text-datasets/tmp
python -m text_dedup.suffix_array \
--path "arrow" \
--data_files "../../datasets/*/${LANGUAGE}_dedup/*.arrow" \
--output "../../datasets/${LANGUAGE}_dedup" \
--split 'train' \
--column 'text' \
--google_repo_path "../deduplicate-text-datasets" \
--local \
--batch_size 10000 \
--k 50
After deduplication some documents become very short in length, hence, we remove those documents of length less than 100 characters. Change the LANGUAGE
variable at the top of the file.
python deduplicate_data/remove_short_docs.py
While we present instructions for Amharic language, similar methods can be used for any language.
- Change the values in the file evaluate_model/hyperparameters.py
- Run the model using
python evaluate_model/run_model.py
While we present instructions for Amharic language, similar method can be used for any language.
Run the script below to finetune. Change the necessary variables.
# rm -rf ./finetune_model/output_dir/facebook/xglm-4.5B/
./finetune_model/run_finetune.sh
If you have any questions related to the code or the paper, feel free to email Bethel Melesse at the email provided in the manuscript. If you encounter any problems when using the code, you can open an issue!
Please cite our paper if you find the repo helpful in your work:
@article{tessema2024unifiedcrawl,
author = {Bethel Melesse Tessema and
Akhil Kedia and
Tae-Sun Chung},
title = {UnifiedCrawl: Aggregated Common Crawl for Affordable Adaptation of LLMs on Low-Resource Languages},
journal = {CoRR},
volume = {abs/2411.14343},
year = {2024},
url = {https://doi.org/10.48550/arXiv.2411.14343},
doi = {10.48550/ARXIV.2411.14343},
eprinttype = {arXiv},
eprint = {2411.14343}
}