DOM-aware tokenization for Hugging Face language models.
Input:
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8">
<meta name="viewport" content="width=device-width">
<title>Hello world</title>
<script>
document.getElementById("demo").innerHTML = "Hello JavaScript!";
</script>
...
Output:
pip install dom-tokenizers[train]
git clone https://github.com/gbenson/dom-tokenizers.git
cd dom-tokenizers
python3 -m venv .venv
. .venv/bin/activate
pip install --upgrade pip
pip install -e .[dev,train]
Check everything's working using a small dataset of around 300 examples:
train-tokenizer gbenson/interesting-dom-snapshots
Train a tokenizer with a 10,000-token vocabulary using a dataset of 4,536 examples and upload it to the Hub:
train-tokenizer gbenson/webui-dom-snapshots -n 10000 -N 4536
huggingface-cli login
huggingface-cli upload dom-tokenizer-10k