A language model for Swiss German based on Huggingface/Transformers.
Using BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding pre-trained on cased German text by Deepset.ai, which included: German Wikipedia dump (6GB of raw txt files), the OpenLegalData dump (2.4 GB) and news articles (3.6 GB)
The model is then fine tuned on the Swiss German data of the Leipzig Corpora Collection and SwissCrawl.
Alternatively, a GPT-2 model can also be trained, but there is no German pre-trained model available for that.
- Python 3
- PyTorch
- Huggingface/Transformers
All dependencies can be installed with pip.
pip install --user -r requirements.txt
On Windows the PyTorch packages may not be available on PyPi, hence you need to point to the official PyTorch registry:
pip install --user -r requirements.txt -f https://download.pytorch.org/whl/torch_stable.html
If you'd like to use a different installation method or another CUDA version with PyTorch follow the instructions on PyTorch - Getting Started.
Model | Configuration | Datasets used for training | Download |
---|---|---|---|
bert-swiss-lm |
bert-base |
Leipzig and SwissCrawl | Google Drive |
Data for training essentially raw text files, but since the Leipzig corpus uses a TSV style, that has been kept, but instead of the second column containing the sentences (first one in Leipzig corpus is the index), it is now the first one. This means you can add more columns after the first one, if you have a dataset that needs additional labels (e.g. for sentiment of the sentence) or just any additional information that will be ignored during training.
The Leipzig corpus can be converted with prepare_data.py
:
python prepare_data.py -i data/leipzig.tsv -o leipzig-converted --split 80
-o
/--output
is the output directory for all the generated data. If not
given, it will be generated to data/<basename-of-input-file>
, which would be
data/leipzig/
in this example.
--split
optionally generates a training validation split (80/20 in this case)
additionally to the full data.
You can also generate a vocabulary, SentencePiece, WordPiece (BERT's style) and
Byte Pair Encoding (BPE) from the input by supplying the --vocab
flag.
Similarly, the SwissCrawl data can be prepared by setting the -t
/--type
option to swiss-crawl
. For this preparation a minimum probability of 0.99 is
used by default and can be changed with -p
/--p
.
python prepare_data.py -i data/swiss-crawl.csv -o swiss-crawl-converted --split 80 -t swiss-crawl
Training is done with the train.py
script:
python train.py --name some-name -c log/some-name/checkpoints/0022/ --train-text /path/to/text.tsv --validation-text /path/to/text.tsv --fp16
The --name
option is used to give it a name, otherwise the checkpoints are
just numbered without any given name and -c
is to resume from the given
checkpoint, if not specified it starts fresh.
Modern GPUs contain Tensor Cores (starting from V100 and RTX series) which enable mixed precision calculation, using optimised fp16 operations while still keeping the fp32 weights and therefore precision.
It can be enabled by setting the --fp16
flag.
Other GPUs without Tensor Cores do not benefit from using mixed precision since they only do fp32 operations and you may find it even becoming slower.
Different models can be selected with the -m
/--model
option, which are
either bert
or gpt2
, to fine tune a pre-trained model, which can be changed
with the --pre-trained
option by specifying one model available at
Huggingface Transformers - Pretrained Models or by
specifyiing a path to the pre-trained model.
There's also the possibility to train either of the model from scratch by
choosing bert-scratch
or gpt-scratch
for the --model
. The configuration
used can still be changed with the --pre-trained
option, but the pre-trained
weights will not be loaded, just the configuration. Addtionally, the vocabulary
can be changed with --vocab
(path to the directory of the generated
vocabularies) if another vocbaulary instead of the pre-trained one should be
used.
For all options see python train.py --help
.
During the training various types of logs are created with Lavd and
everything can be found in log/
and is grouped by the experiment name.
- Summary
- Checkpoints
- Top 5 Checkpoints
- TensorBoard
- Event logs
To visualise the logged data run:
lavd log/