Skip to content

Experiments with (German) text classification using state-of-the-art Deep Learning approaches.

License

Notifications You must be signed in to change notification settings

goerlitz/nlp-classification

Repository files navigation

NLP Classification


Photo by Amador Loureiro on Unsplash

About

Text Classification (or Text Categorization, Document Classification) is the process of analyzing natural language texts and labeling them with a predefined set of categories in order to make it easier to manage them.

Typical use cases are

  • Spam Classification
  • Support Ticket Classification
  • Sentiment Analysis
  • Document Labeling

This repository focuses on classification of German texts using state-of-the-art deep learning models.

Datasets

German Language Models

The basis for the text classification are Transformer Language Models which were pre-trained on a corpus of German texts. All of following German language models are available through the Hugging Face model hub.

BERT:

DistilBERT:

Electra:

Experiments

Many different factors influcence the performance of a NLP model, e.g. from the quality of the training data to the choice of hyperparameters for model tuning. Following we will establish a baseline and then run additional experiements to futher improve the classification accuracy.

Baseline

Transfer learning with a pre-trained Transformers model, using SimpleTranformers with a default Classification head, on the 10k German News Articles dataset.

Preparation of Training Data

  • Text Preprocessing
    • original text - (baseline) no preprocessing
    • lower case - ignore capitalization of words, e.g. at beginning of sentence
    • sentence splitting - one sentence per line (Spacy, SoMaJo)
    • removal of special charaters - markup, urls etc.
    • maximum text length - shorter text are harder for learning but longer texts do not necessarily add extra value
  • Tokenization
    • word splitting - (baseline) just split words and punctuation
    • German umlauts - keep or use tranliteration
    • compound words - keep or split in parts

Model Training

  • Language Model (LM) Training
    • use pre-trained LM - (baseline) Bert, Distilbert, RoBERTa, Electra, ...
    • language-specific vs. multi-language - the latter ones are larger
  • Domain Adaption of Language Model
    • refine LM with task specific data (optional)
  • Downstream task training
    • class imbalance - do nothing, oversampling, class weights
    • model head config
      • layers
      • drop out
    • model training
      • batch size
      • learning rate
      • iterations/steps
    • cross validation

About

Experiments with (German) text classification using state-of-the-art Deep Learning approaches.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published