Goethe

Bringing word2vec to the German language.

Getting started with the Leizpig Corpus

The Leizpig Corpora Collection contains sentences from articles and wikipedia for each year from 1995 to 2015.

Download Leipzig Corpus

To download all news and wikipedia corpora for all years run:

>>> import goethe.utils.leipzig_corpora_downloader
>>> download_corpora_news()
>>> download_corpora_wiki()

Import Leizpig Corpus

The Leizpig Corpora Collection is a quick way to start training models for the German language. You can load a corpus and iterate its sentences with the following code:

from goethe.corpora import LeipzigCorpus

sentences = LeipzigCorpus('path/containing/corpora')

Assuming that you have a file structure like this:

path/containing/corpora/
    deu_news_2015_3M/
        deu_news_2015_3M-sentences.txt
        ...
    deu_wikipedia_2014_3M/
        deu_wikipedia_2014_3M-sentences.txt
        ...
    ...

Model building

You can train models using gensim:

import gensim

sentences = LeipzigCorpus('path/containing/corpora')
model = gensim.models.Word2Vec(sentences)

Evaluation

A trained model can be queried for semantic similarity. We can say e.g., "Obama to USA is what Putin is to X" and ask our model to return a word that matches X:

>>> model.most_similar(['Putin', 'USA'], ['Obama'], topn=3)
[('Russland', 0.7132166028022766),
 ('USA,', 0.7057479619979858),
 ('China', 0.6795132160186768)]

To test our model on multiple such queries you can use the model_accuracy function:

>>> from goethe.evaluation import model_accuracy
>>> model_accuracy(model, 'evaluation/bestmatch-questions.txt', topn=5)
[('Land-Währung', 0.5238095238095238),
 ('Hauptstad-Land', 0.47619047619047616),
 ('Land-Kontinent', 0.34615384615384615),
 ('Land-Sprache', 0.15384615384615385),
 ('Politik', 0.0),
 ('Technik', 0.6666666666666666),
 ('Geschlecht', 0.5220588235294118)]

The resulting list contains a tuple for each section with its name and accuracy. The accuracy here is the percentage of 4-tuples in which the topn words returned by most_similar contained the right word.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Goethe

Getting started with the Leizpig Corpus

Download Leipzig Corpus

Import Leizpig Corpus

Model building

Evaluation

Files

README.md

Latest commit

History

README.md

File metadata and controls

Goethe

Getting started with the Leizpig Corpus

Download Leipzig Corpus

Import Leizpig Corpus

Model building

Evaluation