Skip to content

Latest commit

 

History

History
73 lines (57 loc) · 2.38 KB

README.md

File metadata and controls

73 lines (57 loc) · 2.38 KB

Goethe

Bringing word2vec to the German language.

Getting started with the Leizpig Corpus

The Leizpig Corpora Collection contains sentences from articles and wikipedia for each year from 1995 to 2015.

Download Leipzig Corpus

To download all news and wikipedia corpora for all years run:

>>> import goethe.utils.leipzig_corpora_downloader
>>> download_corpora_news()
>>> download_corpora_wiki()

Import Leizpig Corpus

The Leizpig Corpora Collection is a quick way to start training models for the German language. You can load a corpus and iterate its sentences with the following code:

from goethe.corpora import LeipzigCorpus

sentences = LeipzigCorpus('path/containing/corpora')

Assuming that you have a file structure like this:

path/containing/corpora/
    deu_news_2015_3M/
        deu_news_2015_3M-sentences.txt
        ...
    deu_wikipedia_2014_3M/
        deu_wikipedia_2014_3M-sentences.txt
        ...
    ...

Model building

You can train models using gensim:

import gensim

sentences = LeipzigCorpus('path/containing/corpora')
model = gensim.models.Word2Vec(sentences)

Evaluation

A trained model can be queried for semantic similarity. We can say e.g., "Obama to USA is what Putin is to X" and ask our model to return a word that matches X:

>>> model.most_similar(['Putin', 'USA'], ['Obama'], topn=3)
[('Russland', 0.7132166028022766),
 ('USA,', 0.7057479619979858),
 ('China', 0.6795132160186768)]

To test our model on multiple such queries you can use the model_accuracy function:

>>> from goethe.evaluation import model_accuracy
>>> model_accuracy(model, 'evaluation/bestmatch-questions.txt', topn=5)
[('Land-Währung', 0.5238095238095238),
 ('Hauptstad-Land', 0.47619047619047616),
 ('Land-Kontinent', 0.34615384615384615),
 ('Land-Sprache', 0.15384615384615385),
 ('Politik', 0.0),
 ('Technik', 0.6666666666666666),
 ('Geschlecht', 0.5220588235294118)]

The resulting list contains a tuple for each section with its name and accuracy. The accuracy here is the percentage of 4-tuples in which the topn words returned by most_similar contained the right word.