Backend: fastText

The fasttext backend implements a text classification algorithm based on word embeddings and machine learning. It is a wrapper around the fastText library created by Facebook Research. The model resembles a feed-forward neural network with one hidden layer, though some shortcuts are used for computational efficiency.

The quality of results can be very good, but many parameters have to be selected to get optimal results. Training can be computationally intensive; by default it can train using up to 12 CPU cores in parallel.

Installation

See Optional features and dependencies

Example configuration

This is a simple configuration that creates a relatively small model (1.3GB when trained on the yso-finna-fi dataset):

[fasttext-en]
name=fastText English
language=en
backend=fasttext
analyzer=snowball(english)
dim=100
lr=0.25
epoch=5
loss=hs
limit=100
chunksize=24
vocab=yso-en

This is a more advanced configuration that creates a larger (3.6GB when trained on the yso-finna-fi dataset), but more accurate model which also takes much longer to train:

[yso-fasttext-fi]
name=YSO fastText Finnish
language=fi
backend=fasttext
analyzer=snowball(finnish)
dim=430
lr=0.74
epoch=75
minn=4
maxn=7
minCount=3
loss=hs
limit=1000
chunksize=24
vocab=yso-fi

Backend-specific parameters

With the exception of chunksize and limit, all the parameters are passed directly to the fastText algorithm. If you omit a parameter, a default value is used. You can check out the fastText documentation about options for more details about the parameters.

The backend processes longer documents in chunks: the document is represented as a list of sentences, and that list is turned into chunks. With a chunksize of 24 (as above), each chunk is made of 24 sentences, except for the last chunk which may be shorter. Each chunk is analyzed separately and the results are averaged. Setting chunksize to a high value such as 10000 will in practice disable chunking.

The most important parameters are:

Parameter	Description
chunksize	How many sentences per chunk
limit	Maximum number of results to return
dim	Dimensionality of word vectors (i.e. hidden layer size)
lr	Learning rate
epoch	How many passes over training data to perform
loss	Loss function: `ns`, `hs` or `softmax`. `hs` is much faster than the others.
minn	Lower limit of character n-gram length
maxn	Upper limit of character n-gram length
minCount	Minimum word (or n-gram) frequency to include it in the model

Retraining with cached training data

Preprocessing the training data can take a significant portion of the training time. If you want to experiment with different parameter settings, you can reuse the preprocessed training data by using the --cached option - see Reusing preprocessed training data. Only the analyzer and vocab settings affect the preprocessing; you can use the --cached option as long as you haven't changed these parameters.

Usage

Load a vocabulary:

annif loadvoc fasttext-en /path/to/Annif-corpora/vocab/yso-en.tsv

Train the model:

annif train fasttext-en /path/to/Annif-corpora/training/yso-finna-en.tsv.gz

Retrain the model, reusing already preprocessed training data:

annif train fasttext-en --cached

Test the model with a single document:

cat document.txt | annif suggest fasttext-en

Evaluate a directory full of files in fulltext document corpus format:

annif eval fasttext-en /path/to/documents/

Provide feedback

Saved searches