Which tokenizer does Flair use? #394

Deep1994 · 2019-01-16T03:01:18Z

Hi, I want to tokenize my own text like what Flair does, so I want to know what tokenizer Flair uses? The length of my sentences is variable, does Flair embeddings have any special token considered as a padding token? Or how to process sentences which have variable length when I use Flair? Thanks!

stefan-it · 2019-01-16T09:04:21Z

flair uses the segtok library for tokenization :)

Here you can find an example of how to use it:

https://github.com/zalandoresearch/flair/blob/390cdc51c1aaa992d776622ba28286efd962e883/resources/docs/TUTORIAL_2_TAGGING.md#tagging-a-list-of-sentences

You could use it like:

text = "This is a sentence. This is another sentence. I love Berlin."

# use a library to split into sentences
from segtok.segmenter import split_single
sentences = [sent for sent in split_single(text)]

stefan-it · 2019-01-16T09:36:02Z

A Padding example can be found here :)

Deep1994 · 2019-01-16T10:43:23Z

Thank you, what about word tokenization? Do you know which combination of word embeddings is the best one? One problem is that if I want to conbine Flair embeddings and BERT embeddings, I must use the same word tokenizer, but BERT has its own word tokenizer, so I want to know if you have conducted any experiments on the several word embeddings combination ways and which one performs better?

alanakbik · 2019-01-16T10:52:26Z

If you pass use_tokenizer=True the segtok library is used to tokenize a sentence. So initialize your sentence like this:

from flair.data import Sentence

# Make a sentence object by passing an untokenized string and the 'use_tokenizer' flag
sentence = Sentence('The grass is green.', use_tokenizer=True)

# Print the object to see what's in there
print(sentence)

As @stefan-it wrote, padding is already taken care of when you use our embedding classes, so you need not worry about special padding tokens.

However, I should caution that segtok was built with English in mind. If you want to use a different tokenizer it may be better to first tokenize your text with this tokenizer and write out whitespace tokenized strings. Then read those in without use_tokenizer=True.

Hope this helps!

stefan-it · 2019-01-16T11:04:44Z

@alanakbik Do you think it is worth to add support for different tokenization libraries? E.g. it seems that everyone is using spacy now (here's an overview of available models/languages). But I guess spacy is computationally not as cheap as segtok 🤔

alanakbik · 2019-01-16T11:10:43Z

Yes it would be good to add support for different tokenizers. I was thinking of a unified interface for all tokenizers so that users can easily switch them out.

I am a bit hesitant of adding spacy since -- like allennlp -- this is a massive library with lots of subdependencies that would get installed by default. Since we try to keep Flair lightweight, we generally don't want to add too many dependencies. So I am not sure what the best way forward is on tokenization.

stefan-it · 2019-01-21T01:02:25Z

I would really like to have this kind of interface - I think we could make all dependencies like spacy optional (using try-catch statements like it is done with allennlp library when using the ELMo embeddings).

I currently use the BERT tokenizer code for simply tokenization :)

stale · 2020-04-30T02:53:52Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

alanakbik · 2020-04-30T10:41:07Z

This functionality was added a while back

Deep1994 added the question Further information is requested label Jan 16, 2019

tabergma added the feature A new feature label Jan 22, 2019

alanakbik mentioned this issue Feb 24, 2019

Flair 0.5 features #563

Closed

5 tasks

pwichmann mentioned this issue Aug 1, 2019

What is the expected result for ORGs with apostrophe s? #939

Closed

stale bot added the wontfix This will not be worked on label Apr 30, 2020

alanakbik removed the wontfix This will not be worked on label Apr 30, 2020

alanakbik closed this as completed Apr 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Which tokenizer does Flair use? #394

Which tokenizer does Flair use? #394

Deep1994 commented Jan 16, 2019 •

edited

Loading

stefan-it commented Jan 16, 2019

stefan-it commented Jan 16, 2019

Deep1994 commented Jan 16, 2019 •

edited

Loading

alanakbik commented Jan 16, 2019

stefan-it commented Jan 16, 2019

alanakbik commented Jan 16, 2019

stefan-it commented Jan 21, 2019

stale bot commented Apr 30, 2020

alanakbik commented Apr 30, 2020

Which tokenizer does Flair use? #394

Which tokenizer does Flair use? #394

Comments

Deep1994 commented Jan 16, 2019 • edited Loading

stefan-it commented Jan 16, 2019

stefan-it commented Jan 16, 2019

Deep1994 commented Jan 16, 2019 • edited Loading

alanakbik commented Jan 16, 2019

stefan-it commented Jan 16, 2019

alanakbik commented Jan 16, 2019

stefan-it commented Jan 21, 2019

stale bot commented Apr 30, 2020

alanakbik commented Apr 30, 2020

Deep1994 commented Jan 16, 2019 •

edited

Loading

Deep1994 commented Jan 16, 2019 •

edited

Loading