Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Which tokenizer does Flair use? #394

Closed
Deep1994 opened this issue Jan 16, 2019 · 9 comments
Closed

Which tokenizer does Flair use? #394

Deep1994 opened this issue Jan 16, 2019 · 9 comments
Labels
feature A new feature question Further information is requested

Comments

@Deep1994
Copy link

Deep1994 commented Jan 16, 2019

Hi, I want to tokenize my own text like what Flair does, so I want to know what tokenizer Flair uses? The length of my sentences is variable, does Flair embeddings have any special token considered as a padding token? Or how to process sentences which have variable length when I use Flair? Thanks!

@Deep1994 Deep1994 added the question Further information is requested label Jan 16, 2019
@stefan-it
Copy link
Member

flair uses the segtok library for tokenization :)

Here you can find an example of how to use it:

https://github.com/zalandoresearch/flair/blob/390cdc51c1aaa992d776622ba28286efd962e883/resources/docs/TUTORIAL_2_TAGGING.md#tagging-a-list-of-sentences

You could use it like:

text = "This is a sentence. This is another sentence. I love Berlin."

# use a library to split into sentences
from segtok.segmenter import split_single
sentences = [sent for sent in split_single(text)]

@stefan-it
Copy link
Member

A Padding example can be found here :)

@Deep1994
Copy link
Author

Deep1994 commented Jan 16, 2019

Thank you, what about word tokenization? Do you know which combination of word embeddings is the best one? One problem is that if I want to conbine Flair embeddings and BERT embeddings, I must use the same word tokenizer, but BERT has its own word tokenizer, so I want to know if you have conducted any experiments on the several word embeddings combination ways and which one performs better?

@alanakbik
Copy link
Collaborator

If you pass use_tokenizer=True the segtok library is used to tokenize a sentence. So initialize your sentence like this:

from flair.data import Sentence

# Make a sentence object by passing an untokenized string and the 'use_tokenizer' flag
sentence = Sentence('The grass is green.', use_tokenizer=True)

# Print the object to see what's in there
print(sentence)

As @stefan-it wrote, padding is already taken care of when you use our embedding classes, so you need not worry about special padding tokens.

However, I should caution that segtok was built with English in mind. If you want to use a different tokenizer it may be better to first tokenize your text with this tokenizer and write out whitespace tokenized strings. Then read those in without use_tokenizer=True.

Hope this helps!

@stefan-it
Copy link
Member

@alanakbik Do you think it is worth to add support for different tokenization libraries? E.g. it seems that everyone is using spacy now (here's an overview of available models/languages). But I guess spacy is computationally not as cheap as segtok 🤔

@alanakbik
Copy link
Collaborator

Yes it would be good to add support for different tokenizers. I was thinking of a unified interface for all tokenizers so that users can easily switch them out.

I am a bit hesitant of adding spacy since -- like allennlp -- this is a massive library with lots of subdependencies that would get installed by default. Since we try to keep Flair lightweight, we generally don't want to add too many dependencies. So I am not sure what the best way forward is on tokenization.

@stefan-it
Copy link
Member

I would really like to have this kind of interface - I think we could make all dependencies like spacy optional (using try-catch statements like it is done with allennlp library when using the ELMo embeddings).

I currently use the BERT tokenizer code for simply tokenization :)

@stale
Copy link

stale bot commented Apr 30, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix This will not be worked on label Apr 30, 2020
@alanakbik alanakbik removed the wontfix This will not be worked on label Apr 30, 2020
@alanakbik
Copy link
Collaborator

This functionality was added a while back

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature A new feature question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants