-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Which tokenizer does Flair use? #394
Comments
Here you can find an example of how to use it: You could use it like: text = "This is a sentence. This is another sentence. I love Berlin."
# use a library to split into sentences
from segtok.segmenter import split_single
sentences = [sent for sent in split_single(text)] |
A Padding example can be found here :) |
Thank you, what about word tokenization? Do you know which combination of word embeddings is the best one? One problem is that if I want to conbine Flair embeddings and BERT embeddings, I must use the same word tokenizer, but BERT has its own word tokenizer, so I want to know if you have conducted any experiments on the several word embeddings combination ways and which one performs better? |
If you pass from flair.data import Sentence
# Make a sentence object by passing an untokenized string and the 'use_tokenizer' flag
sentence = Sentence('The grass is green.', use_tokenizer=True)
# Print the object to see what's in there
print(sentence) As @stefan-it wrote, padding is already taken care of when you use our embedding classes, so you need not worry about special padding tokens. However, I should caution that Hope this helps! |
@alanakbik Do you think it is worth to add support for different tokenization libraries? E.g. it seems that everyone is using |
Yes it would be good to add support for different tokenizers. I was thinking of a unified interface for all tokenizers so that users can easily switch them out. I am a bit hesitant of adding spacy since -- like allennlp -- this is a massive library with lots of subdependencies that would get installed by default. Since we try to keep Flair lightweight, we generally don't want to add too many dependencies. So I am not sure what the best way forward is on tokenization. |
I would really like to have this kind of interface - I think we could make all dependencies like I currently use the BERT tokenizer code for simply tokenization :) |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This functionality was added a while back |
Hi, I want to tokenize my own text like what Flair does, so I want to know what tokenizer Flair uses? The length of my sentences is variable, does Flair embeddings have any special token considered as a padding token? Or how to process sentences which have variable length when I use Flair? Thanks!
The text was updated successfully, but these errors were encountered: