-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokenization notebook #173
Conversation
a8aad2d
to
0ca0e68
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to add a stride/overlap when we chunk since this is a best practice. With ELSER, we recommend 50% overlap/256 token stride.
notebooks/search/tokenization.ipynb
Outdated
"\n", | ||
"Currently there is a limitation that [only the first 512 tokens are considered](https://www.elastic.co/guide/en/machine-learning/8.12/ml-nlp-limitations.html#ml-nlp-elser-v1-limit-512). To work around this, we can first split the input text into chunks of 512 tokens and feed the chunks to Elasticsearch separately. Actually, we need to use a limit of 510 to leave space for the two special tokens (`[CLS]` and `[SEP]`) that we saw.\n", | ||
"\n", | ||
"Furthermore, it is best practice to make the chunks overlap (**TODO add reference**). With ELSER, we recommend 50% token overlap (i.e. a 256 token stride)." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@joshdevins Do you have a reference that we can link here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not. Maybe @qherreros does? I don't think it's in public docs anywhere, just in our internal benchmarking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The best reference we have on this subject is in internal documentation.
I found this paper which is explicitly measuring improvements with overlaps, but it's in encoder/decoder architecture, fusing chunks in the decoder. I still think their conclusion can be applied to more than just FiD architecture.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, for now I went without a reference and this generic statement: in practice we often see improved performance when using overlapping chunks
How do you feel about making this notebook run cleanly under |
Definitely. Runs fine for me locally: $ nbtest notebooks/search/tokenization.ipynb
Running notebooks/search/tokenization.ipynb... OK @miguelgrinberg can you confirm that nbtest always expects me to set env variables for Elastic Cloud access even if my notebook does not require it? |
There shouldn't be a need to set env vars if they are not used by the notebook. Are you getting any error(s) that I can see? Also, great that the notebook runs cleanly! What you need to do now is add a Makefile, and then reference the Makefile in the top-level Makefile, so that this gets included when I run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This is brilliant we should have done it ages ago.
notebooks/search/tokenization.ipynb
Outdated
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"The lives of two mob hitmen, a boxer, a gangster and his wife, and a pair of diner bandits intertwine in four tales of violence and redemption.\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pulp Fiction! what do I win?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A Royale with cheese 🍔
notebooks/search/tokenization.ipynb
Outdated
"source": [ | ||
"tokens = bert_tokenizer.encode(long_text)[1:-1] # exclude special tokens at the beginning and end\n", | ||
"chunked = [\n", | ||
" bert_tokenizer.decode(tokens_chunk)\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Encoding the text involves a step that normalises the text to Unicode NFD form and then strips non spacing marks (Unicode category Mn).
Decoding does a reverse look up on the BERT vocabulary file and joins the compound element (those starting with ##
).
Decoding isn't perfect as those non spacing marks are lost and so may appear different to the original text.
This might be a level of detail too far for most readers but it's worth mentioning. Maybe something like:
"Tokenisation involves a normalisation step and strips non spacing marks. If decoding is implemented as a reverse lookup from token Ids to vocabulary entries those stripped marks will not be recovered resulting in decoded text that could be slightly different to the original."
Again I'm not sure we need this level of detail
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting point. This seems to only be a problem for BERT though. The E5 tokenizer preserves the nonspacing marks:
latvian_name = "Da\u0305vis"
print(latvian_name)
print(bert_tokenizer.decode(bert_tokenizer.encode(latvian_name)))
print(e5_tokenizer.decode(e5_tokenizer.encode(latvian_name)))
Da̅vis
[CLS] davis [SEP]
<s> Da̅vis</s>
If we recommend using ELSER=BERT for English-only and E5 for multi-lingual, I lean towards omitting this point here.
@miguelgrinberg Added to the existing Makefile: 4f6dfaa |
Add notebook to show users how tokenization works in the context of semantic search and explain why and how users need to chunk text. (Follow up on a forum thread).
TODO
Add reference to overlap documentationComment on the fact that it can happen that chunks start with a##foo
token.