Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenization notebook #173

Merged
merged 9 commits into from
Jan 26, 2024
Merged

Tokenization notebook #173

merged 9 commits into from
Jan 26, 2024

Conversation

maxjakob
Copy link
Contributor

@maxjakob maxjakob commented Jan 25, 2024

Add notebook to show users how tokenization works in the context of semantic search and explain why and how users need to chunk text. (Follow up on a forum thread).

TODO

  • Refine recommendation at the end
  • Remove output of some cells
  • Read text file from the web
  • Add reference to overlap documentation
  • Comment on the fact that it can happen that chunks start with a ##foo token.

@maxjakob maxjakob marked this pull request as ready for review January 25, 2024 11:02
Copy link
Member

@joshdevins joshdevins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to add a stride/overlap when we chunk since this is a best practice. With ELSER, we recommend 50% overlap/256 token stride.

notebooks/search/lorem-ipsum.txt Outdated Show resolved Hide resolved
notebooks/search/tokenization.ipynb Outdated Show resolved Hide resolved
@maxjakob maxjakob requested review from joshdevins and removed request for davidkyle January 25, 2024 14:59
"\n",
"Currently there is a limitation that [only the first 512 tokens are considered](https://www.elastic.co/guide/en/machine-learning/8.12/ml-nlp-limitations.html#ml-nlp-elser-v1-limit-512). To work around this, we can first split the input text into chunks of 512 tokens and feed the chunks to Elasticsearch separately. Actually, we need to use a limit of 510 to leave space for the two special tokens (`[CLS]` and `[SEP]`) that we saw.\n",
"\n",
"Furthermore, it is best practice to make the chunks overlap (**TODO add reference**). With ELSER, we recommend 50% token overlap (i.e. a 256 token stride)."
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joshdevins Do you have a reference that we can link here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not. Maybe @qherreros does? I don't think it's in public docs anywhere, just in our internal benchmarking.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The best reference we have on this subject is in internal documentation.
I found this paper which is explicitly measuring improvements with overlaps, but it's in encoder/decoder architecture, fusing chunks in the decoder. I still think their conclusion can be applied to more than just FiD architecture.

Copy link
Contributor Author

@maxjakob maxjakob Jan 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, for now I went without a reference and this generic statement: in practice we often see improved performance when using overlapping chunks

@miguelgrinberg
Copy link
Collaborator

How do you feel about making this notebook run cleanly under nbtest?

@maxjakob
Copy link
Contributor Author

How do you feel about making this notebook run cleanly under nbtest?

Definitely. Runs fine for me locally:

$ nbtest notebooks/search/tokenization.ipynb
Running notebooks/search/tokenization.ipynb... OK

@miguelgrinberg can you confirm that nbtest always expects me to set env variables for Elastic Cloud access even if my notebook does not require it?

@miguelgrinberg
Copy link
Collaborator

miguelgrinberg commented Jan 25, 2024

There shouldn't be a need to set env vars if they are not used by the notebook. Are you getting any error(s) that I can see?

Also, great that the notebook runs cleanly! What you need to do now is add a Makefile, and then reference the Makefile in the top-level Makefile, so that this gets included when I run make test from the top!

Copy link
Member

@davidkyle davidkyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

This is brilliant we should have done it ages ago.

"name": "stdout",
"output_type": "stream",
"text": [
"The lives of two mob hitmen, a boxer, a gangster and his wife, and a pair of diner bandits intertwine in four tales of violence and redemption.\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pulp Fiction! what do I win?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A Royale with cheese 🍔

notebooks/search/tokenization.ipynb Outdated Show resolved Hide resolved
"source": [
"tokens = bert_tokenizer.encode(long_text)[1:-1] # exclude special tokens at the beginning and end\n",
"chunked = [\n",
" bert_tokenizer.decode(tokens_chunk)\n",
Copy link
Member

@davidkyle davidkyle Jan 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Encoding the text involves a step that normalises the text to Unicode NFD form and then strips non spacing marks (Unicode category Mn).

Decoding does a reverse look up on the BERT vocabulary file and joins the compound element (those starting with ##).

Decoding isn't perfect as those non spacing marks are lost and so may appear different to the original text.

This might be a level of detail too far for most readers but it's worth mentioning. Maybe something like:

"Tokenisation involves a normalisation step and strips non spacing marks. If decoding is implemented as a reverse lookup from token Ids to vocabulary entries those stripped marks will not be recovered resulting in decoded text that could be slightly different to the original."

Again I'm not sure we need this level of detail

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting point. This seems to only be a problem for BERT though. The E5 tokenizer preserves the nonspacing marks:

latvian_name = "Da\u0305vis"
print(latvian_name)
print(bert_tokenizer.decode(bert_tokenizer.encode(latvian_name)))
print(e5_tokenizer.decode(e5_tokenizer.encode(latvian_name)))
Da̅vis
[CLS] davis [SEP]
<s> Da̅vis</s>

If we recommend using ELSER=BERT for English-only and E5 for multi-lingual, I lean towards omitting this point here.

@maxjakob
Copy link
Contributor Author

What you need to do now is add a Makefile, and then reference the Makefile in the top-level Makefile, so that this gets included when I run make test from the top!

@miguelgrinberg Added to the existing Makefile: 4f6dfaa

@maxjakob maxjakob merged commit 5369ab6 into elastic:main Jan 26, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants