Tokenization notebook #173

maxjakob · 2024-01-25T10:44:46Z

Add notebook to show users how tokenization works in the context of semantic search and explain why and how users need to chunk text. (Follow up on a forum thread).

TODO

Refine recommendation at the end
Remove output of some cells
Read text file from the web
~~Add reference to overlap documentation~~
~~Comment on the fact that it can happen that chunks start with a ##foo token.~~

joshdevins

It would be good to add a stride/overlap when we chunk since this is a best practice. With ELSER, we recommend 50% overlap/256 token stride.

notebooks/search/lorem-ipsum.txt

notebooks/search/tokenization.ipynb

maxjakob · 2024-01-25T15:00:31Z

notebooks/search/tokenization.ipynb

+    "\n",
+    "Currently there is a limitation that [only the first 512 tokens are considered](https://www.elastic.co/guide/en/machine-learning/8.12/ml-nlp-limitations.html#ml-nlp-elser-v1-limit-512). To work around this, we can first split the input text into chunks of 512 tokens and feed the chunks to Elasticsearch separately. Actually, we need to use a limit of 510 to leave space for the two special tokens (`[CLS]` and `[SEP]`) that we saw.\n",
+    "\n",
+    "Furthermore, it is best practice to make the chunks overlap (**TODO add reference**). With ELSER, we recommend 50% token overlap (i.e. a 256 token stride)."


@joshdevins Do you have a reference that we can link here?

I do not. Maybe @qherreros does? I don't think it's in public docs anywhere, just in our internal benchmarking.

The best reference we have on this subject is in internal documentation.
I found this paper which is explicitly measuring improvements with overlaps, but it's in encoder/decoder architecture, fusing chunks in the decoder. I still think their conclusion can be applied to more than just FiD architecture.

Alright, for now I went without a reference and this generic statement: in practice we often see improved performance when using overlapping chunks

miguelgrinberg · 2024-01-25T16:03:04Z

How do you feel about making this notebook run cleanly under nbtest?

maxjakob · 2024-01-25T16:38:57Z

How do you feel about making this notebook run cleanly under nbtest?

Definitely. Runs fine for me locally:

$ nbtest notebooks/search/tokenization.ipynb
Running notebooks/search/tokenization.ipynb... OK

@miguelgrinberg can you confirm that nbtest always expects me to set env variables for Elastic Cloud access even if my notebook does not require it?

miguelgrinberg · 2024-01-25T16:56:00Z

There shouldn't be a need to set env vars if they are not used by the notebook. Are you getting any error(s) that I can see?

Also, great that the notebook runs cleanly! What you need to do now is add a Makefile, and then reference the Makefile in the top-level Makefile, so that this gets included when I run make test from the top!

davidkyle

LGTM

This is brilliant we should have done it ages ago.

davidkyle · 2024-01-25T16:39:12Z

notebooks/search/tokenization.ipynb

+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "The lives of two mob hitmen, a boxer, a gangster and his wife, and a pair of diner bandits intertwine in four tales of violence and redemption.\n",


Pulp Fiction! what do I win?

A Royale with cheese 🍔

notebooks/search/tokenization.ipynb

davidkyle · 2024-01-25T17:04:25Z

notebooks/search/tokenization.ipynb

+   "source": [
+    "tokens = bert_tokenizer.encode(long_text)[1:-1]  # exclude special tokens at the beginning and end\n",
+    "chunked = [\n",
+    "    bert_tokenizer.decode(tokens_chunk)\n",


Encoding the text involves a step that normalises the text to Unicode NFD form and then strips non spacing marks (Unicode category Mn).

Decoding does a reverse look up on the BERT vocabulary file and joins the compound element (those starting with ##).

Decoding isn't perfect as those non spacing marks are lost and so may appear different to the original text.

This might be a level of detail too far for most readers but it's worth mentioning. Maybe something like:

"Tokenisation involves a normalisation step and strips non spacing marks. If decoding is implemented as a reverse lookup from token Ids to vocabulary entries those stripped marks will not be recovered resulting in decoded text that could be slightly different to the original."

Again I'm not sure we need this level of detail

Interesting point. This seems to only be a problem for BERT though. The E5 tokenizer preserves the nonspacing marks:

latvian_name = "Da\u0305vis" print(latvian_name) print(bert_tokenizer.decode(bert_tokenizer.encode(latvian_name))) print(e5_tokenizer.decode(e5_tokenizer.encode(latvian_name)))

Da̅vis [CLS] davis [SEP] <s> Da̅vis</s>

If we recommend using ELSER=BERT for English-only and E5 for multi-lingual, I lean towards omitting this point here.

maxjakob · 2024-01-26T09:40:16Z

What you need to do now is add a Makefile, and then reference the Makefile in the top-level Makefile, so that this gets included when I run make test from the top!

@miguelgrinberg Added to the existing Makefile: 4f6dfaa

Tokenization notebook

0ca0e68

maxjakob force-pushed the tokenization-nb branch from a8aad2d to 0ca0e68 Compare January 25, 2024 11:01

maxjakob marked this pull request as ready for review January 25, 2024 11:02

remove comment

3158306

maxjakob requested review from joshdevins and davidkyle January 25, 2024 12:02

maxjakob added 2 commits January 25, 2024 14:01

exclude special tokens before decoding

e24a7b5

improve descriptions

41741c3

joshdevins requested changes Jan 25, 2024

View reviewed changes

notebooks/search/lorem-ipsum.txt Outdated Show resolved Hide resolved

notebooks/search/tokenization.ipynb Outdated Show resolved Hide resolved

add overlap; remove text file

f22766a

maxjakob requested review from joshdevins and removed request for davidkyle January 25, 2024 14:59

maxjakob commented Jan 25, 2024

View reviewed changes

davidkyle approved these changes Jan 25, 2024

View reviewed changes

joshdevins approved these changes Jan 25, 2024

View reviewed changes

maxjakob added 2 commits January 26, 2024 09:45

run nbtest through Makefile

4f6dfaa

reformulate recommendation

c956be6

maxjakob requested a review from miguelgrinberg January 26, 2024 09:40

maxjakob added 2 commits January 26, 2024 10:44

remove paragraph I wanted to omit

5c8313b

move to document-chunking folder

f53bbbf

maxjakob merged commit 5369ab6 into elastic:main Jan 26, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenization notebook #173

Tokenization notebook #173

maxjakob commented Jan 25, 2024 •

edited

Loading

joshdevins left a comment

maxjakob Jan 25, 2024

joshdevins Jan 25, 2024

qherreros Jan 26, 2024

maxjakob Jan 26, 2024 •

edited

Loading

miguelgrinberg commented Jan 25, 2024

maxjakob commented Jan 25, 2024

miguelgrinberg commented Jan 25, 2024 •

edited

Loading

davidkyle left a comment

davidkyle Jan 25, 2024

maxjakob Jan 26, 2024

davidkyle Jan 25, 2024 •

edited

Loading

maxjakob Jan 26, 2024

maxjakob commented Jan 26, 2024

Tokenization notebook #173

Tokenization notebook #173

Conversation

maxjakob commented Jan 25, 2024 • edited Loading

TODO

joshdevins left a comment

Choose a reason for hiding this comment

maxjakob Jan 25, 2024

Choose a reason for hiding this comment

joshdevins Jan 25, 2024

Choose a reason for hiding this comment

qherreros Jan 26, 2024

Choose a reason for hiding this comment

maxjakob Jan 26, 2024 • edited Loading

Choose a reason for hiding this comment

miguelgrinberg commented Jan 25, 2024

maxjakob commented Jan 25, 2024

miguelgrinberg commented Jan 25, 2024 • edited Loading

davidkyle left a comment

Choose a reason for hiding this comment

davidkyle Jan 25, 2024

Choose a reason for hiding this comment

maxjakob Jan 26, 2024

Choose a reason for hiding this comment

davidkyle Jan 25, 2024 • edited Loading

Choose a reason for hiding this comment

maxjakob Jan 26, 2024

Choose a reason for hiding this comment

maxjakob commented Jan 26, 2024

maxjakob commented Jan 25, 2024 •

edited

Loading

maxjakob Jan 26, 2024 •

edited

Loading

miguelgrinberg commented Jan 25, 2024 •

edited

Loading

davidkyle Jan 25, 2024 •

edited

Loading