From 0ca0e689352f10968c8a69aa170df3f7af816def Mon Sep 17 00:00:00 2001 From: Max Jakob Date: Thu, 25 Jan 2024 11:38:04 +0100 Subject: [PATCH 1/9] Tokenization notebook --- notebooks/search/lorem-ipsum.txt | 1 + notebooks/search/tokenization.ipynb | 324 ++++++++++++++++++++++++++++ 2 files changed, 325 insertions(+) create mode 100644 notebooks/search/lorem-ipsum.txt create mode 100644 notebooks/search/tokenization.ipynb diff --git a/notebooks/search/lorem-ipsum.txt b/notebooks/search/lorem-ipsum.txt new file mode 100644 index 00000000..c69ddaad --- /dev/null +++ b/notebooks/search/lorem-ipsum.txt @@ -0,0 +1 @@ +Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Est pellentesque elit ullamcorper dignissim. Sit amet cursus sit amet dictum sit amet. Enim neque volutpat ac tincidunt vitae semper quis lectus. Nulla facilisi etiam dignissim diam quis enim lobortis. Id velit ut tortor pretium. Ut tortor pretium viverra suspendisse potenti nullam ac tortor. Senectus et netus et malesuada fames ac. Sed faucibus turpis in eu. Maecenas ultricies mi eget mauris pharetra. In iaculis nunc sed augue. Sit amet cursus sit amet dictum. Sit amet luctus venenatis lectus magna. Adipiscing tristique risus nec feugiat. Nisi quis eleifend quam adipiscing vitae proin sagittis nisl rhoncus. Scelerisque varius morbi enim nunc faucibus a. Purus semper eget duis at tellus at. Cursus metus aliquam eleifend mi. Tristique senectus et netus et malesuada fames. Netus et malesuada fames ac. Viverra aliquet eget sit amet tellus cras. Hac habitasse platea dictumst vestibulum rhoncus est pellentesque elit. Molestie ac feugiat sed lectus vestibulum mattis. Etiam erat velit scelerisque in dictum non. Dolor sit amet consectetur adipiscing elit duis tristique sollicitudin nibh. Diam vulputate ut pharetra sit amet aliquam id. Arcu non sodales neque sodales ut etiam sit. Neque vitae tempus quam pellentesque nec nam. Amet porttitor eget dolor morbi non arcu risus quis. Vitae semper quis lectus nulla at volutpat diam ut. Blandit volutpat maecenas volutpat blandit aliquam. Lobortis elementum nibh tellus molestie nunc. Lectus arcu bibendum at varius vel pharetra vel turpis nunc. In hac habitasse platea dictumst. Vitae suscipit tellus mauris a diam maecenas. Mi eget mauris pharetra et. Habitant morbi tristique senectus et netus. Eu lobortis elementum nibh tellus molestie nunc non. Scelerisque varius morbi enim nunc faucibus a. Tincidunt arcu non sodales neque sodales ut etiam sit amet. Tellus integer feugiat scelerisque varius. Magna fermentum iaculis eu non diam phasellus vestibulum lorem. Eget nunc lobortis mattis aliquam faucibus. Dignissim sodales ut eu sem integer vitae justo eget. Urna id volutpat lacus laoreet. Mauris nunc congue nisi vitae suscipit tellus mauris a diam. Scelerisque in dictum non consectetur a erat nam at lectus. Neque sodales ut etiam sit amet nisl. Blandit cursus risus at ultrices. Scelerisque mauris pellentesque pulvinar pellentesque habitant morbi tristique senectus et. Cursus vitae congue mauris rhoncus aenean vel elit scelerisque. Lobortis feugiat vivamus at augue eget arcu dictum. Sagittis orci a scelerisque purus semper eget duis at. Ornare suspendisse sed nisi lacus sed viverra tellus in hac. Massa sapien faucibus et molestie. Vulputate odio ut enim blandit volutpat maecenas volutpat. Mauris rhoncus aenean vel elit scelerisque mauris pellentesque pulvinar pellentesque. Massa sapien faucibus et molestie ac. Orci porta non pulvinar neque laoreet suspendisse interdum consectetur. Mauris commodo quis imperdiet massa. Volutpat consequat mauris nunc congue nisi vitae suscipit. Malesuada fames ac turpis egestas maecenas pharetra convallis. Cursus risus at ultrices mi tempus imperdiet. Non enim praesent elementum facilisis leo vel fringilla est. Felis bibendum ut tristique et. Felis donec et odio pellentesque diam volutpat commodo sed egestas. Ut porttitor leo a diam sollicitudin tempor id eu. Dolor purus non enim praesent. Tortor aliquam nulla facilisi cras. Rhoncus dolor purus non enim. Sed vulputate odio ut enim blandit volutpat maecenas. Consequat semper viverra nam libero justo laoreet. Eget nunc scelerisque viverra mauris. Id cursus metus aliquam eleifend mi in nulla. Mattis molestie a iaculis at erat pellentesque adipiscing. Enim nec dui nunc mattis. Hendrerit gravida rutrum quisque non tellus orci ac. Fermentum iaculis eu non diam phasellus vestibulum lorem sed. Adipiscing diam donec adipiscing tristique risus. Sit amet commodo nulla facilisi nullam vehicula ipsum. Amet consectetur adipiscing elit ut aliquam purus sit. Id diam vel quam elementum pulvinar etiam non quam. Nulla pharetra diam sit amet nisl suscipit adipiscing bibendum. Massa tempor nec feugiat nisl pretium fusce id. \ No newline at end of file diff --git a/notebooks/search/tokenization.ipynb b/notebooks/search/tokenization.ipynb new file mode 100644 index 00000000..95fb92b0 --- /dev/null +++ b/notebooks/search/tokenization.ipynb @@ -0,0 +1,324 @@ +{ + "cells": [ + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "s49gpkvZ7q53" + }, + "source": [ + "# Tokenization for Semantic Search (ELSER and E5)\n", + "\n", + "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elastic/elasticsearch-labs/blob/main/notebooks/search/tokenization.ipynb)\n", + "\n", + "Elasticsearch offers some [semantic search](https://www.elastic.co/what-is/semantic-search) models, most notably [ELSER](https://www.elastic.co/guide/en/machine-learning/current/ml-nlp-elser.html) and [E5](https://www.elastic.co/search-labs/blog/articles/multilingual-vector-search-e5-embedding-model), to search through documents in a _menaningful_ way. Part of the process is breaking up texts (both for indexing documents and for queries) into tokens. Tokens are commonly thought of as words, but this is not accurate. Other substrings in the text also carry meaning to the semantic models and therefore have to be split out separately. For ELSER, our English-only model, this is done with the [`bert-base-uncased`](https://huggingface.co/bert-base-uncased) tokenizer.\n", + "\n", + "For Elasticsearch users it is important to know how texts are broken up into tokens because currently only the [first 512 tokens per field](https://www.elastic.co/guide/en/machine-learning/8.12/ml-nlp-limitations.html#ml-nlp-elser-v1-limit-512) are considered. This means that when you index longer texts, all tokens after the 512 will not be represented in your semantic search. Hence it is valuable to know the number of tokens for your input texts.\n", + "\n", + "Currently it is not possible to get the token count information via the API, so we share the code for calculating token counts here. This notebook also shows how to break longer text up into chunks of the right size so that no information is lost during indexing, which has to be done by the user (as of version 8.12, future version will remove the necessity and auto-chunk behind the scenes).\n" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "gaTFHLJC-Mgi" + }, + "source": [ + "# Install packages\n", + "\n", + "As stated above, ELSER uses [BERT](https://huggingface.co/blog/bert-101)'s tokenizer internally. Here we install the `transformers` package that gives us an interface to this tokenizer. (We install the `tabulate` packge to be able to print a nice table for comparison later on.)" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "K9Q1p2C9-wce", + "outputId": "204d5aee-571e-4363-be6e-f87d058f2d29" + }, + "outputs": [], + "source": [ + "!pip install -qU tabulate transformers" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Next, we import everything we need. You can ignore a potential warning on models not being available because we only need the tokenizer here." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/Users/maxjakob/.pyenv/versions/3.11.7/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", + " from .autonotebook import tqdm as notebook_tqdm\n", + "None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.\n" + ] + } + ], + "source": [ + "import json\n", + "from urllib.request import urlopen\n", + "\n", + "from tabulate import tabulate\n", + "from transformers import AutoTokenizer, BertTokenizer" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Define tokenizers\n", + "\n", + "Now we are ready to initialize the BERT tokenizer that ELSER uses and the E5 tokenizer for the multilingual semantic search. We also define a whitespace tokenizer in order to compare the naive version on creating tokens to the two tokenizers." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n", + "e5_tokenizer = AutoTokenizer.from_pretrained('intfloat/multilingual-e5-base')\n", + "\n", + "def whitespace_tokenize(text):\n", + " return text.split()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Load example data\n", + "\n", + "Download the movies example data that is also used in the other search examples." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "url = \"https://raw.githubusercontent.com/elastic/elasticsearch-labs/main/notebooks/search/movies.json\"\n", + "response = urlopen(url)\n", + "movies = json.load(response)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Compare token counts\n", + "\n", + "Compare the token counts of the different tokenization methods for the descriptions of the movies." + ] + }, + { + "cell_type": "code", + "execution_count": 40, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " whitespace BERT E5 text\n", + "------------ ------ ---- -----------------------------------------------------------------------------------\n", + " 16 21 30 An organized crime dynasty's aging patriarch transfers control of his clandestin...\n", + " 19 25 32 Two imprisoned men bond over a number of years, finding solace and eventual rede...\n", + " 20 25 34 Two detectives, a rookie and a veteran, hunt a serial killer who uses the seven ...\n", + " 20 33 36 An insomniac office worker and a devil-may-care soapmaker form an underground fi...\n", + " 22 28 27 An undercover cop and a mole in the police attempt to identify each other while ...\n", + " 23 26 31 A computer hacker learns from mysterious rebels about the true nature of his rea...\n", + " 26 36 42 A thief who steals corporate secrets through the use of dream-sharing technology...\n", + " 27 36 42 The lives of two mob hitmen, a boxer, a gangster and his wife, and a pair of din...\n", + " 27 40 48 A young F.B.I. cadet must receive the help of an incarcerated and manipulative c...\n", + " 30 35 43 A sole survivor tells of the twisty events leading up to a horrific gun battle o...\n", + " 33 39 44 When the menace known as the Joker wreaks havoc and chaos on the people of Gotha...\n", + " 33 40 44 The story of Henry Hill and his life in the mob, covering his relationship with ...\n" + ] + } + ], + "source": [ + "def count_tokens(text):\n", + " whitespace_tokens = len(whitespace_tokenize(text))\n", + " bert_tokens = len(bert_tokenizer.encode(text))\n", + " e5_tokens = len(e5_tokenizer.encode(text))\n", + " return [whitespace_tokens, bert_tokens, e5_tokens, f\"{text[:80]}...\"]\n", + "\n", + "counts = [count_tokens(movie[\"plot\"]) for movie in movies]\n", + "\n", + "print(tabulate(sorted(counts), [\"whitespace\", \"BERT\", \"E5\", \"text\"]))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Notice that both the BERT the E5 tokenizers yields more tokens in every example, in some cases even twice as many. Why is that? Let's look at an example:" + ] + }, + { + "cell_type": "code", + "execution_count": 41, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "The lives of two mob hitmen, a boxer, a gangster and his wife, and a pair of diner bandits intertwine in four tales of violence and redemption.\n", + "\n", + "['[CLS]', 'the', 'lives', 'of', 'two', 'mob', 'hit', '##men', ',', 'a', 'boxer', ',', 'a', 'gangster', 'and', 'his', 'wife', ',', 'and', 'a', 'pair', 'of', 'diner', 'bandits', 'inter', '##t', '##wine', 'in', 'four', 'tales', 'of', 'violence', 'and', 'redemption', '.', '[SEP]']\n" + ] + } + ], + "source": [ + "example_movie = movies[0][\"plot\"]\n", + "print(example_movie)\n", + "print()\n", + "\n", + "movie_tokens = bert_tokenizer.encode(example_movie)\n", + "print(str([bert_tokenizer.decode([t]) for t in movie_tokens]))\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can observe:\n", + "- There are special tokens `[CLS]` and `[SEP]` to model the the beginning and end of the text. These two extra tokens will become relevant below.\n", + "- All tokens are lower-cased.\n", + "- Punctuations are they own tokens.\n", + "- Compounds words are split into two tokens, for example `hitmen` becomes `hit` and `##men`.\n", + "\n", + "Given this behavior, it is easy to see how longer tests yield lots of tokens and can quickly get beyond the 512 tokens limitation mentioned above." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Handling long texts\n", + "\n", + "We saw how to count the number of tokens using the tokenizers from different models. ELSER uses the BERT tokenizer, so when using `.elser_model_2` it internally splits the text with this method.\n", + "\n", + "Currently there is a limitation that [only the first 512 tokens are considered](https://www.elastic.co/guide/en/machine-learning/8.12/ml-nlp-limitations.html#ml-nlp-elser-v1-limit-512). To work around this, we can first split the input text into chunks of 512 tokens and feed the chunks to Elasticsearch. Actually, we need to use a limit of 510 to leave space for the two special tokens (`[CLS]` and `[SEP]`) that we saw." + ] + }, + { + "cell_type": "code", + "execution_count": 35, + "metadata": {}, + "outputs": [], + "source": [ + "SEMANTIC_SEARCH_TOKEN_LIMIT = 510 # 512 minus space for the 2 special tokens\n", + "\n", + "def chunk(tokens, chunk_size=SEMANTIC_SEARCH_TOKEN_LIMIT):\n", + " for i in range(0, len(tokens), chunk_size):\n", + " yield tokens[i:i+chunk_size]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Loading a longer example text:" + ] + }, + { + "cell_type": "code", + "execution_count": 43, + "metadata": {}, + "outputs": [], + "source": [ + "# url = \"https://raw.githubusercontent.com/elastic/elasticsearch-labs/main/notebooks/search/lorem-ipsum.txt\"\n", + "# response = urlopen(url)\n", + "response = open(\"./lorem-ipsum.txt\") # TODO remove in favor of download\n", + "long_text = response.read()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Next we tokenize the long text, create chunks of size 510 tokens and map the tokens back to text. Notice that on the first run the BERT tokenizer itself is warning us about the 512 tokens limitation." + ] + }, + { + "cell_type": "code", + "execution_count": 44, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['[CLS] lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. est pellentesque elit ullamcorper dignissim. sit amet cursus sit amet dictum sit amet. enim neque volutpat ac tincidunt vitae semper quis lectus. nulla facilisi etiam dignissim diam quis enim lobortis. id velit ut tortor pretium. ut tortor pretium viverra suspendisse potenti nullam ac tortor. senectus et netus et malesuada fames ac. sed faucibus turpis in eu. maecenas ultricies mi eget mauris pharetra. in iaculis nunc sed augue. sit amet cursus sit amet dictum. sit amet luctus venenatis lectus magna. adipiscing tristique risus nec feugiat. nisi quis eleifend quam adipiscing vitae proin sagittis nisl rhoncus. scelerisque varius morbi enim nunc faucibus a. purus semper eget duis at tellus at. cursus metus aliquam eleifend mi. tristique senectus et netus et malesuada fames. netus et malesuada fames ac. viverra aliquet eget sit amet tellus cras. hac habitasse platea dictumst vestibulum rhoncus est pellentesque elit. molestie ac feugiat sed lectus vestibulum mattis. etiam erat velit scelerisque in dictum non. dolor sit amet consectetur adipiscing elit duis tristique sollicitudin nibh. diam vulputate ut pharetra sit amet aliquam id. arcu non sodales neque sodales ut etiam sit. neque vitae tempus quam pellentesque nec nam. amet porttitor eget dolor morbi non arcu risus quis. vitae semper qui',\n", + " '##s lectus nulla at volutpat diam ut. blandit volutpat maecenas volutpat blandit aliquam. lobortis elementum nibh tellus molestie nunc. lectus arcu bibendum at varius vel pharetra vel turpis nunc. in hac habitasse platea dictumst. vitae suscipit tellus mauris a diam maecenas. mi eget mauris pharetra et. habitant morbi tristique senectus et netus. eu lobortis elementum nibh tellus molestie nunc non. scelerisque varius morbi enim nunc faucibus a. tincidunt arcu non sodales neque sodales ut etiam sit amet. tellus integer feugiat scelerisque varius. magna fermentum iaculis eu non diam phasellus vestibulum lorem. eget nunc lobortis mattis aliquam faucibus. dignissim sodales ut eu sem integer vitae justo eget. urna id volutpat lacus laoreet. mauris nunc congue nisi vitae suscipit tellus mauris a diam. scelerisque in dictum non consectetur a erat nam at lectus. neque sodales ut etiam sit amet nisl. blandit cursus risus at ultrices. scelerisque mauris pellentesque pulvinar pellentesque habitant morbi tristique senectus et. cursus vitae congue mauris rhoncus aenean vel elit scelerisque. lobortis feugiat vivamus at augue eget arcu dictum. sagittis orci a scelerisque purus semper eget duis at. ornare suspendisse sed nisi lacus sed viverra tellus in hac. massa sapien faucibus et molestie. vulputate odio ut enim blandit volutpat maecenas volutpat. mauris rhoncus aenean vel elit scelerisque mauris pellentesque',\n", + " 'pulvinar pellentesque. massa sapien faucibus et molestie ac. orci porta non pulvinar neque laoreet suspendisse interdum consectetur. mauris commodo quis imperdiet massa. volutpat consequat mauris nunc congue nisi vitae suscipit. malesuada fames ac turpis egestas maecenas pharetra convallis. cursus risus at ultrices mi tempus imperdiet. non enim praesent elementum facilisis leo vel fringilla est. felis bibendum ut tristique et. felis donec et odio pellentesque diam volutpat commodo sed egestas. ut porttitor leo a diam sollicitudin tempor id eu. dolor purus non enim praesent. tortor aliquam nulla facilisi cras. rhoncus dolor purus non enim. sed vulputate odio ut enim blandit volutpat maecenas. consequat semper viverra nam libero justo laoreet. eget nunc scelerisque viverra mauris. id cursus metus aliquam eleifend mi in nulla. mattis molestie a iaculis at erat pellentesque adipiscing. enim nec dui nunc mattis. hendrerit gravida rutrum quisque non tellus orci ac. fermentum iaculis eu non diam phasellus vestibulum lorem sed. adipiscing diam donec adipiscing tristique risus. sit amet commodo nulla facilisi nullam vehicula ipsum. amet consectetur adipiscing elit ut aliquam purus sit. id diam vel quam elementum pulvinar etiam non quam. nulla pharetra diam sit amet nisl suscipit adipiscing bibendum. massa tempor nec feugiat nisl pretium fusce id. [SEP]']" + ] + }, + "execution_count": 44, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "chunked = [\n", + " bert_tokenizer.decode(tokens_chunk)\n", + " for tokens_chunk in chunk(bert_tokenizer.encode(long_text))\n", + "]\n", + "chunked" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now these chunks can be indexed and we can be sure the semantic search model consideres our whole text." + ] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.7" + }, + "vscode": { + "interpreter": { + "hash": "b0fa6594d8f4cbf19f97940f81e996739fb7646882a419484c72d19e05852a7e" + } + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} From 3158306171b30c46fe8e320ac7c1702f6f95518b Mon Sep 17 00:00:00 2001 From: Max Jakob Date: Thu, 25 Jan 2024 12:11:38 +0100 Subject: [PATCH 2/9] remove comment --- notebooks/search/tokenization.ipynb | 1 - 1 file changed, 1 deletion(-) diff --git a/notebooks/search/tokenization.ipynb b/notebooks/search/tokenization.ipynb index 95fb92b0..809f1b60 100644 --- a/notebooks/search/tokenization.ipynb +++ b/notebooks/search/tokenization.ipynb @@ -201,7 +201,6 @@ "source": [ "We can observe:\n", "- There are special tokens `[CLS]` and `[SEP]` to model the the beginning and end of the text. These two extra tokens will become relevant below.\n", - "- All tokens are lower-cased.\n", "- Punctuations are they own tokens.\n", "- Compounds words are split into two tokens, for example `hitmen` becomes `hit` and `##men`.\n", "\n", From e24a7b58d4afd88084d8f490768d49e2714700c3 Mon Sep 17 00:00:00 2001 From: Max Jakob Date: Thu, 25 Jan 2024 14:01:19 +0100 Subject: [PATCH 3/9] exclude special tokens before decoding --- notebooks/search/tokenization.ipynb | 15 ++++++++------- 1 file changed, 8 insertions(+), 7 deletions(-) diff --git a/notebooks/search/tokenization.ipynb b/notebooks/search/tokenization.ipynb index 809f1b60..61958075 100644 --- a/notebooks/search/tokenization.ipynb +++ b/notebooks/search/tokenization.ipynb @@ -254,31 +254,32 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Next we tokenize the long text, create chunks of size 510 tokens and map the tokens back to text. Notice that on the first run the BERT tokenizer itself is warning us about the 512 tokens limitation." + "Next we tokenize the long text, exclude the special tokens, create chunks of size 510 tokens and map the tokens back to text. Notice that on the first run the BERT tokenizer itself is warning us about the 512 tokens limitation." ] }, { "cell_type": "code", - "execution_count": 44, + "execution_count": 50, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "['[CLS] lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. est pellentesque elit ullamcorper dignissim. sit amet cursus sit amet dictum sit amet. enim neque volutpat ac tincidunt vitae semper quis lectus. nulla facilisi etiam dignissim diam quis enim lobortis. id velit ut tortor pretium. ut tortor pretium viverra suspendisse potenti nullam ac tortor. senectus et netus et malesuada fames ac. sed faucibus turpis in eu. maecenas ultricies mi eget mauris pharetra. in iaculis nunc sed augue. sit amet cursus sit amet dictum. sit amet luctus venenatis lectus magna. adipiscing tristique risus nec feugiat. nisi quis eleifend quam adipiscing vitae proin sagittis nisl rhoncus. scelerisque varius morbi enim nunc faucibus a. purus semper eget duis at tellus at. cursus metus aliquam eleifend mi. tristique senectus et netus et malesuada fames. netus et malesuada fames ac. viverra aliquet eget sit amet tellus cras. hac habitasse platea dictumst vestibulum rhoncus est pellentesque elit. molestie ac feugiat sed lectus vestibulum mattis. etiam erat velit scelerisque in dictum non. dolor sit amet consectetur adipiscing elit duis tristique sollicitudin nibh. diam vulputate ut pharetra sit amet aliquam id. arcu non sodales neque sodales ut etiam sit. neque vitae tempus quam pellentesque nec nam. amet porttitor eget dolor morbi non arcu risus quis. vitae semper qui',\n", - " '##s lectus nulla at volutpat diam ut. blandit volutpat maecenas volutpat blandit aliquam. lobortis elementum nibh tellus molestie nunc. lectus arcu bibendum at varius vel pharetra vel turpis nunc. in hac habitasse platea dictumst. vitae suscipit tellus mauris a diam maecenas. mi eget mauris pharetra et. habitant morbi tristique senectus et netus. eu lobortis elementum nibh tellus molestie nunc non. scelerisque varius morbi enim nunc faucibus a. tincidunt arcu non sodales neque sodales ut etiam sit amet. tellus integer feugiat scelerisque varius. magna fermentum iaculis eu non diam phasellus vestibulum lorem. eget nunc lobortis mattis aliquam faucibus. dignissim sodales ut eu sem integer vitae justo eget. urna id volutpat lacus laoreet. mauris nunc congue nisi vitae suscipit tellus mauris a diam. scelerisque in dictum non consectetur a erat nam at lectus. neque sodales ut etiam sit amet nisl. blandit cursus risus at ultrices. scelerisque mauris pellentesque pulvinar pellentesque habitant morbi tristique senectus et. cursus vitae congue mauris rhoncus aenean vel elit scelerisque. lobortis feugiat vivamus at augue eget arcu dictum. sagittis orci a scelerisque purus semper eget duis at. ornare suspendisse sed nisi lacus sed viverra tellus in hac. massa sapien faucibus et molestie. vulputate odio ut enim blandit volutpat maecenas volutpat. mauris rhoncus aenean vel elit scelerisque mauris pellentesque',\n", - " 'pulvinar pellentesque. massa sapien faucibus et molestie ac. orci porta non pulvinar neque laoreet suspendisse interdum consectetur. mauris commodo quis imperdiet massa. volutpat consequat mauris nunc congue nisi vitae suscipit. malesuada fames ac turpis egestas maecenas pharetra convallis. cursus risus at ultrices mi tempus imperdiet. non enim praesent elementum facilisis leo vel fringilla est. felis bibendum ut tristique et. felis donec et odio pellentesque diam volutpat commodo sed egestas. ut porttitor leo a diam sollicitudin tempor id eu. dolor purus non enim praesent. tortor aliquam nulla facilisi cras. rhoncus dolor purus non enim. sed vulputate odio ut enim blandit volutpat maecenas. consequat semper viverra nam libero justo laoreet. eget nunc scelerisque viverra mauris. id cursus metus aliquam eleifend mi in nulla. mattis molestie a iaculis at erat pellentesque adipiscing. enim nec dui nunc mattis. hendrerit gravida rutrum quisque non tellus orci ac. fermentum iaculis eu non diam phasellus vestibulum lorem sed. adipiscing diam donec adipiscing tristique risus. sit amet commodo nulla facilisi nullam vehicula ipsum. amet consectetur adipiscing elit ut aliquam purus sit. id diam vel quam elementum pulvinar etiam non quam. nulla pharetra diam sit amet nisl suscipit adipiscing bibendum. massa tempor nec feugiat nisl pretium fusce id. [SEP]']" + "['lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. est pellentesque elit ullamcorper dignissim. sit amet cursus sit amet dictum sit amet. enim neque volutpat ac tincidunt vitae semper quis lectus. nulla facilisi etiam dignissim diam quis enim lobortis. id velit ut tortor pretium. ut tortor pretium viverra suspendisse potenti nullam ac tortor. senectus et netus et malesuada fames ac. sed faucibus turpis in eu. maecenas ultricies mi eget mauris pharetra. in iaculis nunc sed augue. sit amet cursus sit amet dictum. sit amet luctus venenatis lectus magna. adipiscing tristique risus nec feugiat. nisi quis eleifend quam adipiscing vitae proin sagittis nisl rhoncus. scelerisque varius morbi enim nunc faucibus a. purus semper eget duis at tellus at. cursus metus aliquam eleifend mi. tristique senectus et netus et malesuada fames. netus et malesuada fames ac. viverra aliquet eget sit amet tellus cras. hac habitasse platea dictumst vestibulum rhoncus est pellentesque elit. molestie ac feugiat sed lectus vestibulum mattis. etiam erat velit scelerisque in dictum non. dolor sit amet consectetur adipiscing elit duis tristique sollicitudin nibh. diam vulputate ut pharetra sit amet aliquam id. arcu non sodales neque sodales ut etiam sit. neque vitae tempus quam pellentesque nec nam. amet porttitor eget dolor morbi non arcu risus quis. vitae semper quis',\n", + " 'lectus nulla at volutpat diam ut. blandit volutpat maecenas volutpat blandit aliquam. lobortis elementum nibh tellus molestie nunc. lectus arcu bibendum at varius vel pharetra vel turpis nunc. in hac habitasse platea dictumst. vitae suscipit tellus mauris a diam maecenas. mi eget mauris pharetra et. habitant morbi tristique senectus et netus. eu lobortis elementum nibh tellus molestie nunc non. scelerisque varius morbi enim nunc faucibus a. tincidunt arcu non sodales neque sodales ut etiam sit amet. tellus integer feugiat scelerisque varius. magna fermentum iaculis eu non diam phasellus vestibulum lorem. eget nunc lobortis mattis aliquam faucibus. dignissim sodales ut eu sem integer vitae justo eget. urna id volutpat lacus laoreet. mauris nunc congue nisi vitae suscipit tellus mauris a diam. scelerisque in dictum non consectetur a erat nam at lectus. neque sodales ut etiam sit amet nisl. blandit cursus risus at ultrices. scelerisque mauris pellentesque pulvinar pellentesque habitant morbi tristique senectus et. cursus vitae congue mauris rhoncus aenean vel elit scelerisque. lobortis feugiat vivamus at augue eget arcu dictum. sagittis orci a scelerisque purus semper eget duis at. ornare suspendisse sed nisi lacus sed viverra tellus in hac. massa sapien faucibus et molestie. vulputate odio ut enim blandit volutpat maecenas volutpat. mauris rhoncus aenean vel elit scelerisque mauris pellentesque pu',\n", + " '##lvinar pellentesque. massa sapien faucibus et molestie ac. orci porta non pulvinar neque laoreet suspendisse interdum consectetur. mauris commodo quis imperdiet massa. volutpat consequat mauris nunc congue nisi vitae suscipit. malesuada fames ac turpis egestas maecenas pharetra convallis. cursus risus at ultrices mi tempus imperdiet. non enim praesent elementum facilisis leo vel fringilla est. felis bibendum ut tristique et. felis donec et odio pellentesque diam volutpat commodo sed egestas. ut porttitor leo a diam sollicitudin tempor id eu. dolor purus non enim praesent. tortor aliquam nulla facilisi cras. rhoncus dolor purus non enim. sed vulputate odio ut enim blandit volutpat maecenas. consequat semper viverra nam libero justo laoreet. eget nunc scelerisque viverra mauris. id cursus metus aliquam eleifend mi in nulla. mattis molestie a iaculis at erat pellentesque adipiscing. enim nec dui nunc mattis. hendrerit gravida rutrum quisque non tellus orci ac. fermentum iaculis eu non diam phasellus vestibulum lorem sed. adipiscing diam donec adipiscing tristique risus. sit amet commodo nulla facilisi nullam vehicula ipsum. amet consectetur adipiscing elit ut aliquam purus sit. id diam vel quam elementum pulvinar etiam non quam. nulla pharetra diam sit amet nisl suscipit adipiscing bibendum. massa tempor nec feugiat nisl pretium fusce id.']" ] }, - "execution_count": 44, + "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ + "tokens = bert_tokenizer.encode(long_text)[1:-1] # exclude special tokens at beginning and end\n", "chunked = [\n", " bert_tokenizer.decode(tokens_chunk)\n", - " for tokens_chunk in chunk(bert_tokenizer.encode(long_text))\n", + " for tokens_chunk in chunk(tokens)\n", "]\n", "chunked" ] From 41741c3a517d3e02fc325cfecf3e039d20c2ceb5 Mon Sep 17 00:00:00 2001 From: Max Jakob Date: Thu, 25 Jan 2024 14:27:03 +0100 Subject: [PATCH 4/9] improve descriptions --- notebooks/search/tokenization.ipynb | 24 +++++++++++++++--------- 1 file changed, 15 insertions(+), 9 deletions(-) diff --git a/notebooks/search/tokenization.ipynb b/notebooks/search/tokenization.ipynb index 61958075..e66f7cfc 100644 --- a/notebooks/search/tokenization.ipynb +++ b/notebooks/search/tokenization.ipynb @@ -11,11 +11,11 @@ "\n", "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elastic/elasticsearch-labs/blob/main/notebooks/search/tokenization.ipynb)\n", "\n", - "Elasticsearch offers some [semantic search](https://www.elastic.co/what-is/semantic-search) models, most notably [ELSER](https://www.elastic.co/guide/en/machine-learning/current/ml-nlp-elser.html) and [E5](https://www.elastic.co/search-labs/blog/articles/multilingual-vector-search-e5-embedding-model), to search through documents in a _menaningful_ way. Part of the process is breaking up texts (both for indexing documents and for queries) into tokens. Tokens are commonly thought of as words, but this is not accurate. Other substrings in the text also carry meaning to the semantic models and therefore have to be split out separately. For ELSER, our English-only model, this is done with the [`bert-base-uncased`](https://huggingface.co/bert-base-uncased) tokenizer.\n", + "Elasticsearch offers [semantic search](https://www.elastic.co/what-is/semantic-search) models, most notably [ELSER](https://www.elastic.co/guide/en/machine-learning/current/ml-nlp-elser.html) and [E5](https://www.elastic.co/search-labs/blog/articles/multilingual-vector-search-e5-embedding-model), to search through documents in a way that takes the text's meaning into account. Part of the semantic search process is breaking up texts into tokens (both for documents and for queries). Tokens are commonly thought of as words, but this is not completely accurate. Different semantic models use different concepts of tokens. Many treat punctuation separately and some break up compound words. For example ELSER (our English language model) uses the [`bert-base-uncased`](https://huggingface.co/bert-base-uncased) tokenizer.\n", "\n", - "For Elasticsearch users it is important to know how texts are broken up into tokens because currently only the [first 512 tokens per field](https://www.elastic.co/guide/en/machine-learning/8.12/ml-nlp-limitations.html#ml-nlp-elser-v1-limit-512) are considered. This means that when you index longer texts, all tokens after the 512 will not be represented in your semantic search. Hence it is valuable to know the number of tokens for your input texts.\n", + "For users of Elasticsearch it is important to know how texts are broken up into tokens because currently only the [first 512 tokens per field](https://www.elastic.co/guide/en/machine-learning/8.12/ml-nlp-limitations.html#ml-nlp-elser-v1-limit-512) are considered. This means that when you index longer texts, all tokens after the 512th are ignored in your semantic search. Hence it is valuable to know the number of tokens for your input texts before choosing the right model and indexing method.\n", "\n", - "Currently it is not possible to get the token count information via the API, so we share the code for calculating token counts here. This notebook also shows how to break longer text up into chunks of the right size so that no information is lost during indexing, which has to be done by the user (as of version 8.12, future version will remove the necessity and auto-chunk behind the scenes).\n" + "Currently it is not possible to get the token count information via the API, so here we share the code for calculating token counts. This notebook also shows how to break longer text up into chunks of the right size so that no information is lost during indexing. Currently (as of version 8.12) this has to be done by the user. Future versions will remove this necessity and Elasticsearch will automatically create chunks behind the scenes." ] }, { @@ -192,7 +192,7 @@ "print()\n", "\n", "movie_tokens = bert_tokenizer.encode(example_movie)\n", - "print(str([bert_tokenizer.decode([t]) for t in movie_tokens]))\n" + "print(str([bert_tokenizer.decode([t]) for t in movie_tokens]))" ] }, { @@ -201,7 +201,7 @@ "source": [ "We can observe:\n", "- There are special tokens `[CLS]` and `[SEP]` to model the the beginning and end of the text. These two extra tokens will become relevant below.\n", - "- Punctuations are they own tokens.\n", + "- Punctuations are their own tokens.\n", "- Compounds words are split into two tokens, for example `hitmen` becomes `hit` and `##men`.\n", "\n", "Given this behavior, it is easy to see how longer tests yield lots of tokens and can quickly get beyond the 512 tokens limitation mentioned above." @@ -215,7 +215,7 @@ "\n", "We saw how to count the number of tokens using the tokenizers from different models. ELSER uses the BERT tokenizer, so when using `.elser_model_2` it internally splits the text with this method.\n", "\n", - "Currently there is a limitation that [only the first 512 tokens are considered](https://www.elastic.co/guide/en/machine-learning/8.12/ml-nlp-limitations.html#ml-nlp-elser-v1-limit-512). To work around this, we can first split the input text into chunks of 512 tokens and feed the chunks to Elasticsearch. Actually, we need to use a limit of 510 to leave space for the two special tokens (`[CLS]` and `[SEP]`) that we saw." + "Currently there is a limitation that [only the first 512 tokens are considered](https://www.elastic.co/guide/en/machine-learning/8.12/ml-nlp-limitations.html#ml-nlp-elser-v1-limit-512). To work around this, we can first split the input text into chunks of 512 tokens and feed the chunks to Elasticsearch separately. Actually, we need to use a limit of 510 to leave space for the two special tokens (`[CLS]` and `[SEP]`) that we saw." ] }, { @@ -235,7 +235,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Loading a longer example text:" + "Here we load a longer example text:" ] }, { @@ -254,7 +254,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Next we tokenize the long text, exclude the special tokens, create chunks of size 510 tokens and map the tokens back to text. Notice that on the first run the BERT tokenizer itself is warning us about the 512 tokens limitation." + "Next we tokenize the long text, exclude the special tokens at beginning and end, create chunks of size 510 tokens and map the tokens back to texts. Notice that on the first run of this cell the BERT tokenizer itself is warning us about the 512 tokens limitation of the model." ] }, { @@ -288,8 +288,14 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Now these chunks can be indexed and we can be sure the semantic search model consideres our whole text." + "---\n", + "And there we go. Now these chunks can be indexed together on the same document in a [nested field](https://www.elastic.co/guide/en/elasticsearch/reference/current/nested.html) and we can be sure the semantic search model considers our whole text." ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [] } ], "metadata": { From f22766a63ef6127e074952ae39b36e4e5a180e56 Mon Sep 17 00:00:00 2001 From: Max Jakob Date: Thu, 25 Jan 2024 15:58:59 +0100 Subject: [PATCH 5/9] add overlap; remove text file --- notebooks/search/lorem-ipsum.txt | 1 - notebooks/search/tokenization.ipynb | 60 +++++++++++++---------------- 2 files changed, 26 insertions(+), 35 deletions(-) delete mode 100644 notebooks/search/lorem-ipsum.txt diff --git a/notebooks/search/lorem-ipsum.txt b/notebooks/search/lorem-ipsum.txt deleted file mode 100644 index c69ddaad..00000000 --- a/notebooks/search/lorem-ipsum.txt +++ /dev/null @@ -1 +0,0 @@ -Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Est pellentesque elit ullamcorper dignissim. Sit amet cursus sit amet dictum sit amet. Enim neque volutpat ac tincidunt vitae semper quis lectus. Nulla facilisi etiam dignissim diam quis enim lobortis. Id velit ut tortor pretium. Ut tortor pretium viverra suspendisse potenti nullam ac tortor. Senectus et netus et malesuada fames ac. Sed faucibus turpis in eu. Maecenas ultricies mi eget mauris pharetra. In iaculis nunc sed augue. Sit amet cursus sit amet dictum. Sit amet luctus venenatis lectus magna. Adipiscing tristique risus nec feugiat. Nisi quis eleifend quam adipiscing vitae proin sagittis nisl rhoncus. Scelerisque varius morbi enim nunc faucibus a. Purus semper eget duis at tellus at. Cursus metus aliquam eleifend mi. Tristique senectus et netus et malesuada fames. Netus et malesuada fames ac. Viverra aliquet eget sit amet tellus cras. Hac habitasse platea dictumst vestibulum rhoncus est pellentesque elit. Molestie ac feugiat sed lectus vestibulum mattis. Etiam erat velit scelerisque in dictum non. Dolor sit amet consectetur adipiscing elit duis tristique sollicitudin nibh. Diam vulputate ut pharetra sit amet aliquam id. Arcu non sodales neque sodales ut etiam sit. Neque vitae tempus quam pellentesque nec nam. Amet porttitor eget dolor morbi non arcu risus quis. Vitae semper quis lectus nulla at volutpat diam ut. Blandit volutpat maecenas volutpat blandit aliquam. Lobortis elementum nibh tellus molestie nunc. Lectus arcu bibendum at varius vel pharetra vel turpis nunc. In hac habitasse platea dictumst. Vitae suscipit tellus mauris a diam maecenas. Mi eget mauris pharetra et. Habitant morbi tristique senectus et netus. Eu lobortis elementum nibh tellus molestie nunc non. Scelerisque varius morbi enim nunc faucibus a. Tincidunt arcu non sodales neque sodales ut etiam sit amet. Tellus integer feugiat scelerisque varius. Magna fermentum iaculis eu non diam phasellus vestibulum lorem. Eget nunc lobortis mattis aliquam faucibus. Dignissim sodales ut eu sem integer vitae justo eget. Urna id volutpat lacus laoreet. Mauris nunc congue nisi vitae suscipit tellus mauris a diam. Scelerisque in dictum non consectetur a erat nam at lectus. Neque sodales ut etiam sit amet nisl. Blandit cursus risus at ultrices. Scelerisque mauris pellentesque pulvinar pellentesque habitant morbi tristique senectus et. Cursus vitae congue mauris rhoncus aenean vel elit scelerisque. Lobortis feugiat vivamus at augue eget arcu dictum. Sagittis orci a scelerisque purus semper eget duis at. Ornare suspendisse sed nisi lacus sed viverra tellus in hac. Massa sapien faucibus et molestie. Vulputate odio ut enim blandit volutpat maecenas volutpat. Mauris rhoncus aenean vel elit scelerisque mauris pellentesque pulvinar pellentesque. Massa sapien faucibus et molestie ac. Orci porta non pulvinar neque laoreet suspendisse interdum consectetur. Mauris commodo quis imperdiet massa. Volutpat consequat mauris nunc congue nisi vitae suscipit. Malesuada fames ac turpis egestas maecenas pharetra convallis. Cursus risus at ultrices mi tempus imperdiet. Non enim praesent elementum facilisis leo vel fringilla est. Felis bibendum ut tristique et. Felis donec et odio pellentesque diam volutpat commodo sed egestas. Ut porttitor leo a diam sollicitudin tempor id eu. Dolor purus non enim praesent. Tortor aliquam nulla facilisi cras. Rhoncus dolor purus non enim. Sed vulputate odio ut enim blandit volutpat maecenas. Consequat semper viverra nam libero justo laoreet. Eget nunc scelerisque viverra mauris. Id cursus metus aliquam eleifend mi in nulla. Mattis molestie a iaculis at erat pellentesque adipiscing. Enim nec dui nunc mattis. Hendrerit gravida rutrum quisque non tellus orci ac. Fermentum iaculis eu non diam phasellus vestibulum lorem sed. Adipiscing diam donec adipiscing tristique risus. Sit amet commodo nulla facilisi nullam vehicula ipsum. Amet consectetur adipiscing elit ut aliquam purus sit. Id diam vel quam elementum pulvinar etiam non quam. Nulla pharetra diam sit amet nisl suscipit adipiscing bibendum. Massa tempor nec feugiat nisl pretium fusce id. \ No newline at end of file diff --git a/notebooks/search/tokenization.ipynb b/notebooks/search/tokenization.ipynb index e66f7cfc..01dc421f 100644 --- a/notebooks/search/tokenization.ipynb +++ b/notebooks/search/tokenization.ipynb @@ -54,19 +54,9 @@ }, { "cell_type": "code", - "execution_count": 2, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "/Users/maxjakob/.pyenv/versions/3.11.7/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", - " from .autonotebook import tqdm as notebook_tqdm\n", - "None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.\n" - ] - } - ], + "outputs": [], "source": [ "import json\n", "from urllib.request import urlopen\n", @@ -128,7 +118,7 @@ }, { "cell_type": "code", - "execution_count": 40, + "execution_count": 5, "metadata": {}, "outputs": [ { @@ -173,7 +163,7 @@ }, { "cell_type": "code", - "execution_count": 41, + "execution_count": 6, "metadata": {}, "outputs": [ { @@ -215,19 +205,23 @@ "\n", "We saw how to count the number of tokens using the tokenizers from different models. ELSER uses the BERT tokenizer, so when using `.elser_model_2` it internally splits the text with this method.\n", "\n", - "Currently there is a limitation that [only the first 512 tokens are considered](https://www.elastic.co/guide/en/machine-learning/8.12/ml-nlp-limitations.html#ml-nlp-elser-v1-limit-512). To work around this, we can first split the input text into chunks of 512 tokens and feed the chunks to Elasticsearch separately. Actually, we need to use a limit of 510 to leave space for the two special tokens (`[CLS]` and `[SEP]`) that we saw." + "Currently there is a limitation that [only the first 512 tokens are considered](https://www.elastic.co/guide/en/machine-learning/8.12/ml-nlp-limitations.html#ml-nlp-elser-v1-limit-512). To work around this, we can first split the input text into chunks of 512 tokens and feed the chunks to Elasticsearch separately. Actually, we need to use a limit of 510 to leave space for the two special tokens (`[CLS]` and `[SEP]`) that we saw.\n", + "\n", + "Furthermore, it is best practice to make the chunks overlap (**TODO add reference**). With ELSER, we recommend 50% token overlap (i.e. a 256 token stride)." ] }, { "cell_type": "code", - "execution_count": 35, + "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "SEMANTIC_SEARCH_TOKEN_LIMIT = 510 # 512 minus space for the 2 special tokens\n", "\n", "def chunk(tokens, chunk_size=SEMANTIC_SEARCH_TOKEN_LIMIT):\n", - " for i in range(0, len(tokens), chunk_size):\n", + " step_size = round(chunk_size * .5) # 50% token overlap between chunks is recommended for ELSER\n", + "\n", + " for i in range(0, len(tokens), step_size):\n", " yield tokens[i:i+chunk_size]" ] }, @@ -240,43 +234,46 @@ }, { "cell_type": "code", - "execution_count": 43, + "execution_count": 8, "metadata": {}, "outputs": [], "source": [ - "# url = \"https://raw.githubusercontent.com/elastic/elasticsearch-labs/main/notebooks/search/lorem-ipsum.txt\"\n", - "# response = urlopen(url)\n", - "response = open(\"./lorem-ipsum.txt\") # TODO remove in favor of download\n", - "long_text = response.read()" + "url = \"https://raw.githubusercontent.com/elastic/elasticsearch-labs/main/datasets/book_summaries_1000_chunked.json\"\n", + "response = urlopen(url)\n", + "book_summaries = json.load(response)\n", + "\n", + "long_text = book_summaries[0][\"synopsis\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Next we tokenize the long text, exclude the special tokens at beginning and end, create chunks of size 510 tokens and map the tokens back to texts. Notice that on the first run of this cell the BERT tokenizer itself is warning us about the 512 tokens limitation of the model." + "Next we tokenize the long text, exclude the special tokens at beginning and end, create chunks of size 510 tokens and map the tokens back to texts." ] }, { "cell_type": "code", - "execution_count": 50, + "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "['lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. est pellentesque elit ullamcorper dignissim. sit amet cursus sit amet dictum sit amet. enim neque volutpat ac tincidunt vitae semper quis lectus. nulla facilisi etiam dignissim diam quis enim lobortis. id velit ut tortor pretium. ut tortor pretium viverra suspendisse potenti nullam ac tortor. senectus et netus et malesuada fames ac. sed faucibus turpis in eu. maecenas ultricies mi eget mauris pharetra. in iaculis nunc sed augue. sit amet cursus sit amet dictum. sit amet luctus venenatis lectus magna. adipiscing tristique risus nec feugiat. nisi quis eleifend quam adipiscing vitae proin sagittis nisl rhoncus. scelerisque varius morbi enim nunc faucibus a. purus semper eget duis at tellus at. cursus metus aliquam eleifend mi. tristique senectus et netus et malesuada fames. netus et malesuada fames ac. viverra aliquet eget sit amet tellus cras. hac habitasse platea dictumst vestibulum rhoncus est pellentesque elit. molestie ac feugiat sed lectus vestibulum mattis. etiam erat velit scelerisque in dictum non. dolor sit amet consectetur adipiscing elit duis tristique sollicitudin nibh. diam vulputate ut pharetra sit amet aliquam id. arcu non sodales neque sodales ut etiam sit. neque vitae tempus quam pellentesque nec nam. amet porttitor eget dolor morbi non arcu risus quis. vitae semper quis',\n", - " 'lectus nulla at volutpat diam ut. blandit volutpat maecenas volutpat blandit aliquam. lobortis elementum nibh tellus molestie nunc. lectus arcu bibendum at varius vel pharetra vel turpis nunc. in hac habitasse platea dictumst. vitae suscipit tellus mauris a diam maecenas. mi eget mauris pharetra et. habitant morbi tristique senectus et netus. eu lobortis elementum nibh tellus molestie nunc non. scelerisque varius morbi enim nunc faucibus a. tincidunt arcu non sodales neque sodales ut etiam sit amet. tellus integer feugiat scelerisque varius. magna fermentum iaculis eu non diam phasellus vestibulum lorem. eget nunc lobortis mattis aliquam faucibus. dignissim sodales ut eu sem integer vitae justo eget. urna id volutpat lacus laoreet. mauris nunc congue nisi vitae suscipit tellus mauris a diam. scelerisque in dictum non consectetur a erat nam at lectus. neque sodales ut etiam sit amet nisl. blandit cursus risus at ultrices. scelerisque mauris pellentesque pulvinar pellentesque habitant morbi tristique senectus et. cursus vitae congue mauris rhoncus aenean vel elit scelerisque. lobortis feugiat vivamus at augue eget arcu dictum. sagittis orci a scelerisque purus semper eget duis at. ornare suspendisse sed nisi lacus sed viverra tellus in hac. massa sapien faucibus et molestie. vulputate odio ut enim blandit volutpat maecenas volutpat. mauris rhoncus aenean vel elit scelerisque mauris pellentesque pu',\n", - " '##lvinar pellentesque. massa sapien faucibus et molestie ac. orci porta non pulvinar neque laoreet suspendisse interdum consectetur. mauris commodo quis imperdiet massa. volutpat consequat mauris nunc congue nisi vitae suscipit. malesuada fames ac turpis egestas maecenas pharetra convallis. cursus risus at ultrices mi tempus imperdiet. non enim praesent elementum facilisis leo vel fringilla est. felis bibendum ut tristique et. felis donec et odio pellentesque diam volutpat commodo sed egestas. ut porttitor leo a diam sollicitudin tempor id eu. dolor purus non enim praesent. tortor aliquam nulla facilisi cras. rhoncus dolor purus non enim. sed vulputate odio ut enim blandit volutpat maecenas. consequat semper viverra nam libero justo laoreet. eget nunc scelerisque viverra mauris. id cursus metus aliquam eleifend mi in nulla. mattis molestie a iaculis at erat pellentesque adipiscing. enim nec dui nunc mattis. hendrerit gravida rutrum quisque non tellus orci ac. fermentum iaculis eu non diam phasellus vestibulum lorem sed. adipiscing diam donec adipiscing tristique risus. sit amet commodo nulla facilisi nullam vehicula ipsum. amet consectetur adipiscing elit ut aliquam purus sit. id diam vel quam elementum pulvinar etiam non quam. nulla pharetra diam sit amet nisl suscipit adipiscing bibendum. massa tempor nec feugiat nisl pretium fusce id.']" + "['old major, the old boar on the manor farm, calls the animals on the farm for a meeting, where he compares the humans to parasites and teaches the animals a revolutionary song,\\'beasts of england \\'. when major dies, two young pigs, snowball and napoleon, assume command and turn his dream into a philosophy. the animals revolt and drive the drunken and irresponsible mr jones from the farm, renaming it \" animal farm \". they adopt seven commandments of animal - ism, the most important of which is, \" all animals are equal \". snowball attempts to teach the animals reading and writing ; food is plentiful, and the farm runs smoothly. the pigs elevate themselves to positions of leadership and set aside special food items, ostensibly for their personal health. napoleon takes the pups from the farm dogs and trains them privately. napoleon and snowball struggle for leadership. when snowball announces his plans to build a windmill, napoleon has his dogs chase snowball away and declares himself leader. napoleon enacts changes to the governance structure of the farm, replacing meetings with a committee of pigs, who will run the farm. using a young pig named squealer as a \" mouthpiece \", napoleon claims credit for the windmill idea. the animals work harder with the promise of easier lives with the windmill. after a violent storm, the animals find the windmill annihilated. napoleon and squealer convince the animals that snowball destroyed it, although the scorn of the neighbouring farmers suggests that its walls were too thin. once snowball becomes a scapegoat, napoleon begins purging the farm with his dogs, killing animals he accuses of consorting with his old rival. he and the pigs abuse their power, imposing more control while reserving privileges for themselves and rewriting history, villainising snowball and glorifying napoleon. squealer justifies every statement napoleon makes, even the pigs\\'alteration of the seven commandments of animalism to benefit themselves.\\'beasts of england\\'is replaced by an anthem glorifying napoleon, who appears to be adopting the lifestyle of a man. the animals remain convinced that they are better off than they were when under mr jones. squealer abuses the animals\\'poor memories and invents numbers to show their improvement. mr frederick, one of the neighbouring farmers, attacks the farm, using blasting powder to blow up the restored windmill. though the animals win the battle, they do so at great',\n", + " 'windmill idea. the animals work harder with the promise of easier lives with the windmill. after a violent storm, the animals find the windmill annihilated. napoleon and squealer convince the animals that snowball destroyed it, although the scorn of the neighbouring farmers suggests that its walls were too thin. once snowball becomes a scapegoat, napoleon begins purging the farm with his dogs, killing animals he accuses of consorting with his old rival. he and the pigs abuse their power, imposing more control while reserving privileges for themselves and rewriting history, villainising snowball and glorifying napoleon. squealer justifies every statement napoleon makes, even the pigs\\'alteration of the seven commandments of animalism to benefit themselves.\\'beasts of england\\'is replaced by an anthem glorifying napoleon, who appears to be adopting the lifestyle of a man. the animals remain convinced that they are better off than they were when under mr jones. squealer abuses the animals\\'poor memories and invents numbers to show their improvement. mr frederick, one of the neighbouring farmers, attacks the farm, using blasting powder to blow up the restored windmill. though the animals win the battle, they do so at great cost, as many, including boxer the workhorse, are wounded. despite his injuries, boxer continues working harder and harder, until he collapses while working on the windmill. napoleon sends for a van to take boxer to the veterinary surgeon\\'s, explaining that better care can be given there. benjamin, the cynical donkey, who \" could read as well as any pig \", notices that the van belongs to a knacker, and attempts to mount a rescue ; but the animals\\'attempts are futile. squealer reports that the van was purchased by the hospital and the writing from the previous owner had not been repainted. he recounts a tale of boxer\\'s death in the hands of the best medical care. years pass, and the pigs learn to walk upright, carry whips and wear clothes. the seven commandments are reduced to a single phrase : \" all animals are equal, but some animals are more equal than others \". napoleon holds a dinner party for the pigs and the humans of the area, who congratulate napoleon on having the hardest - working but least fed animals in the country. napoleon announces an alliance with the humans, against the labouring classes of both \" worlds \". he abolishes practices and traditions related to the revolution,',\n", + " 'cost, as many, including boxer the workhorse, are wounded. despite his injuries, boxer continues working harder and harder, until he collapses while working on the windmill. napoleon sends for a van to take boxer to the veterinary surgeon\\'s, explaining that better care can be given there. benjamin, the cynical donkey, who \" could read as well as any pig \", notices that the van belongs to a knacker, and attempts to mount a rescue ; but the animals\\'attempts are futile. squealer reports that the van was purchased by the hospital and the writing from the previous owner had not been repainted. he recounts a tale of boxer\\'s death in the hands of the best medical care. years pass, and the pigs learn to walk upright, carry whips and wear clothes. the seven commandments are reduced to a single phrase : \" all animals are equal, but some animals are more equal than others \". napoleon holds a dinner party for the pigs and the humans of the area, who congratulate napoleon on having the hardest - working but least fed animals in the country. napoleon announces an alliance with the humans, against the labouring classes of both \" worlds \". he abolishes practices and traditions related to the revolution, and changes the name of the farm to \" the manor farm \". the animals, overhearing the conversation, notice that the faces of the pigs have begun changing. during a poker match, an argument breaks out between napoleon and mr pilkington, and the animals realise that the faces of the pigs look like the faces of humans, and no one can tell the difference between them. the pigs snowball, napoleon, and squealer adapt old major\\'s ideas into an actual philosophy, which they formally name animalism. soon after, napoleon and squealer indulge in the vices of humans ( drinking alcohol, sleeping in beds, trading ). squealer is employed to alter the seven commandments to account for this humanisation, an allusion to the soviet government\\'s revising of history in order to exercise control of the people\\'s beliefs about themselves and their society. the original commandments are : # whatever goes upon two legs is an enemy. # whatever goes upon four legs, or has wings, is a friend. # no animal shall wear clothes. # no animal shall sleep in a bed. # no animal shall drink alcohol. # no animal shall kill any other animal. # all animals are equal.',\n", + " 'and changes the name of the farm to \" the manor farm \". the animals, overhearing the conversation, notice that the faces of the pigs have begun changing. during a poker match, an argument breaks out between napoleon and mr pilkington, and the animals realise that the faces of the pigs look like the faces of humans, and no one can tell the difference between them. the pigs snowball, napoleon, and squealer adapt old major\\'s ideas into an actual philosophy, which they formally name animalism. soon after, napoleon and squealer indulge in the vices of humans ( drinking alcohol, sleeping in beds, trading ). squealer is employed to alter the seven commandments to account for this humanisation, an allusion to the soviet government\\'s revising of history in order to exercise control of the people\\'s beliefs about themselves and their society. the original commandments are : # whatever goes upon two legs is an enemy. # whatever goes upon four legs, or has wings, is a friend. # no animal shall wear clothes. # no animal shall sleep in a bed. # no animal shall drink alcohol. # no animal shall kill any other animal. # all animals are equal. later, napoleon and his pigs secretly revise some commandments to clear them of accusations of law - breaking ( such as \" no animal shall drink alcohol \" having \" to excess \" appended to it and \" no animal shall sleep in a bed \" with \" with sheets \" added to it ). the changed commandments are as follows, with the changes bolded : * 4 no animal shall sleep in a bed with sheets. * 5 no animal shall drink alcohol to excess. * 6 no animal shall kill any other animal without cause. eventually these are replaced with the maxims, \" all animals are equal, but some animals are more equal than others \", and \" four legs good, two legs better! \" as the pigs become more human. this is an ironic twist to the original purpose of the seven commandments, which were supposed to keep order within animal farm by uniting the animals together against the humans, and prevent animals from following the humans\\'evil habits. through the revision of the commandments, orwell demonstrates how simply political dogma can be turned into malleable propaganda.',\n", + " 'later, napoleon and his pigs secretly revise some commandments to clear them of accusations of law - breaking ( such as \" no animal shall drink alcohol \" having \" to excess \" appended to it and \" no animal shall sleep in a bed \" with \" with sheets \" added to it ). the changed commandments are as follows, with the changes bolded : * 4 no animal shall sleep in a bed with sheets. * 5 no animal shall drink alcohol to excess. * 6 no animal shall kill any other animal without cause. eventually these are replaced with the maxims, \" all animals are equal, but some animals are more equal than others \", and \" four legs good, two legs better! \" as the pigs become more human. this is an ironic twist to the original purpose of the seven commandments, which were supposed to keep order within animal farm by uniting the animals together against the humans, and prevent animals from following the humans\\'evil habits. through the revision of the commandments, orwell demonstrates how simply political dogma can be turned into malleable propaganda.']" ] }, - "execution_count": 50, + "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "tokens = bert_tokenizer.encode(long_text)[1:-1] # exclude special tokens at beginning and end\n", + "tokens = bert_tokenizer.encode(long_text)[1:-1] # exclude special tokens at the beginning and end\n", "chunked = [\n", " bert_tokenizer.decode(tokens_chunk)\n", " for tokens_chunk in chunk(tokens)\n", @@ -291,11 +288,6 @@ "---\n", "And there we go. Now these chunks can be indexed together on the same document in a [nested field](https://www.elastic.co/guide/en/elasticsearch/reference/current/nested.html) and we can be sure the semantic search model considers our whole text." ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [] } ], "metadata": { From 4f6dfaab5a8f457c544b758ea7599ff0a829186f Mon Sep 17 00:00:00 2001 From: Max Jakob Date: Fri, 26 Jan 2024 09:45:02 +0100 Subject: [PATCH 6/9] run nbtest through Makefile --- notebooks/search/Makefile | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/notebooks/search/Makefile b/notebooks/search/Makefile index 7b9f9755..4b68f1c0 100644 --- a/notebooks/search/Makefile +++ b/notebooks/search/Makefile @@ -6,7 +6,8 @@ NOTEBOOKS = \ 03-ELSER.ipynb \ 04-multilingual.ipynb \ 05-query-rules.ipynb \ - 06-synonyms-api.ipynb + 06-synonyms-api.ipynb \ + tokenization.ipynb .PHONY: all $(NOTEBOOKS) From c956be6d7da5a7ff730bb6a7fecabc1acf71e87a Mon Sep 17 00:00:00 2001 From: Max Jakob Date: Fri, 26 Jan 2024 10:38:12 +0100 Subject: [PATCH 7/9] reformulate recommendation --- notebooks/search/tokenization.ipynb | 23 ++++++++++++++++------- 1 file changed, 16 insertions(+), 7 deletions(-) diff --git a/notebooks/search/tokenization.ipynb b/notebooks/search/tokenization.ipynb index 01dc421f..05a2ce83 100644 --- a/notebooks/search/tokenization.ipynb +++ b/notebooks/search/tokenization.ipynb @@ -54,7 +54,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 2, "metadata": {}, "outputs": [], "source": [ @@ -192,9 +192,9 @@ "We can observe:\n", "- There are special tokens `[CLS]` and `[SEP]` to model the the beginning and end of the text. These two extra tokens will become relevant below.\n", "- Punctuations are their own tokens.\n", - "- Compounds words are split into two tokens, for example `hitmen` becomes `hit` and `##men`.\n", + "- Compound words are split into two tokens, for example `hitmen` becomes `hit` and `##men`.\n", "\n", - "Given this behavior, it is easy to see how longer tests yield lots of tokens and can quickly get beyond the 512 tokens limitation mentioned above." + "Given this behavior, it is easy to see how longer texts yield lots of tokens and can quickly get beyond the 512 tokens limitation mentioned above." ] }, { @@ -207,7 +207,7 @@ "\n", "Currently there is a limitation that [only the first 512 tokens are considered](https://www.elastic.co/guide/en/machine-learning/8.12/ml-nlp-limitations.html#ml-nlp-elser-v1-limit-512). To work around this, we can first split the input text into chunks of 512 tokens and feed the chunks to Elasticsearch separately. Actually, we need to use a limit of 510 to leave space for the two special tokens (`[CLS]` and `[SEP]`) that we saw.\n", "\n", - "Furthermore, it is best practice to make the chunks overlap (**TODO add reference**). With ELSER, we recommend 50% token overlap (i.e. a 256 token stride)." + "Furthermore, in practice we often see improved performance when using overlapping chunks. With ELSER, we recommend 50% token overlap (i.e. a 255 token stride)." ] }, { @@ -249,14 +249,23 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Next we tokenize the long text, exclude the special tokens at beginning and end, create chunks of size 510 tokens and map the tokens back to texts." + "Next we tokenize the long text, exclude the special tokens at beginning and end, create chunks of size 510 tokens and map the tokens back to texts.\n", + "\n", + "Side note: Be aware that tokenisation involves a normalisation step that strips away [nonspacing marks](https://www.fileformat.info/info/unicode/category/Mn/list.htm). If decoding is implemented as a reverse lookup from token IDs to vocabulary entries those stripped marks will not be recovered resulting in decoded text that could be slightly different to the original." ] }, { "cell_type": "code", - "execution_count": 10, + "execution_count": 9, "metadata": {}, "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Token indices sequence length is longer than the specified maximum sequence length for this model (1242 > 512). Running this sequence through the model will result in indexing errors\n" + ] + }, { "data": { "text/plain": [ @@ -267,7 +276,7 @@ " 'later, napoleon and his pigs secretly revise some commandments to clear them of accusations of law - breaking ( such as \" no animal shall drink alcohol \" having \" to excess \" appended to it and \" no animal shall sleep in a bed \" with \" with sheets \" added to it ). the changed commandments are as follows, with the changes bolded : * 4 no animal shall sleep in a bed with sheets. * 5 no animal shall drink alcohol to excess. * 6 no animal shall kill any other animal without cause. eventually these are replaced with the maxims, \" all animals are equal, but some animals are more equal than others \", and \" four legs good, two legs better! \" as the pigs become more human. this is an ironic twist to the original purpose of the seven commandments, which were supposed to keep order within animal farm by uniting the animals together against the humans, and prevent animals from following the humans\\'evil habits. through the revision of the commandments, orwell demonstrates how simply political dogma can be turned into malleable propaganda.']" ] }, - "execution_count": 10, + "execution_count": 9, "metadata": {}, "output_type": "execute_result" } From 5c8313b907de17c256cec202914df4e013c7a9ec Mon Sep 17 00:00:00 2001 From: Max Jakob Date: Fri, 26 Jan 2024 10:44:55 +0100 Subject: [PATCH 8/9] remove paragraph I wanted to omit --- notebooks/search/tokenization.ipynb | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/notebooks/search/tokenization.ipynb b/notebooks/search/tokenization.ipynb index 05a2ce83..cab24aa1 100644 --- a/notebooks/search/tokenization.ipynb +++ b/notebooks/search/tokenization.ipynb @@ -249,9 +249,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Next we tokenize the long text, exclude the special tokens at beginning and end, create chunks of size 510 tokens and map the tokens back to texts.\n", - "\n", - "Side note: Be aware that tokenisation involves a normalisation step that strips away [nonspacing marks](https://www.fileformat.info/info/unicode/category/Mn/list.htm). If decoding is implemented as a reverse lookup from token IDs to vocabulary entries those stripped marks will not be recovered resulting in decoded text that could be slightly different to the original." + "Next we tokenize the long text, exclude the special tokens at beginning and end, create chunks of size 510 tokens and map the tokens back to texts." ] }, { From f53bbbf45b2d15c8860a6916acbde862a37fb526 Mon Sep 17 00:00:00 2001 From: Max Jakob Date: Fri, 26 Jan 2024 11:07:29 +0100 Subject: [PATCH 9/9] move to `document-chunking` folder --- notebooks/document-chunking/Makefile | 1 + notebooks/{search => document-chunking}/tokenization.ipynb | 0 notebooks/search/Makefile | 3 +-- 3 files changed, 2 insertions(+), 2 deletions(-) rename notebooks/{search => document-chunking}/tokenization.ipynb (100%) diff --git a/notebooks/document-chunking/Makefile b/notebooks/document-chunking/Makefile index bcd601f2..8704a788 100644 --- a/notebooks/document-chunking/Makefile +++ b/notebooks/document-chunking/Makefile @@ -1,5 +1,6 @@ NBTEST = ../../bin/nbtest NOTEBOOKS = \ + tokenization.ipynb \ with-index-pipelines.ipynb \ with-langchain-splitters.ipynb diff --git a/notebooks/search/tokenization.ipynb b/notebooks/document-chunking/tokenization.ipynb similarity index 100% rename from notebooks/search/tokenization.ipynb rename to notebooks/document-chunking/tokenization.ipynb diff --git a/notebooks/search/Makefile b/notebooks/search/Makefile index 4b68f1c0..7b9f9755 100644 --- a/notebooks/search/Makefile +++ b/notebooks/search/Makefile @@ -6,8 +6,7 @@ NOTEBOOKS = \ 03-ELSER.ipynb \ 04-multilingual.ipynb \ 05-query-rules.ipynb \ - 06-synonyms-api.ipynb \ - tokenization.ipynb + 06-synonyms-api.ipynb .PHONY: all $(NOTEBOOKS)