Skip to content

Commit

Permalink
Chunking notebooks: mention semantic_text
Browse files Browse the repository at this point in the history
  • Loading branch information
maxjakob committed Jun 25, 2024
1 parent 2d06d26 commit 48efe85
Show file tree
Hide file tree
Showing 3 changed files with 21 additions and 3 deletions.
8 changes: 7 additions & 1 deletion notebooks/document-chunking/tokenization.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,13 @@
"\n",
"For users of Elasticsearch it is important to know how texts are broken up into tokens because currently only the [first 512 tokens per field](https://www.elastic.co/guide/en/machine-learning/8.12/ml-nlp-limitations.html#ml-nlp-elser-v1-limit-512) are considered. This means that when you index longer texts, all tokens after the 512th are ignored in your semantic search. Hence it is valuable to know the number of tokens for your input texts before choosing the right model and indexing method.\n",
"\n",
"Currently it is not possible to get the token count information via the API, so here we share the code for calculating token counts. This notebook also shows how to break longer text up into chunks of the right size so that no information is lost during indexing. Currently (as of version 8.12) this has to be done by the user. Future versions will remove this necessity and Elasticsearch will automatically create chunks behind the scenes."
"Currently it is not possible to get the token count information via the API, so here we share the code for calculating token counts. This notebook also shows how to break longer text up into chunks of the right size so that no information is lost during indexing.\n",
"\n",
"# Prefer the `semantic_text` field type\n",
"\n",
"**Elasticsearch version 8.14 introduced the [`semantic_text`](https://www.elastic.co/guide/en/elasticsearch/reference/master/semantic-text.html) field type which handles the chunking process behind the scenes. Before continuing with this notebook, we highly recommend looking into this:**\n",
"\n",
"**<https://www.elastic.co/search-labs/blog/semantic-search-simplified-semantic-text>**"
]
},
{
Expand Down
8 changes: 7 additions & 1 deletion notebooks/document-chunking/with-index-pipelines.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,13 @@
"This interactive notebook will:\n",
"- load the model \"sentence-transformers__all-minilm-l6-v2\" from Hugging Face and into Elasticsearch ML Node\n",
"- create an index and ingest pipeline that will chunk large fields into smaller passages and vectorize them using the model\n",
"- perform a search and return docs with the most relevant passages"
"- perform a search and return docs with the most relevant passages\n",
"\n",
"# Prefer the `semantic_text` field type\n",
"\n",
"**Elasticsearch version 8.14 introduced the [`semantic_text`](https://www.elastic.co/guide/en/elasticsearch/reference/master/semantic-text.html) field type which handles the chunking process behind the scenes. Before continuing with this notebook, we highly recommend looking into this:**\n",
"\n",
"**<https://www.elastic.co/search-labs/blog/semantic-search-simplified-semantic-text>**"
]
},
{
Expand Down
8 changes: 7 additions & 1 deletion notebooks/document-chunking/with-langchain-splitters.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,13 @@
"This interactive notebook will:\n",
"- load the model \"sentence-transformers__all-minilm-l6-v2\" from Hugging Face and into Elasticsearch ML Node\n",
"- Use LangChain splitters to chunk the passages into sentences and index them into Elasticsearch with nested dense vector\n",
"- perform a search and return docs with the most relevant passages"
"- perform a search and return docs with the most relevant passages\n",
"\n",
"# Prefer the `semantic_text` field type\n",
"\n",
"**Elasticsearch version 8.14 introduced the [`semantic_text`](https://www.elastic.co/guide/en/elasticsearch/reference/master/semantic-text.html) field type which handles the chunking process behind the scenes. Before continuing with this notebook, we highly recommend looking into this:**\n",
"\n",
"**<https://www.elastic.co/search-labs/blog/semantic-search-simplified-semantic-text>**"
]
},
{
Expand Down

0 comments on commit 48efe85

Please sign in to comment.