Update notebook to use semantic text search

elastic · Nov 7, 2024 · fcd1391 · fcd1391
1 parent d706402
commit fcd1391
Showing 1 changed file with 70 additions and 84 deletions.
diff --git a/notebooks/enterprise-search/app-search-engine-exporter.ipynb b/notebooks/enterprise-search/app-search-engine-exporter.ipynb
@@ -129,7 +129,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 10,
+   "execution_count": 4,
    "metadata": {
     "id": "kpV8K5jHvRK6"
    },
@@ -284,12 +284,12 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "First, we'll start by defining our source and destination indices. We'll also ensure that if the destination index exists, to delete it first so we start fresh."
+    "First, we'll start by defining our source and destination indices. We'll also ensure that if the destination index is deleted if it exists, so that we start fresh."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 33,
+   "execution_count": 7,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -323,7 +323,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 34,
+   "execution_count": 8,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -379,7 +379,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 35,
+   "execution_count": 9,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -469,7 +469,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 36,
+   "execution_count": 10,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -499,7 +499,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 37,
+   "execution_count": 11,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -550,7 +550,8 @@
     "            \"index_options\": \"freqs\",\n",
     "            \"analyzer\": \"i_text_base\",\n",
     "            \"search_analyzer\": \"q_text_base\",\n",
-    "        }"
+    "        }\n",
+    "\n"
    ]
   },
   {
@@ -576,13 +577,15 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Add `sparse_vector` fields for semantic search (optional)\n",
+    "# Add semantic text fields for semantic search (optional)\n",
     "\n",
-    "One of the advantages of having our exported index directly in Elasticsearch is that we can easily take advantage of doing semantic search with ELSER. To do this, we'll need to add a `sparse_vector` field to our index, set up an ingest pipeline, and reindex our data.\n",
+    "One of the advantages of having our exported index directly in Elasticsearch is that we can easily take advantage of doing semantic search with ELSER. To do this, we'll need to add an inference endpoint using ELSER, and a `semantic_text` field to our index to use it.\n",
     "\n",
     "Note that to use this feature, your cluster must have at least one ML node set up with enough resources allocated to it.\n",
     "\n",
-    "Let's first start by adding `sparse_vector` fields to our new index mapping."
+    "If you have not already, be sure that your ELSER v2 model is [setup and deployed](https://www.elastic.co/guide/en/machine-learning/current/ml-nlp-elser.html).\n",
+    "\n",
+    "Let's first start by creating our inference endpoint using the [Create inference API]](https://www.elastic.co/guide/en/elasticsearch/reference/current/put-inference-api.html)."
    ]
   },
   {
@@ -591,31 +594,34 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# by default we are adding a `sparse_vector` field for all text fields in our engine\n",
-    "# feel free to modify this list to only include the fields that are relevant\n",
-    "SPARSE_VECTOR_FIELDS = [\n",
-    "    field_name + \"_semantic\" for field_name in schema if schema[field_name] == \"text\"\n",
-    "]\n",
-    "\n",
-    "sparse_vector_fields = {}\n",
-    "for field_name in SPARSE_VECTOR_FIELDS:\n",
-    "    # this is added so we can use semantic search with ELSER\n",
-    "    sparse_vector_fields[field_name] = {\"type\": \"sparse_vector\"}\n",
-    "\n",
-    "elasticsearch.indices.put_mapping(index=DEST_INDEX, properties=sparse_vector_fields)"
+    "# delete our inference endpoint if it is already created\n",
+    "if elasticsearch.inference.get(inference_id=\"elser_inference_endpoint\"):\n",
+    "    elasticsearch.inference.delete(inference_id=\"elser_inference_endpoint\")\n",
+    "\n",
+    "# and create our endpoint using the ELSER v2 model\n",
+    "elasticsearch.inference.put(\n",
+    "    inference_id='elser_inference_endpoint',\n",
+    "    inference_config={\n",
+    "        \"service\": \"elasticsearch\",\n",
+    "        \"service_settings\": {\n",
+    "            \"model_id\": \".elser_model_2_linux-x86_64\",\n",
+    "            \"num_allocations\": 1,\n",
+    "            \"num_threads\": 1\n",
+    "        }\n",
+    "    },\n",
+    "    task_type=\"sparse_embedding\"\n",
+    ")\n"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Setup an ingest pipeline using ELSER\n",
-    "\n",
-    "> If you have not already deployed ELSER, follow this [guide](https://www.elastic.co/guide/en/machine-learning/current/ml-nlp-elser.html) on how to download and deploy the model. Without this step, you will receive errors below when you run the `reindex` command.\n",
+    "## Using semantic text fields for ingest and query\n",
     "\n",
-    "Assuming you have downloaded and deployed ELSER in your deployment, we can now define an ingest pipeline that will enrich the documents with the `sparse_vector` fields that can be used with semantic search.\n",
+    "Next, we'll augment our text fields with `semantic_text` fields in our index. We'll do this by creating a `semtantic_text` field, and providing a `copy_to` directive from the original source field to copy the text into our semantic text fields.\n",
     "\n",
-    "Also - check to ensure that your ELSER model is deployed and started completely. Your `model_id` below may differ from the model in your cluster, so ensure it is correct before proceeding."
+    "In the example below, we are using the `description` and `title` fields from our example index to add semantic search on those fields."
    ]
   },
   {
@@ -624,62 +630,43 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "PIPELINE = \"elser-ingest-pipeline-\" + ENGINE_NAME\n",
+    "# by default we are adding a `semantic_text` field for the \"description\" and \"title\" fields in our schema\n",
+    "# feel free to modify this list to only include the fields that are relevant\n",
+    "SEMANTIC_TEXT_FIELDS = [\"description\", \"title\"]\n",
+    "\n",
+    "# add the semantic_text field to our mapping for each field defined\n",
+    "for field_name in SEMANTIC_TEXT_FIELDS:\n",
+    "    semantic_field_name = field_name + \"_semantic\"\n",
+    "    mapping[semantic_field_name] = {\n",
+    "        \"type\": \"semantic_text\",\n",
+    "        \"inference_id\": \"elser_inference_endpoint\",\n",
+    "    }\n",
     "\n",
-    "processors = []\n",
+    "# and for our text fields, add a \"copy_to\" directive to copy the text to the semantic_text field\n",
+    "for field_name in SEMANTIC_TEXT_FIELDS:\n",
+    "    semantic_field_name = field_name + \"_semantic\"\n",
+    "    mapping[field_name].update({ \"copy_to\": semantic_field_name })\n",
     "\n",
-    "for output_field in SPARSE_VECTOR_FIELDS:\n",
-    "    input_field = output_field.removesuffix(\"_semantic\")\n",
-    "    processors.append(\n",
-    "        {\n",
-    "            \"inference\": {\n",
-    "                \"model_id\": \".elser_model_2_linux-x86_64\",\n",
-    "                \"input_output\": [\n",
-    "                    {\"input_field\": input_field, \"output_field\": output_field}\n",
-    "                ],\n",
-    "                \"on_failure\": [\n",
-    "                    {\n",
-    "                        \"append\": {\n",
-    "                            \"field\": \"_source._ingest.inference_errors\",\n",
-    "                            \"allow_duplicates\": False,\n",
-    "                            \"value\": [\n",
-    "                                {\n",
-    "                                    \"message\": \"Processor failed for field '\"\n",
-    "                                    + input_field\n",
-    "                                    + \"' with message '{{ _ingest.on_failure_message }}'\",\n",
-    "                                    \"timestamp\": \"{{{ _ingest.timestamp }}}\",\n",
-    "                                }\n",
-    "                            ],\n",
-    "                        }\n",
-    "                    }\n",
-    "                ],\n",
-    "            }\n",
-    "        }\n",
-    "    )\n",
-    "\n",
-    "# create the ingest pipeline\n",
-    "elasticsearch.ingest.put_pipeline(\n",
-    "    id=PIPELINE, description=\"Ingest pipeline for ELSER\", processors=processors\n",
-    ")"
+    "elasticsearch.indices.put_mapping(index=DEST_INDEX, properties=mapping)"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "## Reindex the data\n",
-    "Now that we have created the Elasticsearch index and the ingest pipeline, it's time to reindex our data in the new index. The pipeline definition we created above will create a field for each of the `SPARSE_VECTOR_FIELDS` we defined with a `_semantic` suffix, and then infer the sparse vector values from ELSER as the reindex takes place."
+    "Now that we have created the Elasticsearch index, it's time to reindex our data in the new index. If you are using the `semantic_text` fields as defined above with a `_semantic` suffix, and then the reindexing process with automatically infer the sparse vector values from ELSER and use those for the vectors as the reindex takes place."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 41,
+   "execution_count": 15,
    "metadata": {},
    "outputs": [],
    "source": [
     "reindex_task = elasticsearch.reindex(\n",
     "    source={\"index\": SOURCE_INDEX},\n",
-    "    dest={\"index\": DEST_INDEX, \"pipeline\": PIPELINE},\n",
+    "    dest={\"index\": DEST_INDEX},\n",
     "    wait_for_completion=False,\n",
     ")\n",
     "\n",
@@ -739,7 +726,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 19,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -768,7 +755,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 20,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -878,13 +865,10 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### How to do semantic search using ELSER\n",
+    "### How to do semantic search using ELSER with semantic text fields\n",
     "\n",
     "If you [enabled and reindexed your data with ELSER](#add-sparse_vector-fields-for-semantic-search-optional), we can now use this to do semantic search.\n",
-    "For each `spare_vector` we will generate a `text_expansion` query. These `text_expansion` queries will be added as `should` clauses to a top-level `bool` query.\n",
-    "We also use `min_score` because we want to exclude less relevant results. \n",
-    "\n",
-    "Again note here our ELSER model id. Ensure that the `model_id` matches the one you have used in your pipeline above."
+    "For each `semantic_text` field type, we can define a [semantic query](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-semantic-query.html) to easily perform a semantic search on these fields.\n"
    ]
   },
   {
@@ -894,19 +878,21 @@
    "outputs": [],
    "source": [
     "# replace with your own\n",
-    "QUERY_STRING = \"Which national park has dangerous wild animals?\"\n",
-    "text_expansion_queries = []\n",
+    "QUERY_STRING = \"best sunset view\"\n",
+    "semantic_text_queries = []\n",
     "\n",
-    "for field_name in SPARSE_VECTOR_FIELDS:\n",
-    "    text_expansion_queries.append(\n",
+    "for field_name in SEMANTIC_TEXT_FIELDS:\n",
+    "    semantic_field_name = field_name + \"_semantic\"\n",
+    "    semantic_text_queries.append(\n",
     "        {\n",
-    "            \"text_expansion\": {\n",
-    "                field_name: {\"model_id\": \".elser_model_2_linux-x86_64\", \"model_text\": QUERY_STRING}\n",
+    "            \"semantic\": {\n",
+    "                \"field\": semantic_field_name,\n",
+    "                \"query\": QUERY_STRING,\n",
     "            }\n",
     "        }\n",
     "    )\n",
     "\n",
-    "semantic_query = {\"bool\": {\"should\": text_expansion_queries}}\n",
+    "semantic_query = {\"bool\": {\"should\": semantic_text_queries}}\n",
     "print(f\"Elasticsearch query:\\n{json.dumps(semantic_query, indent=2)}\\n\")"
    ]
   },
@@ -916,15 +902,15 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "results = elasticsearch.search(index=DEST_INDEX, query=semantic_query, min_score=20)\n",
+    "results = elasticsearch.search(index=DEST_INDEX, query=semantic_query, min_score=1)\n",
     "print(f\"Query results:\\n{json.dumps(results.body, indent=2)}\\n\")"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### How to combine App Search queries with ELSER\n",
+    "### How to combine App Search queries with Semantic Text\n",
     "\n",
     "We will now provide an example on how to combine the previous two queries into a single query that applies both BM25 search and semantic search.\n",
     "In the previous examples, we have a `bool` query with `should` clauses.\n",
@@ -963,9 +949,9 @@
    "source": [
     "payload = app_search_query_payload.copy()\n",
     "\n",
-    "for text_expansion_query in text_expansion_queries:\n",
+    "for semantic_text_query in semantic_text_queries:\n",
     "    payload[\"query\"][\"rule\"][\"organic\"][\"bool\"][\"should\"].append(\n",
-    "        text_expansion_query\n",
+    "        semantic_text_query\n",
     "    )\n",
     "\n",
     "print(f\"Elasticsearch payload:\\n{json.dumps(payload, indent=2)}\\n\")"