diff --git a/notebooks/enterprise-search/app-search-engine-exporter.ipynb b/notebooks/enterprise-search/app-search-engine-exporter.ipynb index 9bfab3f7..15b465cc 100644 --- a/notebooks/enterprise-search/app-search-engine-exporter.ipynb +++ b/notebooks/enterprise-search/app-search-engine-exporter.ipynb @@ -66,7 +66,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 2, "metadata": {}, "outputs": [], "source": [ @@ -100,7 +100,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 3, "metadata": {}, "outputs": [], "source": [ @@ -129,7 +129,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 10, "metadata": { "id": "kpV8K5jHvRK6" }, @@ -147,6 +147,22 @@ " )" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's take a quick look at the synonyms we've migrated. We'll do this via the `GET _synonyms` endpoint" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(json.dumps(elasticsearch.synonyms.get_synonym(id=ENGINE_NAME).body, indent=2))" + ] + }, { "cell_type": "markdown", "metadata": { @@ -186,8 +202,7 @@ " }\n", " )\n", "\n", - "\n", - "elasticsearch.query_ruleset.put(ruleset_id=ENGINE_NAME, rules=query_rules)" + "elasticsearch.query_rules.put_ruleset(ruleset_id=ENGINE_NAME, rules=query_rules)\n" ] }, { @@ -265,9 +280,16 @@ "Also note that below, we set up variables for our `SOURCE_INDEX` and `DEST_INDEX`. If you want your destination index to be named differently, you can edit it here as these variables are used throughout the rest of the notebook." ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "First, we'll start by defining our source and destination indices. We'll also ensure that if the destination index exists, to delete it first so we start fresh." + ] + }, { "cell_type": "code", - "execution_count": null, + "execution_count": 33, "metadata": {}, "outputs": [], "source": [ @@ -277,8 +299,210 @@ "\n", "# delete the index if it's already created\n", "if elasticsearch.indices.exists(index=DEST_INDEX):\n", - " elasticsearch.indices.delete(index=DEST_INDEX)\n", + " elasticsearch.indices.delete(index=DEST_INDEX)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Next, we'll create our settings which includes filters and and analyzers to use for our text fields.\n", + "\n", + "These are similar to the Elasticsearch analyzers we use for App Search. The main difference is that we are also adding a synonyms filter so that we can\n", + "leverage the Elasticsearch synonym set we created in a previous step. If you want a different mapping for text fields, feel free to modify.\n", + "\n", + "To start with, we'll define a number of filters that we can reuse in our analyzer itself. These include:\n", + "* `front_ngram`: defines a front loaded [n-gram token filter](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-edgengram-tokenfilter.html) that can help create prefixes for terms.\n", + "* `bigram_max_size`: defines a [maximum length](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-length-tokenfilter.html) for any bigram. In our example, we exclude any bigrams larger than 16 characters.\n", + "* `en-stem-filter`: defines [a stemmer](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-stemmer-tokenfilter.html) for use with English text.\n", + "* `bigram_joiner_unigrams`: a filter that [adds word n-grams](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-shingle-tokenfilter.html) into our token stream. This helps to expand the query to capture more context.\n", + "* `delimiter`: a [word delimiter graph token filter](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-word-delimiter-graph-tokenfilter.html) with this rules we've set on how to explicitly split tokens in our input.\n", + "* `en-stop-words-filter`: a default [stop token filter](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-stop-tokenfilter.html) to remove common English terms from our input.\n", + "* `synonyms-filter`: a [synonym graph token filter](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-synonym-graph-tokenfilter.html) that allows us to reuse the synonym set that we've defined above.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 34, + "metadata": {}, + "outputs": [], + "source": [ + "settings_analysis_filters = {\n", + " \"front_ngram\": {\"type\": \"edge_ngram\", \"min_gram\": \"1\", \"max_gram\": \"12\"},\n", + " \"bigram_joiner\": {\n", + " \"max_shingle_size\": \"2\",\n", + " \"token_separator\": \"\",\n", + " \"output_unigrams\": \"false\",\n", + " \"type\": \"shingle\",\n", + " },\n", + " \"bigram_max_size\": {\"type\": \"length\", \"max\": \"16\", \"min\": \"0\"},\n", + " \"en-stem-filter\": {\"name\": \"light_english\", \"type\": \"stemmer\"},\n", + " \"bigram_joiner_unigrams\": {\n", + " \"max_shingle_size\": \"2\",\n", + " \"token_separator\": \"\",\n", + " \"output_unigrams\": \"true\",\n", + " \"type\": \"shingle\",\n", + " },\n", + " \"delimiter\": {\n", + " \"split_on_numerics\": \"true\",\n", + " \"generate_word_parts\": \"true\",\n", + " \"preserve_original\": \"false\",\n", + " \"catenate_words\": \"true\",\n", + " \"generate_number_parts\": \"true\",\n", + " \"catenate_all\": \"true\",\n", + " \"split_on_case_change\": \"true\",\n", + " \"type\": \"word_delimiter_graph\",\n", + " \"catenate_numbers\": \"true\",\n", + " \"stem_english_possessive\": \"true\",\n", + " },\n", + " \"en-stop-words-filter\": {\"type\": \"stop\", \"stopwords\": \"_english_\"},\n", + " \"synonyms-filter\": {\n", + " \"type\": \"synonym_graph\",\n", + " \"synonyms_set\": ENGINE_NAME,\n", + " \"updateable\": True,\n", + " }, \n", + "}" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now we'll create our analyzer that utilizes these filters. The various parts of the analyzer will be used in different parts of our field mapping for text, and will help us to be able to index and query our text in different ways. These include:\n", "\n", + "* `iq_text_delimiter` is used for tokenizing and searching terms split on our specified delimiters in our text.\n", + "* `i_prefix` and `q_prefix` define our indexing and query tokenizers for creating prefix versions of our terms.\n", + "* `iq_text_stem` is used to create and query on stemmed versions of our tokens.\n", + "* `i_text_bigram` and `q_text_bigram` define our tokenizers for indexing and querying to create bigram terms.\n", + "* `i_text_base` and `q_text_base` define the indexing and query tokenization rules for general text tokenization." + ] + }, + { + "cell_type": "code", + "execution_count": 35, + "metadata": {}, + "outputs": [], + "source": [ + "settings_analyzer = {\n", + " \"i_prefix\": {\n", + " \"filter\": [\"cjk_width\", \"lowercase\", \"asciifolding\", \"front_ngram\"],\n", + " \"tokenizer\": \"standard\",\n", + " },\n", + " \"iq_text_delimiter\": {\n", + " \"filter\": [\n", + " \"delimiter\",\n", + " \"cjk_width\",\n", + " \"lowercase\",\n", + " \"asciifolding\",\n", + " \"en-stop-words-filter\",\n", + " \"en-stem-filter\",\n", + " ],\n", + " \"tokenizer\": \"whitespace\",\n", + " },\n", + " \"q_prefix\": {\n", + " \"filter\": [\"cjk_width\", \"lowercase\", \"asciifolding\"],\n", + " \"tokenizer\": \"standard\",\n", + " },\n", + " \"i_text_base\": {\n", + " \"filter\": [\n", + " \"cjk_width\",\n", + " \"lowercase\",\n", + " \"asciifolding\",\n", + " \"en-stop-words-filter\",\n", + " ],\n", + " \"tokenizer\": \"standard\",\n", + " },\n", + " \"q_text_base\": {\n", + " \"filter\": [\n", + " \"cjk_width\",\n", + " \"lowercase\",\n", + " \"asciifolding\",\n", + " \"en-stop-words-filter\",\n", + " \"synonyms-filter\",\n", + " ],\n", + " \"tokenizer\": \"standard\",\n", + " },\n", + " \"iq_text_stem\": {\n", + " \"filter\": [\n", + " \"cjk_width\",\n", + " \"lowercase\",\n", + " \"asciifolding\",\n", + " \"en-stop-words-filter\",\n", + " \"en-stem-filter\",\n", + " ],\n", + " \"tokenizer\": \"standard\",\n", + " },\n", + " \"i_text_bigram\": {\n", + " \"filter\": [\n", + " \"cjk_width\",\n", + " \"lowercase\",\n", + " \"asciifolding\",\n", + " \"en-stem-filter\",\n", + " \"bigram_joiner\",\n", + " \"bigram_max_size\",\n", + " ],\n", + " \"tokenizer\": \"standard\",\n", + " },\n", + " \"q_text_bigram\": {\n", + " \"filter\": [\n", + " \"cjk_width\",\n", + " \"lowercase\",\n", + " \"asciifolding\",\n", + " \"synonyms-filter\",\n", + " \"en-stem-filter\",\n", + " \"bigram_joiner_unigrams\",\n", + " \"bigram_max_size\",\n", + " ],\n", + " \"tokenizer\": \"standard\",\n", + " },\n", + "}" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Finally, we'll combine our filters and our analyzer into a settings object that we can use to define our destination index's settings.\n", + "\n", + "More information on creating custom analyzers can be found in the [Elasticsearch documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-custom-analyzer.html)." + ] + }, + { + "cell_type": "code", + "execution_count": 36, + "metadata": {}, + "outputs": [], + "source": [ + "settings = {\n", + " \"analysis\": {\n", + " \"filter\": settings_analysis_filters,\n", + " \"analyzer\": settings_analyzer,\n", + " }\n", + "}" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now that we have our settings built for our analysis, we'll get the current schema from our App Search engine and use that to build the mappings for our destination index we will be migrating the data into.\n", + "\n", + "For any text fields, we'll explicitly define that mappings for how we want these fields to be stored. We define a number of fields here to emulate what App Search does underneath the hood. These include:\n", + "* A `keyword` field that ignores any token greater than 2048 characters in length.\n", + "* A `delimiter` field that captures any delimiters that we've defined in the above `delimiter` analysis.\n", + "* A `joined` field that uses our bigram analysis from above. This will create pairs of joined tokens that can be used for phrase queries.\n", + "* A `prefix` field that uses our prefix analysis from above. This is used for prefix wildcard to allow for partial matches as well as autocomplete queries.\n", + "* A `stem` field that captures the stemmed versions of our tokens.\n", + "\n", + "Finally, the overall text field will be fully stored and analyzed using our base analyzer that we've defined above." + ] + }, + { + "cell_type": "code", + "execution_count": 37, + "metadata": {}, + "outputs": [], + "source": [ "# get the App Search engine schema\n", "schema = app_search.get_schema(engine_name=ENGINE_NAME)\n", "\n", @@ -326,125 +550,22 @@ " \"index_options\": \"freqs\",\n", " \"analyzer\": \"i_text_base\",\n", " \"search_analyzer\": \"q_text_base\",\n", - " }\n", - "\n", - "# These are similar to the Elasticsearch analyzers we use for App Search.\n", - "# The main difference is that we are also adding a synonyms filter so that we can\n", - "# leverage the Elasticsearch synonym set we created in a previous step.\n", - "# If you want a different mapping for text fields, feel free to modify.\n", - "settings = {\n", - " \"analysis\": {\n", - " \"filter\": {\n", - " \"front_ngram\": {\"type\": \"edge_ngram\", \"min_gram\": \"1\", \"max_gram\": \"12\"},\n", - " \"bigram_joiner\": {\n", - " \"max_shingle_size\": \"2\",\n", - " \"token_separator\": \"\",\n", - " \"output_unigrams\": \"false\",\n", - " \"type\": \"shingle\",\n", - " },\n", - " \"bigram_max_size\": {\"type\": \"length\", \"max\": \"16\", \"min\": \"0\"},\n", - " \"en-stem-filter\": {\"name\": \"light_english\", \"type\": \"stemmer\"},\n", - " \"bigram_joiner_unigrams\": {\n", - " \"max_shingle_size\": \"2\",\n", - " \"token_separator\": \"\",\n", - " \"output_unigrams\": \"true\",\n", - " \"type\": \"shingle\",\n", - " },\n", - " \"delimiter\": {\n", - " \"split_on_numerics\": \"true\",\n", - " \"generate_word_parts\": \"true\",\n", - " \"preserve_original\": \"false\",\n", - " \"catenate_words\": \"true\",\n", - " \"generate_number_parts\": \"true\",\n", - " \"catenate_all\": \"true\",\n", - " \"split_on_case_change\": \"true\",\n", - " \"type\": \"word_delimiter_graph\",\n", - " \"catenate_numbers\": \"true\",\n", - " \"stem_english_possessive\": \"true\",\n", - " },\n", - " \"en-stop-words-filter\": {\"type\": \"stop\", \"stopwords\": \"_english_\"},\n", - " \"synonyms-filter\": {\n", - " \"type\": \"synonym_graph\",\n", - " \"synonyms_set\": ENGINE_NAME,\n", - " \"updateable\": True,\n", - " },\n", - " },\n", - " \"analyzer\": {\n", - " \"i_prefix\": {\n", - " \"filter\": [\"cjk_width\", \"lowercase\", \"asciifolding\", \"front_ngram\"],\n", - " \"tokenizer\": \"standard\",\n", - " },\n", - " \"iq_text_delimiter\": {\n", - " \"filter\": [\n", - " \"delimiter\",\n", - " \"cjk_width\",\n", - " \"lowercase\",\n", - " \"asciifolding\",\n", - " \"en-stop-words-filter\",\n", - " \"en-stem-filter\",\n", - " ],\n", - " \"tokenizer\": \"whitespace\",\n", - " },\n", - " \"q_prefix\": {\n", - " \"filter\": [\"cjk_width\", \"lowercase\", \"asciifolding\"],\n", - " \"tokenizer\": \"standard\",\n", - " },\n", - " \"i_text_base\": {\n", - " \"filter\": [\n", - " \"cjk_width\",\n", - " \"lowercase\",\n", - " \"asciifolding\",\n", - " \"en-stop-words-filter\",\n", - " ],\n", - " \"tokenizer\": \"standard\",\n", - " },\n", - " \"q_text_base\": {\n", - " \"filter\": [\n", - " \"cjk_width\",\n", - " \"lowercase\",\n", - " \"asciifolding\",\n", - " \"en-stop-words-filter\",\n", - " \"synonyms-filter\",\n", - " ],\n", - " \"tokenizer\": \"standard\",\n", - " },\n", - " \"iq_text_stem\": {\n", - " \"filter\": [\n", - " \"cjk_width\",\n", - " \"lowercase\",\n", - " \"asciifolding\",\n", - " \"en-stop-words-filter\",\n", - " \"en-stem-filter\",\n", - " ],\n", - " \"tokenizer\": \"standard\",\n", - " },\n", - " \"i_text_bigram\": {\n", - " \"filter\": [\n", - " \"cjk_width\",\n", - " \"lowercase\",\n", - " \"asciifolding\",\n", - " \"en-stem-filter\",\n", - " \"bigram_joiner\",\n", - " \"bigram_max_size\",\n", - " ],\n", - " \"tokenizer\": \"standard\",\n", - " },\n", - " \"q_text_bigram\": {\n", - " \"filter\": [\n", - " \"cjk_width\",\n", - " \"lowercase\",\n", - " \"asciifolding\",\n", - " \"synonyms-filter\",\n", - " \"en-stem-filter\",\n", - " \"bigram_joiner_unigrams\",\n", - " \"bigram_max_size\",\n", - " ],\n", - " \"tokenizer\": \"standard\",\n", - " },\n", - " },\n", - " }\n", - "}\n", - "\n", + " }" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "And now, we create our destination index that uses our mappings and analysis settings." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ "# and actually create our index\n", "elasticsearch.indices.create(\n", " index=DEST_INDEX, mappings={\"properties\": mapping}, settings=settings\n", @@ -492,7 +613,9 @@ "\n", "> If you have not already deployed ELSER, follow this [guide](https://www.elastic.co/guide/en/machine-learning/current/ml-nlp-elser.html) on how to download and deploy the model. Without this step, you will receive errors below when you run the `reindex` command.\n", "\n", - "Assuming you have downloaded and deployed ELSER in your deployment, we can now define an ingest pipeline that will enrich the documents with the `sparse_vector` fields that can be used with semantic search." + "Assuming you have downloaded and deployed ELSER in your deployment, we can now define an ingest pipeline that will enrich the documents with the `sparse_vector` fields that can be used with semantic search.\n", + "\n", + "Also - check to ensure that your ELSER model is deployed and started completely. Your `model_id` below may differ from the model in your cluster, so ensure it is correct before proceeding." ] }, { @@ -510,7 +633,7 @@ " processors.append(\n", " {\n", " \"inference\": {\n", - " \"model_id\": \".elser_model_2\",\n", + " \"model_id\": \".elser_model_2_linux-x86_64\",\n", " \"input_output\": [\n", " {\"input_field\": input_field, \"output_field\": output_field}\n", " ],\n", @@ -550,7 +673,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 41, "metadata": {}, "outputs": [], "source": [ @@ -606,7 +729,12 @@ "}'\n", "```\n", "\n", - "From the output of the API call above, we can see the actual Elasticsearch query that will be used. Below, we are using this query as a base to build our own App Search like query using query rules and our Elasticsearch synonyms. The query is further enhanced by augmentation with the built-in App Search multifield types for such things as stemming and prefix matching." + "From the output of the API call above, we can see the actual Elasticsearch query that will be used. Below, we are using this query as a base to build our own App Search like query using query rules and our Elasticsearch synonyms. The query is further enhanced by augmentation with the built-in App Search multifield types for such things as stemming and prefix matching.\n", + "\n", + "To walk through a bit of what is happening in the query below. First, we gather some preliminary information about the fields we want to query and return.\n", + "1) We gather the fields we want for our result. This includes all the keys in the schema from above.\n", + "2) Next, we gather all of our text fields in our schema\n", + "3) And finally we gather the \"best fields\" which are those we want to query on using our stemmer." ] }, { @@ -620,8 +748,30 @@ "result_fields = list(schema.keys())\n", "\n", "text_fields = [field_name for field_name in schema if schema[field_name] == \"text\"]\n", - "best_fields = [field_name + \".stem\" for field_name in text_fields]\n", + "best_fields = [field_name + \".stem\" for field_name in text_fields]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now, from our text fields, we create set of fields with specified weights for our various analyzers.\n", "\n", + "* For the text field itself, we weight this as neutral, with a `1.0`\n", + "* For any stem fields, we weight this _slightly_ less to pull in closely stemmed words in the query.\n", + "* Any prefixes, we weight this with a minimal weight to ensure these do not dominate our scoring.\n", + "* For any potential bigram phrase matches, we weight these as well with a `0.75`\n", + "* Finally for our delimiter analyzed terms, we wight these somewhere in the middle.\n", + "\n", + "These are the default weightings that App Search uses. Feel free to experiement with these values to find a balance that works for you." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ "cross_fields = []\n", "\n", "for text_field in text_fields:\n", @@ -629,11 +779,32 @@ " cross_fields.append(text_field + \".stem^0.95\")\n", " cross_fields.append(text_field + \".prefix^0.1\")\n", " cross_fields.append(text_field + \".joined^0.75\")\n", - " cross_fields.append(text_field + \".delimiter^0.4\")\n", + " cross_fields.append(text_field + \".delimiter^0.4\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now we're ready to create our actual payload for our query. This is analagous to the query that App Search uses when querying.\n", + "\n", + "Within this query, we first set an organic query rule. This defines a boolean query under the hood that allows a match to be found and scored either in our cross fields we defined above, or in the \"best fields\" as defined.\n", + "\n", + "For the results, we sort on our score descending as the primary sort, with the document id as the secondary.\n", + "\n", + "We apply highlights to our results, request a return size of the top 10 hits, and for each hit, return the result fields." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ "\n", "app_search_query_payload = {\n", " \"query\": {\n", - " \"rule_query\": {\n", + " \"rule\": {\n", " \"organic\": {\n", " \"bool\": {\n", " \"should\": [\n", @@ -658,7 +829,7 @@ " ]\n", " }\n", " },\n", - " \"ruleset_id\": ENGINE_NAME,\n", + " \"ruleset_ids\": ENGINE_NAME,\n", " \"match_criteria\": {\"user_query\": QUERY_STRING},\n", " }\n", " },\n", @@ -711,7 +882,9 @@ "\n", "If you [enabled and reindexed your data with ELSER](#add-sparse_vector-fields-for-semantic-search-optional), we can now use this to do semantic search.\n", "For each `spare_vector` we will generate a `text_expansion` query. These `text_expansion` queries will be added as `should` clauses to a top-level `bool` query.\n", - "We also use `min_score` because we want to exclude less relevant results. " + "We also use `min_score` because we want to exclude less relevant results. \n", + "\n", + "Again note here our ELSER model id. Ensure that the `model_id` matches the one you have used in your pipeline above." ] }, { @@ -728,7 +901,7 @@ " text_expansion_queries.append(\n", " {\n", " \"text_expansion\": {\n", - " field_name: {\"model_id\": \".elser_model_2\", \"model_text\": QUERY_STRING}\n", + " field_name: {\"model_id\": \".elser_model_2_linux-x86_64\", \"model_text\": QUERY_STRING}\n", " }\n", " }\n", " )\n", @@ -791,7 +964,7 @@ "payload = app_search_query_payload.copy()\n", "\n", "for text_expansion_query in text_expansion_queries:\n", - " payload[\"query\"][\"rule_query\"][\"organic\"][\"bool\"][\"should\"].append(\n", + " payload[\"query\"][\"rule\"][\"organic\"][\"bool\"][\"should\"].append(\n", " text_expansion_query\n", " )\n", "\n", @@ -823,7 +996,7 @@ "provenance": [] }, "kernelspec": { - "display_name": "Python 3.12.3 64-bit", + "display_name": "Python 3", "language": "python", "name": "python3" }, @@ -837,12 +1010,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.12.3" - }, - "vscode": { - "interpreter": { - "hash": "b0fa6594d8f4cbf19f97940f81e996739fb7646882a419484c72d19e05852a7e" - } + "version": "3.11.9" } }, "nbformat": 4,