Skip to content

Commit

Permalink
## [Security Solution] [Elastic AI Assistant] Hybrid (vector + terms)…
Browse files Browse the repository at this point in the history
… search for improved ES|QL query generation

This PR implements hybrid (vector + terms) search to improve the quality of `ES|QL` queries generated by the Elastic AI Assistant.

The hybrid search combines (from a single request to Elasticsearch):

- Vector search results from ELSER that vary depending on the query specified by the user
- Terms search results that return a set of Knowledge Base (KB) documents marked as "required" for a topic

The hybrid search results, when provided as context to an LLM, improve the quality of generated `ES|QL` queries by combining `ES|QL` parser grammar and documentation specific to the question asked by a user with additional examples of valid `ES|QL` queries that aren't specific to the query.

## Details

### Indexing additional `metadata`

The `loadESQL` function in `x-pack/plugins/elastic_assistant/server/lib/langchain/content_loaders/esql_loader.ts` loads a directory containing 13 valid, and one invalid example of `ES|QL` queries:

```typescript
    const rawExampleQueries = await exampleQueriesLoader.load();

    // Add additional metadata to the example queries that indicates they are required KB documents:
    const requiredExampleQueries = addRequiredKbResourceMetadata({
      docs: rawExampleQueries,
      kbResource: ESQL_RESOURCE,
    });
```

The `addRequiredKbResourceMetadata` function adds two additional fields to the `metadata` property of the document:

- `kbResource` - a `keyword` field that specifies the category of knowledge, e.g. `esql`
- `required` - a `boolean` field that when `true`, indicates the document should be returned in all searches for the `kbResource`

The additional metadata fields are shown below in the following abridged sample document:

```
{
  "_index": ".kibana-elastic-ai-assistant-kb",
  "_id": "e297e2d9-fb0e-4638-b4be-af31d1b31b9f",
  "_version": 1,
  "_seq_no": 129,
  "_primary_term": 1,
  "found": true,
  "_source": {
    "metadata": {
      "source": "/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/example_queries/esql_example_query_0001.asciidoc",
      "required": true,
      "kbResource": "esql"
    },
    "vector": {
      "tokens": {
        "serial": 0.5612584,
        "syntax": 0.006727545,
        "user": 1.1184403,
        // ...additional tokens
      },
      "model_id": ".elser_model_2"
    },
    "text": """[[esql-example-queries]]

The following is an example ES|QL query:

\`\`\`
FROM logs-*
| WHERE NOT CIDR_MATCH(destination.ip, "10.0.0.0/8", "172.16.0.0/12", "192.168.0.0/16")
| STATS destcount = COUNT(destination.ip) by user.name, host.name
| ENRICH ldap_lookup_new ON user.name
| WHERE group.name IS NOT NULL
| EVAL follow_up = CASE(
    destcount >= 100, "true",
     "false")
| SORT destcount desc
| KEEP destcount, host.name, user.name, group.name, follow_up
\`\`\`
"""
  }
}
```

### Hybrid search

The `ElasticsearchStore.similaritySearch` function is invoked by LangChain's `VectorStoreRetriever.getRelevantDocuments` function when the `RetrievalQAChain` searches for documents.

A single request to Elasticsearch performs a hybrid search that combines the vector and terms searches into a single request with an [msearch](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-multi-search.html):

```typescript
    // requiredDocs is an array of filters that can be used in a `bool` Elasticsearch DSL query to filter in/out required KB documents:
    const requiredDocs = getRequiredKbDocsTermsQueryDsl(this.kbResource);

    // The `k` parameter is typically provided by LangChain's `VectorStoreRetriever._getRelevantDocuments`, which calls this function:
    const vectorSearchQuerySize = k ?? FALLBACK_SIMILARITY_SEARCH_SIZE;

    // build a vector search query:
    const vectorSearchQuery = getVectorSearchQuery({
      filter,
      modelId: this.model,
      mustNotTerms: requiredDocs,
      query,
    });

    // build a (separate) terms search query:
    const termsSearchQuery = getTermsSearchQuery(requiredDocs);

    // combine the vector search query and the terms search queries into a single multi-search query:
    const mSearchQueryBody = getMsearchQueryBody({
      index: this.index,
      termsSearchQuery,
      termsSearchQuerySize: TERMS_QUERY_SIZE,
      vectorSearchQuery,
      vectorSearchQuerySize,
    });

    try {
      // execute both queries via a single multi-search request:
      const result = await this.esClient.msearch<MsearchResponse>(mSearchQueryBody);

      // flatten the results of the combined queries into a single array of hits:
      const results: FlattenedHit[] = result.responses.flatMap((response) =>
      // ...
```

## Desk testing

1. Delete any previous instances of the Knowledge Base by executing the following query in Kibana's `Dev Tools`:

```

DELETE .kibana-elastic-ai-assistant-kb

```

2. In the Security Solution, open the Elastic AI Assistant

3. In the assistant, click the `Settings` gear

4. Click the `Knowledge Base` icon to view the KB settings

5. Toggle the `Knowledge Base` setting `off` if it's already on

6. Toggle the `Knowledge Base` setting `on` to load the KB documents

7. Click the `Save` button to close settings

8. Enter the following prompt, then press Enter:

```
Generate an ES|QL query that will count the number of connections made to external IP addresses, broken down by user. If the count is greater than 100 for a specific user, add a new field called "follow_up" that contains a value of "true", otherwise, it should contain "false". The user names should also be enriched with their respective group names.
```

**Expected result**

A response similar to the following is returned:

```
FROM logs-*
| WHERE NOT CIDR_MATCH(destination.ip, "10.0.0.0/8", "172.16.0.0/12", "192.168.0.0/16")
| STATS destcount = COUNT(destination.ip) BY user.name
| ENRICH ldap_lookup ON user.name
| EVAL follow_up = CASE(
    destcount >= 100, "true",
    "false")
| SORT destcount DESC
| KEEP destcount, user.name, group.name, follow_up
```

### Reference: Annotated `verbose: true` output

The following output, annotated with `// comments` was generating by setting `verbose: true` in the following code in `x-pack/plugins/elastic_assistant/server/lib/langchain/execute_custom_llm_chain/index.ts`:

```typescript
  const executor = await initializeAgentExecutorWithOptions(tools, llm, {
    agentType: 'chat-conversational-react-description',
    memory,
    verbose: true, // <--
  });
```

<details>
  <summary>Annotated verbose output</summary>

```json
// The chain starts with just the input from the user: a system prompt, plus the user's input:

[chain/start] [1:chain:AgentExecutor] Entering Chain run with input: {
  "input": "You are a helpful, expert assistant who answers questions about Elastic Security. Do not answer questions unrelated to Elastic Security.\nIf you answer a question related to KQL, EQL, or ES|QL, it should be immediately usable within an Elastic Security timeline; please always format the output correctly with back ticks. Any answer provided for Query DSL should also be usable in a security timeline. This means you should only ever include the \"filter\" portion of the query.\nUse the following context to answer questions:\n\n\n\nGenerate an ES|QL query that will count the number of connections made to external IP addresses, broken down by user. If the count is greater than 100 for a specific user, add a new field called \"follow_up\" that contains a value of \"true\", otherwise, it should contain \"false\". The user names should also be enriched with their respective group names.",
  "chat_history": []
}

// The input from the previous step is unchanged in this one:

[chain/start] [1:chain:AgentExecutor > 2:chain:LLMChain] Entering Chain run with input: {
  "input": "You are a helpful, expert assistant who answers questions about Elastic Security. Do not answer questions unrelated to Elastic Security.\nIf you answer a question related to KQL, EQL, or ES|QL, it should be immediately usable within an Elastic Security timeline; please always format the output correctly with back ticks. Any answer provided for Query DSL should also be usable in a security timeline. This means you should only ever include the \"filter\" portion of the query.\nUse the following context to answer questions:\n\n\n\nGenerate an ES|QL query that will count the number of connections made to external IP addresses, broken down by user. If the count is greater than 100 for a specific user, add a new field called \"follow_up\" that contains a value of \"true\", otherwise, it should contain \"false\". The user names should also be enriched with their respective group names.",
  "chat_history": [],
  "agent_scratchpad": [],
  "stop": [
    "Observation:"
  ]
}

// The "prompts" array below contains content written by LangChain inform the LLM about the available tools, including the ES|QL knowledge base, and "teach" it how to use them:

[llm/start] [1:chain:AgentExecutor > 2:chain:LLMChain > 3:llm:ActionsClientLlm] Entering LLM run with input: {
  "prompts": [
    "[{\"lc\":1,\"type\":\"constructor\",\"id\":[\"langchain\",\"schema\",\"SystemMessage\"],\"kwargs\":{\"content\":\"Assistant is a large language model trained by OpenAI.\\n\\nAssistant is designed to be able to assist with a wide range of tasks, from answering simple questions to providing in-depth explanations and discussions on a wide range of topics. As a language model, Assistant is able to generate human-like text based on the input it receives, allowing it to engage in natural-sounding conversations and provide responses that are coherent and relevant to the topic at hand.\\n\\nAssistant is constantly learning and improving, and its capabilities are constantly evolving. It is able to process and understand large amounts of text, and can use this knowledge to provide accurate and informative responses to a wide range of questions. Additionally, Assistant is able to generate its own text based on the input it receives, allowing it to engage in discussions and provide explanations and descriptions on a wide range of topics.\\n\\nOverall, Assistant is a powerful system that can help with a wide range of tasks and provide valuable insights and information on a wide range of topics. Whether you need help with a specific question or just want to have a conversation about a particular topic, Assistant is here to assist. However, above all else, all responses must adhere to the format of RESPONSE FORMAT INSTRUCTIONS.\",\"additional_kwargs\":{}}},{\"lc\":1,\"type\":\"constructor\",\"id\":[\"langchain\",\"schema\",\"HumanMessage\"],\"kwargs\":{\"content\":\"TOOLS\\n------\\nAssistant can ask the user to use tools to look up information that may be helpful in answering the users original question. The tools the human can use are:\\n\\nesql-language-knowledge-base: Call this for knowledge on how to build an ESQL query, or answer questions about the ES|QL query language.\\n\\nRESPONSE FORMAT INSTRUCTIONS\\n----------------------------\\n\\nOutput a JSON markdown code snippet containing a valid JSON object in one of two formats:\\n\\n**Option 1:**\\nUse this if you want the human to use a tool.\\nMarkdown code snippet formatted in the following schema:\\n\\n```json\\n{\\n    \\\"action\\\": string, // The action to take. Must be one of [esql-language-knowledge-base]\\n    \\\"action_input\\\": string // The input to the action. May be a stringified object.\\n}\\n```\\n\\n**Option #2:**\\nUse this if you want to respond directly and conversationally to the human. Markdown code snippet formatted in the following schema:\\n\\n```json\\n{\\n    \\\"action\\\": \\\"Final Answer\\\",\\n    \\\"action_input\\\": string // You should put what you want to return to use here and make sure to use valid json newline characters.\\n}\\n```\\n\\nFor both options, remember to always include the surrounding markdown code snippet delimiters (begin with \\\"```json\\\" and end with \\\"```\\\")!\\n\\n\\nUSER'S INPUT\\n--------------------\\nHere is the user's input (remember to respond with a markdown code snippet of a json blob with a single action, and NOTHING else):\\n\\nYou are a helpful, expert assistant who answers questions about Elastic Security. Do not answer questions unrelated to Elastic Security.\\nIf you answer a question related to KQL, EQL, or ES|QL, it should be immediately usable within an Elastic Security timeline; please always format the output correctly with back ticks. Any answer provided for Query DSL should also be usable in a security timeline. This means you should only ever include the \\\"filter\\\" portion of the query.\\nUse the following context to answer questions:\\n\\n\\n\\nGenerate an ES|QL query that will count the number of connections made to external IP addresses, broken down by user. If the count is greater than 100 for a specific user, add a new field called \\\"follow_up\\\" that contains a value of \\\"true\\\", otherwise, it should contain \\\"false\\\". The user names should also be enriched with their respective group names.\",\"additional_kwargs\":{}}}]"
  ]
}

// The LLM then uses the prompt above, to generate a response (below), which is then passed to the Chain:

[llm/end] [1:chain:AgentExecutor > 2:chain:LLMChain > 3:llm:ActionsClientLlm] [5.48s] Exiting LLM run with output: {
  "generations": [
    [
      {
        "text": "```json\n{\n  \"action\": \"esql-language-knowledge-base\",\n  \"action_input\": \"Generate an ES|QL query that will count the number of connections made to external IP addresses, broken down by user. If the count is greater than 100 for a specific user, add a new field called \\\"follow_up\\\" that contains a value of \\\"true\\\", otherwise, it should contain \\\"false\\\". The user names should also be enriched with their respective group names.\"\n}\n```"
      }
    ]
  ]
}

// It's worth noting that the LLM **ONLY** provided the actual question posed by the user. The LLM correctly omitted all the other instructions, including the system prompt, because the question asked by the user is the most relevant piece of information for the LLM to use to generate a response.

[chain/end] [1:chain:AgentExecutor > 2:chain:LLMChain] [5.49s] Exiting Chain run with output: {
  "text": "```json\n{\n  \"action\": \"esql-language-knowledge-base\",\n  \"action_input\": \"Generate an ES|QL query that will count the number of connections made to external IP addresses, broken down by user. If the count is greater than 100 for a specific user, add a new field called \\\"follow_up\\\" that contains a value of \\\"true\\\", otherwise, it should contain \\\"false\\\". The user names should also be enriched with their respective group names.\"\n}\n```"
}

// In this step, the `AgentExecutor` takes the output from the previous step, and passes it to the `ChainTool`:

[agent/action] [1:chain:AgentExecutor] Agent selected action: {
  "tool": "esql-language-knowledge-base",
  "toolInput": "Generate an ES|QL query that will count the number of connections made to external IP addresses, broken down by user. If the count is greater than 100 for a specific user, add a new field called \"follow_up\" that contains a value of \"true\", otherwise, it should contain \"false\". The user names should also be enriched with their respective group names.",
  "log": "```json\n{\n  \"action\": \"esql-language-knowledge-base\",\n  \"action_input\": \"Generate an ES|QL query that will count the number of connections made to external IP addresses, broken down by user. If the count is greater than 100 for a specific user, add a new field called \\\"follow_up\\\" that contains a value of \\\"true\\\", otherwise, it should contain \\\"false\\\". The user names should also be enriched with their respective group names.\"\n}\n```"
}

// The `ChainTool` then passes the input to the `RetrievalQAChain`:

[tool/start] [1:chain:AgentExecutor > 4:tool:ChainTool] Entering Tool run with input: "Generate an ES|QL query that will count the number of connections made to external IP addresses, broken down by user. If the count is greater than 100 for a specific user, add a new field called "follow_up" that contains a value of "true", otherwise, it should contain "false". The user names should also be enriched with their respective group names."

// The `RetrievalQAChain` then passes the input to the `VectorStoreRetriever`:

[chain/start] [1:chain:AgentExecutor > 4:tool:ChainTool > 5:chain:RetrievalQAChain] Entering Chain run with input: {
  "query": "Generate an ES|QL query that will count the number of connections made to external IP addresses, broken down by user. If the count is greater than 100 for a specific user, add a new field called \"follow_up\" that contains a value of \"true\", otherwise, it should contain \"false\". The user names should also be enriched with their respective group names."
}

// The `VectorStoreRetriever` then passes the input to the `ElasticsearchStore`, and calls the `similaritySearch` method, in this example with a `k` value of `4`, which means that the `ElasticsearchStore` will return the top 4 results:

[retriever/start] [1:chain:AgentExecutor > 4:tool:ChainTool > 5:chain:RetrievalQAChain > 6:retriever:VectorStoreRetriever] Entering Retriever run with input: {
  "query": "Generate an ES|QL query that will count the number of connections made to external IP addresses, broken down by user. If the count is greater than 100 for a specific user, add a new field called \"follow_up\" that contains a value of \"true\", otherwise, it should contain \"false\". The user names should also be enriched with their respective group names."
}

// The `VectorStoreRetriever]` returned 18 results, the first 4 results are from ELSER, because the LangChain `RetrievalQAChain` is configured to return 4 results. The other 14 results matched a terms query where "metadata.kbResource": "esql" AND "metadata.required": true:

[retriever/end] [1:chain:AgentExecutor > 4:tool:ChainTool > 5:chain:RetrievalQAChain > 6:retriever:VectorStoreRetriever] [23ms] Exiting Retriever run with output: {
  "documents": [
    {
      "pageContent": "[[esql]]\n= {esql}\n\n:esql-tests: {xes-repo-dir}/../../plugin/esql/qa\n:esql-specs: {esql-tests}/testFixtures/src/main/resources\n\n[partintro]\n--\n\npreview::[]\n\nThe {es} Query Language ({esql}) is a query language that enables the iterative\nexploration of data.\n\nAn {esql} query consists of a series of commands, separated by pipes. Each query\nstarts with a <<esql-source-commands,source command>>. A source command produces\na table, typically with data from {es}.\n\nimage::images/esql/source-command.svg[A source command producing a table from {es},align=\"center\"]\n\nA source command can be followed by one or more\n<<esql-processing-commands,processing commands>>. Processing commands change an\ninput table by adding, removing, or changing rows and columns.\n\nimage::images/esql/processing-command.svg[A processing command changing an input table,align=\"center\"]\n\nYou can chain processing commands, separated by a pipe character: `|`. Each\nprocessing command works on the output table of the previous command.\n\nimage::images/esql/chaining-processing-commands.svg[Processing commands can be chained,align=\"center\"]\n\nThe result of a query is the table produced by the final processing command.\n\n[discrete]\n[[esql-console]]\n=== Run an {esql} query\n\n[discrete]\n==== The {esql} API\n\nUse the `_query` endpoint to run an {esql} query:\n\n[source,console]\n----\nPOST /_query\n{\n  \"query\": \"\"\"\n    FROM library\n    | EVAL year = DATE_TRUNC(1 YEARS, release_date)\n    | STATS MAX(page_count) BY year\n    | SORT year\n    | LIMIT 5\n  \"\"\"\n}\n----\n// TEST[setup:library]\n\nThe results come back in rows:\n\n[source,console-result]\n----\n{\n  \"columns\": [\n    { \"name\": \"MAX(page_count)\", \"type\": \"integer\"},\n    { \"name\": \"year\"           , \"type\": \"date\"}\n  ],\n  \"values\": [\n    [268, \"1932-01-01T00:00:00.000Z\"],\n    [224, \"1951-01-01T00:00:00.000Z\"],\n    [227, \"1953-01-01T00:00:00.000Z\"],\n    [335, \"1959-01-01T00:00:00.000Z\"],\n    [604, \"1965-01-01T00:00:00.000Z\"]\n  ]\n}\n----\n\nBy default, results are returned as JSON. To return results formatted as text,\nCSV, or TSV, use the `format` parameter:\n\n[source,console]\n----\nPOST /_query?format=txt\n{\n  \"query\": \"\"\"\n    FROM library\n    | EVAL year = DATE_TRUNC(1 YEARS, release_date)\n    | STATS MAX(page_count) BY year\n    | SORT year\n    | LIMIT 5\n  \"\"\"\n}\n----\n// TEST[setup:library]\n\n[discrete]\n==== {kib}\n\n{esql} can be used in Discover to explore a data set, and in Lens to visualize it.\nFirst, enable the `enableTextBased` setting in *Advanced Settings*. Next, in\nDiscover or Lens, from the data view dropdown, select *{esql}*.\n\nNOTE: {esql} queries in Discover and Lens are subject to the time range selected\nwith the time filter.\n\n[discrete]\n[[esql-limitations]]\n=== Limitations\n\n{esql} currently supports the following <<mapping-types,field types>>:\n\n- `alias`\n- `boolean`\n- `date`\n- `double` (`float`, `half_float`, `scaled_float` are represented as `double`)\n- `ip`\n- `keyword` family including `keyword`, `constant_keyword`, and `wildcard`\n- `int` (`short` and `byte` are represented as `int`)\n- `long`\n- `null`\n- `text`\n- `unsigned_long`\n- `version`\n--\n\ninclude::esql-get-started.asciidoc[]\n\ninclude::esql-syntax.asciidoc[]\n\ninclude::esql-source-commands.asciidoc[]\n\ninclude::esql-processing-commands.asciidoc[]\n\ninclude::esql-functions.asciidoc[]\n\ninclude::aggregation-functions.asciidoc[]\n\ninclude::multivalued-fields.asciidoc[]\n\ninclude::task-management.asciidoc[]\n\n:esql-tests!:\n:esql-specs!:\n",
      "metadata": {
        "source": "/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/docs/index.asciidoc"
      }
    },
    {
      "pageContent": "[[esql-from]]\n=== `FROM`\n\nThe `FROM` source command returns a table with up to 10,000 documents from a\ndata stream, index, or alias. Each row in the resulting table represents a\ndocument. Each column corresponds to a field, and can be accessed by the name\nof that field.\n\n[source,esql]\n----\nFROM employees\n----\n\nYou can use <<api-date-math-index-names,date math>> to refer to indices, aliases\nand data streams. This can be useful for time series data, for example to access\ntoday's index:\n\n[source,esql]\n----\nFROM <logs-{now/d}>\n----\n\nUse comma-separated lists or wildcards to query multiple data streams, indices,\nor aliases:\n\n[source,esql]\n----\nFROM employees-00001,employees-*\n----\n",
      "metadata": {
        "source": "/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/docs/source_commands/from.asciidoc"
      }
    },
    {
      "pageContent": "[[esql-agg-count]]\n=== `COUNT`\nCounts field values.\n\n[source.merge.styled,esql]\n----\ninclude::{esql-specs}/stats.csv-spec[tag=count]\n----\n[%header.monospaced.styled,format=dsv,separator=|]\n|===\ninclude::{esql-specs}/stats.csv-spec[tag=count-result]\n|===\n\nCan take any field type as input and the result is always a `long` not matter\nthe input type.\n\nNOTE: There isn't yet a `COUNT(*)`. Please count a single valued field if you\n      need a count of rows.\n",
      "metadata": {
        "source": "/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/docs/aggregation_functions/count.asciidoc"
      }
    },
    {
      "pageContent": "[[esql-agg-count-distinct]]\n=== `COUNT_DISTINCT`\nThe approximate number of distinct values.\n\n[source.merge.styled,esql]\n----\ninclude::{esql-specs}/stats_count_distinct.csv-spec[tag=count-distinct]\n----\n[%header.monospaced.styled,format=dsv,separator=|]\n|===\ninclude::{esql-specs}/stats_count_distinct.csv-spec[tag=count-distinct-result]\n|===\n\nCan take any field type as input and the result is always a `long` not matter\nthe input type.\n\n==== Counts are approximate\n\nComputing exact counts requires loading values into a set and returning its\nsize. This doesn't scale when working on high-cardinality sets and/or large\nvalues as the required memory usage and the need to communicate those\nper-shard sets between nodes would utilize too many resources of the cluster.\n\nThis `COUNT_DISTINCT` function is based on the\nhttps://static.googleusercontent.com/media/research.google.com/fr//pubs/archive/40671.pdf[HyperLogLog++]\nalgorithm, which counts based on the hashes of the values with some interesting\nproperties:\n\ninclude::../../aggregations/metrics/cardinality-aggregation.asciidoc[tag=explanation]\n\n==== Precision is configurable\n\nThe `COUNT_DISTINCT` function takes an optional second parameter to configure the\nprecision discussed previously.\n\n[source.merge.styled,esql]\n----\ninclude::{esql-specs}/stats_count_distinct.csv-spec[tag=count-distinct-precision]\n----\n[%header.monospaced.styled,format=dsv,separator=|]\n|===\ninclude::{esql-specs}/stats_count_distinct.csv-spec[tag=count-distinct-precision-result]\n|===\n",
      "metadata": {
        "source": "/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/docs/aggregation_functions/count_distinct.asciidoc"
      }
    },
    {
      "pageContent": "[[esql-example-queries]]\n\nThe following is an example ES|QL query:\n\n```\nFROM logs-*\n| WHERE NOT CIDR_MATCH(destination.ip, \"10.0.0.0/8\", \"172.16.0.0/12\", \"192.168.0.0/16\")\n| STATS destcount = COUNT(destination.ip) by user.name, host.name\n| ENRICH ldap_lookup_new ON user.name\n| WHERE group.name IS NOT NULL\n| EVAL follow_up = CASE(\n    destcount >= 100, \"true\",\n     \"false\")\n| SORT destcount desc\n| KEEP destcount, host.name, user.name, group.name, follow_up\n```\n",
      "metadata": {
        "source": "/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/example_queries/esql_example_query_0001.asciidoc"
      }
    },
    {
      "pageContent": "[[esql-example-queries]]\n\nThe following is an example ES|QL query:\n\n```\nfrom logs-*\n| grok dns.question.name \"%{DATA}\\\\.%{GREEDYDATA:dns.question.registered_domain:string}\"\n| stats unique_queries = count_distinct(dns.question.name) by dns.question.registered_domain, process.name\n| where unique_queries > 5\n| sort unique_queries desc\n```\n",
      "metadata": {
        "source": "/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/example_queries/esql_example_query_0002.asciidoc"
      }
    },
    {
      "pageContent": "[[esql-example-queries]]\n\nThe following is an example ES|QL query:\n\n```\nfrom logs-*\n| where event.code is not null\n| stats event_code_count = count(event.code) by event.code,host.name\n| enrich win_events on event.code with EVENT_DESCRIPTION\n| where EVENT_DESCRIPTION is not null and host.name is not null\n| rename EVENT_DESCRIPTION as event.description\n| sort event_code_count desc\n| keep event_code_count,event.code,host.name,event.description\n```\n",
      "metadata": {
        "source": "/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/example_queries/esql_example_query_0003.asciidoc"
      }
    },
    {
      "pageContent": "[[esql-example-queries]]\n\nThe following is an example ES|QL query:\n\n```\nfrom logs-*\n| where event.category == \"file\" and event.action == \"creation\"\n| stats filecount = count(file.name) by process.name,host.name\n| dissect process.name \"%{process}.%{extension}\"\n| eval proclength = length(process.name)\n| where proclength > 10\n| sort filecount,proclength desc\n| limit 10\n| keep host.name,process.name,filecount,process,extension,fullproc,proclength\n```\n",
      "metadata": {
        "source": "/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/example_queries/esql_example_query_0004.asciidoc"
      }
    },
    {
      "pageContent": "[[esql-example-queries]]\n\nThe following is an example ES|QL query:\n\n```\nfrom logs-*\n| where process.name == \"curl.exe\"\n| stats bytes = sum(destination.bytes) by destination.address\n| eval kb =  bytes/1024\n| sort kb desc\n| limit 10\n| keep kb,destination.address\n```\n",
      "metadata": {
        "source": "/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/example_queries/esql_example_query_0005.asciidoc"
      }
    },
    {
      "pageContent": "[[esql-example-queries]]\n\nThe following is an example ES|QL query:\n\n```\nFROM metrics-apm*\n| WHERE metricset.name == \"transaction\" AND metricset.interval == \"1m\"\n| EVAL bucket = AUTO_BUCKET(transaction.duration.histogram, 50, <start-date>, <end-date>)\n| STATS avg_duration = AVG(transaction.duration.histogram) BY bucket\n```\n",
      "metadata": {
        "source": "/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/example_queries/esql_example_query_0006.asciidoc"
      }
    },
    {
      "pageContent": "[[esql-example-queries]]\n\nThe following is an example ES|QL query:\n\n```\nFROM packetbeat-*\n| STATS doc_count = COUNT(destination.domain) BY destination.domain\n| SORT doc_count DESC\n| LIMIT 10\n```\n",
      "metadata": {
        "source": "/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/example_queries/esql_example_query_0007.asciidoc"
      }
    },
    {
      "pageContent": "[[esql-example-queries]]\n\nThe following is an example ES|QL query:\n\n```\nFROM employees\n| EVAL hire_date_formatted = DATE_FORMAT(hire_date, \"MMMM yyyy\")\n| SORT hire_date\n| KEEP emp_no, hire_date_formatted\n| LIMIT 5\n```\n",
      "metadata": {
        "source": "/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/example_queries/esql_example_query_0008.asciidoc"
      }
    },
    {
      "pageContent": "[[esql-example-queries]]\n\nThe following is NOT an example of an ES|QL query:\n\n```\nPagination is not supported\n```\n",
      "metadata": {
        "source": "/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/example_queries/esql_example_query_0009.asciidoc"
      }
    },
    {
      "pageContent": "[[esql-example-queries]]\n\nThe following is an example ES|QL query:\n\n```\nFROM logs-*\n| WHERE @timestamp >= NOW() - 15 minutes\n| EVAL bucket = DATE_TRUNC(1 minute, @timestamp)\n| STATS avg_cpu = AVG(system.cpu.total.norm.pct) BY bucket, host.name\n| LIMIT 10\n```\n",
      "metadata": {
        "source": "/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/example_queries/esql_example_query_0010.asciidoc"
      }
    },
    {
      "pageContent": "[[esql-example-queries]]\n\nThe following is an example ES|QL query:\n\n```\nFROM traces-apm*\n| WHERE @timestamp >= NOW() - 24 hours\n| EVAL successful = CASE(event.outcome == \"success\", 1, 0),\n  failed = CASE(event.outcome == \"failure\", 1, 0)\n| STATS success_rate = AVG(successful),\n  avg_duration = AVG(transaction.duration),\n  total_requests = COUNT(transaction.id) BY service.name\n```\n",
      "metadata": {
        "source": "/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/example_queries/esql_example_query_0011.asciidoc"
      }
    },
    {
      "pageContent": "[[esql-example-queries]]\n\nThe following is an example ES|QL query:\n\n```\nFROM metricbeat*\n| EVAL cpu_pct_normalized = (system.cpu.user.pct + system.cpu.system.pct) / system.cpu.cores\n| STATS AVG(cpu_pct_normalized) BY host.name\n```\n",
      "metadata": {
        "source": "/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/example_queries/esql_example_query_0012.asciidoc"
      }
    },
    {
      "pageContent": "[[esql-example-queries]]\n\nThe following is an example ES|QL query:\n\n```\nFROM postgres-logs\n| DISSECT message \"%{} duration: %{query_duration} ms\"\n| EVAL query_duration_num = TO_DOUBLE(query_duration)\n| STATS avg_duration = AVG(query_duration_num)\n```\n",
      "metadata": {
        "source": "/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/example_queries/esql_example_query_0013.asciidoc"
      }
    },
    {
      "pageContent": "[[esql-example-queries]]\n\nThe following is an example ES|QL query:\n\n```\nFROM nyc_taxis\n| WHERE DATE_EXTRACT(drop_off_time, \"hour\") >= 6 AND DATE_EXTRACT(drop_off_time, \"hour\") < 10\n| LIMIT 10\n```\n",
      "metadata": {
        "source": "/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/example_queries/esql_example_query_0014.asciidoc"
      }
    }
  ]
}

// The search results are then transformed into documents:

[chain/start] [1:chain:AgentExecutor > 4:tool:ChainTool > 5:chain:RetrievalQAChain > 7:chain:StuffDocumentsChain] Entering Chain run with input: {
  "question": "Generate an ES|QL query that will count the number of connections made to external IP addresses, broken down by user. If the count is greater than 100 for a specific user, add a new field called \"follow_up\" that contains a value of \"true\", otherwise, it should contain \"false\". The user names should also be enriched with their respective group names.",
  "input_documents": [
    {
      "pageContent": "[[esql]]\n= {esql}\n\n:esql-tests: {xes-repo-dir}/../../plugin/esql/qa\n:esql-specs: {esql-tests}/testFixtures/src/main/resources\n\n[partintro]\n--\n\npreview::[]\n\nThe {es} Query Language ({esql}) is a query language that enables the iterative\nexploration of data.\n\nAn {esql} query consists of a series of commands, separated by pipes. Each query\nstarts with a <<esql-source-commands,source command>>. A source command produces\na table, typically with data from {es}.\n\nimage::images/esql/source-command.svg[A source command producing a table from {es},align=\"center\"]\n\nA source command can be followed by one or more\n<<esql-processing-commands,processing commands>>. Processing commands change an\ninput table by adding, removing, or changing rows and columns.\n\nimage::images/esql/processing-command.svg[A processing command changing an input table,align=\"center\"]\n\nYou can chain processing commands, separated by a pipe character: `|`. Each\nprocessing command works on the output table of the previous command.\n\nimage::images/esql/chaining-processing-commands.svg[Processing commands can be chained,align=\"center\"]\n\nThe result of a query is the table produced by the final processing command.\n\n[discrete]\n[[esql-console]]\n=== Run an {esql} query\n\n[discrete]\n==== The {esql} API\n\nUse the `_query` endpoint to run an {esql} query:\n\n[source,console]\n----\nPOST /_query\n{\n  \"query\": \"\"\"\n    FROM library\n    | EVAL year = DATE_TRUNC(1 YEARS, release_date)\n    | STATS MAX(page_count) BY year\n    | SORT year\n    | LIMIT 5\n  \"\"\"\n}\n----\n// TEST[setup:library]\n\nThe results come back in rows:\n\n[source,console-result]\n----\n{\n  \"columns\": [\n    { \"name\": \"MAX(page_count)\", \"type\": \"integer\"},\n    { \"name\": \"year\"           , \"type\": \"date\"}\n  ],\n  \"values\": [\n    [268, \"1932-01-01T00:00:00.000Z\"],\n    [224, \"1951-01-01T00:00:00.000Z\"],\n    [227, \"1953-01-01T00:00:00.000Z\"],\n    [335, \"1959-01-01T00:00:00.000Z\"],\n    [604, \"1965-01-01T00:00:00.000Z\"]\n  ]\n}\n----\n\nBy default, results are returned as JSON. To return results formatted as text,\nCSV, or TSV, use the `format` parameter:\n\n[source,console]\n----\nPOST /_query?format=txt\n{\n  \"query\": \"\"\"\n    FROM library\n    | EVAL year = DATE_TRUNC(1 YEARS, release_date)\n    | STATS MAX(page_count) BY year\n    | SORT year\n    | LIMIT 5\n  \"\"\"\n}\n----\n// TEST[setup:library]\n\n[discrete]\n==== {kib}\n\n{esql} can be used in Discover to explore a data set, and in Lens to visualize it.\nFirst, enable the `enableTextBased` setting in *Advanced Settings*. Next, in\nDiscover or Lens, from the data view dropdown, select *{esql}*.\n\nNOTE: {esql} queries in Discover and Lens are subject to the time range selected\nwith the time filter.\n\n[discrete]\n[[esql-limitations]]\n=== Limitations\n\n{esql} currently supports the following <<mapping-types,field types>>:\n\n- `alias`\n- `boolean`\n- `date`\n- `double` (`float`, `half_float`, `scaled_float` are represented as `double`)\n- `ip`\n- `keyword` family including `keyword`, `constant_keyword`, and `wildcard`\n- `int` (`short` and `byte` are represented as `int`)\n- `long`\n- `null`\n- `text`\n- `unsigned_long`\n- `version`\n--\n\ninclude::esql-get-started.asciidoc[]\n\ninclude::esql-syntax.asciidoc[]\n\ninclude::esql-source-commands.asciidoc[]\n\ninclude::esql-processing-commands.asciidoc[]\n\ninclude::esql-functions.asciidoc[]\n\ninclude::aggregation-functions.asciidoc[]\n\ninclude::multivalued-fields.asciidoc[]\n\ninclude::task-management.asciidoc[]\n\n:esql-tests!:\n:esql-specs!:\n",
      "metadata": {
        "source": "/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/docs/index.asciidoc"
      }
    },
    {
      "pageContent": "[[esql-from]]\n=== `FROM`\n\nThe `FROM` source command returns a table with up to 10,000 documents from a\ndata stream, index, or alias. Each row in the resulting table represents a\ndocument. Each column corresponds to a field, and can be accessed by the name\nof that field.\n\n[source,esql]\n----\nFROM employees\n----\n\nYou can use <<api-date-math-index-names,date math>> to refer to indices, aliases\nand data streams. This can be useful for time series data, for example to access\ntoday's index:\n\n[source,esql]\n----\nFROM <logs-{now/d}>\n----\n\nUse comma-separated lists or wildcards to query multiple data streams, indices,\nor aliases:\n\n[source,esql]\n----\nFROM employees-00001,employees-*\n----\n",
      "metadata": {
        "source": "/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/docs/source_commands/from.asciidoc"
      }
    },
    {
      "pageContent": "[[esql-agg-count]]\n=== `COUNT`\nCounts field values.\n\n[source.merge.styled,esql]\n----\ninclude::{esql-specs}/stats.csv-spec[tag=count]\n----\n[%header.monospaced.styled,format=dsv,separator=|]\n|===\ninclude::{esql-specs}/stats.csv-spec[tag=count-result]\n|===\n\nCan take any field type as input and the result is always a `long` not matter\nthe input type.\n\nNOTE: There isn't yet a `COUNT(*)`. Please count a single valued field if you\n      need a count of rows.\n",
      "metadata": {
        "source": "/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/docs/aggregation_functions/count.asciidoc"
      }
    },
    {
      "pageContent": "[[esql-agg-count-distinct]]\n=== `COUNT_DISTINCT`\nThe approximate number of distinct values.\n\n[source.merge.styled,esql]\n----\ninclude::{esql-specs}/stats_count_distinct.csv-spec[tag=count-distinct]\n----\n[%header.monospaced.styled,format=dsv,separator=|]\n|===\ninclude::{esql-specs}/stats_count_distinct.csv-spec[tag=count-distinct-result]\n|===\n\nCan take any field type as input and the result is always a `long` not matter\nthe input type.\n\n==== Counts are approximate\n\nComputing exact counts requires loading values into a set and returning its\nsize. This doesn't scale when working on high-cardinality sets and/or large\nvalues as the required memory usage and the need to communicate those\nper-shard sets between nodes would utilize too many resources of the cluster.\n\nThis `COUNT_DISTINCT` function is based on the\nhttps://static.googleusercontent.com/media/research.google.com/fr//pubs/archive/40671.pdf[HyperLogLog++]\nalgorithm, which counts based on the hashes of the values with some interesting\nproperties:\n\ninclude::../../aggregations/metrics/cardinality-aggregation.asciidoc[tag=explanation]\n\n==== Precision is configurable\n\nThe `COUNT_DISTINCT` function takes an optional second parameter to configure the\nprecision discussed previously.\n\n[source.merge.styled,esql]\n----\ninclude::{esql-specs}/stats_count_distinct.csv-spec[tag=count-distinct-precision]\n----\n[%header.monospaced.styled,format=dsv,separator=|]\n|===\ninclude::{esql-specs}/stats_count_distinct.csv-spec[tag=count-distinct-precision-result]\n|===\n",
      "metadata": {
        "source": "/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/docs/aggregation_functions/count_distinct.asciidoc"
      }
    },
    {
      "pageContent": "[[esql-example-queries]]\n\nThe following is an example ES|QL query:\n\n```\nFROM logs-*\n| WHERE NOT CIDR_MATCH(destination.ip, \"10.0.0.0/8\", \"172.16.0.0/12\", \"192.168.0.0/16\")\n| STATS destcount = COUNT(destination.ip) by user.name, host.name\n| ENRICH ldap_lookup_new ON user.name\n| WHERE group.name IS NOT NULL\n| EVAL follow_up = CASE(\n    destcount >= 100, \"true\",\n     \"false\")\n| SORT destcount desc\n| KEEP destcount, host.name, user.name, group.name, follow_up\n```\n",
      "metadata": {
        "source": "/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/example_queries/esql_example_query_0001.asciidoc"
      }
    },
    {
      "pageContent": "[[esql-example-queries]]\n\nThe following is an example ES|QL query:\n\n```\nfrom logs-*\n| grok dns.question.name \"%{DATA}\\\\.%{GREEDYDATA:dns.question.registered_domain:string}\"\n| stats unique_queries = count_distinct(dns.question.name) by dns.question.registered_domain, process.name\n| where unique_queries > 5\n| sort unique_queries desc\n```\n",
      "metadata": {
        "source": "/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/example_queries/esql_example_query_0002.asciidoc"
      }
    },
    {
      "pageContent": "[[esql-example-queries]]\n\nThe following is an example ES|QL query:\n\n```\nfrom logs-*\n| where event.code is not null\n| stats event_code_count = count(event.code) by event.code,host.name\n| enrich win_events on event.code with EVENT_DESCRIPTION\n| where EVENT_DESCRIPTION is not null and host.name is not null\n| rename EVENT_DESCRIPTION as event.description\n| sort event_code_count desc\n| keep event_code_count,event.code,host.name,event.description\n```\n",
      "metadata": {
        "source": "/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/example_queries/esql_example_query_0003.asciidoc"
      }
    },
    {
      "pageContent": "[[esql-example-queries]]\n\nThe following is an example ES|QL query:\n\n```\nfrom logs-*\n| where event.category == \"file\" and event.action == \"creation\"\n| stats filecount = count(file.name) by process.name,host.name\n| dissect process.name \"%{process}.%{extension}\"\n| eval proclength = length(process.name)\n| where proclength > 10\n| sort filecount,proclength desc\n| limit 10\n| keep host.name,process.name,filecount,process,extension,fullproc,proclength\n```\n",
      "metadata": {
        "source": "/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/example_queries/esql_example_query_0004.asciidoc"
      }
    },
    {
      "pageContent": "[[esql-example-queries]]\n\nThe following is an example ES|QL query:\n\n```\nfrom logs-*\n| where process.name == \"curl.exe\"\n| stats bytes = sum(destination.bytes) by destination.address\n| eval kb =  bytes/1024\n| sort kb desc\n| limit 10\n| keep kb,destination.address\n```\n",
      "metadata": {
        "source": "/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/example_queries/esql_example_query_0005.asciidoc"
      }
    },
    {
      "pageContent": "[[esql-example-queries]]\n\nThe following is an example ES|QL query:\n\n```\nFROM metrics-apm*\n| WHERE metricset.name == \"transaction\" AND metricset.interval == \"1m\"\n| EVAL bucket = AUTO_BUCKET(transaction.duration.histogram, 50, <start-date>, <end-date>)\n| STATS avg_duration = AVG(transaction.duration.histogram) BY bucket\n```\n",
      "metadata": {
        "source": "/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/example_queries/esql_example_query_0006.asciidoc"
      }
    },
    {
      "pageContent": "[[esql-example-queries]]\n\nThe following is an example ES|QL query:\n\n```\nFROM packetbeat-*\n| STATS doc_count = COUNT(destination.domain) BY destination.domain\n| SORT doc_count DESC\n| LIMIT 10\n```\n",
      "metadata": {
        "source": "/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/example_queries/esql_example_query_0007.asciidoc"
      }
    },
    {
      "pageContent": "[[esql-example-queries]]\n\nThe following is an example ES|QL query:\n\n```\nFROM employees\n| EVAL hire_date_formatted = DATE_FORMAT(hire_date, \"MMMM yyyy\")\n| SORT hire_date\n| KEEP emp_no, hire_date_formatted\n| LIMIT 5\n```\n",
      "metadata": {
        "source": "/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/example_queries/esql_example_query_0008.asciidoc"
      }
    },
    {
      "pageContent": "[[esql-example-queries]]\n\nThe following is NOT an example of an ES|QL query:\n\n```\nPagination is not supported\n```\n",
      "metadata": {
        "source": "/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/example_queries/esql_example_query_0009.asciidoc"
      }
    },
    {
      "pageContent": "[[esql-example-queries]]\n\nThe following is an example ES|QL query:\n\n```\nFROM logs-*\n| WHERE @timestamp >= NOW() - 15 minutes\n| EVAL bucket = DATE_TRUNC(1 minute, @timestamp)\n| STATS avg_cpu = AVG(system.cpu.total.norm.pct) BY bucket, host.name\n| LIMIT 10\n```\n",
      "metadata": {
        "source": "/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/example_queries/esql_example_query_0010.asciidoc"
      }
    },
    {
      "pageContent": "[[esql-example-queries]]\n\nThe following is an example ES|QL query:\n\n```\nFROM traces-apm*\n| WHERE @timestamp >= NOW() - 24 hours\n| EVAL successful = CASE(event.outcome == \"success\", 1, 0),\n  failed = CASE(event.outcome == \"failure\", 1, 0)\n| STATS success_rate = AVG(successful),\n  avg_duration = AVG(transaction.duration),\n  total_requests = COUNT(transaction.id) BY service.name\n```\n",
      "metadata": {
        "source": "/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/example_queries/esql_example_query_0011.asciidoc"
      }
    },
    {
      "pageContent": "[[esql-example-queries]]\n\nThe following is an example ES|QL query:\n\n```\nFROM metricbeat*\n| EVAL cpu_pct_normalized = (system.cpu.user.pct + system.cpu.system.pct) / system.cpu.cores\n| STATS AVG(cpu_pct_normalized) BY host.name\n```\n",
      "metadata": {
        "source": "/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/example_queries/esql_example_query_0012.asciidoc"
      }
    },
    {
      "pageContent": "[[esql-example-queries]]\n\nThe following is an example ES|QL query:\n\n```\nFROM postgres-logs\n| DISSECT message \"%{} duration: %{query_duration} ms\"\n| EVAL query_duration_num = TO_DOUBLE(query_duration)\n| STATS avg_duration = AVG(query_duration_num)\n```\n",
      "metadata": {
        "source": "/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/example_queries/esql_example_query_0013.asciidoc"
      }
    },
    {
      "pageContent": "[[esql-example-queries]]\n\nThe following is an example ES|QL query:\n\n```\nFROM nyc_taxis\n| WHERE DATE_EXTRACT(drop_off_time, \"hour\") >= 6 AND DATE_EXTRACT(drop_off_time, \"hour\") < 10\n| LIMIT 10\n```\n",
      "metadata": {
        "source": "/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/example_queries/esql_example_query_0014.asciidoc"
      }
    }
  ],
  "query": "Generate an ES|QL query that will count the number of connections made to external IP addresses, broken down by user. If the count is greater than 100 for a specific user, add a new field called \"follow_up\" that contains a value of \"true\", otherwise, it should contain \"false\". The user names should also be enriched with their respective group names."
}

// The `pageContent`, but not the `metadata`, is then passed back to the `LLMChain`:

[chain/start] [1:chain:AgentExecutor > 4:tool:ChainTool > 5:chain:RetrievalQAChain > 7:chain:StuffDocumentsChain > 8:chain:LLMChain] Entering Chain run with input: {
  "question": "Generate an ES|QL query that will count the number of connections made to external IP addresses, broken down by user. If the count is greater than 100 for a specific user, add a new field called \"follow_up\" that contains a value of \"true\", otherwise, it should contain \"false\". The user names should also be enriched with their respective group names.",
  "query": "Generate an ES|QL query that will count the number of connections made to external IP addresses, broken down by user. If the count is greater than 100 for a specific user, add a new field called \"follow_up\" that contains a value of \"true\", otherwise, it should contain \"false\". The user names should also be enriched with their respective group names.",
  "context": "[[esql]]\n= {esql}\n\n:esql-tests: {xes-repo-dir}/../../plugin/esql/qa\n:esql-specs: {esql-tests}/testFixtures/src/main/resources\n\n[partintro]\n--\n\npreview::[]\n\nThe {es} Query Language ({esql}) is a query language that enables the iterative\nexploration of data.\n\nAn {esql} query consists of a series of commands, separated by pipes. Each query\nstarts with a <<esql-source-commands,source command>>. A source command produces\na table, typically with data from {es}.\n\nimage::images/esql/source-command.svg[A source command producing a table from {es},align=\"center\"]\n\nA source command can be followed by one or more\n<<esql-processing-commands,processing commands>>. Processing commands change an\ninput table by adding, removing, or changing rows and columns.\n\nimage::images/esql/processing-command.svg[A processing command changing an input table,align=\"center\"]\n\nYou can chain processing commands, separated by a pipe character: `|`. Each\nprocessing command works on the output table of the previous command.\n\nimage::images/esql/chaining-processing-commands.svg[Processing commands can be chained,align=\"center\"]\n\nThe result of a query is the table produced by the final processing command.\n\n[discrete]\n[[esql-console]]\n=== Run an {esql} query\n\n[discrete]\n==== The {esql} API\n\nUse the `_query` endpoint to run an {esql} query:\n\n[source,console]\n----\nPOST /_query\n{\n  \"query\": \"\"\"\n    FROM library\n    | EVAL year = DATE_TRUNC(1 YEARS, release_date)\n    | STATS MAX(page_count) BY year\n    | SORT year\n    | LIMIT 5\n  \"\"\"\n}\n----\n// TEST[setup:library]\n\nThe results come back in rows:\n\n[source,console-result]\n----\n{\n  \"columns\": [\n    { \"name\": \"MAX(page_count)\", \"type\": \"integer\"},\n    { \"name\": \"year\"           , \"type\": \"date\"}\n  ],\n  \"values\": [\n    [268, \"1932-01-01T00:00:00.000Z\"],\n    [224, \"1951-01-01T00:00:00.000Z\"],\n    [227, \"1953-01-01T00:00:00.000Z\"],\n    [335, \"1959-01-01T00:00:00.000Z\"],\n    [604, \"1965-01-01T00:00:00.000Z\"]\n  ]\n}\n----\n\nBy default, results are returned as JSON. To return results formatted as text,\nCSV, or TSV, use the `format` parameter:\n\n[source,console]\n----\nPOST /_query?format=txt\n{\n  \"query\": \"\"\"\n    FROM library\n    | EVAL year = DATE_TRUNC(1 YEARS, release_date)\n    | STATS MAX(page_count) BY year\n    | SORT year\n    | LIMIT 5\n  \"\"\"\n}\n----\n// TEST[setup:library]\n\n[discrete]\n==== {kib}\n\n{esql} can be used in Discover to explore a data set, and in Lens to visualize it.\nFirst, enable the `enableTextBased` setting in *Advanced Settings*. Next, in\nDiscover or Lens, from the data view dropdown, select *{esql}*.\n\nNOTE: {esql} queries in Discover and Lens are subject to the time range selected\nwith the time filter.\n\n[discrete]\n[[esql-limitations]]\n=== Limitations\n\n{esql} currently supports the following <<mapping-types,field types>>:\n\n- `alias`\n- `boolean`\n- `date`\n- `double` (`float`, `half_float`, `scaled_float` are represented as `double`)\n- `ip`\n- `keyword` family including `keyword`, `constant_keyword`, and `wildcard`\n- `int` (`short` and `byte` are represented as `int`)\n- `long`\n- `null`\n- `text`\n- `unsigned_long`\n- `version`\n--\n\ninclude::esql-get-started.asciidoc[]\n\ninclude::esql-syntax.asciidoc[]\n\ninclude::esql-source-commands.asciidoc[]\n\ninclude::esql-processing-commands.asciidoc[]\n\ninclude::esql-functions.asciidoc[]\n\ninclude::aggregation-functions.asciidoc[]\n\ninclude::multivalued-fields.asciidoc[]\n\ninclude::task-management.asciidoc[]\n\n:esql-tests!:\n:esql-specs!:\n\n\n[[esql-from]]\n=== `FROM`\n\nThe `FROM` source command returns a table with up to 10,000 documents from a\ndata stream, index, or alias. Each row in the resulting table represents a\ndocument. Each column corresponds to a field, and can be accessed by the name\nof that field.\n\n[source,esql]\n----\nFROM employees\n----\n\nYou can use <<api-date-math-index-names,date math>> to refer to indices, aliases\nand data streams. This can be useful for time series data, for example to access\ntoday's index:\n\n[source,esql]\n----\nFROM <logs-{now/d}>\n----\n\nUse comma-separated lists or wildcards to query multiple data streams, indices,\nor aliases:\n\n[source,esql]\n----\nFROM employees-00001,employees-*\n----\n\n\n[[esql-agg-count]]\n=== `COUNT`\nCounts field values.\n\n[source.merge.styled,esql]\n----\ninclude::{esql-specs}/stats.csv-spec[tag=count]\n----\n[%header.monospaced.styled,format=dsv,separator=|]\n|===\ninclude::{esql-specs}/stats.csv-spec[tag=count-result]\n|===\n\nCan take any field type as input and the result is always a `long` not matter\nthe input type.\n\nNOTE: There isn't yet a `COUNT(*)`. Please count a single valued field if you\n      need a count of rows.\n\n\n[[esql-agg-count-distinct]]\n=== `COUNT_DISTINCT`\nThe approximate number of distinct values.\n\n[source.merge.styled,esql]\n----\ninclude::{esql-specs}/stats_count_distinct.csv-spec[tag=count-distinct]\n----\n[%header.monospaced.styled,format=dsv,separator=|]\n|===\ninclude::{esql-specs}/stats_count_distinct.csv-spec[tag=count-distinct-result]\n|===\n\nCan take any field type as input and the result is always a `long` not matter\nthe input type.\n\n==== Counts are approximate\n\nComputing exact counts requires loading values into a set and returning its\nsize. This doesn't scale when working on high-cardinality sets and/or large\nvalues as the required memory usage and the need to communicate those\nper-shard sets between nodes would utilize too many resources of the cluster.\n\nThis `COUNT_DISTINCT` function is based on the\nhttps://static.googleusercontent.com/media/research.google.com/fr//pubs/archive/40671.pdf[HyperLogLog++]\nalgorithm, which counts based on the hashes of the values with some interesting\nproperties:\n\ninclude::../../aggregations/metrics/cardinality-aggregation.asciidoc[tag=explanation]\n\n==== Precision is configurable\n\nThe `COUNT_DISTINCT` function takes an optional second parameter to configure the\nprecision discussed previously.\n\n[source.merge.styled,esql]\n----\ninclude::{esql-specs}/stats_count_distinct.csv-spec[tag=count-distinct-precision]\n----\n[%header.monospaced.styled,format=dsv,separator=|]\n|===\ninclude::{esql-specs}/stats_count_distinct.csv-spec[tag=count-distinct-precision-result]\n|===\n\n\n[[esql-example-queries]]\n\nThe following is an example ES|QL query:\n\n```\nFROM logs-*\n| WHERE NOT CIDR_MATCH(destination.ip, \"10.0.0.0/8\", \"172.16.0.0/12\", \"192.168.0.0/16\")\n| STATS destcount = COUNT(destination.ip) by user.name, host.name\n| ENRICH ldap_lookup_new ON user.name\n| WHERE group.name IS NOT NULL\n| EVAL follow_up = CASE(\n    destcount >= 100, \"true\",\n     \"false\")\n| SORT destcount desc\n| KEEP destcount, host.name, user.name, group.name, follow_up\n```\n\n\n[[esql-example-queries]]\n\nThe following is an example ES|QL query:\n\n```\nfrom logs-*\n| grok dns.question.name \"%{DATA}\\\\.%{GREEDYDATA:dns.question.registered_domain:string}\"\n| stats unique_queries = count_distinct(dns.question.name) by dns.question.registered_domain, process.name\n| where unique_queries > 5\n| sort unique_queries desc\n```\n\n\n[[esql-example-queries]]\n\nThe following is an example ES|QL query:\n\n```\nfrom logs-*\n| where event.code is not null\n| stats event_code_count = count(event.code) by event.code,host.name\n| enrich win_events on event.code with EVENT_DESCRIPTION\n| where EVENT_DESCRIPTION is not null and host.name is not null\n| rename EVENT_DESCRIPTION as event.description\n| sort event_code_count desc\n| keep event_code_count,event.code,host.name,event.description\n```\n\n\n[[esql-example-queries]]\n\nThe following is an example ES|QL query:\n\n```\nfrom logs-*\n| where event.category == \"file\" and event.action == \"creation\"\n| stats filecount = count(file.name) by process.name,host.name\n| dissect process.name \"%{process}.%{extension}\"\n| eval proclength = length(process.name)\n| where proclength > 10\n| sort filecount,proclength desc\n| limit 10\n| keep host.name,process.name,filecount,process,extension,fullproc,proclength\n```\n\n\n[[esql-example-queries]]\n\nThe following is an example ES|QL query:\n\n```\nfrom logs-*\n| where process.name == \"curl.exe\"\n| stats bytes = sum(destination.bytes) by destination.address\n| eval kb =  bytes/1024\n| sort kb desc\n| limit 10\n| keep kb,destination.address\n```\n\n\n[[esql-example-queries]]\n\nThe following is an example ES|QL query:\n\n```\nFROM metrics-apm*\n| WHERE metricset.name == \"transaction\" AND metricset.interval == \"1m\"\n| EVAL bucket = AUTO_BUCKET(transaction.duration.histogram, 50, <start-date>, <end-date>)\n| STATS avg_duration = AVG(transaction.duration.histogram) BY bucket\n```\n\n\n[[esql-example-queries]]\n\nThe following is an example ES|QL query:\n\n```\nFROM packetbeat-*\n| STATS doc_count = COUNT(destination.domain) BY destination.domain\n| SORT doc_count DESC\n| LIMIT 10\n```\n\n\n[[esql-example-queries]]\n\nThe following is an example ES|QL query:\n\n```\nFROM employees\n| EVAL hire_date_formatted = DATE_FORMAT(hire_date, \"MMMM yyyy\")\n| SORT hire_date\n| KEEP emp_no, hire_date_formatted\n| LIMIT 5\n```\n\n\n[[esql-example-queries]]\n\nThe following is NOT an example of an ES|QL query:\n\n```\nPagination is not supported\n```\n\n\n[[esql-example-queries]]\n\nThe following is an example ES|QL query:\n\n```\nFROM logs-*\n| WHERE @timestamp >= NOW() - 15 minutes\n| EVAL bucket = DATE_TRUNC(1 minute, @timestamp)\n| STATS avg_cpu = AVG(system.cpu.total.norm.pct) BY bucket, host.name\n| LIMIT 10\n```\n\n\n[[esql-example-queries]]\n\nThe following is an example ES|QL query:\n\n```\nFROM traces-apm*\n| WHERE @timestamp >= NOW() - 24 hours\n| EVAL successful = CASE(event.outcome == \"success\", 1, 0),\n  failed = CASE(event.outcome == \"failure\", 1, 0)\n| STATS success_rate = AVG(successful),\n  avg_duration = AVG(transaction.duration),\n  total_requests = COUNT(transaction.id) BY service.name\n```\n\n\n[[esql-example-queries]]\n\nThe following is an example ES|QL query:\n\n```\nFROM metricbeat*\n| EVAL cpu_pct_normalized = (system.cpu.user.pct + system.cpu.system.pct) / system.cpu.cores\n| STATS AVG(cpu_pct_normalized) BY host.name\n```\n\n\n[[esql-example-queries]]\n\nThe following is an example ES|QL query:\n\n```\nFROM postgres-logs\n| DISSECT message \"%{} duration: %{query_duration} ms\"\n| EVAL query_duration_num = TO_DOUBLE(query_duration)\n| STATS avg_duration = AVG(query_duration_num)\n```\n\n\n[[esql-example-queries]]\n\nThe following is an example ES|QL query:\n\n```\nFROM nyc_taxis\n| WHERE DATE_EXTRACT(drop_off_time, \"hour\") >= 6 AND DATE_EXTRACT(drop_off_time, \"hour\") < 10\n| LIMIT 10\n```\n"
}

// The `LLMChain` then generates a new prompt based on the `pageContent` and passes it to the `ActionsClientLlm`, so the LLM can produce the final answer:

[llm/start] [1:chain:AgentExecutor > 4:tool:ChainTool > 5:chain:RetrievalQAChain > 7:chain:StuffDocumentsChain > 8:chain:LLMChain > 9:llm:ActionsClientLlm] Entering LLM run with input: {
  "prompts": [
    "[{\"lc\":1,\"type\":\"constructor\",\"id\":[\"langchain\",\"schema\",\"SystemMessage\"],\"kwargs\":{\"content\":\"Use the following pieces of context to answer the users question. \\nIf you don't know the answer, just say that you don't know, don't try to make up an answer.\\n----------------\\n[[esql]]\\n= {esql}\\n\\n:esql-tests: {xes-repo-dir}/../../plugin/esql/qa\\n:esql-specs: {esql-tests}/testFixtures/src/main/resources\\n\\n[partintro]\\n--\\n\\npreview::[]\\n\\nThe {es} Query Language ({esql}) is a query language that enables the iterative\\nexploration of data.\\n\\nAn {esql} query consists of a series of commands, separated by pipes. Each query\\nstarts with a <<esql-source-commands,source command>>. A source command produces\\na table, typically with data from {es}.\\n\\nimage::images/esql/source-command.svg[A source command producing a table from {es},align=\\\"center\\\"]\\n\\nA source command can be followed by one or more\\n<<esql-processing-commands,processing commands>>. Processing commands change an\\ninput table by adding, removing, or changing rows and columns.\\n\\nimage::images/esql/processing-command.svg[A processing command changing an input table,align=\\\"center\\\"]\\n\\nYou can chain processing commands, separated by a pipe character: `|`. Each\\nprocessing command works on the output table of the previous command.\\n\\nimage::images/esql/chaining-processing-commands.svg[Processing commands can be chained,align=\\\"center\\\"]\\n\\nThe result of a query is the table produced by the final processing command.\\n\\n[discrete]\\n[[esql-console]]\\n=== Run an {esql} query\\n\\n[discrete]\\n==== The {esql} API\\n\\nUse the `_query` endpoint to run an {esql} query:\\n\\n[source,console]\\n----\\nPOST /_query\\n{\\n  \\\"query\\\": \\\"\\\"\\\"\\n    FROM library\\n    | EVAL year = DATE_TRUNC(1 YEARS, release_date)\\n    | STATS MAX(page_count) BY year\\n    | SORT year\\n    | LIMIT 5\\n  \\\"\\\"\\\"\\n}\\n----\\n// TEST[setup:library]\\n\\nThe results come back in rows:\\n\\n[source,console-result]\\n----\\n{\\n  \\\"columns\\\": [\\n    { \\\"name\\\": \\\"MAX(page_count)\\\", \\\"type\\\": \\\"integer\\\"},\\n    { \\\"name\\\": \\\"year\\\"           , \\\"type\\\": \\\"date\\\"}\\n  ],\\n  \\\"values\\\": [\\n    [268, \\\"1932-01-01T00:00:00.000Z\\\"],\\n    [224, \\\"1951-01-01T00:00:00.000Z\\\"],\\n    [227, \\\"1953-01-01T00:00:00.000Z\\\"],\\n    [335, \\\"1959-01-01T00:00:00.000Z\\\"],\\n    [604, \\\"1965-01-01T00:00:00.000Z\\\"]\\n  ]\\n}\\n----\\n\\nBy default, results are returned as JSON. To return results formatted as text,\\nCSV, or TSV, use the `format` parameter:\\n\\n[source,console]\\n----\\nPOST /_query?format=txt\\n{\\n  \\\"query\\\": \\\"\\\"\\\"\\n    FROM library\\n    | EVAL year = DATE_TRUNC(1 YEARS, release_date)\\n    | STATS MAX(page_count) BY year\\n    | SORT year\\n    | LIMIT 5\\n  \\\"\\\"\\\"\\n}\\n----\\n// TEST[setup:library]\\n\\n[discrete]\\n==== {kib}\\n\\n{esql} can be used in Discover to explore a data set, and in Lens to visualize it.\\nFirst, enable the `enableTextBased` setting in *Advanced Settings*. Next, in\\nDiscover or Lens, from the data view dropdown, select *{esql}*.\\n\\nNOTE: {esql} queries in Discover and Lens are subject to the time range selected\\nwith the time filter.\\n\\n[discrete]\\n[[esql-limitations]]\\n=== Limitations\\n\\n{esql} currently supports the following <<mapping-types,field types>>:\\n\\n- `alias`\\n- `boolean`\\n- `date`\\n- `double` (`float`, `half_f…
  • Loading branch information
andrew-goldstein committed Oct 16, 2023
1 parent ef32c99 commit 06e2eb0
Show file tree
Hide file tree
Showing 49 changed files with 1,673 additions and 114 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
/*
* Copyright Elasticsearch B.V. and/or licensed to Elasticsearch B.V. under one
* or more contributor license agreements. Licensed under the Elastic License
* 2.0; you may not use this file except in compliance with the Elastic License
* 2.0.
*/

import { Document } from 'langchain/document';

/**
* Mock LangChain `Document`s from `knowledge_base/esql/docs`, loaded from a LangChain `DirectoryLoader`
*/
export const mockEsqlDocsFromDirectoryLoader: Document[] = [
{
pageContent:
'[[esql-agg-avg]]\n=== `AVG`\nThe average of a numeric field.\n\n[source.merge.styled,esql]\n----\ninclude::{esql-specs}/stats.csv-spec[tag=avg]\n----\n[%header.monospaced.styled,format=dsv,separator=|]\n|===\ninclude::{esql-specs}/stats.csv-spec[tag=avg-result]\n|===\n\nThe result is always a `double` not matter the input type.\n',
metadata: {
source:
'/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/docs/aggregation_functions/avg.asciidoc',
},
},
];

/**
* Mock LangChain `Document`s from `knowledge_base/esql/language_definition`, loaded from a LangChain `DirectoryLoader`
*/
export const mockEsqlLanguageDocsFromDirectoryLoader: Document[] = [
{
pageContent:
"lexer grammar EsqlBaseLexer;\n\nDISSECT : 'dissect' -> pushMode(EXPRESSION);\nDROP : 'drop' -> pushMode(SOURCE_IDENTIFIERS);\nENRICH : 'enrich' -> pushMode(SOURCE_IDENTIFIERS);\nEVAL : 'eval' -> pushMode(EXPRESSION);\nEXPLAIN : 'explain' -> pushMode(EXPLAIN_MODE);\nFROM : 'from' -> pushMode(SOURCE_IDENTIFIERS);\nGROK : 'grok' -> pushMode(EXPRESSION);\nINLINESTATS : 'inlinestats' -> pushMode(EXPRESSION);\nKEEP : 'keep' -> pushMode(SOURCE_IDENTIFIERS);\nLIMIT : 'limit' -> pushMode(EXPRESSION);\nMV_EXPAND : 'mv_expand' -> pushMode(SOURCE_IDENTIFIERS);\nPROJECT : 'project' -> pushMode(SOURCE_IDENTIFIERS);\nRENAME : 'rename' -> pushMode(SOURCE_IDENTIFIERS);\nROW : 'row' -> pushMode(EXPRESSION);\nSHOW : 'show' -> pushMode(EXPRESSION);\nSORT : 'sort' -> pushMode(EXPRESSION);\nSTATS : 'stats' -> pushMode(EXPRESSION);\nWHERE : 'where' -> pushMode(EXPRESSION);\nUNKNOWN_CMD : ~[ \\r\\n\\t[\\]/]+ -> pushMode(EXPRESSION);\n\nLINE_COMMENT\n : '//' ~[\\r\\n]* '\\r'? '\\n'? -> channel(HIDDEN)\n ;\n\nMULTILINE_COMMENT\n : '/*' (MULTILINE_COMMENT|.)*? '*/' -> channel(HIDDEN)\n ;\n\nWS\n : [ \\r\\n\\t]+ -> channel(HIDDEN)\n ;\n\n\nmode EXPLAIN_MODE;\nEXPLAIN_OPENING_BRACKET : '[' -> type(OPENING_BRACKET), pushMode(DEFAULT_MODE);\nEXPLAIN_PIPE : '|' -> type(PIPE), popMode;\nEXPLAIN_WS : WS -> channel(HIDDEN);\nEXPLAIN_LINE_COMMENT : LINE_COMMENT -> channel(HIDDEN);\nEXPLAIN_MULTILINE_COMMENT : MULTILINE_COMMENT -> channel(HIDDEN);\n\nmode EXPRESSION;\n\nPIPE : '|' -> popMode;\n\nfragment DIGIT\n : [0-9]\n ;\n\nfragment LETTER\n : [A-Za-z]\n ;\n\nfragment ESCAPE_SEQUENCE\n : '\\\\' [tnr\"\\\\]\n ;\n\nfragment UNESCAPED_CHARS\n : ~[\\r\\n\"\\\\]\n ;\n\nfragment EXPONENT\n : [Ee] [+-]? DIGIT+\n ;\n\nSTRING\n : '\"' (ESCAPE_SEQUENCE | UNESCAPED_CHARS)* '\"'\n | '\"\"\"' (~[\\r\\n])*? '\"\"\"' '\"'? '\"'?\n ;\n\nINTEGER_LITERAL\n : DIGIT+\n ;\n\nDECIMAL_LITERAL\n : DIGIT+ DOT DIGIT*\n | DOT DIGIT+\n | DIGIT+ (DOT DIGIT*)? EXPONENT\n | DOT DIGIT+ EXPONENT\n ;\n\nBY : 'by';\n\nAND : 'and';\nASC : 'asc';\nASSIGN : '=';\nCOMMA : ',';\nDESC : 'desc';\nDOT : '.';\nFALSE : 'false';\nFIRST : 'first';\nLAST : 'last';\nLP : '(';\nIN: 'in';\nIS: 'is';\nLIKE: 'like';\nNOT : 'not';\nNULL : 'null';\nNULLS : 'nulls';\nOR : 'or';\nPARAM: '?';\nRLIKE: 'rlike';\nRP : ')';\nTRUE : 'true';\nINFO : 'info';\nFUNCTIONS : 'functions';\n\nEQ : '==';\nNEQ : '!=';\nLT : '<';\nLTE : '<=';\nGT : '>';\nGTE : '>=';\n\nPLUS : '+';\nMINUS : '-';\nASTERISK : '*';\nSLASH : '/';\nPERCENT : '%';\n\n// Brackets are funny. We can happen upon a CLOSING_BRACKET in two ways - one\n// way is to start in an explain command which then shifts us to expression\n// mode. Thus, the two popModes on CLOSING_BRACKET. The other way could as\n// the start of a multivalued field constant. To line up with the double pop\n// the explain mode needs, we double push when we see that.\nOPENING_BRACKET : '[' -> pushMode(EXPRESSION), pushMode(EXPRESSION);\nCLOSING_BRACKET : ']' -> popMode, popMode;\n\n\nUNQUOTED_IDENTIFIER\n : LETTER (LETTER | DIGIT | '_')*\n // only allow @ at beginning of identifier to keep the option to allow @ as infix operator in the future\n // also, single `_` and `@` characters are not valid identifiers\n | ('_' | '@') (LETTER | DIGIT | '_')+\n ;\n\nQUOTED_IDENTIFIER\n : '`' ( ~'`' | '``' )* '`'\n ;\n\nEXPR_LINE_COMMENT\n : LINE_COMMENT -> channel(HIDDEN)\n ;\n\nEXPR_MULTILINE_COMMENT\n : MULTILINE_COMMENT -> channel(HIDDEN)\n ;\n\nEXPR_WS\n : WS -> channel(HIDDEN)\n ;\n\n\n\nmode SOURCE_IDENTIFIERS;\n\nSRC_PIPE : '|' -> type(PIPE), popMode;\nSRC_OPENING_BRACKET : '[' -> type(OPENING_BRACKET), pushMode(SOURCE_IDENTIFIERS), pushMode(SOURCE_IDENTIFIERS);\nSRC_CLOSING_BRACKET : ']' -> popMode, popMode, type(CLOSING_BRACKET);\nSRC_COMMA : ',' -> type(COMMA);\nSRC_ASSIGN : '=' -> type(ASSIGN);\nAS : 'as';\nMETADATA: 'metadata';\nON : 'on';\nWITH : 'with';\n\nSRC_UNQUOTED_IDENTIFIER\n : SRC_UNQUOTED_IDENTIFIER_PART+\n ;\n\nfragment SRC_UNQUOTED_IDENTIFIER_PART\n : ~[=`|,[\\]/ \\t\\r\\n]+\n | '/' ~[*/] // allow single / but not followed by another / or * which would start a comment\n ;\n\nSRC_QUOTED_IDENTIFIER\n : QUOTED_IDENTIFIER\n ;\n\nSRC_LINE_COMMENT\n : LINE_COMMENT -> channel(HIDDEN)\n ;\n\nSRC_MULTILINE_COMMENT\n : MULTILINE_COMMENT -> channel(HIDDEN)\n ;\n\nSRC_WS\n : WS -> channel(HIDDEN)\n ;\n",
metadata: {
source:
'/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/language_definition/esql_base_lexer.g4',
},
},
{
pageContent:
"DISSECT=1\nDROP=2\nENRICH=3\nEVAL=4\nEXPLAIN=5\nFROM=6\nGROK=7\nINLINESTATS=8\nKEEP=9\nLIMIT=10\nMV_EXPAND=11\nPROJECT=12\nRENAME=13\nROW=14\nSHOW=15\nSORT=16\nSTATS=17\nWHERE=18\nUNKNOWN_CMD=19\nLINE_COMMENT=20\nMULTILINE_COMMENT=21\nWS=22\nEXPLAIN_WS=23\nEXPLAIN_LINE_COMMENT=24\nEXPLAIN_MULTILINE_COMMENT=25\nPIPE=26\nSTRING=27\nINTEGER_LITERAL=28\nDECIMAL_LITERAL=29\nBY=30\nAND=31\nASC=32\nASSIGN=33\nCOMMA=34\nDESC=35\nDOT=36\nFALSE=37\nFIRST=38\nLAST=39\nLP=40\nIN=41\nIS=42\nLIKE=43\nNOT=44\nNULL=45\nNULLS=46\nOR=47\nPARAM=48\nRLIKE=49\nRP=50\nTRUE=51\nINFO=52\nFUNCTIONS=53\nEQ=54\nNEQ=55\nLT=56\nLTE=57\nGT=58\nGTE=59\nPLUS=60\nMINUS=61\nASTERISK=62\nSLASH=63\nPERCENT=64\nOPENING_BRACKET=65\nCLOSING_BRACKET=66\nUNQUOTED_IDENTIFIER=67\nQUOTED_IDENTIFIER=68\nEXPR_LINE_COMMENT=69\nEXPR_MULTILINE_COMMENT=70\nEXPR_WS=71\nAS=72\nMETADATA=73\nON=74\nWITH=75\nSRC_UNQUOTED_IDENTIFIER=76\nSRC_QUOTED_IDENTIFIER=77\nSRC_LINE_COMMENT=78\nSRC_MULTILINE_COMMENT=79\nSRC_WS=80\nEXPLAIN_PIPE=81\n'dissect'=1\n'drop'=2\n'enrich'=3\n'eval'=4\n'explain'=5\n'from'=6\n'grok'=7\n'inlinestats'=8\n'keep'=9\n'limit'=10\n'mv_expand'=11\n'project'=12\n'rename'=13\n'row'=14\n'show'=15\n'sort'=16\n'stats'=17\n'where'=18\n'by'=30\n'and'=31\n'asc'=32\n'desc'=35\n'.'=36\n'false'=37\n'first'=38\n'last'=39\n'('=40\n'in'=41\n'is'=42\n'like'=43\n'not'=44\n'null'=45\n'nulls'=46\n'or'=47\n'?'=48\n'rlike'=49\n')'=50\n'true'=51\n'info'=52\n'functions'=53\n'=='=54\n'!='=55\n'<'=56\n'<='=57\n'>'=58\n'>='=59\n'+'=60\n'-'=61\n'*'=62\n'/'=63\n'%'=64\n']'=66\n'as'=72\n'metadata'=73\n'on'=74\n'with'=75\n",
metadata: {
source:
'/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/language_definition/esql_base_lexer.tokens',
},
},
];

/**
* Mock LangChain `Document`s from `knowledge_base/esql/example_queries`, loaded from a LangChain `DirectoryLoader`
*/
export const mockExampleQueryDocsFromDirectoryLoader: Document[] = [
{
pageContent:
'[[esql-example-queries]]\n\nThe following is an example an ES|QL query:\n\n```\nFROM logs-*\n| WHERE NOT CIDR_MATCH(destination.ip, "10.0.0.0/8", "172.16.0.0/12", "192.168.0.0/16")\n| STATS destcount = COUNT(destination.ip) by user.name, host.name\n| ENRICH ldap_lookup_new ON user.name\n| WHERE group.name IS NOT NULL\n| EVAL follow_up = CASE(\n destcount >= 100, "true",\n "false")\n| SORT destcount desc\n| KEEP destcount, host.name, user.name, group.name, follow_up\n```\n',
metadata: {
source:
'/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/example_queries/esql_example_query_0001.asciidoc',
},
},
{
pageContent:
'[[esql-example-queries]]\n\nThe following is an example an ES|QL query:\n\n```\nfrom logs-*\n| grok dns.question.name "%{DATA}\\\\.%{GREEDYDATA:dns.question.registered_domain:string}"\n| stats unique_queries = count_distinct(dns.question.name) by dns.question.registered_domain, process.name\n| where unique_queries > 5\n| sort unique_queries desc\n```\n',
metadata: {
source:
'/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/example_queries/esql_example_query_0002.asciidoc',
},
},
{
pageContent:
'[[esql-example-queries]]\n\nThe following is an example an ES|QL query:\n\n```\nfrom logs-*\n| where event.code is not null\n| stats event_code_count = count(event.code) by event.code,host.name\n| enrich win_events on event.code with EVENT_DESCRIPTION\n| where EVENT_DESCRIPTION is not null and host.name is not null\n| rename EVENT_DESCRIPTION as event.description\n| sort event_code_count desc\n| keep event_code_count,event.code,host.name,event.description\n```\n',
metadata: {
source:
'/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/example_queries/esql_example_query_0003.asciidoc',
},
},
];
75 changes: 75 additions & 0 deletions x-pack/plugins/elastic_assistant/server/__mocks__/msearch_query.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
/*
* Copyright Elasticsearch B.V. and/or licensed to Elasticsearch B.V. under one
* or more contributor license agreements. Licensed under the Elastic License
* 2.0; you may not use this file except in compliance with the Elastic License
* 2.0.
*/

import type { QueryDslTextExpansionQuery } from '@elastic/elasticsearch/lib/api/types';

import type { MsearchQueryBody } from '../lib/langchain/elasticsearch_store/helpers/get_msearch_query_body';

/**
* This mock Elasticsearch msearch request body contains two queries:
* - The first query is a similarity (vector) search
* - The second query is a required KB document (terms) search
*/
export const mSearchQueryBody: MsearchQueryBody = {
body: [
{
index: '.kibana-elastic-ai-assistant-kb',
},
{
query: {
bool: {
must_not: [
{
term: {
'metadata.kbResource': 'esql',
},
},
{
term: {
'metadata.required': true,
},
},
],
must: [
{
text_expansion: {
'vector.tokens': {
model_id: '.elser_model_2',
model_text:
'Generate an ESQL query that will count the number of connections made to external IP addresses, broken down by user. If the count is greater than 100 for a specific user, add a new field called "follow_up" that contains a value of "true", otherwise, it should contain "false". The user names should also be enriched with their respective group names.',
},
} as unknown as QueryDslTextExpansionQuery,
},
],
},
},
size: 1,
},
{
index: '.kibana-elastic-ai-assistant-kb',
},
{
query: {
bool: {
must: [
{
term: {
'metadata.kbResource': 'esql',
},
},
{
term: {
'metadata.required': true,
},
},
],
},
},
size: 1,
},
],
};
101 changes: 101 additions & 0 deletions x-pack/plugins/elastic_assistant/server/__mocks__/msearch_response.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
/*
* Copyright Elasticsearch B.V. and/or licensed to Elasticsearch B.V. under one
* or more contributor license agreements. Licensed under the Elastic License
* 2.0; you may not use this file except in compliance with the Elastic License
* 2.0.
*/

import type { MsearchResponse } from '@elastic/elasticsearch/lib/api/types';

/**
* This mock response from an Elasticsearch msearch contains two hits, where
* the first hit is from a similarity (vector) search, and the second hit is a
* required KB document (terms) search.
*/
export const mockMsearchResponse: MsearchResponse = {
took: 142,
responses: [
{
took: 142,
timed_out: false,
_shards: {
total: 1,
successful: 1,
skipped: 0,
failed: 0,
},
hits: {
total: {
value: 129,
relation: 'eq',
},
max_score: 21.658352,
hits: [
{
_index: '.kibana-elastic-ai-assistant-kb',
_id: 'fa1c8ba1-25c9-4404-9736-09b7eb7124f8',
_score: 21.658352,
_ignored: ['text.keyword'],
_source: {
metadata: {
source:
'/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/docs/source_commands/from.asciidoc',
},
vector: {
tokens: {
wild: 1.2001507,
// truncated for mock
},
model_id: '.elser_model_2',
},
text: "[[esql-from]]\n=== `FROM`\n\nThe `FROM` source command returns a table with up to 10,000 documents from a\ndata stream, index, or alias. Each row in the resulting table represents a\ndocument. Each column corresponds to a field, and can be accessed by the name\nof that field.\n\n[source,esql]\n----\nFROM employees\n----\n\nYou can use <<api-date-math-index-names,date math>> to refer to indices, aliases\nand data streams. This can be useful for time series data, for example to access\ntoday's index:\n\n[source,esql]\n----\nFROM <logs-{now/d}>\n----\n\nUse comma-separated lists or wildcards to query multiple data streams, indices,\nor aliases:\n\n[source,esql]\n----\nFROM employees-00001,employees-*\n----\n",
},
},
],
},
status: 200,
},
{
took: 3,
timed_out: false,
_shards: {
total: 1,
successful: 1,
skipped: 0,
failed: 0,
},
hits: {
total: {
value: 14,
relation: 'eq',
},
max_score: 0.034783483,
hits: [
{
_index: '.kibana-elastic-ai-assistant-kb',
_id: '280d4882-0f64-4471-a268-669a3f8c958f',
_score: 0.034783483,
_ignored: ['text.keyword'],
_source: {
metadata: {
source:
'/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/example_queries/esql_example_query_0001.asciidoc',
required: true,
kbResource: 'esql',
},
vector: {
tokens: {
user: 1.1084619,
// truncated for mock
},
model_id: '.elser_model_2',
},
text: '[[esql-example-queries]]\n\nThe following is an example an ES|QL query:\n\n```\nFROM logs-*\n| WHERE NOT CIDR_MATCH(destination.ip, "10.0.0.0/8", "172.16.0.0/12", "192.168.0.0/16")\n| STATS destcount = COUNT(destination.ip) by user.name, host.name\n| ENRICH ldap_lookup_new ON user.name\n| WHERE group.name IS NOT NULL\n| EVAL follow_up = CASE(\n destcount >= 100, "true",\n "false")\n| SORT destcount desc\n| KEEP destcount, host.name, user.name, group.name, follow_up\n```\n',
},
},
],
},
status: 200,
},
],
};
28 changes: 28 additions & 0 deletions x-pack/plugins/elastic_assistant/server/__mocks__/query_text.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
/*
* Copyright Elasticsearch B.V. and/or licensed to Elasticsearch B.V. under one
* or more contributor license agreements. Licensed under the Elastic License
* 2.0; you may not use this file except in compliance with the Elastic License
* 2.0.
*/

/**
* This mock query text is an example of a prompt that might be passed to
* the `ElasticSearchStore`'s `similaritySearch` function, as the `query`
* parameter.
*
* In the real world, an LLM extracted the `mockQueryText` from the
* following prompt, which includes a system prompt:
*
* ```
* You are a helpful, expert assistant who answers questions about Elastic Security. Do not answer questions unrelated to Elastic Security.
* If you answer a question related to KQL, EQL, or ES|QL, it should be immediately usable within an Elastic Security timeline; please always format the output correctly with back ticks. Any answer provided for Query DSL should also be usable in a security timeline. This means you should only ever include the "filter" portion of the query.
*
* Use the following context to answer questions:
*
* Generate an ES|QL query that will count the number of connections made to external IP addresses, broken down by user. If the count is greater than 100 for a specific user, add a new field called "follow_up" that contains a value of "true", otherwise, it should contain "false". The user names should also be enriched with their respective group names.
* ```
*
* In the example above, the LLM omitted the system prompt, such that only `mockQueryText` is passed to the `similaritySearch` function.
*/
export const mockQueryText =
'Generate an ES|QL query that will count the number of connections made to external IP addresses, broken down by user. If the count is greater than 100 for a specific user, add a new field called follow_up that contains a value of true, otherwise, it should contain false. The user names should also be enriched with their respective group names.';
28 changes: 28 additions & 0 deletions x-pack/plugins/elastic_assistant/server/__mocks__/terms.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
/*
* Copyright Elasticsearch B.V. and/or licensed to Elasticsearch B.V. under one
* or more contributor license agreements. Licensed under the Elastic License
* 2.0; you may not use this file except in compliance with the Elastic License
* 2.0.
*/

import type { Field, FieldValue, QueryDslTermQuery } from '@elastic/elasticsearch/lib/api/types';

/**
* These (mock) terms may be used in multiple queries.
*
* For example, it may be be used in a vector search to exclude the required `esql` KB docs.
*
* It may also be used in a terms search to find all of the required `esql` KB docs.
*/
export const mockTerms: Array<Partial<Record<Field, QueryDslTermQuery | FieldValue>>> = [
{
term: {
'metadata.kbResource': 'esql',
},
},
{
term: {
'metadata.required': true,
},
},
];
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
/*
* Copyright Elasticsearch B.V. and/or licensed to Elasticsearch B.V. under one
* or more contributor license agreements. Licensed under the Elastic License
* 2.0; you may not use this file except in compliance with the Elastic License
* 2.0.
*/

import type { QueryDslQueryContainer } from '@elastic/elasticsearch/lib/api/types';

/**
* This Elasticsearch query DSL is a terms search for required `esql` KB docs
*/
export const mockTermsSearchQuery: QueryDslQueryContainer = {
bool: {
must: [
{
term: {
'metadata.kbResource': 'esql',
},
},
{
term: {
'metadata.required': true,
},
},
],
},
};
Loading

0 comments on commit 06e2eb0

Please sign in to comment.