Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ES|QL] Combine Disjunctive CIDRMatch #111501

Merged
merged 11 commits into from
Aug 8, 2024

Conversation

fang-xing-esql
Copy link
Member

@fang-xing-esql fang-xing-esql commented Aug 1, 2024

Resolves #105143

This PR leverages CombineDisjunctionsToIn as it is very similar with combining multiple Equalss or Ins into In, and efficient if do it in one pass.

@elasticsearchmachine
Copy link
Collaborator

Hi @fang-xing-esql, I've created a changelog YAML for you.

@fang-xing-esql fang-xing-esql marked this pull request as ready for review August 1, 2024 15:28
@elasticsearchmachine elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Aug 1, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@alex-spies alex-spies self-requested a review August 2, 2024 07:35
Copy link
Contributor

@bpintea bpintea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit on the fence about reusing/extending this rule, but it does make some sense as "generic" OR-combiner; and then also rename to drop the ...ToIn.

Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am ok re-using this rule to also cover CIDRs. The final result of these combinations (either == or IN or CIDR) have a common result - the use of a terms query.

But:

  • the javadoc of the class must be updated to reflect the new reality
  • other comments in the class itself need update or new comments added
  • an additional test I think it's needed in CombineDisjunctionsToInTests that cover more complex queries: WHERE host == "alpha" OR host == "gamma" OR CIDR_MATCH(ip1, "127.0.0.2/32") OR CIDR_MATCH(ip1, "127.0.0.3/32") OR card IN ("eth0", "eth1") OR card == "lo0" OR CIDR_MATCH(ip0, "127.0.0.1") OR CIDR_MATCH(ip0, "fe80::cae2:65ff:fece:feb9") OR host == "beta"
  • use the same test as above in the csv-spec file
  • look into why the query above is translated into
{
  "bool": {
    "should": [
      {
        "esql_single_value": {
          "field": "host",
          "next": {
            "terms": {
              "host": [
                "alpha",
                "gamma",
                "beta"
              ],
              "boost": 1
            }
          },
          "source": "host@1:20"
        }
      },
      {
        "esql_single_value": {
          "field": "card",
          "next": {
            "terms": {
              "card": [
                "eth0",
                "eth1",
                "lo0"
              ],
              "boost": 1
            }
          },
          "source": "card@1:128"
        }
      },
      {
        "bool": {
          "should": [
            {
              "esql_single_value": {
                "field": "ip1",
                "next": {
                  "terms": {
                    "ip1": [
                      "127.0.0.2/32",
                      "127.0.0.3/32"
                    ],
                    "boost": 1
                  }
                },
                "source": "ip1@1:69"
              }
            },
            {
              "esql_single_value": {
                "field": "ip0",
                "next": {
                  "terms": {
                    "ip0": [
                      "127.0.0.1",
                      "fe80::cae2:65ff:fece:feb9"
                    ],
                    "boost": 1
                  }
                },
                "source": "ip0@1:184"
              }
            }
          ],
          "boost": 1
        }
      }
    ],
    "boost": 1
  }
}

There is an additional bool query inside but I don't think it should be there (there should be only one bool query with 4 should statements).

  • this query has a strange behavior, maybe we can improve it: WHERE host == "alpha" OR host == "gamma" OR ip1 IN ("127.0.0.2/32"::ip) OR CIDR_MATCH(ip1, "127.0.0.3/32") OR card IN ("eth0", "eth1") OR card == "lo0" OR ip0 IN ("127.0.0.1"::ip) OR CIDR_MATCH(ip0, "fe80::cae2:65ff:fece:feb9") OR host == "beta"

is translated into (ip0 and ip1 are not included in the same terms query, even though they could; yes, they are coming from ip0 IN .... and cidr_match(ip0...) but the resulting queries could be combined in a better way):

{
  "bool": {
    "should": [
      {
        "esql_single_value": {
          "field": "card",
          "next": {
            "terms": {
              "card": [
                "eth0",
                "eth1",
                "lo0"
              ],
              "boost": 1
            }
          },
          "source": "card@1:124"
        }
      },
      {
        "esql_single_value": {
          "field": "ip0",
          "next": {
            "term": {
              "ip0": {
                "value": "127.0.0.1"
              }
            }
          },
          "source": "ip0@1:169"
        }
      },
      {
        "esql_single_value": {
          "field": "host",
          "next": {
            "terms": {
              "host": [
                "alpha",
                "gamma",
                "beta"
              ],
              "boost": 1
            }
          },
          "source": "host@1:20"
        }
      },
      {
        "bool": {
          "should": [
            {
              "esql_single_value": {
                "field": "ip1",
                "next": {
                  "terms": {
                    "ip1": [
                      "127.0.0.3/32"
                    ],
                    "boost": 1
                  }
                },
                "source": "ip1@1:100"
              }
            },
            {
              "esql_single_value": {
                "field": "ip0",
                "next": {
                  "terms": {
                    "ip0": [
                      "fe80::cae2:65ff:fece:feb9"
                    ],
                    "boost": 1
                  }
                },
                "source": "ip0@1:208"
              }
            }
          ],
          "boost": 1
        }
      }
    ],
    "boost": 1
  }
}

Copy link
Contributor

@alex-spies alex-spies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, this looks good to me. Bogdan pointed out some good additional edge cases that I also think we should add tests for.

One thing I think would be great is tests that show that the combined CIDR_MATCH now gets pushed down to Lucene; I added one in #105061, it'd be great to have another one like that that comes with different kinds of disjunctions (and conjunctions, for good measure) of CIDR_MATCHes.

I think previously, we were not able to push down multiple CIDR_MATCHes (or at least, not as a single terms query), but combined this should now be possible.

@fang-xing-esql
Copy link
Member Author

fang-xing-esql commented Aug 6, 2024

Thanks to @astefan @bpintea and @alex-spies for reviewing! These are really good test cases!!

The disjunctive or(tree) we build is not a left or right deep tree, it is a balance tree in stead, ExpressionTranslator's BinaryLogic can be improved to handle this better. Today if both LHS and RHS of an Or predicate are BoolQuerys, they may not be combined into one BoolQuery, the BinaryLogic is updated to allow it in this PR.

The CombineDisjuctions rule is also updated to combine expressions(Equals, In and CIDRMatch) on IP field.

@fang-xing-esql
Copy link
Member Author

There is an additional bool query inside but I don't think it should be there (there should be only one bool query with 4 should statements).

  • this query has a strange behavior, maybe we can improve it: WHERE host == "alpha" OR host == "gamma" OR ip1 IN ("127.0.0.2/32"::ip) OR CIDR_MATCH(ip1, "127.0.0.3/32") OR card IN ("eth0", "eth1") OR card == "lo0" OR ip0 IN ("127.0.0.1"::ip) OR CIDR_MATCH(ip0, "fe80::cae2:65ff:fece:feb9") OR host == "beta"

Thank you for catching these interesting problems! Both queries are added to either CsvTests or QueryTranslatorTests. CombineDisjunctions and ExpressionTranslators.BinaryLogic() are updated to support them.

Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
Left only few minor remarks.

Copy link
Contributor

@alex-spies alex-spies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I have a minor testing suggestion, but this is already good as-is IMO.

@alex-spies
Copy link
Contributor

Thank you for the iterations @fang-xing-esql !

Copy link
Contributor

@bpintea bpintea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@fang-xing-esql fang-xing-esql merged commit 9c6a2e0 into elastic:main Aug 8, 2024
15 checks passed
@fang-xing-esql fang-xing-esql deleted the CombineCIDRMatch branch August 8, 2024 16:04
weizijun added a commit to weizijun/elasticsearch that referenced this pull request Aug 9, 2024
* upstream/main: (22 commits)
  Prune changelogs after 8.15.0 release
  Bump versions after 8.15.0 release
  EIS integration (elastic#111154)
  Skip LOOKUP/INLINESTATS cases unless on snapshot (elastic#111755)
  Always enforce strict role validation (elastic#111056)
  Mute org.elasticsearch.xpack.esql.analysis.VerifierTests testUnsupportedAndMultiTypedFields elastic#111753
  [ML] Force time shift integration test (elastic#111620)
  ESQL: Add tests for sort, where with unsupported type (elastic#111737)
  [ML] Force time shift documentation (elastic#111668)
  Fix remote cluster credential secure settings reload   (elastic#111535)
  ESQL: Fix for overzealous validation in case of invalid mapped fields (elastic#111475)
  Pass allow security manager flag in gradle test policy setup plugin (elastic#111725)
  Rename streamContent/Separator to bulkContent/Separator (elastic#111716)
  Mute org.elasticsearch.tdigest.ComparisonTests testSparseGaussianDistribution elastic#111721
  Remove 8.14 from branches.json
  Only emit product origin in deprecation log if present (elastic#111683)
  Forward port release notes for v8.15.0 (elastic#111714)
  [ES|QL] Combine Disjunctive CIDRMatch (elastic#111501)
  ESQL: Remove qualifier from attrs (elastic#110581)
  Force using the last centroid during merging (elastic#111644)
  ...

# Conflicts:
#	server/src/main/java/org/elasticsearch/TransportVersions.java
#	x-pack/plugin/inference/src/main/java/org/elasticsearch/xpack/inference/InferenceNamedWriteablesProvider.java
cbuescher pushed a commit to cbuescher/elasticsearch that referenced this pull request Sep 4, 2024
* support CIDRMatch in CombineDisjunctions
davidkyle pushed a commit to davidkyle/elasticsearch that referenced this pull request Sep 5, 2024
* support CIDRMatch in CombineDisjunctions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/ES|QL AKA ESQL >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v8.16.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ESQL: Optimize multiple CIDR_MATCHes
6 participants