Skip to content
This repository has been archived by the owner on Aug 2, 2022. It is now read-only.

GROUP BY size restriction from second aggregation and up #1081

Closed
alexferl opened this issue Mar 29, 2021 · 4 comments
Closed

GROUP BY size restriction from second aggregation and up #1081

alexferl opened this issue Mar 29, 2021 · 4 comments
Labels
enhancement New feature or request SQL

Comments

@alexferl
Copy link

It seems like issue #276 is back or I am doing something wrong.

On a 1.13.1 cluster explaining the following query:

SELECT field1, field2, field3 FROM `2021-03-25-index` WHERE field0 = 'value' GROUP BY field1, field2, field3

returns the following:

{
  "from": 0,
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        {
          "bool": {
            "must": [
              {
                "term": {
                  "field0": {
                    "value": "value",
                    "boost": 1.0
                  }
                }
              }
            ],
            "adjust_pure_negative": true,
            "boost": 1.0
          }
        }
      ],
      "adjust_pure_negative": true,
      "boost": 1.0
    }
  },
  "_source": {
    "includes": [
      "field1",
      "field2",
      "field3"
    ],
    "excludes": []
  },
  "stored_fields": [
    "field1",
    "field2",
    "field3"
  ],
  "aggregations": {
    "field1": {
      "terms": {
        "field": "field1",
        "size": 200,
        "min_doc_count": 1,
        "shard_min_doc_count": 0,
        "show_term_doc_count_error": false,
        "order": [
          {
            "_count": "desc"
          },
          {
            "_key": "asc"
          }
        ]
      },
      "aggregations": {
        "field2": {
          "terms": {
            "field": "field2",
            "size": 10,
            "min_doc_count": 1,
            "shard_min_doc_count": 0,
            "show_term_doc_count_error": false,
            "order": [
              {
                "_count": "desc"
              },
              {
                "_key": "asc"
              }
            ]
          },
          "aggregations": {
            "field3": {
              "terms": {
                "field": "field3",
                "size": 10,
                "min_doc_count": 1,
                "shard_min_doc_count": 0,
                "show_term_doc_count_error": false,
                "order": [
                  {
                    "_count": "desc"
                  },
                  {
                    "_key": "asc"
                  }
                ]
              }
            }
          }
        }
      }
    }
  }
}

I have the following settings on the cluster:

{
  "persistent": {},
  "transient": {
    "opendistro": {
      "query": {
        "size_limit": "10000"
      },
      "sql": {
        "cursor": {
          "enabled": "true",
          "fetch_size": "10000",
          "keep_alive": "5m"
        },
        "engine": {
          "new": {
            "enabled": "true"
          }
        },
        "query": {
          "analysis": {
            "enabled": "false"
          }
        }
      }
    }
  }
}

According to #730, the bucket size should be 1000 instead of 10. My query size (10000) is also not honored but it seems the issue is already reported #1039.

Any ideas?

Thank you!

@chloe-zh
Copy link
Member

chloe-zh commented Mar 29, 2021

Hi @alexferl , thanks for reporting this issue. It looks like your query fell back to the old engine so that the explanation differed from the behavior we are having now. Here is what it looks like in the new engine:

POST _opendistro/_sql/_explain
{
  "query": "SELECT OriginCountry, DestCountry, Carrier FROM kibana_sample_data_flights WHERE OriginWeather = 'Sunny' GROUP BY OriginCountry, DestCountry, Carrier"
}

{
  "root": {
    "name": "ProjectOperator",
    "description": {
      "fields": "[OriginCountry, DestCountry, Carrier]"
    },
    "children": [
      {
        "name": "ElasticsearchIndexScan",
        "description": {
          "request": """ElasticsearchQueryRequest(indexName=kibana_sample_data_flights, sourceBuilder={"from":0,"size":0,"timeout":"1m","query":{"term":{"OriginWeather":{"value":"Sunny","boost":1.0}}},"sort":[{"_doc":{"order":"asc"}}],"aggregations":{"composite_buckets":{"composite":{"size":1000,"sources":[{"OriginCountry":{"terms":{"field":"OriginCountry","missing_bucket":true,"order":"asc"}}},{"DestCountry":{"terms":{"field":"DestCountry","missing_bucket":true,"order":"asc"}}},{"Carrier":{"terms":{"field":"Carrier","missing_bucket":true,"order":"asc"}}}]}}}}, searchDone=false)"""
        },
        "children": []
      }
    ]
  }
}

Your query syntax should be good to pass the new engine, not sure whether your new engine switch is off, could you take a look at your plugin settings to double check it? If so, please turn on the new engine following this manual: https://github.com/opendistro-for-elasticsearch/sql/blob/develop/docs/user/admin/settings.rst#opendistro-sql-engine-new-enabled Thanks!

@alexferl
Copy link
Author

Hi @chloe-zh,

I've tried this and it's still using the old engine.

I get the following response when I PUT the setting so it seems like should work:

{
  "acknowledged": true,
  "persistent": {},
  "transient": {
    "opendistro": {
      "sql": {
        "engine": {
          "new": {
            "enabled": "true"
          }
        }
      }
    }
  }
}

but I still get the same old-style query with nested aggregations or missing data when I actually run the query because of the size=10 limit.

Thank you!

@chloe-zh
Copy link
Member

Sorry I missed your info saying that the new engine is enabled already. Is the query the exact one that you are using? Could you provide some sample data so that we can try to reproduce and locate the issue from our side?

@dai-chen
Copy link
Member

@alexferl @chloe-zh It seems fallback will happen if cursor is enabled: https://github.com/opendistro-for-elasticsearch/sql/blob/develop/legacy/src/main/java/com/amazon/opendistroforelasticsearch/sql/legacy/plugin/RestSqlAction.java#L151

I think this check was added because once cursor enabled, a request is treated as cursor request even if no fetch_size field. Trying to figure out if we can make new engine usable when cursor enabled and meanwhile won't break anything. Thanks!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request SQL
Projects
None yet
Development

No branches or pull requests

3 participants