Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] MAX/MIN functions does not work properly with array values #3138

Open
normanj-bitquill opened this issue Oct 29, 2024 · 9 comments
Open
Labels
bug Something isn't working

Comments

@normanj-bitquill
Copy link
Contributor

What is the bug?
If MAX or MIN are applied to a field with array values, then the MAX or MIN of any element in any of the array values is returned.

Consider an index with the following data:

{"x": 1, "y": [1, 2]}
{"x": 2, "y": [3, 4]}
{"x": 3, "y": [1, 5]}
{"x": 4, "y": [1, 2]}
{"x": 5, "y": [2, 3]}

and this query:

SELECT MAX(y) FROM test3;
5

or this query:

SELECT MIN(y) FROM test3;
1

For MAX, the expectd result is [3, 4] and for MIN, the expected result is [1, 2]. The comparison should be performed element by element and preserve the entire array value.

How can one reproduce the bug?
Steps to reproduce the behavior:

  1. Create a new index and load the data above
  2. Run the query above

What is the expected behavior?
Use array comparisons to compare the array values and return the MAX or MIN array.

What is your host/environment?

  • OS: Mac OS X
  • Version: 3.0 code base
  • Plugins: SQL plugin

Do you have any screenshots?
N/A

Do you have any additional context?
Issue #1300 had a change recently merged in that allows array values to be used in query evaluation and in the result set.

@normanj-bitquill normanj-bitquill added bug Something isn't working untriaged labels Oct 29, 2024
@normanj-bitquill
Copy link
Contributor Author

This appears to be due to how the OpenSearch engine calculates the min and max for arrays values.

POST test3/_search
{
  "size": 0,
  "aggs": {
    "y_min": {
      "min": {
        "field": "y"
      }
    }
  }
}
{
  "took": 6,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 5,
      "relation": "eq"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "y_min": {
      "value": 1
    }
  }
}
POST test3/_search
{
  "size": 0,
  "aggs": {
    "y_max": {
      "max": {
        "field": "y"
      }
    }
  }
}
{
  "took": 7,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 5,
      "relation": "eq"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "y_max": {
      "value": 5
    }
  }
}

@normanj-bitquill
Copy link
Contributor Author

For reference, this is how PostgreSQL behaves:

SELECT * FROM with_array;
 x |   y
---+-------
 1 | {1,2}
 2 | {3,4}
 3 | {1,5}
 4 | {1,2}
(4 rows)
SELECT MIN(y) FROM with_array;
  min
-------
 {1,2}
(1 row)
SELECT MAX(y) FROM with_array;
  max
-------
 {3,4}
(1 row)

@penghuo
Copy link
Collaborator

penghuo commented Oct 30, 2024

Thanks for rasing issue. OpenSearch/Lucene does not support ARRAY data type. Lucene allows adding multiple values for the same field name. for example. when calcualte min/max of field y, OpenSeach read all the value of y, and calcuate metrics, the result is min(y)=1, max(y)=5.

field,value
y,1
y,2
y,3
y,4
y,1
y,5
y,1
y,2
y,2
y,3

@penghuo
Copy link
Collaborator

penghuo commented Oct 30, 2024

Related issue in opensearch-core. opensearch-project/OpenSearch#16420

@penghuo
Copy link
Collaborator

penghuo commented Oct 30, 2024

One solution is if array_field is used in aggregation, we should do post-processing, instead of rewrite as DSL aggregation query.

@normanj-bitquill
Copy link
Contributor Author

@penghuo I'm not sure the SQL plugin can easily know that the field is an array field. By default, the mapping will only contain information about the element types.

GET my_index/mapping
{
  "properties": {
    "id": {
      "type": "long"
    },
    "array_field": {
      "type": "long"
    }
  }
}

If the user adds something to the mapping, we could know that it is an array.
https://trino.io/docs/current/connector/elasticsearch.html#array-types

Is there anything similar currently in OpenSearch?

@penghuo
Copy link
Collaborator

penghuo commented Oct 31, 2024

@normanj-bitquill agree, opensearch-project/OpenSearch#16420 discuss similar approach.

@andrross
Copy link
Member

[Catch All Triage - 1, 2, 3, 4, 5]

@YANG-DB
Copy link
Member

YANG-DB commented Dec 16, 2024

@normanj-bitquill @acarbonetto I've seen the approach suggested here.

IMO we can utilize mapping._meta#field.name.type:array location for the user to explicitly hint the specific mapping field is expected to be treated and an array.

indexing: ingestion a mismatched value will not cause an error (if this field only contains a single value) but the query engine (PPL/SQL) should treat this value as a single list.

IMO in the first iteration we can have the plugin be aware of the mapping._meta#field.name.type:array and actually do the aggregation in the client's scope (not pushing down).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants