-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support widening of numeric types in union-types #112610
Support widening of numeric types in union-types #112610
Conversation
Only two lines of this PR are the actual fix. All the rest is updating the CSV-spec testing infrastructure to make it easier to test this, and adding the tests. The refactoring involve some cleanup and simplifications also. This update allows us to add alternative mappings of existing data files without copying the files and changing the header line. Some of the existing union-types test files were deleted as a result, which is a step in the right direction.
Pinging @elastic/es-analytical-engine (Team:Analytics) |
Hi @craigtaverner, I've created a changelog YAML for you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love the improvements to the csv data loader.
I think we could increase test coverage a little; and, while we're at it, tidy up usage of widened data types in field attributes a little, since I think there are some footguns in place in the form of unwritten invariants, that we may very well make written.
But this is a nice PR and can also be merged as-is IMHO.
@@ -63,7 +63,7 @@ protected AbstractConvertFunction(StreamInput in) throws IOException { | |||
* Build the evaluator given the evaluator a multivalued field. | |||
*/ | |||
protected final ExpressionEvaluator.Factory evaluator(ExpressionEvaluator.Factory fieldEval) { | |||
DataType sourceType = field().dataType(); | |||
DataType sourceType = field().dataType().widenSmallNumeric(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uhh, how did that field not end up being widened in the first place? In Analyzer.mappingAsAttributes
, line 240, each field attribute's field should already have been widened, including the contained EsField
's type.
What fails if we do not widen here? I'd like to figure out if this is really necessary.
Also, to avoid confusion in the future, do you think we should add an assert to FieldAttribute
's constructor to ensure it only ever gets built with already widened data types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was surprised this failed too. It fails later in planning, I don't remember the exact stack trace, but the error was that this function was being called with an unsupported type (the narrow type). I understood the evaluator
function to be called much later, so I was surprised by this too. I could re-investigate because perhaps there is a code path to this that is happening unnecessarily early?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this deserves a follow up, because there begin to be many places were widening is necessary - and we only find out by failed queries/tests :/ I opened #112691.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I investigated, and this exception is thrown down during query execution (ie. after local physical planning) during the setup of the type converting block loader at https://github.com/elastic/elasticsearch/blob/main/x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/planner/EsPhysicalOperationProviders.java#L384.
This is well past any analyzing, planning phases, so it is clear that the types are not widened in general, but only in specific cases. A followup investigation makes a lot of sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For completeness, this is the stack trace:
org.elasticsearch.xpack.esql.EsqlIllegalArgumentException: illegal data type [short]
at org.elasticsearch.xpack.esql.EsqlIllegalArgumentException.illegalDataType(EsqlIllegalArgumentException.java:43)
at org.elasticsearch.xpack.esql.EsqlIllegalArgumentException.illegalDataType(EsqlIllegalArgumentException.java:39)
at org.elasticsearch.xpack.esql.expression.function.scalar.convert.AbstractConvertFunction.evaluator(AbstractConvertFunction.java:69)
at org.elasticsearch.xpack.esql.expression.function.scalar.convert.AbstractConvertFunction.toEvaluator(AbstractConvertFunction.java:128)
at org.elasticsearch.xpack.esql.planner.EsPhysicalOperationProviders.<init>(EsPhysicalOperationProviders.java:384)
at org.elasticsearch.xpack.esql.planner.EsPhysicalOperationProviders.getBlockLoaderFor(EsPhysicalOperationProviders.java:143)
at org.elasticsearch.xpack.esql.planner.EsPhysicalOperationProviders.lambda(EsPhysicalOperationProviders.java:123)
at org.elasticsearch.compute.lucene.ValuesSourceReaderOperator.newShard(ValuesSourceReaderOperator.java:470)
at org.elasticsearch.compute.lucene.ValuesSourceReaderOperator.positionFieldWork(ValuesSourceReaderOperator.java:195)
at org.elasticsearch.compute.lucene.ValuesSourceReaderOperator.loadFromSingleLeaf(ValuesSourceReaderOperator.java:220)
at org.elasticsearch.compute.lucene.ValuesSourceReaderOperator.process(ValuesSourceReaderOperator.java:143)
at org.elasticsearch.compute.operator.AbstractPageMappingOperator.getOutput(AbstractPageMappingOperator.java:76)
at org.elasticsearch.compute.operator.Driver.runSingleLoopIteration(Driver.java:258)
at org.elasticsearch.compute.operator.Driver.run(Driver.java:189)
at org.elasticsearch.compute.operator.Driver.doRun(Driver.java:378)
// TODO: Shouldn't we perform widening of small numerical types here? | ||
if (supportedTypes.contains(type)) { | ||
if (supportedTypes.contains(type.widenSmallNumeric())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Contrary to the funny TODO
that I left here, haha, maybe we should actually widen at the point where the InvalidMappedField
gets created? That'd be consistent with how we widen the types of regular EsField
s in Analyzer.mappingAsAttributes
. If we do so, we could enforce this as an invariant of FieldAttribute
(so that when it contains an IMF, it has to be widened).
required_capability: union_types_numeric_widening | ||
|
||
FROM apps, apps_short METADATA _index | ||
| EVAL id = id::integer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is cool!
However, to reduce risk some more (some data types can be quirky and stuff could happen at block loading time, I guess, to types like float
and byte
...); how about we add to this/replace this by a parameterized integration test that exercises all widened data types?
RestEsqlTestCase is quite nice and, like csv tests, is executed in multiple environments (single/multi node etc.).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests in RestEsqlTestCase
also have the benefit that we can more easily test cases with more than 2 different types becoming a union type. E.g. float
, byte
and, I dunno, keyword
being stuffed into to_string
.
private static final TestsDataset EMPLOYEES = new TestsDataset("employees", "mapping-default.json", "employees.csv").noSubfields(); | ||
private static final TestsDataset HOSTS = new TestsDataset("hosts"); | ||
private static final TestsDataset APPS = new TestsDataset("apps"); | ||
private static final TestsDataset APPS_SHORT = APPS.withIndex("apps_short").withTypeMapping(Map.of("id", "short")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neat!
…tion-ironbank-ubi * upstream/main: (302 commits) Deduplicate BucketOrder when deserializing (elastic#112707) Introduce test utils for ingest pipelines (elastic#112733) [Test] Account for auto-repairing for shard gen file (elastic#112778) Do not throw in task enqueued by CancellableRunner (elastic#112780) Mute org.elasticsearch.script.StatsSummaryTests testEqualsAndHashCode elastic#112439 Mute org.elasticsearch.repositories.blobstore.testkit.integrity.RepositoryVerifyIntegrityIT testTransportException elastic#112779 Use a dedicated test executor in MockTransportService (elastic#112748) Estimate segment field usages (elastic#112760) (Doc+) Inference Pipeline ignores Mapping Analyzers (elastic#112522) Fix verifyVersions task (elastic#112765) (Doc+) Terminating Exit Codes (elastic#112530) (Doc+) CAT Nodes default columns (elastic#112715) [DOCS] Augment installation warnings (elastic#112756) Mute org.elasticsearch.repositories.blobstore.testkit.integrity.RepositoryVerifyIntegrityIT testCorruption elastic#112769 Bump Elasticsearch to a minimum of JDK 21 (elastic#112252) ESQL: Compute support for filtering ungrouped aggs (elastic#112717) Bump Elasticsearch version to 9.0.0 (elastic#112570) add CDR related data streams to kibana_system priviliges (elastic#112655) Support widening of numeric types in union-types (elastic#112610) Introduce data stream options and failure store configuration classes (elastic#109515) ...
* Support widening of numeric types in union-types Only two lines of this PR are the actual fix. All the rest is updating the CSV-spec testing infrastructure to make it easier to test this, and adding the tests. The refactoring involve some cleanup and simplifications also. This update allows us to add alternative mappings of existing data files without copying the files and changing the header line. Some of the existing union-types test files were deleted as a result, which is a step in the right direction. * Update docs/changelog/112610.yaml * Link capability to PR
* Support widening of numeric types in union-types Only two lines of this PR are the actual fix. All the rest is updating the CSV-spec testing infrastructure to make it easier to test this, and adding the tests. The refactoring involve some cleanup and simplifications also. This update allows us to add alternative mappings of existing data files without copying the files and changing the header line. Some of the existing union-types test files were deleted as a result, which is a step in the right direction. * Update docs/changelog/112610.yaml * Link capability to PR
…112821) * Fix union-types where one index is missing the field (#111932) * Fix union-types where one index is missing the field When none of the indexes has the field, a validation error is correctly thrown, and when all indexes have the field, union-types works as normal. But when some indexes have the field and some do not, we were getting and internal error. We treat this case similarly to when some documents are missing the field, in which case `null` values are produced. So now a multi-index query where some indexes are missing the field will produce nulls for the documents coming from those indexes. * Update docs/changelog/111932.yaml * Added capability for this fix (missing-field) * Support widening of numeric types in union-types (#112610) * Support widening of numeric types in union-types Only two lines of this PR are the actual fix. All the rest is updating the CSV-spec testing infrastructure to make it easier to test this, and adding the tests. The refactoring involve some cleanup and simplifications also. This update allows us to add alternative mappings of existing data files without copying the files and changing the header line. Some of the existing union-types test files were deleted as a result, which is a step in the right direction. * Update docs/changelog/112610.yaml * Link capability to PR
Only two lines of this PR are the actual fix.
All the rest is updating the CSV-spec testing infrastructure to make it easier to test this, and adding the tests.
The refactoring also involved some cleanup and simplifications. This update allows us to add alternative mappings of existing data files without copying the files and changing the header line. Some of the existing union-types test files were deleted as a result, which is a step in the right direction.
Fixes #111277