Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support widening of numeric types in union-types #112610

Merged

Conversation

craigtaverner
Copy link
Contributor

Only two lines of this PR are the actual fix.

All the rest is updating the CSV-spec testing infrastructure to make it easier to test this, and adding the tests.

The refactoring also involved some cleanup and simplifications. This update allows us to add alternative mappings of existing data files without copying the files and changing the header line. Some of the existing union-types test files were deleted as a result, which is a step in the right direction.

Fixes #111277

Only two lines of this PR are the actual fix.
All the rest is updating the CSV-spec testing infrastructure to make it easier to test this, and adding the tests.
The refactoring involve some cleanup and simplifications also.
This update allows us to add alternative mappings of existing data files without copying the files and changing the header line.
Some of the existing union-types test files were deleted as a result, which is a step in the right direction.
@craigtaverner craigtaverner added >bug Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) :Analytics/ES|QL AKA ESQL labels Sep 6, 2024
@craigtaverner craigtaverner requested a review from a team as a code owner September 6, 2024 17:16
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@elasticsearchmachine
Copy link
Collaborator

Hi @craigtaverner, I've created a changelog YAML for you.

Copy link
Contributor

@alex-spies alex-spies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love the improvements to the csv data loader.

I think we could increase test coverage a little; and, while we're at it, tidy up usage of widened data types in field attributes a little, since I think there are some footguns in place in the form of unwritten invariants, that we may very well make written.

But this is a nice PR and can also be merged as-is IMHO.

@@ -63,7 +63,7 @@ protected AbstractConvertFunction(StreamInput in) throws IOException {
* Build the evaluator given the evaluator a multivalued field.
*/
protected final ExpressionEvaluator.Factory evaluator(ExpressionEvaluator.Factory fieldEval) {
DataType sourceType = field().dataType();
DataType sourceType = field().dataType().widenSmallNumeric();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhh, how did that field not end up being widened in the first place? In Analyzer.mappingAsAttributes, line 240, each field attribute's field should already have been widened, including the contained EsField's type.

What fails if we do not widen here? I'd like to figure out if this is really necessary.

Also, to avoid confusion in the future, do you think we should add an assert to FieldAttribute's constructor to ensure it only ever gets built with already widened data types?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was surprised this failed too. It fails later in planning, I don't remember the exact stack trace, but the error was that this function was being called with an unsupported type (the narrow type). I understood the evaluator function to be called much later, so I was surprised by this too. I could re-investigate because perhaps there is a code path to this that is happening unnecessarily early?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this deserves a follow up, because there begin to be many places were widening is necessary - and we only find out by failed queries/tests :/ I opened #112691.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I investigated, and this exception is thrown down during query execution (ie. after local physical planning) during the setup of the type converting block loader at https://github.com/elastic/elasticsearch/blob/main/x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/planner/EsPhysicalOperationProviders.java#L384.

This is well past any analyzing, planning phases, so it is clear that the types are not widened in general, but only in specific cases. A followup investigation makes a lot of sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For completeness, this is the stack trace:

org.elasticsearch.xpack.esql.EsqlIllegalArgumentException: illegal data type [short]
	at org.elasticsearch.xpack.esql.EsqlIllegalArgumentException.illegalDataType(EsqlIllegalArgumentException.java:43)
	at org.elasticsearch.xpack.esql.EsqlIllegalArgumentException.illegalDataType(EsqlIllegalArgumentException.java:39)
	at org.elasticsearch.xpack.esql.expression.function.scalar.convert.AbstractConvertFunction.evaluator(AbstractConvertFunction.java:69)
	at org.elasticsearch.xpack.esql.expression.function.scalar.convert.AbstractConvertFunction.toEvaluator(AbstractConvertFunction.java:128)
	at org.elasticsearch.xpack.esql.planner.EsPhysicalOperationProviders.<init>(EsPhysicalOperationProviders.java:384)
	at org.elasticsearch.xpack.esql.planner.EsPhysicalOperationProviders.getBlockLoaderFor(EsPhysicalOperationProviders.java:143)
	at org.elasticsearch.xpack.esql.planner.EsPhysicalOperationProviders.lambda(EsPhysicalOperationProviders.java:123)
	at org.elasticsearch.compute.lucene.ValuesSourceReaderOperator.newShard(ValuesSourceReaderOperator.java:470)
	at org.elasticsearch.compute.lucene.ValuesSourceReaderOperator.positionFieldWork(ValuesSourceReaderOperator.java:195)
	at org.elasticsearch.compute.lucene.ValuesSourceReaderOperator.loadFromSingleLeaf(ValuesSourceReaderOperator.java:220)
	at org.elasticsearch.compute.lucene.ValuesSourceReaderOperator.process(ValuesSourceReaderOperator.java:143)
	at org.elasticsearch.compute.operator.AbstractPageMappingOperator.getOutput(AbstractPageMappingOperator.java:76)
	at org.elasticsearch.compute.operator.Driver.runSingleLoopIteration(Driver.java:258)
	at org.elasticsearch.compute.operator.Driver.run(Driver.java:189)
	at org.elasticsearch.compute.operator.Driver.doRun(Driver.java:378)

Comment on lines -1226 to +1225
// TODO: Shouldn't we perform widening of small numerical types here?
if (supportedTypes.contains(type)) {
if (supportedTypes.contains(type.widenSmallNumeric())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Contrary to the funny TODO that I left here, haha, maybe we should actually widen at the point where the InvalidMappedField gets created? That'd be consistent with how we widen the types of regular EsFields in Analyzer.mappingAsAttributes. If we do so, we could enforce this as an invariant of FieldAttribute (so that when it contains an IMF, it has to be widened).

required_capability: union_types_numeric_widening

FROM apps, apps_short METADATA _index
| EVAL id = id::integer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is cool!

However, to reduce risk some more (some data types can be quirky and stuff could happen at block loading time, I guess, to types like float and byte ...); how about we add to this/replace this by a parameterized integration test that exercises all widened data types?

RestEsqlTestCase is quite nice and, like csv tests, is executed in multiple environments (single/multi node etc.).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests in RestEsqlTestCase also have the benefit that we can more easily test cases with more than 2 different types becoming a union type. E.g. float, byte and, I dunno, keyword being stuffed into to_string.

private static final TestsDataset EMPLOYEES = new TestsDataset("employees", "mapping-default.json", "employees.csv").noSubfields();
private static final TestsDataset HOSTS = new TestsDataset("hosts");
private static final TestsDataset APPS = new TestsDataset("apps");
private static final TestsDataset APPS_SHORT = APPS.withIndex("apps_short").withTypeMapping(Map.of("id", "short"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat!

@craigtaverner craigtaverner merged commit bb872e6 into elastic:main Sep 11, 2024
15 checks passed
v1v added a commit to v1v/elasticsearch that referenced this pull request Sep 12, 2024
…tion-ironbank-ubi

* upstream/main: (302 commits)
  Deduplicate BucketOrder when deserializing (elastic#112707)
  Introduce test utils for ingest pipelines (elastic#112733)
  [Test] Account for auto-repairing for shard gen file (elastic#112778)
  Do not throw in task enqueued by CancellableRunner (elastic#112780)
  Mute org.elasticsearch.script.StatsSummaryTests testEqualsAndHashCode elastic#112439
  Mute org.elasticsearch.repositories.blobstore.testkit.integrity.RepositoryVerifyIntegrityIT testTransportException elastic#112779
  Use a dedicated test executor in MockTransportService (elastic#112748)
  Estimate segment field usages (elastic#112760)
  (Doc+) Inference Pipeline ignores Mapping Analyzers (elastic#112522)
  Fix verifyVersions task (elastic#112765)
  (Doc+) Terminating Exit Codes (elastic#112530)
  (Doc+) CAT Nodes default columns (elastic#112715)
  [DOCS] Augment installation warnings (elastic#112756)
  Mute org.elasticsearch.repositories.blobstore.testkit.integrity.RepositoryVerifyIntegrityIT testCorruption elastic#112769
  Bump Elasticsearch to a minimum of JDK 21 (elastic#112252)
  ESQL: Compute support for filtering ungrouped aggs (elastic#112717)
  Bump Elasticsearch version to 9.0.0 (elastic#112570)
  add CDR related data streams to kibana_system priviliges (elastic#112655)
  Support widening of numeric types in union-types (elastic#112610)
  Introduce data stream options and failure store configuration classes (elastic#109515)
  ...
davidkyle pushed a commit that referenced this pull request Sep 12, 2024
* Support widening of numeric types in union-types

Only two lines of this PR are the actual fix.
All the rest is updating the CSV-spec testing infrastructure to make it easier to test this, and adding the tests.
The refactoring involve some cleanup and simplifications also.
This update allows us to add alternative mappings of existing data files without copying the files and changing the header line.
Some of the existing union-types test files were deleted as a result, which is a step in the right direction.

* Update docs/changelog/112610.yaml

* Link capability to PR
craigtaverner added a commit to craigtaverner/elasticsearch that referenced this pull request Sep 12, 2024
* Support widening of numeric types in union-types

Only two lines of this PR are the actual fix.
All the rest is updating the CSV-spec testing infrastructure to make it easier to test this, and adding the tests.
The refactoring involve some cleanup and simplifications also.
This update allows us to add alternative mappings of existing data files without copying the files and changing the header line.
Some of the existing union-types test files were deleted as a result, which is a step in the right direction.

* Update docs/changelog/112610.yaml

* Link capability to PR
craigtaverner added a commit that referenced this pull request Sep 13, 2024
…112821)

* Fix union-types where one index is missing the field (#111932)

* Fix union-types where one index is missing the field

When none of the indexes has the field, a validation error is correctly thrown, and when all indexes have the field, union-types works as normal.
But when some indexes have the field and some do not, we were getting and internal error.
We treat this case similarly to when some documents are missing the field, in which case `null` values are produced.
So now a multi-index query where some indexes are missing the field will produce nulls for the documents coming from those indexes.

* Update docs/changelog/111932.yaml

* Added capability for this fix (missing-field)

* Support widening of numeric types in union-types (#112610)

* Support widening of numeric types in union-types

Only two lines of this PR are the actual fix.
All the rest is updating the CSV-spec testing infrastructure to make it easier to test this, and adding the tests.
The refactoring involve some cleanup and simplifications also.
This update allows us to add alternative mappings of existing data files without copying the files and changing the header line.
Some of the existing union-types test files were deleted as a result, which is a step in the right direction.

* Update docs/changelog/112610.yaml

* Link capability to PR
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/ES|QL AKA ESQL >bug Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v8.16.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ESQL: union types not treating widened data types correctly
3 participants