Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ES|QL: Add multi-node Spec tests #98849

Merged
merged 13 commits into from
Sep 8, 2023

Conversation

luigidellaquila
Copy link
Contributor

@luigidellaquila luigidellaquila commented Aug 24, 2023

Adding a module to run multi-node Spec tests, see #98731

The first run spotted a small serialization problem (InvalidMappedField was not properly mapped - fixed with this PR)

WIP: there seem to be more serialization problems that make the tests fail:

  • DocBlock is not mapped for serialization
  • assertion failures on block name checks

@ChrisHegarty
Copy link
Contributor

assertion failures on block name checks

PR #98938 will fix this.

@ChrisHegarty
Copy link
Contributor

@luigidellaquila you wanna separate out the changes to PlanNamedTypes ? It would be good to add a unit test to PlanNamedTypesTests, for this too.

@luigidellaquila
Copy link
Contributor Author

@luigidellaquila you wanna separate out the changes to PlanNamedTypes ?

Yes definitely, this PR is mostly to outline possible problems in multi-node execution, but the single fixes will be in separate PRs

@luigidellaquila
Copy link
Contributor Author

Current failures are due to non-deterministic results for queries without SORT, see #99045

@luigidellaquila
Copy link
Contributor Author

Problems related to non-deterministic CSV tests seem to be fixed.

Current failures are due to serialization of DocBlock or to ClassCastException (DocBlock cannot be cast to <something>Block)

@luigidellaquila luigidellaquila marked this pull request as ready for review September 6, 2023 13:36
@elasticsearchmachine elasticsearchmachine added the Team:QL (Deprecated) Meta label for query languages team label Sep 6, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-ql (Team:QL)

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/elasticsearch-esql (:Query Languages/ES|QL)

@dnhatn
Copy link
Member

dnhatn commented Sep 6, 2023

@luigidellaquila @costin

I believe we are serializing DocBlock because the LocalExecutionPlanner is either selecting the wrong projection mask or skipping the projection incorrectly. I haven't dug into the details yet. Below are two examples of this failure.

ExchangeSinkExec[[emp_no{f}#8, rehired{r}#4, is_rehired{f}#9]]
\_FragmentExec[filter=null, estimatedRowSize=0, fragment=[<>
Project[[emp_no{f}#8, rehired{r}#4, is_rehired{f}#9]]
\_TopN[[Order[emp_no{f}#8,ASC,LAST]],5[INTEGER]]
  \_Eval[[TOSTRING(is_rehired{f}#9) AS rehired]]
    \_EsRelation[employees][emp_no{f}#8, is_rehired{f}#9]<>]]


[2023-09-06T09:29:35,844][INFO ][o.e.x.e.p.ComputeService ] [javaRestTest-2] Local physical plan:
ExchangeSinkExec[[emp_no{f}#8, rehired{r}#4, is_rehired{f}#9]]
\_ProjectExec[[emp_no{f}#8, rehired{r}#4, is_rehired{f}#9]]
  \_TopNExec[[Order[emp_no{f}#8,ASC,LAST]],5[INTEGER],0]
    \_FieldExtractExec[emp_no{f}#8]
      \_EvalExec[[TOSTRING(is_rehired{f}#9) AS rehired]]
        \_FieldExtractExec[is_rehired{f}#9]
          \_EsQueryExec[employees], query[][_doc{f}#4], limit[], sort[] estimatedRowSize[59]


[2023-09-06T09:29:35,863][INFO ][o.e.x.e.p.ComputeService ] [javaRestTest-2] Local execution plan:
DriverFactory(instances = 1, type = DATA_PARALLELISM)
\_LuceneSourceOperator[dataPartitioning = SEGMENT, maxPageSize = 4443, limit = 2147483647]
\_ValuesSourceReaderOperator[field = is_rehired]
\_EvalOperator[evaluator=ToString[field=Attribute[channel=1]]]
\_ValuesSourceReaderOperator[field = emp_no]
\_TopNOperator[count = 5, sortOrders = [SortOrder[channel=3, asc=true, nullsFirst=false, encoder=DefaultEncoder]]]
\_ProjectOperator[mask = {0, 1, 3}]
\_ExchangeSinkOperator

The correct projection mask should be {1, 2, 3} instead of {0, 1, 3}. In the below case, we mistakenly removed the projection.

[2023-09-06T12:29:00,220][INFO ][o.e.x.e.p.ComputeService ] [javaRestTest-1] Received physical plan:
ExchangeSinkExec[[emp_no{f}#2069, job_positions{f}#2070]]
\_FragmentExec[filter=null, estimatedRowSize=0, fragment=[<>
Project[[emp_no{f}#2069, job_positions{f}#2070]]
\_TopN[[Order[emp_no{f}#2069,ASC,LAST]],6[INTEGER]]
  \_Filter[NOT(job_positions{f}#2070 < [43][KEYWORD])]
    \_EsRelation[employees][emp_no{f}#2069, job_positions{f}#2070]<>]]


[2023-09-06T12:29:00,220][INFO ][o.e.x.e.p.ComputeService ] [javaRestTest-1] Local physical plan:
ExchangeSinkExec[[emp_no{f}#2069, job_positions{f}#2070]]
\_ProjectExec[[emp_no{f}#2069, job_positions{f}#2070]]
  \_FieldExtractExec[emp_no{f}#2069]
    \_EsQueryExec[employees], query[{"esql_single_value":{"field":"job_positions","next":{"bool":{"must_not":[{"range":{"job_positions":{"lt":"C","boost":1.0}}}],"boost":1.0}}}}][_doc{f}#2070], limit[6], sort[[FieldSort[field=emp_no{f}#2069, direction=ASC, nulls=LAST]]] estimatedRowSize[20]
[2023-09-06T12:29:00,220][INFO ][o.e.x.e.p.LocalExecutionPlanner] [javaRestTest-1] --> query [_doc{f}#2070]
[2023-09-06T12:29:00,221][WARN ][o.e.x.e.p.LocalExecutionPlanner] [javaRestTest-1] --> projection [emp_no{f}#2069, job_positions{f}#2070] layout BlockLayout{layout={2069=1, 2070=0}, numberOfChannels=2}
[2023-09-06T12:29:00,221][INFO ][o.e.x.e.p.LocalExecutionPlanner] [javaRestTest-1] --> selected field emp_no{f}#2069 2069
[2023-09-06T12:29:00,221][INFO ][o.e.x.e.p.LocalExecutionPlanner] [javaRestTest-1] --> selected field job_positions{f}#2070 2070


[2023-09-06T12:29:00,221][INFO ][o.e.x.e.p.ComputeService ] [javaRestTest-1] Local execution plan:
DriverFactory(instances = 1, type = DATA_PARALLELISM)
\_LuceneTopNSourceOperator[dataPartitioning = SEGMENT, maxPageSize = 13107, limit = 6, sorts = [{"emp_no":{"order":"asc","missing":"_last","unmapped_type":"integer"}}]]
\_ValuesSourceReaderOperator[field = emp_no]
\_ExchangeSinkOperator

I won't be able to spend more time on this until tomorrow.

@dnhatn
Copy link
Member

dnhatn commented Sep 6, 2023

\_ProjectExec[[emp_no{f}#2069, job_positions{f}#2070]]
-> query [_doc{f}#2070]

It seems that conflicting nameId is an issue here as the _doc (i.e., the DocBlock) has the same nameId with the job_positions field.

@costin
Copy link
Member

costin commented Sep 7, 2023

Kudos to @dnhatn for figuring the issue - it relates to the deserialization of a NameId in the plan sent from the coordinator on the data nodes. When running in a multi node cluster, each node has its own counter and depending on how the queries get executed, sometimes the NameIds clash.

Currently the serialization saves the id as a long so when it gets dehydrated on the other node, it can and will interfere with the nameId of newly created attributes.
In practice this means the wrong block is used which leads to different errors:

  • unexpected value encountered if the type is the same
  • CCE if the types are different
  • serialization of DocBlock

Here's an example - a plan is sent to the data node.
The node EsQueryExec is instantiated which creates a new FieldAttribute which gets assigned a new NameId - say 1000.
A project is deserialized and along with it one attribute which happen to have NameId 1000 on the coordinator node. The same id is used on the data node which makes the two attributes equivalent even though they share different names and data types resulting into 💥

@dnhatn
Copy link
Member

dnhatn commented Sep 7, 2023

Our integration tests (i.e., EsqlActionIT) don't identify this issue because the testing Elasticsearch nodes run on the same JVM with a single global NameId.

@ChrisHegarty
Copy link
Contributor

Thank you so much @dnhatn and @costin - great sleuthing!!! The following PR has been raised to fix this issue - #99295

@ChrisHegarty
Copy link
Contributor

With the NameId issue resolved, I still see a couple (two) randomly failing tests, which appear to be unrelated to named id or serialisation. They are:

2> REPRODUCE WITH: ./gradlew ':x-pack:plugin:esql:qa:server:multi-node:javaRestTest' --tests 
"org.elasticsearch.xpack.esql.qa.multi_node.EsqlSpecIT" -Dtests.method="test {stats.ByUnmentionedIntAndLong}" -
Dtests.seed=4F37F967DF2BCC04 -Dtests.locale=es-SV -Dtests.timezone=Europe/Busingen -Druntime.java=20
2> org.junit.ComparisonFailure: expected:<[null]> but was:<[1]>
  at __randomizedtesting.SeedInfo.seed([4F37F967DF2BCC04:C763C6BD71D7A1FC]:0)
  at org.junit.Assert.assertEquals(Assert.java:115)
  at org.junit.Assert.assertEquals(Assert.java:144)
  at org.elasticsearch.xpack.esql.CsvAssert.assertData(CsvAssert.java:208)
  at org.elasticsearch.xpack.esql.qa.rest.EsqlSpecTestCase.doTest(EsqlSpecTestCase.java:103)
  at org.elasticsearch.xpack.esql.qa.rest.EsqlSpecTestCase.test(EsqlSpecTestCase.java:84)
  ...
2> REPRODUCE WITH: ./gradlew ':x-pack:plugin:esql:qa:server:multi-node:javaRestTest' --tests 
"org.elasticsearch.xpack.esql.qa.multi_node.EsqlSpecIT" -Dtests.method="test {date.In}" -
Dtests.seed=4F37F967DF2BCC04 -Dtests.locale=es-SV -Dtests.timezone=Europe/Busingen -Druntime.java=20
2> org.junit.ComparisonFailure: expected:<1995-[01-27]T00:00:00.000Z> but was:<1995-[12-15]T00:00:00.000Z>
  at __randomizedtesting.SeedInfo.seed([4F37F967DF2BCC04:C763C6BD71D7A1FC]:0)
  at org.junit.Assert.assertEquals(Assert.java:115)
  at org.junit.Assert.assertEquals(Assert.java:144)
  at org.elasticsearch.xpack.esql.CsvAssert.assertData(CsvAssert.java:208)
  at org.elasticsearch.xpack.esql.qa.rest.EsqlSpecTestCase.doTest(EsqlSpecTestCase.java:103)
  at org.elasticsearch.xpack.esql.qa.rest.EsqlSpecTestCase.test(EsqlSpecTestCase.java:84)
  ...

Reproducibly locally with loopy testing, e.g.

export x=; while ./gradlew :x-pack:plugin:esql:qa:server:multi-node:javaRestTest; do echo $x | wc -c; export x=x$x; done

@luigidellaquila
Copy link
Contributor Author

Thanks Chris, I'll check them, probably just non-deterministic tests

@luigidellaquila
Copy link
Contributor Author

@ChrisHegarty your fix on NameId worked great.
I also fixed the non-deterministic csv tests, so the build is green now

Copy link
Contributor

@ChrisHegarty ChrisHegarty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@luigidellaquila luigidellaquila merged commit d507295 into elastic:main Sep 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/ES|QL AKA ESQL Team:QL (Deprecated) Meta label for query languages team >test Issues or PRs that are addressing/adding tests v8.11.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants