Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] HeapAttackIT testHugeConcat failing #100288

Closed
Tracked by #100147
ChrisHegarty opened this issue Oct 4, 2023 · 12 comments
Closed
Tracked by #100147

[CI] HeapAttackIT testHugeConcat failing #100288

ChrisHegarty opened this issue Oct 4, 2023 · 12 comments
Labels
:Analytics/ES|QL AKA ESQL Team:QL (Deprecated) Meta label for query languages team >test-failure Triaged test failures from CI

Comments

@ChrisHegarty
Copy link
Contributor

Build scan:
https://gradle-enterprise.elastic.co/s/raslhxar3ujes/tests/:x-pack:plugin:esql:qa:server:single-node:javaRestTest/org.elasticsearch.xpack.esql.qa.single_node.HeapAttackIT/testHugeConcat
Reproduction line:

./gradlew ':x-pack:plugin:esql:qa:server:single-node:javaRestTest' --tests "org.elasticsearch.xpack.esql.qa.single_node.HeapAttackIT.testHugeConcat" -Dtests.seed=C4EA0D658E691F34 -Dtests.locale=ar-SD -Dtests.timezone=Etc/GMT+11 -Druntime.java=21

Applicable branches:
main

Reproduces locally?:
Yes

Failure history:
https://gradle-enterprise.elastic.co/scans/tests?tests.container=org.elasticsearch.xpack.esql.qa.single_node.HeapAttackIT&tests.test=testHugeConcat
Failure excerpt:

junit.framework.AssertionFailedError: Unexpected exception type, expected ResponseException but got java.io.IOException: Connection reset

  at org.apache.lucene.tests.util.LuceneTestCase.expectThrows(LuceneTestCase.java:2869)
  at org.apache.lucene.tests.util.LuceneTestCase.expectThrows(LuceneTestCase.java:2850)
  at org.elasticsearch.xpack.esql.qa.single_node.HeapAttackIT.assertCircuitBreaks(HeapAttackIT.java:71)
  at org.elasticsearch.xpack.esql.qa.single_node.HeapAttackIT.testHugeConcat(HeapAttackIT.java:163)
  at jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
  at java.lang.reflect.Method.invoke(Method.java:580)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
  at java.lang.Thread.run(Thread.java:1583)

  Caused by: java.io.IOException: Connection reset

    at org.elasticsearch.client.RestClient.extractAndWrapCause(RestClient.java:935)
    at org.elasticsearch.client.RestClient.performRequest(RestClient.java:300)
    at org.elasticsearch.client.RestClient.performRequest(RestClient.java:303)
    at org.elasticsearch.client.RestClient.performRequest(RestClient.java:288)
    at org.elasticsearch.xpack.esql.qa.single_node.HeapAttackIT.concat(HeapAttackIT.java:177)
    at org.elasticsearch.xpack.esql.qa.single_node.HeapAttackIT.lambda$testHugeConcat$1(HeapAttackIT.java:163)
    at org.apache.lucene.tests.util.LuceneTestCase._expectThrows(LuceneTestCase.java:3022)
    at org.apache.lucene.tests.util.LuceneTestCase.expectThrows(LuceneTestCase.java:2859)
    at org.apache.lucene.tests.util.LuceneTestCase.expectThrows(LuceneTestCase.java:2850)
    at org.elasticsearch.xpack.esql.qa.single_node.HeapAttackIT.assertCircuitBreaks(HeapAttackIT.java:71)
    at org.elasticsearch.xpack.esql.qa.single_node.HeapAttackIT.testHugeConcat(HeapAttackIT.java:163)
    at jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
    at java.lang.reflect.Method.invoke(Method.java:580)
    at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
    at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
    at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
    at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
    at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
    at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
    at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
    at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
    at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
    at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
    at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
    at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
    at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
    at java.lang.Thread.run(Thread.java:1583)

    Caused by: java.net.SocketException: Connection reset

      at sun.nio.ch.SocketChannelImpl.throwConnectionReset(SocketChannelImpl.java:401)
      at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:434)
      at org.apache.http.impl.nio.reactor.SessionInputBufferImpl.fill(SessionInputBufferImpl.java:231)
      at org.apache.http.impl.nio.codecs.AbstractMessageParser.fillBuffer(AbstractMessageParser.java:136)
      at org.apache.http.impl.nio.DefaultNHttpClientConnection.consumeInput(DefaultNHttpClientConnection.java:241)
      at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:87)
      at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:40)
      at org.apache.http.impl.nio.reactor.AbstractIODispatch.inputReady(AbstractIODispatch.java:114)
      at org.apache.http.impl.nio.reactor.BaseIOReactor.readable(BaseIOReactor.java:162)
      at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvent(AbstractIOReactor.java:337)
      at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvents(AbstractIOReactor.java:315)
      at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:276)
      at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104)
      at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:591)
      at java.lang.Thread.run(Thread.java:1583)

@ChrisHegarty ChrisHegarty added the >test-failure Triaged test failures from CI label Oct 4, 2023
@elasticsearchmachine elasticsearchmachine added the needs:triage Requires assignment of a team area label label Oct 4, 2023
@ChrisHegarty
Copy link
Contributor Author

In fact, these all fail:

Test failures:
HeapAttackIT » testHugeConcat
HeapAttackIT » testSortByManyLongsTooMuchMemory
HeapAttackIT » testSmallConcat
RestEsqlIT » testUseUnknownIndex
RestEsqlIT » testTextMode
RestEsqlIT » testCSVMode
RestEsqlIT » testErrorMessageForInvalidTypeInParams
RestEsqlIT » testGetAnswer
RestEsqlIT » testBasicEsql
RestEsqlIT » testColumnarMode
RestEsqlIT » testErrorMessageForArrayValuesInParams
RestEsqlIT » testErrorMessageForMissingTypeInParams
RestEsqlIT » testErrorMessageForMissingValueInParams
RestEsqlIT » testInvalidPragma
RestEsqlIT » testErrorMessageForInvalidParams
RestEsqlIT » testErrorMessageForLiteralDateMathOverflowOnNegation
RestEsqlIT » testNullInAggs
RestEsqlIT » testTSVMode
RestEsqlIT » testCSVNoHeaderMode
RestEsqlIT » testPragmaNotAllowed

@ChrisHegarty ChrisHegarty added Team:QL (Deprecated) Meta label for query languages team :Analytics/ES|QL AKA ESQL labels Oct 4, 2023
@elasticsearchmachine elasticsearchmachine removed the needs:triage Requires assignment of a team area label label Oct 4, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-ql (Team:QL)

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/elasticsearch-esql (:Query Languages/ES|QL)

@nik9000
Copy link
Member

nik9000 commented Oct 5, 2023

Are you sure? Or do they just fail because they run after and the breaker isn't reset?

@nik9000
Copy link
Member

nik9000 commented Oct 5, 2023

I mean, both are totally possible at this point.

@nik9000
Copy link
Member

nik9000 commented Oct 5, 2023

Or, rather, do they fail because we knock the node over in the huge concat test?

@ChrisHegarty
Copy link
Contributor Author

Or, rather, do they fail because we knock the node over in the huge concat test?

That could very well be the case.

@ChrisHegarty
Copy link
Contributor Author

ChrisHegarty commented Oct 5, 2023

Running testHugeConcat by itself I see no issues - it passes successfully many many times.

If I modify testHugeConcat to loop 10 times - so just do what it is doing 10 times sequentially, I see failures. When I expect to see none. This is odd and kinda makes me think that we're holding on to something from a previous query execution - a real memory leak?

Anyway, debugging the x10. run I see failure come mostly from the BytesRefVectorBuilder as it tries to deep copy humungous byte[]'s. A BytesRefArray with a single large entry is hard to extract, since it needs to copy out the internal byte data into a new byte[]. Internally, the BytesRefArray uses bigArrays rather than byte[].

Some screenshots of the failures I see.
Screenshot 2023-10-05 at 10 08 41

@ChrisHegarty
Copy link
Contributor Author

Options. In BytesRefVectorBuilder:

  1. only copy when the single entry is small(ish), otherwise just create a regular (non-constant), BytesRefArrayBlock with the already created BytesRefArray that contains a single entry.
  2. Rather than deepCopy, use the BreakingBytesRefBuilder to create the new BytesRef

@nik9000
Copy link
Member

nik9000 commented Oct 5, 2023

What a lovely catch! go go tests!

I think if we're going to deep copy here we should use the BreakingBytesRefBuilder. But I might just not deep copy at all for now and call it ok.

@ChrisHegarty
Copy link
Contributor Author

well.. fixing this one place (by just removing the optimisation - just always create a BytesRefArrayVector even for an array with a single value), moves the problem :-)

BigByteArray::get allocates a new byte[] to assign to the destination BytesRef, if the get crosses a page boundary. This allocation does not use the circuit breaker. I guess because BytesRef is not Accountable or Releasable.

Screenshot 2023-10-05 at 14 09 28

elasticsearchmachine pushed a commit that referenced this issue Oct 9, 2023
This adds a limit on the bytes length that a concatenated string can get
to. The limit is set to 1MB (per entry). When the limit is hit, an
exception is returned to the caller (similar to an accounting circuit
breaking) and execution halted.

Related: #100288.
bpintea added a commit to bpintea/elasticsearch that referenced this issue Oct 9, 2023
This adds a limit on the bytes length that a concatenated string can get
to. The limit is set to 1MB (per entry). When the limit is hit, an
exception is returned to the caller (similar to an accounting circuit
breaking) and execution halted.

Related: elastic#100288.
@bpintea
Copy link
Contributor

bpintea commented Oct 9, 2023

Adressed in #100360.

@bpintea bpintea closed this as completed Oct 9, 2023
elasticsearchmachine pushed a commit that referenced this issue Oct 9, 2023
This adds a limit on the bytes length that a concatenated string can get
to. The limit is set to 1MB (per entry). When the limit is hit, an
exception is returned to the caller (similar to an accounting circuit
breaking) and execution halted.

Related: #100288.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/ES|QL AKA ESQL Team:QL (Deprecated) Meta label for query languages team >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests

4 participants