Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] DownsampleActionIT testRollupNonTSIndex failing #103981

Closed
kingherc opened this issue Jan 5, 2024 · 11 comments · Fixed by #109787
Closed

[CI] DownsampleActionIT testRollupNonTSIndex failing #103981

kingherc opened this issue Jan 5, 2024 · 11 comments · Fixed by #109787
Assignees
Labels
low-risk An open issue or test failure that is a low risk to future releases :StorageEngine/Downsampling Downsampling (replacement for rollups) - Turn fine-grained time-based data into coarser-grained data Team:StorageEngine >test-failure Triaged test failures from CI

Comments

@kingherc
Copy link
Contributor

kingherc commented Jan 5, 2024

Build scan:
https://gradle-enterprise.elastic.co/s/ruzhsbyskkuk6/tests/:x-pack:plugin:ilm:qa:multi-node:javaRestTest/org.elasticsearch.xpack.ilm.actions.DownsampleActionIT/testRollupNonTSIndex

Reproduction line:

./gradlew ':x-pack:plugin:ilm:qa:multi-node:javaRestTest' --tests "org.elasticsearch.xpack.ilm.actions.DownsampleActionIT.testRollupNonTSIndex" -Dtests.seed=25F9F6ED0F69AEE9 -Dtests.locale=ar-DZ -Dtests.timezone=Europe/Monaco -Druntime.java=19

Applicable branches:
main

Reproduces locally?:
Didn't try

Failure history:
Failure dashboard for org.elasticsearch.xpack.ilm.actions.DownsampleActionIT#testRollupNonTSIndex

Failure excerpt:

java.lang.AssertionError: 
Expected: <{"phase":"warm","action":"complete","name":"complete"}>
     but: was <{"phase":"null","action":"null","name":"null"}>

  at __randomizedtesting.SeedInfo.seed([25F9F6ED0F69AEE9:B9F4C90CB6DD48DE]:0)
  at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
  at org.junit.Assert.assertThat(Assert.java:956)
  at org.junit.Assert.assertThat(Assert.java:923)
  at org.elasticsearch.xpack.ilm.actions.DownsampleActionIT.lambda$testRollupNonTSIndex$16(DownsampleActionIT.java:404)
  at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:1281)
  at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:1254)
  at org.elasticsearch.xpack.ilm.actions.DownsampleActionIT.testRollupNonTSIndex(DownsampleActionIT.java:404)
  at jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
  at java.lang.reflect.Method.invoke(Method.java:578)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
  at java.lang.Thread.run(Thread.java:1589)

@kingherc kingherc added :StorageEngine/Rollup Turn fine-grained time-based data into coarser-grained data >test-failure Triaged test failures from CI Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) labels Jan 5, 2024
@elasticsearchmachine elasticsearchmachine added blocker needs:triage Requires assignment of a team area label and removed Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) labels Jan 5, 2024
@wchaparro wchaparro added Team:StorageEngine and removed needs:triage Requires assignment of a team area label labels Jan 5, 2024
DaveCTurner added a commit that referenced this issue Jan 8, 2024
@martijnvg martijnvg self-assigned this Jan 11, 2024
@martijnvg martijnvg added low-risk An open issue or test failure that is a low risk to future releases :StorageEngine/Downsampling Downsampling (replacement for rollups) - Turn fine-grained time-based data into coarser-grained data and removed blocker :StorageEngine/Rollup Turn fine-grained time-based data into coarser-grained data labels Jan 11, 2024
martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Feb 5, 2024
and add more logging for when test fails next time.

Relates to elastic#103981
martijnvg added a commit that referenced this issue Feb 5, 2024
and add more logging for when test fails next time.

Relates to #103981
@DaveCTurner
Copy link
Contributor

Failed again at https://gradle-enterprise.elastic.co/s/4hzqteph2qrpo, reinstating the mute.

DaveCTurner added a commit that referenced this issue Feb 9, 2024
martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Mar 5, 2024
increase ILM logging to TRACE.

Relates to elastic#103981
martijnvg added a commit that referenced this issue Mar 5, 2024
increase ILM logging to TRACE.

Relates to #103981
@martijnvg
Copy link
Member

No reported failures since the last ~2 weeks, which was when #106563 was merged. I think that change stabilized this test as well.

elasticsearchmachine pushed a commit that referenced this issue Apr 15, 2024
@kkrik-es kkrik-es assigned kkrik-es and unassigned martijnvg Apr 16, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-storage-engine (Team:StorageEngine)

@kkrik-es
Copy link
Contributor

There was a single failure over the past week, over more than 2k runs. Re-enabling the test for coverage, keeping an eye for further failures.

@kkrik-es
Copy link
Contributor

No more failures over the past week. Closing it for now.

@ldematte
Copy link
Contributor

@ldematte ldematte reopened this Apr 24, 2024
@kkrik-es
Copy link
Contributor

Seems like a race condition:

1> [2024-04-24T19:24:50,027][INFO ][o.e.x.i.a.DownsampleActionIT] [testRollupNonTSIndex] --> running [testRollupNonTSIndex] with index [index-kugukislys], data stream [ds-cwycolhzpu], alias [alias-ppQRH] and policy [policy-UuNay]
  1> [2024-04-24T19:25:00,353][WARN ][o.e.x.i.a.DownsampleActionIT] [testRollupNonTSIndex] --> original index name is [index-kugukislys], rollup index name is NULL, possible explanation: {index=index-kugukislys, managed=true, policy=policy-UuNay, index_creation_date_millis=1713968690115, time_since_index_creation=10.23s, step_info={type=illegal_argument_exception, reason=policy [policy-UuNay] does not exist}}
» [2024-04-24T14:25:00,549][ERROR][o.e.x.i.IndexLifecycleRunner] [javaRestTest-3] policy [policy-UuNay] for index [index-kugukislys] failed on step [{"phase":"warm","action":"migrate","name":"migrate"}]. Moving to ERROR step [index-kugukislys/zdP8uRrlTiWztDwfcKEEBA] org.elasticsearch.index.IndexNotFoundException: no such index [index-kugukislys]

Waiting for longer should allow for things to work.

elasticsearchmachine pushed a commit that referenced this issue Apr 24, 2024
look_ahead_time is 1m, need to wait for longer than that.

Related to #103981
@davidkyle
Copy link
Member

Failed on 8.14 https://gradle-enterprise.elastic.co/s/ns3igizthf3eq

@kkrik-es was the fix backported?

@kkrik-es
Copy link
Contributor

Yeah I saw the failure.. I didn't backport the fix, will do so if I don't see another failure in main this week.

@kkrik-es
Copy link
Contributor

kkrik-es commented May 1, 2024

No more failures, closing (and backporting the fix) for now.

@kkrik-es kkrik-es closed this as completed May 1, 2024
elasticsearchmachine pushed a commit that referenced this issue Jun 17, 2024
This PR fixes two test failures
#103981 &
#105437 and refactors the
code a bit to make things more explicit.

**What was the issue** These tests were creating an index with a policy
before that policy was created. This could cause an issue if ILM would
run after the index was created but before the policy was created. 

When ILM runs before the policy is added, the following happen:

-  the index encounters an error the ILM state sets that the current step is `null`, which makes sense since there is no policy to retrieve a step from. 
- A `null` step does not qualify to be executed periodically, which also makes sense because probably nothing changed, so chances are the index will remain in this state.
- The test keeps waiting for something to happen, but this is not happening because no cluster state updates are coming like they would have if this was a "real" cluster. 
- Until the test tear down starts, then the index gets updates with the ILM policy but it's a bit too late.

The previous scenario is confirmed by the logging too.

```
----> The index gets created referring a policy that does not exist yet, ILM runs at least twice before the policy is there
[2024-06-12T20:14:28,857][....] [index-sanohmhwxl] creating index, ......
[2024-06-12T20:14:28,870][....] [index-sanohmhwxl] retrieved current step key: null
[2024-06-12T20:14:28,871][....] unable to retrieve policy [policy-tohpA] for index [index-sanohmhwxl], recording this in step_info for this index java.lang.IllegalArgumentException: policy [policy-tohpA] does not exist

-----> Only now the policy is added
[2024-06-12T20:14:29,024][....] adding index lifecycle policy [policy-tohpA]

-----> ILM is running periodically but because the current step is null it ignores it
[2024-06-12T20:15:23,791][....] job triggered: ilm, 1718223323790, 1718223323790
[2024-06-12T20:15:23,791][....] retrieved current step key: null
[2024-06-12T20:15:23,791][....] maybe running periodic step (InitializePolicyContextStep) with current step {"phase":"new","action":"init","name":"init"}
```

This can also be locally reproduced by adding a 5s thread sleep before
adding the policy. 

**The fix** Adding a non existing policy to an index is a not a
supported path. For this reason, we refactored the test to reflect a
more realistic scenario. 

- We add the policy as an argument in `private void createIndex(String index, String alias, String policy, boolean isTimeSeries)`. This way it's clear that a policy could be added.
- We created the policy before adding the index, it does not appear that adding the policy later is crucial for the test, so simplifying it sounded like a good idea.
- Simplified `testRollupIndexInTheHotPhaseWithoutRollover` that ensures that a downsampling action cannot be added in the hot phase without rollover. An index is not necessary for this test, so again simplifying it makes the purpose of the test more clear.

Fixes: #103981 Fixes:
#105437
gmarouli added a commit to gmarouli/elasticsearch that referenced this issue Jun 17, 2024
…9787)

This PR fixes two test failures
elastic#103981 &
elastic#105437 and refactors the
code a bit to make things more explicit.

**What was the issue** These tests were creating an index with a policy
before that policy was created. This could cause an issue if ILM would
run after the index was created but before the policy was created. 

When ILM runs before the policy is added, the following happen:

-  the index encounters an error the ILM state sets that the current step is `null`, which makes sense since there is no policy to retrieve a step from. 
- A `null` step does not qualify to be executed periodically, which also makes sense because probably nothing changed, so chances are the index will remain in this state.
- The test keeps waiting for something to happen, but this is not happening because no cluster state updates are coming like they would have if this was a "real" cluster. 
- Until the test tear down starts, then the index gets updates with the ILM policy but it's a bit too late.

The previous scenario is confirmed by the logging too.

```
----> The index gets created referring a policy that does not exist yet, ILM runs at least twice before the policy is there
[2024-06-12T20:14:28,857][....] [index-sanohmhwxl] creating index, ......
[2024-06-12T20:14:28,870][....] [index-sanohmhwxl] retrieved current step key: null
[2024-06-12T20:14:28,871][....] unable to retrieve policy [policy-tohpA] for index [index-sanohmhwxl], recording this in step_info for this index java.lang.IllegalArgumentException: policy [policy-tohpA] does not exist

-----> Only now the policy is added
[2024-06-12T20:14:29,024][....] adding index lifecycle policy [policy-tohpA]

-----> ILM is running periodically but because the current step is null it ignores it
[2024-06-12T20:15:23,791][....] job triggered: ilm, 1718223323790, 1718223323790
[2024-06-12T20:15:23,791][....] retrieved current step key: null
[2024-06-12T20:15:23,791][....] maybe running periodic step (InitializePolicyContextStep) with current step {"phase":"new","action":"init","name":"init"}
```

This can also be locally reproduced by adding a 5s thread sleep before
adding the policy. 

**The fix** Adding a non existing policy to an index is a not a
supported path. For this reason, we refactored the test to reflect a
more realistic scenario. 

- We add the policy as an argument in `private void createIndex(String index, String alias, String policy, boolean isTimeSeries)`. This way it's clear that a policy could be added.
- We created the policy before adding the index, it does not appear that adding the policy later is crucial for the test, so simplifying it sounded like a good idea.
- Simplified `testRollupIndexInTheHotPhaseWithoutRollover` that ensures that a downsampling action cannot be added in the hot phase without rollover. An index is not necessary for this test, so again simplifying it makes the purpose of the test more clear.

Fixes: elastic#103981 Fixes:
elastic#105437
elasticsearchmachine pushed a commit that referenced this issue Jun 17, 2024
…109791)

This PR fixes two test failures
#103981 &
#105437 and refactors the
code a bit to make things more explicit.

**What was the issue** These tests were creating an index with a policy
before that policy was created. This could cause an issue if ILM would
run after the index was created but before the policy was created. 

When ILM runs before the policy is added, the following happen:

-  the index encounters an error the ILM state sets that the current step is `null`, which makes sense since there is no policy to retrieve a step from. 
- A `null` step does not qualify to be executed periodically, which also makes sense because probably nothing changed, so chances are the index will remain in this state.
- The test keeps waiting for something to happen, but this is not happening because no cluster state updates are coming like they would have if this was a "real" cluster. 
- Until the test tear down starts, then the index gets updates with the ILM policy but it's a bit too late.

The previous scenario is confirmed by the logging too.

```
----> The index gets created referring a policy that does not exist yet, ILM runs at least twice before the policy is there
[2024-06-12T20:14:28,857][....] [index-sanohmhwxl] creating index, ......
[2024-06-12T20:14:28,870][....] [index-sanohmhwxl] retrieved current step key: null
[2024-06-12T20:14:28,871][....] unable to retrieve policy [policy-tohpA] for index [index-sanohmhwxl], recording this in step_info for this index java.lang.IllegalArgumentException: policy [policy-tohpA] does not exist

-----> Only now the policy is added
[2024-06-12T20:14:29,024][....] adding index lifecycle policy [policy-tohpA]

-----> ILM is running periodically but because the current step is null it ignores it
[2024-06-12T20:15:23,791][....] job triggered: ilm, 1718223323790, 1718223323790
[2024-06-12T20:15:23,791][....] retrieved current step key: null
[2024-06-12T20:15:23,791][....] maybe running periodic step (InitializePolicyContextStep) with current step {"phase":"new","action":"init","name":"init"}
```

This can also be locally reproduced by adding a 5s thread sleep before
adding the policy. 

**The fix** Adding a non existing policy to an index is a not a
supported path. For this reason, we refactored the test to reflect a
more realistic scenario. 

- We add the policy as an argument in `private void createIndex(String index, String alias, String policy, boolean isTimeSeries)`. This way it's clear that a policy could be added.
- We created the policy before adding the index, it does not appear that adding the policy later is crucial for the test, so simplifying it sounded like a good idea.
- Simplified `testRollupIndexInTheHotPhaseWithoutRollover` that ensures that a downsampling action cannot be added in the hot phase without rollover. An index is not necessary for this test, so again simplifying it makes the purpose of the test more clear.

Fixes: #103981 Fixes:
#105437
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
low-risk An open issue or test failure that is a low risk to future releases :StorageEngine/Downsampling Downsampling (replacement for rollups) - Turn fine-grained time-based data into coarser-grained data Team:StorageEngine >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants