Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] DownsampleActionSingleNodeTests testCannotDownsampleWhileOtherDownsampleInProgress failing #106403

Closed
davidkyle opened this issue Mar 18, 2024 · 5 comments
Assignees
Labels
low-risk An open issue or test failure that is a low risk to future releases :StorageEngine/TSDB You know, for Metrics Team:StorageEngine >test-failure Triaged test failures from CI

Comments

@davidkyle
Copy link
Member

Build scan:
https://gradle-enterprise.elastic.co/s/hjehcw3ak4ets/tests/:x-pack:plugin:downsample:test/org.elasticsearch.xpack.downsample.DownsampleActionSingleNodeTests/testCannotDownsampleWhileOtherDownsampleInProgress

Reproduction line:

./gradlew ':x-pack:plugin:downsample:test' --tests "org.elasticsearch.xpack.downsample.DownsampleActionSingleNodeTests.testCannotDownsampleWhileOtherDownsampleInProgress" -Dtests.seed=8E5AEEE1E9924404 -Dtests.locale=zh-Hans-CN -Dtests.timezone=AST -Druntime.java=21

Applicable branches:
main

Reproduces locally?:
Didn't try

Failure history:
Failure dashboard for org.elasticsearch.xpack.downsample.DownsampleActionSingleNodeTests#testCannotDownsampleWhileOtherDownsampleInProgress

Failure excerpt:

org.elasticsearch.ElasticsearchException: downsample task [downsample-downsample-ynbrxltcjhvoej-0-13d] failed

  at __randomizedtesting.SeedInfo.seed([8E5AEEE1E9924404:AE3519837F3E6A11]:0)
  at org.elasticsearch.xpack.downsample.TransportDownsampleAction$2.onResponse(TransportDownsampleAction.java:425)
  at org.elasticsearch.xpack.downsample.TransportDownsampleAction$2.onResponse(TransportDownsampleAction.java:417)
  at org.elasticsearch.persistent.PersistentTasksService$1.onNewClusterState(PersistentTasksService.java:213)
  at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onNewClusterState(ClusterStateObserver.java:379)
  at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.clusterChanged(ClusterStateObserver.java:230)
  at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateListener(ClusterApplierService.java:560)
  at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateListeners(ClusterApplierService.java:547)
  at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:505)
  at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:429)
  at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:154)
  at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:917)
  at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:217)
  at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:183)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
  at java.lang.Thread.run(Thread.java:1583)

@davidkyle davidkyle added :StorageEngine/TSDB You know, for Metrics >test-failure Triaged test failures from CI labels Mar 18, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-storage-engine (Team:StorageEngine)

@martijnvg martijnvg self-assigned this Mar 18, 2024
@martijnvg martijnvg added low-risk An open issue or test failure that is a low risk to future releases and removed blocker labels Mar 18, 2024
@martijnvg
Copy link
Member

The cause of the downsampling failure:

[2024-03-17T20:39:16,407][WARN ][o.e.p.AllocatedPersistentTask] [node_s_0] task [downsample-downsample-ynbrxltcjhvoej-0-13d] failed with an exception	
org.elasticsearch.xpack.downsample.DownsampleShardIndexerException: Downsampling task [downsample-downsample-ynbrxltcjhvoej-0-13d] on shard [ynbrxltcjhvoej][0] failed indexing [452]	
	at org.elasticsearch.xpack.downsample.DownsampleShardIndexer.execute(DownsampleShardIndexer.java:202) ~[main/:?]	
	at org.elasticsearch.xpack.downsample.DownsampleShardPersistentTaskExecutor$1.doRun(DownsampleShardPersistentTaskExecutor.java:220) ~[main/:?]	
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:984) ~[elasticsearch-8.14.0-SNAPSHOT.jar:8.14.0-SNAPSHOT]	
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-8.14.0-SNAPSHOT.jar:8.14.0-SNAPSHOT]	
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]	
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]

I think logging should here be improved, so that we can get insight how the indexing failed.

@martijnvg
Copy link
Member

The reason that indexing failed is because the target index started to block writes:

[2024-03-17T20:39:16,345][ERROR][o.e.x.d.DownsampleShardIndexer] [node_s_0] Shard [[ynbrxltcjhvoej][0]] failed to populate downsample index. Failures: [{null=org.elasticsearch.cluster.block.ClusterBlockException: index [downsample-ynbrxltcjhvoej] blocked by: [FORBIDDEN/8/index write (api)];}]

@martijnvg
Copy link
Member

Looking better at the test logs, it looks like the first downsample attempt completes successfully, but just after the persistent task completes, the duplicate downsample operations starts downsampling. Failing later, because the initial downsample operation made target index read only. I think just before the downsample shard operation starts, a check should be added that checks whether the index.downsample.status has been set to success. There is a similar check at pre check in TransportDownsampleAction, but in this case that isn't enough.

[2024-03-17T20:39:16,127][INFO ][o.e.x.d.DownsampleShardIndexer] [node_s_0] Downsampling task [downsample-downsample-ynbrxltcjhvoej-0-13d on shard [ynbrxltcjhvoej][0] started	
[2024-03-17T20:39:16,132][INFO ][o.e.x.d.DownsampleShardIndexer] [node_s_0] Shard [ynbrxltcjhvoej][0] processed [1034] docs, created [452] downsample buckets	
[2024-03-17T20:39:16,215][INFO ][o.e.x.d.DownsampleShardIndexer] [node_s_0] Shard [[ynbrxltcjhvoej][0]] successfully sent [1034], received source doc [452], indexed downsampled doc [452], failed [0], took [0s]	
[2024-03-17T20:39:16,216][INFO ][o.e.x.d.DownsampleShardIndexer] [node_s_0] Downsampling task [downsample-downsample-ynbrxltcjhvoej-0-13d on shard [ynbrxltcjhvoej][0] completed	
[2024-03-17T20:39:16,236][INFO ][o.e.x.d.TransportDownsampleAction] [node_s_0] Downsampling task [downsample-downsample-ynbrxltcjhvoej-0-13d completed for shard [ynbrxltcjhvoej][0]	
[2024-03-17T20:39:16,236][INFO ][o.e.x.d.TransportDownsampleAction] [node_s_0] All downsampling tasks completed [1]	
[2024-03-17T20:39:16,329][INFO ][o.e.x.d.DownsampleShardIndexer] [node_s_0] Downsampling task [downsample-downsample-ynbrxltcjhvoej-0-13d on shard [ynbrxltcjhvoej][0] started	
[2024-03-17T20:39:16,334][INFO ][o.e.x.d.DownsampleShardIndexer] [node_s_0] Shard [ynbrxltcjhvoej][0] processed [1034] docs, created [452] downsample buckets	
[2024-03-17T20:39:16,345][ERROR][o.e.x.d.DownsampleShardIndexer] [node_s_0] Shard [[ynbrxltcjhvoej][0]] failed to populate downsample index. Failures: [{null=org.elasticsearch.cluster.block.ClusterBlockException: index [downsample-ynbrxltcjhvoej] blocked by: [FORBIDDEN/8/index write (api)];}]	
[2024-03-17T20:39:16,346][INFO ][o.e.x.d.DownsampleShardIndexer] [node_s_0] Shard [[ynbrxltcjhvoej][0]] successfully sent [1034], received source doc [452], indexed downsampled doc [452], failed [452], took [0s]	
[2024-03-17T20:39:16,350][INFO ][o.e.x.d.DownsampleShardIndexer] [node_s_0] Downsampling task [downsample-downsample-ynbrxltcjhvoej-0-13d] on shard [ynbrxltcjhvoej][0] failed indexing [452]	

martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Mar 20, 2024
This is relevant in the case multiple downsample api invocations have been executed for the same source index, target index and  fixed interval. Whether the target index is ready, is now also checked just before starting the downsample persistent tasks.

Relates to elastic#106403
martijnvg added a commit that referenced this issue Mar 21, 2024
This is relevant in the case multiple downsample api invocations have been executed for the same source index, target index and  fixed interval. Whether the target index is ready, is now also checked just before starting the downsample persistent tasks.

Relates to #106403
@martijnvg
Copy link
Member

No reported failures since the last 2 weeks, which was when #106563 was merged. I think that change stabilized this test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
low-risk An open issue or test failure that is a low risk to future releases :StorageEngine/TSDB You know, for Metrics Team:StorageEngine >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests

3 participants