-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Task fails silently during parallel _reindex operation #33764
Comments
Pinging @elastic/es-distributed |
At first glance, I would say that you increased the number of parallel reindex tasks a bit too much. 16 at the same time cause rejections like you have found out rather than just delays and that is expected. The way that the indexing threadpool works is that we queue up a certain number of requests, but at some point (like in your case) the queue may reach a threshold, which causes rejections. I will let somebody @elastic/es-distributed comment though before closing, just to make sure that there is nothing we can do. |
This looks like a bug to me. I think the same retry policy should be used for storing the task result as for sending the bulk requests. @nik9000 what are your thoughts on this? |
I admit that 16 parallel tasks is much and I don't need to run that many. I created this case to point out that the way how ES handles the rejection does not allow client to react. If it instead of removing failed task sets the task result to something like "rejected" and let client back off and retry later it would be perfectly fine. |
I think we should be fairly relentless about storing the task result, yes. This is more fuel for the "don't stick task results into an index" argument.
If we didn't accidentally throw the task result away you'd get something like that. It'd be a failure recorded with the reason set to "rejected" or something along those lines and it'd be up to you to figure out how to handle it. How you handle it is kind of dependant on the data that you were reindexing. Sometimes it'd be fine to just retry the reindex. |
Adds about a minute worth of backoffs and retries to saving task results so it is *much* more likely that a busy cluster won't lose task results. This isn't an ideal solution to losing task results, but it is an incremental improvement. If all of the retries fail when still log the task result, but that is far from ideal. Closes elastic#33764
Adds about a minute worth of backoffs and retries to saving task results so it is *much* more likely that a busy cluster won't lose task results. This isn't an ideal solution to losing task results, but it is an incremental improvement. If all of the retries fail when still log the task result, but that is far from ideal. Closes #33764
Adds about a minute worth of backoffs and retries to saving task results so it is *much* more likely that a busy cluster won't lose task results. This isn't an ideal solution to losing task results, but it is an incremental improvement. If all of the retries fail when still log the task result, but that is far from ideal. Closes #33764
Elasticsearch version: Version: 6.4.0, Build: default/tar/595516e/2018-08-17T23:18:47.308994Z, JVM: 10.0.2 (Note: This is official docker container
docker.elastic.co/elasticsearch/elasticsearch:6.4.0
)Plugins installed: [og9AEaa ingest-geoip(6.4.0), og9AEaa ingest-user-agent (6.4.0)]
JVM version: 10.0.2
OS version: Linux 45660be5076f 4.15.0-34-generic #37-Ubuntu SMP Mon Aug 27 15:21:48 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Description of the problem including expected versus actual behavior:
I'm running reindex operation from remote ES 1.4 cluster via Reindex API. Since remote reindex does not support slicing I'm doing manual slicing using
query
and limiting remote search bymin < document_timestamp && document_timestamp < max
and usingwait_for_completion=false
to generate multiple parallel tasks.When I use up to 8 manual slices everything seems to be working fine. But when I double to 16 slices (source index has 100,000 documents so each slice has 6250 documents) the target ES 6.4 at one point loses track of reindex task.
It first correctly creates 16 reindexing tasks and returns their IDs, then it correctly gives me status of all 16 tasks when I ask for it via
http://localhost:9206/_tasks/<task:id>
(I have container port 9200 mapped to local 9206) but when I ask for status few times one of those 16 tasks returns 404 - not found.Sample of reindexing script log (the problematic task is
HI4YEIRMRuSdwsE4bCK7Bg:197
):When I look at ES log I see that given task failed (full log at the bottom):
As it looks to me (without knowing much about ES internals) the
write
thread pool is full because of running reindexing and when ES tries to something with task that operation is rejected because of full queue. I would expect that tasks just take longer to execute, not that they fail silently (looking from outside). If it happens I have to fail whole reindexing process and start over, which fails again.Steps to reproduce:
http://localhost:9206/_reindex?pretty&refresh&wait_for_completion=false
GET http://localhost:9206/_tasks/<part_1_of_task_id:number>
resource_not_found_exception
ES 6.4 logs:
The text was updated successfully, but these errors were encountered: