-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade Assistant - Phase 2 - Reindexing #26368
Comments
Pinging @elastic/kibana-operations |
@joshdover whenever you have a UI around this and would like some feedback, let me know and I can take a look. |
I've updated the issue to include our plan to use saved objects + browser-driven polling for the first iteration and how we'll add background polling in the second if time permits. @alexfrancoeur Will do! |
@droberts195 Can you add information this ticket relating to any special handling that ML indices need during the reindex? As of right now, I know that ML jobs will need to be paused while reindexing and then resumed.
If there's anything else that needs to be handled, please add that here as well. |
@joshdover we started off along the path of upgrading ML indices without pausing the ML jobs - elastic/elasticsearch#36643. This is more complex but nicer for users who are running real-time anomaly detection and have large ML indices that date back to 5.x. If we were to continue along that path then the Kibana side logic would be not to reindex ML indices using the Kibana functionality but instead call that endpoint. The idea of pausing jobs by cancelling allocations of ML persistent tasks only came up last week. We'll decide within the next couple of days whether to switch to that approach.
They all start with
We have different types of ML indices:
Reindexing For the results indices,
There are no APIs to do this currently. If we decide to switch from online upgrade to pause/resume upgrade then we'll have to add these APIs into 6.7. Given the work that's been done so far I'm not convinced that the pause/resume option is the easiest way forward. To summarise there are two ways forward:
In either case, |
@droberts195 Thanks for writing this up. I think the best course for us right now is to wait on your decision and then jump on a video to call to work out the details depending on which path the ML team decides to move forward with. From my perspective, it may actually be simpler for Kibana to use the ML-specific reindexing endpoint rather than pausing/resuming jobs. I think it's most likely too late for this upgrade cycle, but we should probably explore using this approach with other user indices in the 8.0 upgrade cycle. If we can accomplish zero-downtime reindexing that would be great for many use-cases.
This is not something that is handled right now by the Upgrade Assistant and actually something we hadn't considered. I'm going to take a look at this today and see how the current logic would behave when reindexing an index that already has an alias. I agree that moving any aliases should be handled by the Upgrade Assistant. |
@joshdover I spoke to @bleskes this morning and we're going to go with the pause/resume option. We're going to discuss exactly how in the ES distributed team's weekly meeting tomorrow, so I'll update this issue after that. @benwtrent will probably do work for this.
I'm surprised that no customers complained about that in the 5.6 to 6.x upgrade. It should be possible to add arbitrarily many aliases to the new index in the same operation where you delete the old index. It would be similar to what's in the "It is also possible to swap an index with an alias in one operation" example in https://www.elastic.co/guide/en/elasticsearch/reference/6.x/indices-aliases.html, but you can have many |
@droberts195 @benwtrent Here's the plan I went over with Ben yesterday, written out for clarity:
Note, with this plan, we are not pausing/resuming specific ML jobs, but instead pausing and resuming all ML jobs. If we need to do specific jobs we could, but I'm not sure that optimization is needed at this time. |
Thanks @joshdover that plan sounds good to me. The pause/resume endpoints we're thinking of using at the moment are:
These still aren't implemented so it's possible someone will object to that naming and we'll have to change it, but the difficulty in calling the endpoints will not be any higher than that. |
Great! @benwtrent is there a PR to follow for this? I didn't see one when I briefly poked around the ES repo. Also, with this API will it be guaranteed that Kibana can set indices to read-only as soon as we've gotten a response back from this API? |
@joshdover I am currently writing tests for the API. The PR should be opened this week or early next week. Yes, once the API returns, the Indices can be set to read-only and re-indexing can begin. |
@joshdover PR: elastic/elasticsearch#37837 Its a biggie, lots of stuff going on to enable this change. Should get some reviewers taking a gander tomorrow/monday and hopefully have it finished early next week :) |
This PR fixes a bug with the |
All the planned work on this is complete. |
As part of Phase 2 of #20890, we need to add a UI and state layer to allow users to reindex old indices (created before 6.x) in order to be compatible with 7.0.
Left to Implement
In first PR:
.watches
In follow up PR(s):
.tasks
index (Upgrade Assistant cannot reindex Elasticsearch's .tasks index #29454).watches
and.triggered-watches
indices are not reindexing.ml_settings
deprecations to Cluster tab_doc
in masterOther nice-to-haves:
Details
This feature will be similar in flow to the upgrade assistant in 5.6 and will:
One issue with this flow last time was around persistence. Almost all of this logic was driven by client-side code, so if you left the page in the browser the process would stop.
This time around we want to persist the reindex process into a saved object and leverage the Task Manager (#24356) to poll Elasticsearch's Task API (naming is fun) to poll the status of the reindex task and to resume the flow once the reindex is done.We've decided to persist this using saved objects that we will update using optimistic concurrency. We are going to break this work into two parts, first to get this working ONLY when the browser is on the page, and then if we have time, add a worker that could handle this in the background. We should also be able to offer a reindex progress indicator and the ability to abort or reset a reindex process.Browser-driven iteration
For each reindex operation, we will create a saved object that acts as a state-machine to track the steps of the reindex process. To update this object, we will utilize the
version
parameter in Elasticsearch to ensure that there are not two browser tabs (or workers) attempting to update the object simultaneously.Reindex flow:
status
field to track the state machine.If the user leaves the page while the browser is polling, the alias switchover will not complete until they return to the upgrade assistant.
Worker-driven iteration
Largely the same flow, but we will have a in-process worker on the server side that will look for in-progress reindex operations, and continue to poll for their completion.
To reduce overhead from polling Elasticsearch, we could only boot up this worker if there are any known reindexes in-progress. This check will be done at startup and when a new reindex operation is started.
Potential problem:
kibana1
starts up, no reindex operations in progress, does not start worker.kibana2
starts up, receives request to start reindex operation, starts worker.kibana2
crashes before reindex is completekibana1
never starts worker, reindex operation is not shown as completed (and aliases not swapped over).We could address this issue by either:
Known Unknowns
index.uuid
index.creation_date
index.version.created
index.version.upgraded
index.provided_name
index.blocks
index.legacy
Possible Improvements
The text was updated successfully, but these errors were encountered: