Upgrade Assistant - Phase 2 - Reindexing #26368

joshdover · 2018-11-28T19:32:03Z

As part of Phase 2 of #20890, we need to add a UI and state layer to allow users to reindex old indices (created before 6.x) in order to be compatible with 7.0.

Left to Implement

In first PR:

Add confirmation textbox for destructive changes
Handle conflicting index names
Design cleanup
Ensure indices are writable if reindexing fails
Handle pausing ML jobs when reindexing ML indices
Stop/start watcher when reindex .watches

In follow up PR(s):

Remove warning banner about issues not being up to date in 6.7 (Remove warnings about last minor from Upgrade Assistant #29467)
Handle .tasks index (Upgrade Assistant cannot reindex Elasticsearch's .tasks index #29454)
Handle only resuming watcher when both .watches and .triggered-watches indices are not reindexing
Display ML and Watcher stopping/starting steps in UI
Add security credential checks before
Add upgrade overview API for Cloud
Add .ml_settings deprecations to Cluster tab
Enforce mapping type to be _doc in master

Other nice-to-haves:

Add ability to cancel a reindex operation
Localize UI strings ([upgrade] Localize reindexing flyout in Upgrade Assistant #30432)
Add API documentation ([docs] Add docs for Upgrade Assistant APIs #30330)
Make API human readable
Consolidate reindex warnings and progress UIs into single flyout UI

Details

This feature will be similar in flow to the upgrade assistant in 5.6 and will:

Make the old index read-only
Create new index with the same settings and mappings
Begin the reindexing using the Reindex API
Wait for reindex to finish
Alias old index name to point to new index and delete old index

One issue with this flow last time was around persistence. Almost all of this logic was driven by client-side code, so if you left the page in the browser the process would stop. This time around we want to persist the reindex process into a saved object and leverage the Task Manager (#24356) to poll Elasticsearch's Task API (naming is fun) to poll the status of the reindex task and to resume the flow once the reindex is done. We've decided to persist this using saved objects that we will update using optimistic concurrency. We are going to break this work into two parts, first to get this working ONLY when the browser is on the page, and then if we have time, add a worker that could handle this in the background. We should also be able to offer a reindex progress indicator and the ability to abort or reset a reindex process.

Browser-driven iteration

For each reindex operation, we will create a saved object that acts as a state-machine to track the steps of the reindex process. To update this object, we will utilize the version parameter in Elasticsearch to ensure that there are not two browser tabs (or workers) attempting to update the object simultaneously.

Reindex flow:

User clicks "reindex", browser makes API call to server to begin reindexing for the given index.
Server creates a saved object to track this reindex operation with a status. Begins the first steps of the reindex: set old index as readonly, create new index, start the reindex operation. For each step of the way, we update the saved object's status field to track the state machine.
While the browser tab is on the Upgrade Assistant page, the browser will continue to poll for known reindexes in progress.
Once the reindex has finished, the server will complete the reindex process: alias the new index, delete the old index, mark the reindex operation as completed.

If the user leaves the page while the browser is polling, the alias switchover will not complete until they return to the upgrade assistant.

Worker-driven iteration

Largely the same flow, but we will have a in-process worker on the server side that will look for in-progress reindex operations, and continue to poll for their completion.

To reduce overhead from polling Elasticsearch, we could only boot up this worker if there are any known reindexes in-progress. This check will be done at startup and when a new reindex operation is started.

Potential problem:

kibana1 starts up, no reindex operations in progress, does not start worker.
kibana2 starts up, receives request to start reindex operation, starts worker.
kibana2 crashes before reindex is complete
kibana1 never starts worker, reindex operation is not shown as completed (and aliases not swapped over).

We could address this issue by either:

Polling for in-progress reindex operation saved objects on regular, but infrequent basis (say, every 5 minutes). If a new one is found, start polling its progress frequently (every 10s).
Polling for in-progress redindex operation saved objects whenever the user visits the Upgrade Assistant.

Known Unknowns

Which settings should be copied from the original index to the new index? So far, I know these cannot be copied:
- index.uuid
- index.creation_date
- index.version.created
- index.version.upgraded
- index.provided_name
- index.blocks
- index.legacy
Can we intelligently block the user from using this tool for large indices? If so, how do we decide this? Can ES's reindex API tell us whether or not this process should succeed?
UI Design

Possible Improvements

Should we offer an option to reindex many small indices in a single action (done in serial, not in parallel)?

The text was updated successfully, but these errors were encountered:

elasticmachine · 2018-11-28T19:32:05Z

Pinging @elastic/kibana-operations

alexfrancoeur · 2018-12-10T15:09:32Z

@joshdover whenever you have a UI around this and would like some feedback, let me know and I can take a look.

joshdover · 2018-12-17T23:07:52Z

I've updated the issue to include our plan to use saved objects + browser-driven polling for the first iteration and how we'll add background polling in the second if time permits.

cc @epixa @tylersmalley

@alexfrancoeur Will do!

joshdover · 2019-01-14T16:08:11Z

@droberts195 Can you add information this ticket relating to any special handling that ML indices need during the reindex? As of right now, I know that ML jobs will need to be paused while reindexing and then resumed.

How do I identify an ML index?
How do I identify which jobs are indexing into that index?
Which APIs would I use to pause and resume jobs?

If there's anything else that needs to be handled, please add that here as well.

droberts195 · 2019-01-14T16:52:50Z

@joshdover we started off along the path of upgrading ML indices without pausing the ML jobs - elastic/elasticsearch#36643. This is more complex but nicer for users who are running real-time anomaly detection and have large ML indices that date back to 5.x. If we were to continue along that path then the Kibana side logic would be not to reindex ML indices using the Kibana functionality but instead call that endpoint. The idea of pausing jobs by cancelling allocations of ML persistent tasks only came up last week. We'll decide within the next couple of days whether to switch to that approach.

How do I identify an ML index?

They all start with .ml-.

How do I identify which jobs are indexing into that index?

We have different types of ML indices:

Results (pattern .ml-anomalies*)
State (.ml-state)
Metadata (.ml-meta)
Notifications (.ml-notifications)
Config (.ml-config - cannot possibly need reindexing in 6.7 as it was only added in 6.6)
Annotations (.ml-annotations - cannot possibly need reindexing in 6.7 as it was only added in 6.6)

.ml-state, .ml-meta and .ml-notifications are shared by all jobs.

.ml-meta and .ml-notifications are small, infrequently written, and failure to write to them won't cause running jobs to fail, so I think they can just be reindexed using the standard migration assistant procedure.

Reindexing .ml-state would require all jobs to be paused while it is reindexed. But if we carry on along the ML upgrade endpoint path instead then the UI should not allow the standard migration procedure to run against it, but instead call the ML upgrade endpoint.

For the results indices, .ml-anomalies*, there is an alias for each job that points to its results index. You could use these aliases to work out which jobs are using each index. (Also, all these aliases need to be switched over to the new index after reindexing is complete - does the migration assistant already switch over arbitrary amounts of existing aliases?) The work already done in elastic/elasticsearch#36643 can handle migration of these indices and aliases while jobs remain running, so if we continue along that path then the UI should not allow the standard migration procedure to run against any index matching .ml-anomalies*, but instead call the ML upgrade endpoint.

Which APIs would I use to pause and resume jobs?

There are no APIs to do this currently. If we decide to switch from online upgrade to pause/resume upgrade then we'll have to add these APIs into 6.7.

Given the work that's been done so far I'm not convinced that the pause/resume option is the easiest way forward.

To summarise there are two ways forward:

Continue with the ML migration endpoint. Kibana disallows standard migration for .ml-state* and .ml-anomalies* and if either or both is from 5.x then calls the ML migration endpoint instead.
Add ES endpoints to pause and resume ML jobs. Standard Kibana index migration is used for .ml-state* and .ml-anomalies* but pausing jobs before reindex and resuming after.

In either case, .ml-meta and .ml-notifications can be reindexed using the standard procedure.

joshdover · 2019-01-15T15:54:06Z

@droberts195 Thanks for writing this up. I think the best course for us right now is to wait on your decision and then jump on a video to call to work out the details depending on which path the ML team decides to move forward with.

From my perspective, it may actually be simpler for Kibana to use the ML-specific reindexing endpoint rather than pausing/resuming jobs. I think it's most likely too late for this upgrade cycle, but we should probably explore using this approach with other user indices in the 8.0 upgrade cycle. If we can accomplish zero-downtime reindexing that would be great for many use-cases.

Also, all these aliases need to be switched over to the new index after reindexing is complete - does the migration assistant already switch over arbitrary amounts of existing aliases?

This is not something that is handled right now by the Upgrade Assistant and actually something we hadn't considered. I'm going to take a look at this today and see how the current logic would behave when reindexing an index that already has an alias. I agree that moving any aliases should be handled by the Upgrade Assistant.

droberts195 · 2019-01-15T16:16:25Z

@joshdover I spoke to @bleskes this morning and we're going to go with the pause/resume option. We're going to discuss exactly how in the ES distributed team's weekly meeting tomorrow, so I'll update this issue after that. @benwtrent will probably do work for this.

I'm going to take a look at this today and see how the current logic would behave when reindexing an index that already has an alias

I'm surprised that no customers complained about that in the 5.6 to 6.x upgrade. It should be possible to add arbitrarily many aliases to the new index in the same operation where you delete the old index. It would be similar to what's in the "It is also possible to swap an index with an alias in one operation" example in https://www.elastic.co/guide/en/elasticsearch/reference/6.x/indices-aliases.html, but you can have many add actions in the same request so as well as adding an alias to the new index with the same name as the old index you could add additional aliases to the new index to replace all the aliases that the old index had.

joshdover · 2019-01-17T23:44:18Z

@droberts195 @benwtrent Here's the plan I went over with Ben yesterday, written out for clarity:

ML will provide two APIs in Elasticsearch. One will stop/pause all ML jobs and the other will resume/restart all ML jobs.
When Kibana is reindexing any .ml-state* or .ml-anomalies* indices it will:
- Call the ML stop endpoint is ES
- Set the index as read only
- Reindex the data, with no transformations.
- Create an alias from the original index name to the new index name, copy any aliases pointing to the old index over to the new index, and delete the old index. This will all happen in a single atomic Update Aliases call.
- Resume any ML jobs only if this is the only ML index being reindexed still. If there are others in progress, this step will be skipped so that only the last index to finish will resume the ML jobs.

Note, with this plan, we are not pausing/resuming specific ML jobs, but instead pausing and resuming all ML jobs. If we need to do specific jobs we could, but I'm not sure that optimization is needed at this time.

droberts195 · 2019-01-18T17:10:19Z

Thanks @joshdover that plan sounds good to me.

The pause/resume endpoints we're thinking of using at the moment are:

_ml/set_upgrade_mode?enabled=true
_ml/set_upgrade_mode?enabled=false

These still aren't implemented so it's possible someone will object to that naming and we'll have to change it, but the difficulty in calling the endpoints will not be any higher than that.

joshdover · 2019-01-23T22:05:52Z

Great! @benwtrent is there a PR to follow for this? I didn't see one when I briefly poked around the ES repo. Also, with this API will it be guaranteed that Kibana can set indices to read-only as soon as we've gotten a response back from this API?

benwtrent · 2019-01-24T13:03:41Z

@joshdover I am currently writing tests for the API. The PR should be opened this week or early next week.

Yes, once the API returns, the Indices can be set to read-only and re-indexing can begin.

benwtrent · 2019-01-24T23:14:44Z

@joshdover PR: elastic/elasticsearch#37837

Its a biggie, lots of stuff going on to enable this change. Should get some reviewers taking a gander tomorrow/monday and hopefully have it finished early next week :)

benwtrent · 2019-01-28T20:04:13Z

elastic/elasticsearch#37942

This PR fixes a bug with the set-upgrade-mode API. Apparently I did not account for the situation when there were no tasks to worry about :/

joshdover · 2019-02-20T17:32:13Z

All the planned work on this is complete.

joshdover added Team:Operations Team label for Operations Team v6.7.0 labels Nov 28, 2018

joshdover self-assigned this Nov 28, 2018

joshdover mentioned this issue Dec 18, 2018

Add reindex feature to Upgrade Assistant #27457

Merged

6 tasks

joshdover changed the title ~~Add reindex feature to Upgrade Assistant~~ Upgrade Assistant - Phase 2 - Reindexing Jan 14, 2019

joshdover mentioned this issue Jan 28, 2019

Upgrade Assistant cannot reindex Elasticsearch's .tasks index #29454

Closed

This was referenced Feb 2, 2019

[upgrade] Add cancel button to reindexing #29913

Merged

Filter out security realm deprecations on Cloud #30018

Merged

[upgrade] Localize reindexing flyout in Upgrade Assistant #30432

Merged

joshdover closed this as completed Feb 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade Assistant - Phase 2 - Reindexing #26368

Upgrade Assistant - Phase 2 - Reindexing #26368

joshdover commented Nov 28, 2018 •

edited

Loading

elasticmachine commented Nov 28, 2018

alexfrancoeur commented Dec 10, 2018

joshdover commented Dec 17, 2018

joshdover commented Jan 14, 2019

droberts195 commented Jan 14, 2019

joshdover commented Jan 15, 2019

droberts195 commented Jan 15, 2019

joshdover commented Jan 17, 2019

droberts195 commented Jan 18, 2019

joshdover commented Jan 23, 2019

benwtrent commented Jan 24, 2019

benwtrent commented Jan 24, 2019

benwtrent commented Jan 28, 2019 •

edited

Loading

joshdover commented Feb 20, 2019

Upgrade Assistant - Phase 2 - Reindexing #26368

Upgrade Assistant - Phase 2 - Reindexing #26368

Comments

joshdover commented Nov 28, 2018 • edited Loading

Left to Implement

Details

Browser-driven iteration

Worker-driven iteration

Known Unknowns

Possible Improvements

elasticmachine commented Nov 28, 2018

alexfrancoeur commented Dec 10, 2018

joshdover commented Dec 17, 2018

joshdover commented Jan 14, 2019

droberts195 commented Jan 14, 2019

joshdover commented Jan 15, 2019

droberts195 commented Jan 15, 2019

joshdover commented Jan 17, 2019

droberts195 commented Jan 18, 2019

joshdover commented Jan 23, 2019

benwtrent commented Jan 24, 2019

benwtrent commented Jan 24, 2019

benwtrent commented Jan 28, 2019 • edited Loading

joshdover commented Feb 20, 2019

joshdover commented Nov 28, 2018 •

edited

Loading

benwtrent commented Jan 28, 2019 •

edited

Loading