-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Segment Replication] Support for mixed cluster versions (Rolling Upgrade) #3881
Comments
Based on the proposed solution suggested above do we need to just ensure we set the lucene version on the IndexWriter using The only place I can find being used today is for shrink/split operations during local shard recovery OpenSearch/server/src/main/java/org/opensearch/index/shard/StoreRecovery.java Lines 206 to 217 in fe0b917
|
@Bukhtawar This seems like a possible alternative worth exploring. This setting is not dynamically updatable within LiveIndexWriterConfig, so we'd need to recreate the writer/engine on primaries once all nodes have finished upgrade, maybe not so bad? At a minimum, we could also block any replica from syncing to segments that are ahead. This will slow freshness but still gives a path fwd. |
Thanks @mch2 I guess the cluster usually operates in a backward compatible mode till all nodes have upgraded. Also primary moves to the newer version before replicas do. So we just need to ensure that in a mixed cluster version upgraded primaries start with older IW version and replicate as usual to the older version replicas. Then when all nodes upgrade we implicitly have all replicas upgraded we recreate IW on primary. It might not be as bad if we could handle inflight operations gracefully by buffering them up and re-driving during the IW config switch(similar to primary relocation handoff) |
Thanks Bukhtawar, this is interesting problem and definitely a blocker for Segment replication release.
Doesn't this imply the nodes in the newer version already continue to create segments with older version until all nodes are upgraded? Or are we saying it is just backward compatible(from read perspective), but no new node can be downgraded even if the entire cluster is still not in latest version? |
@muralikpbhat, @Bukhtawar. I understand this idea as the newer version would continue to create segments with the older codec until all nodes are upgraded, but I don't think the cluster operates this way currently. I played around with this a bit with the Lucene demo. Setting |
We are going to start working on this. Writing up some smaller steps. |
Next step: |
Looking into it |
Support for mixed cluster versions:
Additionally check the working of:
|
The POC we have done proves that we can load an older codec to write with. An issue with this approach is we will need to provide bwc versions of non default codecs to be loaded by codecService, and map to them in CodecService. Some additional thoughts:
@Poojita-Raj @Bukhtawar @nknize curious of your thoughts on this. |
One of the major friction for users upgrading to newer versions could be downtime. Trying to understand the behaviour better, during an upgrade once primary shard moves over to the new version, at what point we can flip it back to the newer IndexWriter once all its replicas have upgraded. Would it amount to disruption during the time index writer is closing and opening with the new codecs or this can be handled gracefully, more like allowing older requests to finish concurrently with newer in-flight requests served by the new writer |
Coming from #7698, I see current version lucene library is not write compatible when using previous major lucene codecs i.e. using N-1 codec in Nth version of Lucene. Any attempt to indexing operation fails with Step 1. Create index with older lucene index using current lucene version 9x (any of above 3)
Step 2. Index operation
|
Thanks for this @dreamer-89, this breaks our current downgrade proposal across major verisons. I've had some offline discussions with @Bukhtawar and @nknize on the possibility of running two Lucene versions side by side for major version upgrades. The initial idea was to load a the old IW version inside of a separate engine implementation and support that with each OS release. Unfortunately this won't work because we would need the same major version of Lucene across all ingest packages, particularly during analysis. So I don't think we have a clear path with this solution. Perhaps as part of a migration component in the future, but we would need to figure out how to load essentially two versions of Opensearch. I think we can solve this for segrep by preventing this invalid mixed cluster state (primary on upgraded node copying to replicas on old node) rather than supporting it directly. The two ways you would get into this state are:
From our docs, we suggest to turn off replica allocation during the upgrade process. In this case we end up with a cluster where all replicas are UNASSIGNED until the setting is flipped back. So IMO nothing is required here?
Switching falling behind replicas to docrep is an idea, but would be problematic as replicas are relocated directly to the upgraded nodes, so we would have to force recovery from the primary in these cases to have the same segments. Also if a primary dies and a docrep shard is promoted, we have to reconcile that with other replicas / remote storage. In this case, we could provide an option to move replicas first if relocation is triggered. We have a setting to move primaries first - #1445, so we could do the inverse. The rationale for that setting was to prevent all copies of any single shard from being relocated before any copies of another shard. So we would lose this guarantee unless we moved a single replica copy for each replication group before moving the rest. This would leave us eventually in a temporary state where all primaries are on old nodes and replicas on new nodes until upgrade is completed. If the old nodes/primaries are overwhelmed the primary first setting could be enabled. |
Thank you @mch2 for the your comment and dividing this problem into two separate sub-problems i.e. rolling upgrades and relocation upgrades along with possible solutions. Rolling upgrades - No fix is neededI see for rolling upgrades case there is recommendation [1] to have
It is still possible to force move primary shards to upgraded nodes but it is not recommended. This needs to be updated on rolling upgrade documentation. Based on this, the only use case that needs to be solved is the relocation based upgrades (state 2 in last comment). Relocation based upgradesUsing proposals from previous comment, compiled list of possible solutions listed below for solving relocation based upgrades. 1. No changeDo not change anything around version upgrades. With primary shards sitting on upgraded nodes, there will be replica shards failures on non-upgraded nodes. These replica shards will eventually go to Pros
Cons
2. Reset codec in mixed cluster stateWhen cluster is running in mixed version state, reset codec on primary to version running on non-upgraded node so that segment files (now written with older lucene codec) can be read on replicas which are still running on non-upgraded nodes. Please note, replicas which are running on upgraded node can still read files from primary. We used OpenSearch version to lucene codec name mapping that allows primary to pick the bwc codec name. We reset the codec to latest one once upgrade is complete. This solution was attempted in #7698 but has below two limitations. Custom codecsPlugins that have custom codec implementations load the current custom version of Lucene codec which prevents codec downgrade. Thus, this solution also needs changes in downstream components which overrides CodecService. One such example is k-NN plugin.
Major lucene version bumpsFor major version bumps this solution doesn’t work because lucene treats previous major (N-1) codecs as read only and maintain only for backward compatibility[1]. When using previous major codecs with IndexWriter; it results in For example, in current version of Lucene9X, Lucene87StoredFieldsFormat class (defines the format of file that stores data, specific fields and metadata (.fdt, .fdm etc)) is defined in bwc-codecs and prevents initializations of any write related methods. Any attempt to perform indexing operations results in UnsupportedOperationException 3. Move replicas first to upgraded nodesIn relocation upgrades, move replica shard copies first to upgraded nodes and move primary only when all (or certain %) have moved to upgraded nodes. Replicas shards on upgraded node with updated codecs can read segment files written in older lucene version on non-upgraded node containing primary shard. This holds true due to Lucene bwc read guarantees that allow current major version to read segment files written in all minors in previous major. For e.g. data files written in Lucene8x version can be read by replica shard running on Lucene9x version. Pros
Cons
4. Use docrep during rolling upgradesUse document replication in mixed cluster state that allows primary and replica to index and create segment files independently. Primary derives this engine switch only when replica runs on non-upgraded node and primary on upgraded node. When replica shard containing node is upgraded, primary ignores the event. When primary shard is upgraded, it pings alll replica shards to promote to writeable engine. Replica shards running on non-upgraded nodes follow this directive and switch to InternalEngine while replicas already on upgraded nodes ignores this command. Similarly, when upgrade completes, primary shards pings all replica shards to switch back to NRTReplicationEngine. Again, only replica shards which previously upgraded, will only demote the engine. During downgrade, replica shards perform a force sync from primary to mirror image files from primary. During this sync, replica shard ignores segment file diffs (happens due to new segment files written with InternalEngine). Once sync completes, the replica shard performs store cleanup which removes all segment files which doesn’t belong to latest commit. Example state. Replica 1, Replica 2 are replica shard copies of Primary. Other Replica is a replica copy belonging to a different index. Pros
Cons
ProposalBased on above, moving replicas first seems to be clean and less riskier solution compared to others and thus is the preferred solution. Tagging folks for review and feedback @mch2 @andrross @nknize @Bukhtawar References[1] https://github.com/apache/lucene/tree/main/lucene/backward-codecs |
Is rolling back a partial upgrade also a concern and can we solve it here? If 1/3 nodes in a cluster has been upgraded to 4.0, I think we want to allow rolling it back, or recovering from/failing over to a replica. I think this gives us actual rollback from failed upgrades in flight! |
I don't think rollback is an option given segments created in higher versions cannot be understood by Lucene in lower version. While I would ideally like to address this BWC issue in Lucene to support creating segments with IW across major versions just the way minor versions are supported. Is there a discuss thread on Lucene that we can brainstorm on. I would potentially want us to evaluate a Blue-Green option with updating the replica copies first and then the primaries to get past the primary overload on the last few non-upgraded node issue. |
Thanks @Bukhtawar for your comment and feedback. I opened an issue to Lucene where I see a similar ask in the past for N-1 writer & reader support. This ask was identified as a bigger task and encouraged to be solved in distributed system. Please feel free to comment on the issue to initiate more discussion.
Yes, we are proposing to move replicas copies first for relocation based upgrades (Blue-Gree option) and work is tracked in #8265. For rolling-ugprades, documentation recommends users to disable replica shard allocation using |
After the discussion we've had here, the final conclusion when it comes to providing mixed version cluster support is to move replicas first to upgraded nodes. We are introducing a replica first shard movement setting in order to achieve this. To support mixed cluster versions:
|
…relocation (#8875) (#9153) When some node or set of nodes is excluded, the shards are moved away in random order. When segment replication is enabled for a cluster, we might end up in a mixed version state where replicas will be on lower version and unable to read segments sent from higher version primaries and fail. To avoid this, we could prioritize replica shard movement to avoid entering this situation. Adding a new setting called shard movement strategy - `SHARD_MOVEMENT_STRATEGY_SETTING` - that will allow us to specify in which order we want to move our shards: `NO_PREFERENCE` (default), `PRIMARY_FIRST` or `REPLICA_FIRST`. The `PRIMARY_FIRST` option will perform the same behavior as the previous setting `SHARD_MOVE_PRIMARY_FIRST_SETTING` which will be now deprecated in favor of the shard movement strategy setting. Expected behavior: If `SHARD_MOVEMENT_STRATEGY_SETTING` is changed from its default behavior to be either `PRIMARY_FIRST` or `REPLICA_FIRST` then we perform this behavior whether or not `SHARD_MOVE_PRIMARY_FIRST_SETTING` is enabled. If `SHARD_MOVEMENT_STRATEGY_SETTING` is still at its default setting of `NO_PREFERENCE` and `SHARD_MOVE_PRIMARY_FIRST_SETTING` is enabled we move the primary shards first. This ensures that users still using this setting will not see any changes in behavior. Reference: #1445 Parent issue: #3881 --------- Signed-off-by: Poojita Raj <[email protected]> (cherry picked from commit c6e4bcd)
I the update requires documentation for 2.10, please create a doc issue. Thanks! |
Closing out this issue as all work has been completed. Doc changes are being tracked here: opensearch-project/documentation-website#4827 |
Is your feature request related to a problem? Please describe.
Once we enable segment based replication for an index, we would need to solve version upgrades leading to mixed version clusters.
We can have an index with primary on OS 3.x whose replica might be on OS 4.x this itself is not a problem since next major versions have Lucene wire compatibility with their previous major version. At some point during the upgrade process its possible that the primary moves to OS 4.x whose replica sits on OS 3.x. Now when segrep tries to replicate segments to a replica sitting on host with lower version the replica would fail to identify segment as it might be created on a higher Lucene version which it probably doesn't know.
During a rolling upgrade, primary shards assigned to a node running the new version cannot have their replicas on the old version.
This is not a problem in document based replication as segments are independently created on both version nodes and lets say a replica fails on older version, it cannot be assigned on node which has a lower version than its primary, neither can shards move from a higher version node to a lower version node
Describe the solution you'd like
We should support mixed version cluster with segrep. One way to achieve it is to create segments on a lower version even on nodes with higher version while the upgrade is in-progress i.e cluster should operate in a BWC mode.
Describe alternatives you've considered
Segregate ALL primaries on one set replicas on a different set of nodes. Upgrade the set of nodes hosting replica and then upgrade the set of nodes hosting primary. However this can cause primary only nodes to be overloaded in a homogeneous cluster setup.
Additional context
Add any other context or screenshots about the feature request here.
The text was updated successfully, but these errors were encountered: