Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Segment Replication] Support for mixed cluster versions (Rolling Upgrade) #3881

Closed
Tracked by #2194
Bukhtawar opened this issue Jul 13, 2022 · 20 comments
Closed
Tracked by #2194
Assignees
Labels
>breaking Identifies a breaking change. discuss Issues intended to help drive brainstorming and decision making distributed framework enhancement Enhancement or improvement to existing feature or request Indexing:Replication Issues and PRs related to core replication framework eg segrep v2.8.0 'Issues and PRs related to version v2.8.0'

Comments

@Bukhtawar
Copy link
Collaborator

Is your feature request related to a problem? Please describe.
Once we enable segment based replication for an index, we would need to solve version upgrades leading to mixed version clusters.
We can have an index with primary on OS 3.x whose replica might be on OS 4.x this itself is not a problem since next major versions have Lucene wire compatibility with their previous major version. At some point during the upgrade process its possible that the primary moves to OS 4.x whose replica sits on OS 3.x. Now when segrep tries to replicate segments to a replica sitting on host with lower version the replica would fail to identify segment as it might be created on a higher Lucene version which it probably doesn't know.
During a rolling upgrade, primary shards assigned to a node running the new version cannot have their replicas on the old version.

This is not a problem in document based replication as segments are independently created on both version nodes and lets say a replica fails on older version, it cannot be assigned on node which has a lower version than its primary, neither can shards move from a higher version node to a lower version node

Describe the solution you'd like
We should support mixed version cluster with segrep. One way to achieve it is to create segments on a lower version even on nodes with higher version while the upgrade is in-progress i.e cluster should operate in a BWC mode.

Describe alternatives you've considered
Segregate ALL primaries on one set replicas on a different set of nodes. Upgrade the set of nodes hosting replica and then upgrade the set of nodes hosting primary. However this can cause primary only nodes to be overloaded in a homogeneous cluster setup.

Additional context
Add any other context or screenshots about the feature request here.

@Bukhtawar Bukhtawar added enhancement Enhancement or improvement to existing feature or request untriaged >breaking Identifies a breaking change. and removed untriaged labels Jul 13, 2022
@Bukhtawar
Copy link
Collaborator Author

Bukhtawar commented Oct 27, 2022

Based on the proposed solution suggested above do we need to just ensure we set the lucene version on the IndexWriter using IndexWriter#setIndexCreatedVersionMajor during peer recovery to ensure IndexWriter continues using the older Lucene version for segment creation to support backward compatibility.

The only place I can find being used today is for shrink/split operations during local shard recovery

final int luceneIndexCreatedVersionMajor = Lucene.readSegmentInfos(sources[0]).getIndexCreatedVersionMajor();
final Directory hardLinkOrCopyTarget = new org.apache.lucene.misc.store.HardlinkCopyDirectoryWrapper(target);
IndexWriterConfig iwc = new IndexWriterConfig(null).setSoftDeletesField(Lucene.SOFT_DELETES_FIELD)
.setCommitOnClose(false)
// we don't want merges to happen here - we call maybe merge on the engine
// later once we stared it up otherwise we would need to wait for it here
// we also don't specify a codec here and merges should use the engines for this index
.setMergePolicy(NoMergePolicy.INSTANCE)
.setOpenMode(IndexWriterConfig.OpenMode.CREATE)
.setIndexCreatedVersionMajor(luceneIndexCreatedVersionMajor);

@mch2
Copy link
Member

mch2 commented Nov 1, 2022

@Bukhtawar This seems like a possible alternative worth exploring. This setting is not dynamically updatable within LiveIndexWriterConfig, so we'd need to recreate the writer/engine on primaries once all nodes have finished upgrade, maybe not so bad?

At a minimum, we could also block any replica from syncing to segments that are ahead. This will slow freshness but still gives a path fwd.

@Bukhtawar
Copy link
Collaborator Author

Thanks @mch2 I guess the cluster usually operates in a backward compatible mode till all nodes have upgraded. Also primary moves to the newer version before replicas do. So we just need to ensure that in a mixed cluster version upgraded primaries start with older IW version and replicate as usual to the older version replicas. Then when all nodes upgrade we implicitly have all replicas upgraded we recreate IW on primary. It might not be as bad if we could handle inflight operations gracefully by buffering them up and re-driving during the IW config switch(similar to primary relocation handoff)

@muralikpbhat
Copy link

Thanks Bukhtawar, this is interesting problem and definitely a blocker for Segment replication release.

Thanks @mch2 I guess the cluster usually operates in a backward compatible mode till all nodes have upgraded.

Doesn't this imply the nodes in the newer version already continue to create segments with older version until all nodes are upgraded? Or are we saying it is just backward compatible(from read perspective), but no new node can be downgraded even if the entire cluster is still not in latest version?

@mch2
Copy link
Member

mch2 commented Nov 15, 2022

Doesn't this imply the nodes in the newer version already continue to create segments with older version until all nodes are upgraded? Or are we saying it is just backward compatible(from read perspective), but no new node can be downgraded even if the entire cluster is still not in latest version?

@muralikpbhat, @Bukhtawar. I understand this idea as the newer version would continue to create segments with the older codec until all nodes are upgraded, but I don't think the cluster operates this way currently. I played around with this a bit with the Lucene demo. Setting setIndexCreatedVersionMajor in iwc will bypass this check SegmentInfos.init, but it does not control the codec with which new segments are written. We would need the new shard to understand the codec in use on replicas, ensure it is available, and set that with IndexWriterConfig.setCodec for the segments to be read.

@mch2
Copy link
Member

mch2 commented Feb 6, 2023

We are going to start working on this. Writing up some smaller steps.

@Poojita-Raj
Copy link
Contributor

Poojita-Raj commented Mar 10, 2023

We are going to start working on this. Writing up some smaller steps.

* [ ]  [[Segment Replication] - Create a POC proving out mixed clusters with SR enabled.](https://github.com/opensearch-project/OpenSearch/issues/6211)

Next step:

@dreamer-89
Copy link
Member

Looking into it

@dreamer-89 dreamer-89 added v2.8.0 'Issues and PRs related to version v2.8.0' discuss Issues intended to help drive brainstorming and decision making labels Apr 27, 2023
@Poojita-Raj
Copy link
Contributor

Poojita-Raj commented May 1, 2023

Support for mixed cluster versions:

Additionally check the working of:

  • 2.7 to 2.8 rolling upgrade with no changes - default codec
  • 2.7 to 2.8 rolling upgrade with no changes - non-default codecs

@mch2
Copy link
Member

mch2 commented May 3, 2023

The POC we have done proves that we can load an older codec to write with. An issue with this approach is we will need to provide bwc versions of non default codecs to be loaded by codecService, and map to them in CodecService. PerFieldMappingPostingFormatCodec and any plugin provided codec, in particular knn.

Some additional thoughts:

  1. We need to benchmark any impact to recovery/upgrade times.
  2. We could provide an option to not go this route for users that are ok with lagging replicas during an upgrade.
  3. We could provide an option to block/avoid a primary relocation to an upgraded node if a certain % of replicas are live on non upgraded nodes.

@Poojita-Raj @Bukhtawar @nknize curious of your thoughts on this.

@anasalkouz anasalkouz changed the title [Segment Replication: Breaking] Support for mixed cluster versions [Segment Replication: Breaking] Support for mixed cluster versions (Rolling Upgrade) May 4, 2023
@mch2 mch2 moved this from Todo to In Progress in Segment Replication May 11, 2023
@Bukhtawar
Copy link
Collaborator Author

Bukhtawar commented May 15, 2023

One of the major friction for users upgrading to newer versions could be downtime. Trying to understand the behaviour better, during an upgrade once primary shard moves over to the new version, at what point we can flip it back to the newer IndexWriter once all its replicas have upgraded. Would it amount to disruption during the time index writer is closing and opening with the new codecs or this can be handled gracefully, more like allowing older requests to finish concurrently with newer in-flight requests served by the new writer

@dreamer-89
Copy link
Member

dreamer-89 commented Jun 8, 2023

Coming from #7698, I see current version lucene library is not write compatible when using previous major lucene codecs i.e. using N-1 codec in Nth version of Lucene. Any attempt to indexing operation fails with UnsupportedOperationException. Verified this on different lucene version of 9.x (listed below), that creating index using Lucene87 fails during indexing. Lucene only allows codecs in current major (e.g. Lucene90, .., Luene95 for 9X) for writing (part of core library) and older/bwc codecs are only meant for reading the older segments. Thus, it appears downgrading lucene codec on primary wouldn't work during major lucene bumps.

  1. Lucene95 - latest main
  2. Lucene92 5358502
  3. Lucene90 006c832

Step 1. Create index with older lucene index using current lucene version 9x (any of above 3)

{
    "settings": {
        "index": {
            "number_of_shards": 1,
            "number_of_replicas": 1,
            "codec": "Lucene87"
        }
    }
}

Step 2. Index operation

{
    "error": {
        "root_cause": [
            {
                "type": "unsupported_operation_exception",
                "reason": "Old codecs may only be used for reading"
            }
        ],
        "type": "unsupported_operation_exception",
        "reason": "Old codecs may only be used for reading"
    },
    "status": 500
}

CC @mch2 @nknize @anasalkouz

@mch2
Copy link
Member

mch2 commented Jun 14, 2023

Thanks for this @dreamer-89, this breaks our current downgrade proposal across major verisons. I've had some offline discussions with @Bukhtawar and @nknize on the possibility of running two Lucene versions side by side for major version upgrades. The initial idea was to load a the old IW version inside of a separate engine implementation and support that with each OS release. Unfortunately this won't work because we would need the same major version of Lucene across all ingest packages, particularly during analysis. So I don't think we have a clear path with this solution. Perhaps as part of a migration component in the future, but we would need to figure out how to load essentially two versions of Opensearch.

I think we can solve this for segrep by preventing this invalid mixed cluster state (primary on upgraded node copying to replicas on old node) rather than supporting it directly.

The two ways you would get into this state are:

  1. Through a rolling upgrade where each node is restarted in place.

From our docs, we suggest to turn off replica allocation during the upgrade process. In this case we end up with a cluster where all replicas are UNASSIGNED until the setting is flipped back. So IMO nothing is required here?

  1. Adding an upgraded node to a cluster and relocating shards to it.

Switching falling behind replicas to docrep is an idea, but would be problematic as replicas are relocated directly to the upgraded nodes, so we would have to force recovery from the primary in these cases to have the same segments. Also if a primary dies and a docrep shard is promoted, we have to reconcile that with other replicas / remote storage.

In this case, we could provide an option to move replicas first if relocation is triggered. We have a setting to move primaries first - #1445, so we could do the inverse. The rationale for that setting was to prevent all copies of any single shard from being relocated before any copies of another shard. So we would lose this guarantee unless we moved a single replica copy for each replication group before moving the rest. This would leave us eventually in a temporary state where all primaries are on old nodes and replicas on new nodes until upgrade is completed. If the old nodes/primaries are overwhelmed the primary first setting could be enabled.

@dreamer-89
Copy link
Member

dreamer-89 commented Jun 16, 2023

Thanks for this @dreamer-89, this breaks our current downgrade proposal across major verisons. I've had some offline discussions with @Bukhtawar and @nknize on the possibility of running two Lucene versions side by side for major version upgrades. The initial idea was to load a the old IW version inside of a separate engine implementation and support that with each OS release. Unfortunately this won't work because we would need the same major version of Lucene across all ingest packages, particularly during analysis. So I don't think we have a clear path with this solution. Perhaps as part of a migration component in the future, but we would need to figure out how to load essentially two versions of Opensearch.

I think we can solve this for segrep by preventing this invalid mixed cluster state (primary on upgraded node copying to replicas on old node) rather than supporting it directly.

The two ways you would get into this state are:

  1. Through a rolling upgrade where each node is restarted in place.

From our docs, we suggest to turn off replica allocation during the upgrade process. In this case we end up with a cluster where all replicas are UNASSIGNED until the setting is flipped back. So IMO nothing is required here?

  1. Adding an upgraded node to a cluster and relocating shards to it.

Switching falling behind replicas to docrep is an idea, but would be problematic as replicas are relocated directly to the upgraded nodes, so we would have to force recovery from the primary in these cases to have the same segments. Also if a primary dies and a docrep shard is promoted, we have to reconcile that with other replicas / remote storage.

In this case, we could provide an option to move replicas first if relocation is triggered. We have a setting to move primaries first - #1445, so we could do the inverse. The rationale for that setting was to prevent all copies of any single shard from being relocated before any copies of another shard. So we would lose this guarantee unless we moved a single replica copy for each replication group before moving the rest. This would leave us eventually in a temporary state where all primaries are on old nodes and replicas on new nodes until upgrade is completed. If the old nodes/primaries are overwhelmed the primary first setting could be enabled.

Thank you @mch2 for the your comment and dividing this problem into two separate sub-problems i.e. rolling upgrades and relocation upgrades along with possible solutions.

Rolling upgrades - No fix is needed

I see for rolling upgrades case there is recommendation [1] to have cluster.routing.allocation.enable setting set to primary. This setting prevents replica shard allocations and avoids replica shard relocations storm resulting due to OS process restarts. The end result of having this setting in rolling upgrade prevents state where replica shards are on non-upgraded nodes while primary shard on upgraded node. A node restart action will have impact on

  • Replica shard.
    • Shard is not allowed to allocate due to cluster.routing.allocation.enable setting and thus will remain in UNASSIGNED state.
  • Primary shard
    • If replica copy avaialble on other node, one is promoted as new primary (failover case).
    • If there is no available replica copy available due to previous node restarts, cluster manager may allocate primary on upgraded node where previous replica copy existed. This is best effort solution to have available primary shard copy.
    • If there are no replica configured, shard will be allocated on same node post restart.

It is still possible to force move primary shards to upgraded nodes but it is not recommended. This needs to be updated on rolling upgrade documentation. Based on this, the only use case that needs to be solved is the relocation based upgrades (state 2 in last comment).

Relocation based upgrades

Using proposals from previous comment, compiled list of possible solutions listed below for solving relocation based upgrades.

1. No change

Do not change anything around version upgrades. With primary shards sitting on upgraded nodes, there will be replica shards failures on non-upgraded nodes. These replica shards will eventually go to UNASSIGNED state and yellow cluster. The replica shard will auto recover when allocation is attempted on upgraded node.

Pros

  • No code change thus cleanest

Cons

  • Problematic to customer sensitive to search to unassigned replicas in mixed cluster

2. Reset codec in mixed cluster state

When cluster is running in mixed version state, reset codec on primary to version running on non-upgraded node so that segment files (now written with older lucene codec) can be read on replicas which are still running on non-upgraded nodes. Please note, replicas which are running on upgraded node can still read files from primary. We used OpenSearch version to lucene codec name mapping that allows primary to pick the bwc codec name. We reset the codec to latest one once upgrade is complete. This solution was attempted in #7698 but has below two limitations.

Custom codecs

Plugins that have custom codec implementations load the current custom version of Lucene codec which prevents codec downgrade. Thus, this solution also needs changes in downstream components which overrides CodecService. One such example is k-NN plugin.

@Override
public Codec codec(String name) {
return KNNCodecVersion.current().getKnnCodecSupplier().apply(super.codec(name), mapperService);
}

Major lucene version bumps

For major version bumps this solution doesn’t work because lucene treats previous major (N-1) codecs as read only and maintain only for backward compatibility[1]. When using previous major codecs with IndexWriter; it results in UnsupportedOperationException during indexing operations.

For example, in current version of Lucene9X, Lucene87StoredFieldsFormat class (defines the format of file that stores data, specific fields and metadata (.fdt, .fdm etc)) is defined in bwc-codecs and prevents initializations of any write related methods. Any attempt to perform indexing operations results in UnsupportedOperationException

3. Move replicas first to upgraded nodes

In relocation upgrades, move replica shard copies first to upgraded nodes and move primary only when all (or certain %) have moved to upgraded nodes. Replicas shards on upgraded node with updated codecs can read segment files written in older lucene version on non-upgraded node containing primary shard. This holds true due to Lucene bwc read guarantees that allow current major version to read segment files written in all minors in previous major. For e.g. data files written in Lucene8x version can be read by replica shard running on Lucene9x version.

Pros

  • Non intrusive and clean.
  • Proven solution due to lucene bwc guarantees

Cons

  • For overwhelmed clusters, risk of running primary shard on degraded cluster for extended times.
  • Using cluster level settings mean have same behavior of moving replicas first for all indices

4. Use docrep during rolling upgrades

Use document replication in mixed cluster state that allows primary and replica to index and create segment files independently. Primary derives this engine switch only when replica runs on non-upgraded node and primary on upgraded node. When replica shard containing node is upgraded, primary ignores the event. When primary shard is upgraded, it pings alll replica shards to promote to writeable engine. Replica shards running on non-upgraded nodes follow this directive and switch to InternalEngine while replicas already on upgraded nodes ignores this command. Similarly, when upgrade completes, primary shards pings all replica shards to switch back to NRTReplicationEngine. Again, only replica shards which previously upgraded, will only demote the engine. During downgrade, replica shards perform a force sync from primary to mirror image files from primary. During this sync, replica shard ignores segment file diffs (happens due to new segment files written with InternalEngine). Once sync completes, the replica shard performs store cleanup which removes all segment files which doesn’t belong to latest commit.

Example state. Replica 1, Replica 2 are replica shard copies of Primary. Other Replica is a replica copy belonging to a different index.

nPL1Zzem48Nl_XKZBeK3I9jZ9Qgjr5RYK42hzjIJQm_9gCQER0UMVzyXmOQGx4hRLkef97xp_CtanJo6A7rizaRKV2L5jRGH9bjFU-bF4v01wkvtN_24jI4buQ6L-ExkOLnJt6gAlcOJA_IGb5DDknrwv1Mo6fzdRQKxQuHOBk5xNNgf4LfeM2x5Z_c7n0Vlxzmcq5paU617KNhNrF3H60YLz1gKgma3Un4_SXQii_DBXUOR

Pros

  • Works equally well for both type of upgrades.

Cons

  • Intrusive as it involes switching engine.
  • Need take extra effort to implement segrep ↔ docrep switch as it is not available today.
  • During downgrade, there will be search related disruption as segment files written with InternalEngine will be cleaned up e.g. Scroll/PIT queries

Proposal

Based on above, moving replicas first seems to be clean and less riskier solution compared to others and thus is the preferred solution. Tagging folks for review and feedback @mch2 @andrross @nknize @Bukhtawar

References

[1] https://github.com/apache/lucene/tree/main/lucene/backward-codecs
[2] https://opensearch.org/docs/latest/install-and-configure/upgrade-opensearch/index/

@dblock
Copy link
Member

dblock commented Jun 20, 2023

Is rolling back a partial upgrade also a concern and can we solve it here? If 1/3 nodes in a cluster has been upgraded to 4.0, I think we want to allow rolling it back, or recovering from/failing over to a replica. I think this gives us actual rollback from failed upgrades in flight!

@Bukhtawar
Copy link
Collaborator Author

Is rolling back a partial upgrade also a concern and can we solve it here?

I don't think rollback is an option given segments created in higher versions cannot be understood by Lucene in lower version.

While I would ideally like to address this BWC issue in Lucene to support creating segments with IW across major versions just the way minor versions are supported. Is there a discuss thread on Lucene that we can brainstorm on.

I would potentially want us to evaluate a Blue-Green option with updating the replica copies first and then the primaries to get past the primary overload on the last few non-upgraded node issue.

@dreamer-89
Copy link
Member

dreamer-89 commented Jun 27, 2023

Is rolling back a partial upgrade also a concern and can we solve it here?

I don't think rollback is an option given segments created in higher versions cannot be understood by Lucene in lower version.

While I would ideally like to address this BWC issue in Lucene to support creating segments with IW across major versions just the way minor versions are supported. Is there a discuss thread on Lucene that we can brainstorm on.

Thanks @Bukhtawar for your comment and feedback. I opened an issue to Lucene where I see a similar ask in the past for N-1 writer & reader support. This ask was identified as a bigger task and encouraged to be solved in distributed system. Please feel free to comment on the issue to initiate more discussion.

I would potentially want us to evaluate a Blue-Green option with updating the replica copies first and then the primaries to get past the primary overload on the last few non-upgraded node issue.

Yes, we are proposing to move replicas copies first for relocation based upgrades (Blue-Gree option) and work is tracked in #8265.

For rolling-ugprades, documentation recommends users to disable replica shard allocation using "cluster.routing.allocation.enable": "primaries" setting (to prevent recovery storm). In my experimentation so far, wIth primary allocation only setting enabled, I have not found any state where it is possible to have replica on non-upgraded nodes while primary on upgrade nodes. I provided more details in proposal above. Please let me know if this is not correct.

@Poojita-Raj
Copy link
Contributor

After the discussion we've had here, the final conclusion when it comes to providing mixed version cluster support is to move replicas first to upgraded nodes. We are introducing a replica first shard movement setting in order to achieve this.

To support mixed cluster versions:

  1. Prioritize replica shard movement during shard relocation (for successful blue green deployments). [Segment Replication] Prioritize replica shard movement during shard relocation #8875
  2. Handle failover of primary shards in mixed cluster mode [Segment Replication] Handle failover in mixed cluster mode #9027

tlfeng pushed a commit that referenced this issue Aug 7, 2023
…relocation (#8875) (#9153)

When some node or set of nodes is excluded, the shards are moved away in random order. When segment replication is enabled for a cluster, we might end up in a mixed version state where replicas will be on lower version and unable to read segments sent from higher version primaries and fail.

To avoid this, we could prioritize replica shard movement to avoid entering this situation.

Adding a new setting called shard movement strategy - `SHARD_MOVEMENT_STRATEGY_SETTING` - that will allow us to specify in which order we want to move our shards: `NO_PREFERENCE` (default), `PRIMARY_FIRST` or `REPLICA_FIRST`. 

The `PRIMARY_FIRST` option will perform the same behavior as the previous setting `SHARD_MOVE_PRIMARY_FIRST_SETTING` which will be now deprecated in favor of the shard movement strategy setting. 

Expected behavior: 

If `SHARD_MOVEMENT_STRATEGY_SETTING` is changed from its default behavior to be either `PRIMARY_FIRST` or `REPLICA_FIRST` then we perform this behavior whether or not `SHARD_MOVE_PRIMARY_FIRST_SETTING` is enabled. 

If `SHARD_MOVEMENT_STRATEGY_SETTING` is still at its default setting of `NO_PREFERENCE` and `SHARD_MOVE_PRIMARY_FIRST_SETTING` is enabled we move the primary shards first. This ensures that users still using this setting will not see any changes in behavior. 

Reference: #1445

Parent issue: #3881
---------

Signed-off-by: Poojita Raj <[email protected]>
(cherry picked from commit c6e4bcd)
@hdhalter
Copy link

I the update requires documentation for 2.10, please create a doc issue. Thanks!

@Poojita-Raj
Copy link
Contributor

Closing out this issue as all work has been completed. Doc changes are being tracked here: opensearch-project/documentation-website#4827

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>breaking Identifies a breaking change. discuss Issues intended to help drive brainstorming and decision making distributed framework enhancement Enhancement or improvement to existing feature or request Indexing:Replication Issues and PRs related to core replication framework eg segrep v2.8.0 'Issues and PRs related to version v2.8.0'
Projects
Status: 2.10.0 (Launched)
Status: Done
Development

No branches or pull requests

8 participants