Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compute Segment Replication stats for SR backpressure. #6520

Closed
wants to merge 9 commits into from

Conversation

mch2
Copy link
Member

@mch2 mch2 commented Mar 1, 2023

Description

This PR introduces new mechanisms to keep track of the current replicas within a replication group. It is the first step in changes to add backpressure for lagging replicas when SR is enabled.

  • Adds new fields to CheckpointState inside a primary's ReplicationTracker to keep track of the following new metrics:
    • checkpointTimers - a map of ReplicationCheckpoint to a timer. When the replica is caught up to at least that checkpoint, timers are stopped and removed. Timers are added when the primary publishes a new checkpoint. This allows us to compute the bytes behind and ongoing replication lag when requested.
    • visibleReplicationCheckpoint - the currently searchable checkpoint on the replica.
    • lastCompletedReplicationLag - lag for the most recently completed replication event.
  • ReplicationTracker also has a new method getSegmentReplicationStats to fetch these stats per in-sync replica in addition to the amount of bytes the replica is behind. In addition it returns the average replication lag for the whole group.
  • The new stats are also added to NodeStats API.

Example Response:

    "oltX47ZFRfSMY9k5cef3eA" : {
      "timestamp" : 1677792047146,
      "name" : "node-2",
      "roles" : [
        "cluster_manager",
        "data",
        "ingest",
        "remote_cluster_client"
      ],
      "attributes" : {
        "shard_indexing_pressure_enabled" : "true"
      },
      "segment_replication" : {
        "[so][5]" : {
          "rejected_requests" : 0,
          "replicas" : {
            "syu-_SmoRmK1HrzuaoIBHQ" : {
              "checkpoints_behind" : 0,
              "bytes_behind" : "0b",
              "current_replication_lag" : "0s",
              "last_completed_replication_lag" : "273ms"
            }
          }
        },
        "[so][2]" : {
          "rejected_requests" : 0,
          "replicas" : {
            "x3BEIA9wQKGDfht4vXHUMA" : {
              "checkpoints_behind" : 0,
              "bytes_behind" : "0b",
              "current_replication_lag" : "0s",
              "last_completed_replication_lag" : "146ms"
            }
          }
        },
        "[so][8]" : {
          "rejected_requests" : 0,
          "replicas" : {
            "pkb-3tEOTJiRZQjlTV3VuQ" : {
              "checkpoints_behind" : 1,
              "bytes_behind" : "14.3mb",
              "current_replication_lag" : "385ms",
              "last_completed_replication_lag" : "300ms"
            }
          }
        }
      }

Issues Resolved

closes #6388
related #4478

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions

This comment was marked as outdated.

@github-actions

This comment was marked as outdated.

@github-actions

This comment was marked as outdated.

@github-actions

This comment was marked as outdated.

@github-actions

This comment was marked as outdated.

@github-actions

This comment was marked as outdated.

@github-actions

This comment was marked as outdated.

@github-actions

This comment was marked as outdated.

@github-actions

This comment was marked as outdated.

@github-actions

This comment was marked as outdated.

@github-actions

This comment was marked as outdated.

@github-actions

This comment was marked as outdated.

@mch2 mch2 force-pushed the bp branch 2 times, most recently from 4d062a2 to 9e6d73c Compare March 2, 2023 04:03
@github-actions

This comment was marked as outdated.

@github-actions

This comment was marked as outdated.

@github-actions

This comment was marked as outdated.

@github-actions

This comment was marked as outdated.

@mch2 mch2 force-pushed the bp branch 2 times, most recently from 5720723 to 4372177 Compare March 2, 2023 18:36
This PR introduces new mechanisms to keep track of the current replicas within a replication group intended to be used
to apply backpressure.  The new stats are also added to NodeStats.

Signed-off-by: Marc Handalian <[email protected]>
Signed-off-by: Marc Handalian <[email protected]>
@github-actions
Copy link
Contributor

github-actions bot commented Mar 2, 2023

Gradle Check (Jenkins) Run Completed with:

Copy link
Member

@dreamer-89 dreamer-89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mch2 Posting comments from older review (PR got updated in between).

@github-actions
Copy link
Contributor

github-actions bot commented Mar 2, 2023

Gradle Check (Jenkins) Run Completed with:

@codecov-commenter
Copy link

codecov-commenter commented Mar 2, 2023

Codecov Report

Merging #6520 (2c69235) into main (ee305d0) will decrease coverage by 0.02%.
The diff coverage is 43.78%.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

@@             Coverage Diff              @@
##               main    #6520      +/-   ##
============================================
- Coverage     70.80%   70.78%   -0.02%     
- Complexity    59143    59157      +14     
============================================
  Files          4802     4807       +5     
  Lines        282965   283159     +194     
  Branches      40792    40819      +27     
============================================
+ Hits         200356   200440      +84     
- Misses        66187    66271      +84     
- Partials      16422    16448      +26     
Impacted Files Coverage Δ
...min/cluster/stats/TransportClusterStatsAction.java 69.56% <ø> (ø)
...nsearch/index/SegmentReplicationPerGroupStats.java 0.00% <0.00%> (ø)
.../org/opensearch/index/SegmentReplicationStats.java 0.00% <0.00%> (ø)
...eplication/checkpoint/PublishCheckpointAction.java 23.80% <0.00%> (+0.37%) ⬆️
...search/cluster/MockInternalClusterInfoService.java 0.00% <0.00%> (ø)
.../java/org/opensearch/test/InternalTestCluster.java 58.25% <ø> (-0.09%) ⬇️
...rch/action/admin/cluster/node/stats/NodeStats.java 50.28% <22.22%> (-1.55%) ⬇️
...ensearch/index/SegmentReplicationStatsTracker.java 23.07% <23.07%> (ø)
...opensearch/index/SegmentReplicationShardStats.java 28.12% <28.12%> (ø)
...src/main/java/org/opensearch/node/NodeService.java 69.69% <50.00%> (-0.62%) ⬇️
... and 486 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

Signed-off-by: Marc Handalian <[email protected]>
@github-actions
Copy link
Contributor

github-actions bot commented Mar 3, 2023

Gradle Check (Jenkins) Run Completed with:

Also update to return an array of replicas vs a map of objects.

Signed-off-by: Marc Handalian <[email protected]>
@github-actions
Copy link
Contributor

github-actions bot commented Mar 3, 2023

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.index.IndexServiceTests.testAsyncTranslogTrimTaskOnClosedIndex

@github-actions
Copy link
Contributor

github-actions bot commented Mar 3, 2023

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

github-actions bot commented Mar 3, 2023

Gradle Check (Jenkins) Run Completed with:

… segments.

This can happen when there is a new checkpoint published because of a SegmentInfos bump,
but the replica does not need to fetch any files.

Signed-off-by: Marc Handalian <[email protected]>
@github-actions
Copy link
Contributor

github-actions bot commented Mar 3, 2023

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.indices.replication.SegmentReplicationIT.testCancellation
      1 org.opensearch.cluster.allocation.AwarenessAllocationIT.testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness

@github-actions
Copy link
Contributor

github-actions bot commented Mar 3, 2023

Gradle Check (Jenkins) Run Completed with:

Signed-off-by: Marc Handalian <[email protected]>
@github-actions

This comment was marked as outdated.

@mch2

This comment was marked as outdated.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 6, 2023

Gradle Check (Jenkins) Run Completed with:

*
* @opensearch.internal
*/
public class SegmentReplicationStatsTracker {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can this class be removed with getStats() moved to SegmentReplicationPressureService as I don't see this class maintaining/tracking segrep stats. Please ignore if there is any pending future work which demands this class's existence.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented with this commit - 473cc02. I've opened that up in draft and will open it for review once this lands.

@mch2
Copy link
Member Author

mch2 commented Mar 7, 2023

Would like a second reviewer on this and 473cc02. @Bukhtawar, @sachinpkale, @ashking94 available for review?

Note - 473cc02 implements actual enforcement of pressure from these computed metrics. I'll also be putting up a separate PR to actually check for and fail stale replicas from both the primary & replica sides.

@mch2
Copy link
Member Author

mch2 commented Mar 12, 2023

@dreamer-89 I think we should consolidate these stats into the segment_replication API rather than in NodeStats. Will discard this and open a separate PR.

@mch2 mch2 closed this Mar 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants