Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Segment Replication] Implementing cat/segment_replication API #5718

Merged
merged 37 commits into from
Feb 8, 2023

Conversation

Rishikesh1159
Copy link
Member

@Rishikesh1159 Rishikesh1159 commented Jan 5, 2023

Signed-off-by: Rishikesh1159 [email protected]

Description

This PR implements segment_replication API to fetch each segment replication event stats.

This PR implements first two points in comment.

-> This API is built by taking reference of _cat/recovery API as many components of segment replication and recovery process are same.

Overview:
The purpose of the this API is to return metric information about ongoing and completed segment replication events on replica shards. This API returns metric per shard level and this API should only be called on Indices with segment replication enabled.

Paths:

  • GET/_cat/segment_replication

  • GET/_cat/segment_replication/<index>

  • If you want to get information for more than one index, separate the indices with commas:

GET/_cat/segment_replication/index1,index2,index3

Description:
→ cat segment_replication API returns metric information about ongoing and completed segment replication events.
→ Segment Replication is a process of copying segment files from primary shard to replica shards. When primary sends checkpoint to replica shards on a refresh, a new segment replication event is triggered on replica shards.
→ Segment Replication event occurs on following processes:

  • When a new replica shard is added to cluster.
  • When there are segment file changes on a primary shard refresh.
  • When recovering replica shards from primary shard using peer recovery.

Query Parameters:

  • active_only
    (Optional, Boolean) If true, the response only includes ongoing segment replications. Defaults to false.
  • detailed
    (Optional, string) If true, the response only includes ongoing segment replications. Defaults to false.
  • shards
    (Optional, string) Comma-separated list of shards to display.
  • format
    (Optional, string) Short version of the HTTP accept header. Valid values include JSON, YAML, etc.
  • h
    (Optional, string) Comma-separated list of column names to display.
  • help
    (Optional, Boolean) If true, the response includes help information. Defaults to false.
  • index
    (Optional, string) Comma-separated list or wildcard expression of index names used to limit the request.
  • time
    (Optional) Unit used to display time values. milliseconds by default.
  • v
    (Optional, Boolean) If true, the response includes column headings. Defaults to false.

Metric Fields:

  • index | i,idx | index name : Name of the Index
  • shardId | shard Id : Id of a specific shard
  • start_time | start : segment replication start time. Show up only when default=true
  • start_time_millis | start_millis : segment replication start time in epoch milliseconds. Show up only when default=true
  • stop_time | stop : ssegment replication stop time. Show up only when default=true
  • stop_time_millis | stop_millis : segment replication stop time in epoch milliseconds. Show up only when default=true
  • time | t | ti : time taken to complete segment replication event in milliseconds
  • stage | st : current stage of segment replication event.
  • source_host | shost : source host
  • source_node | snode : source node name
  • target_host | thost : target host
  • target_node | tnode : target node name
  • files_fetched | ff : count to files fetched until now in segment replication event
  • files_percent | fp : percent of files fetched until now in segment replication event
  • bytes_fetched | bf : amount of bytes fetched until now in segment replication event
  • bytes_percent| bp : percent of bytes fetched until now in segment replication event

All metrics mentioned below will present in response only when query parameter detailed=true

  • files | f : count of files that needs to be fetched in a segment replication event
  • files_total | tf : total number of files that are part of this recovery, both re-used and recovered
  • bytes | b : amount of bytes that needs to be fetched in a segment replication event
  • files_total | ft : total number of bytes in the shard
  • replication_id : Id of the ongoing/completed segment replication event
  • replicating_stage_time_taken | rstt : Time taken to complete “replicating” stage of segment replication event
  • get_checkpoint_info_stage_time_taken | gcistt : Time taken to complete “get checkpoint info” stage of segment replication event
  • file_diff_stage_time_taken | fdstt : Time taken to complete “file diff” stage of segment replication event
  • get_files_stage_time_taken | gfstt : Time taken to complete “get files” stage of segment replication event
  • finalize_replication_stage_time_taken | frstt : Time taken to complete “finalize replication” stage of segment replication event

Example Response of API:
→ Sample response with no ongoing segment replication events:

curl -X GET "localhost:9200/_cat/segment_replication/test4?v=true"

index shardId time stage source_host source_node target_host target_node files_fetched files_percent bytes_fetched bytes_percent
test4 0 13ms done 127.0.0.1 runTask-0 127.0.0.1 runTask-2 0 0.0% 0 0.0%
test4 1 20ms done 127.0.0.1 runTask-2 127.0.0.1 runTask-1 3 100.0% 3661 100.0%

→ Sample response with query parameter shards=0, which limits response to only specific shards with ID as 0:

curl -X GET "localhost:9200/_cat/segment_replication?v=true&shards=0"

index shardId time stage source_host source_node target_host target_node files_fetched files_percent bytes_fetched bytes_percent
test4 0 13ms done 127.0.0.1 runTask-0 127.0.0.1 runTask-2 0 0.0% 0 0.0%
test6 0 9ms done 127.0.0.1 runTask-1 127.0.0.1 runTask-2 3 100.0% 3661 100.0%

→ Sample response with query parameter detailed=true, which gives more detailed information each stage of segment replication event:

curl -X GET "localhost:9200/_cat/segment_replication?v=true&detailed=true"

index shardId time stage source_host source_node target_host target_node files_fetched files_percent bytes_fetched bytes_percent files files_total bytes bytes_total replication_id replicating_stage_time_taken get_checkpoint_info_stage_time_taken file_diff_stage_time_taken get_files_stage_time_taken finalize_replication_stage_time_taken

test4 0 20ms done 127.0.0.1 runTask-0 127.0.0.1 runTask-1 0 0.0% 0 0.0% 0 0 0 0 2 0s 7ms 0s 4ms 7ms
test4 1 18ms done 127.0.0.1 runTask-1 127.0.0.1 runTask-2 0 0.0% 0 0.0% 0 0 0 0 2 0s 8ms 0s 2ms 6ms

Issues Resolved

Part of #4554

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions
Copy link
Contributor

github-actions bot commented Jan 5, 2023

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

github-actions bot commented Jan 9, 2023

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@Rishikesh1159 Rishikesh1159 changed the title [Segment Replication] Initial Draft for adding segment_replication API [Segment Replication] Implementing cat/segment_replication API Jan 25, 2023
@Rishikesh1159 Rishikesh1159 marked this pull request as ready for review January 25, 2023 18:09
@github-actions
Copy link
Contributor

github-actions bot commented Feb 3, 2023

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

github-actions bot commented Feb 3, 2023

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.search.SearchWeightedRoutingIT.testSearchAggregationWithNetworkDisruption_FailOpenEnabled

@github-actions
Copy link
Contributor

github-actions bot commented Feb 6, 2023

Gradle Check (Jenkins) Run Completed with:

* compatible open source license.
*/

package org.opensearch.action.admin.indices.segment_replication;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't catch this on first run through, lets pls remove this snake casing in the package name. How about:
org.opensearch.action.admin.indices.replication

/**
*Indices segment replication
*/
ActionFuture<SegmentReplicationStatsResponse> segment_replication(SegmentReplicationStatsRequest request);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets pls use camel casing here and below segmentReplication.

@Override
public String getSourceDescription() {
String description = "Host:" + this.sourceNode.getHostName() + ", Node:" + this.sourceNode.getName();
return description;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - no need for one time use variable. return "Host:" + this.sourceNode.getHostName() + ", Node:" + this.sourceNode.getName();

Also I think we only need the node name, not the host name because that can be looked up from other APIs.

/**
* Get the source description
*/
default String getSourceDescription() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given this is going to be displayed directly in a rest API, I do not think we should provide a default here and require an implementation.

return timingData;
}

public Stage getStage() {
return stage;
public long getGetCheckpointInfoStageTime() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These methods can return a TimeValue to reduce some duplication.

builder.startObject(SegmentReplicationState.Fields.INDEX);
index.toXContent(builder, params);
builder.endObject();
builder.field(Fields.REPLICATING_STAGE, new TimeValue(timingData.get("REPLICATING")));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than fetching from the map here directly, you can invoke your methods above. getGetCheckpointInfoStageTime etc...

* returns SegmentReplicationState of on-going segment replication events.
*/
public SegmentReplicationState getOngoingEventSegmentReplicationState(ShardRouting shardRouting) {
SegmentReplicationTarget target = onGoingReplications.getOngoingReplicationTarget(shardRouting.shardId());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add @Nullable annotation to these method declarations. Also all three of these methods only need the ShardId, not the entire ShardRouting.

Also nit -

        return Optional.ofNullable(onGoingReplications.getOngoingReplicationTarget(shardRouting.shardId()))
            .map(SegmentReplicationTarget::state)
            .orElse(null);

* returns SegmentReplicationState of on-going if present or completed segment replication events.
*/
public SegmentReplicationState getSegmentReplicationState(ShardRouting shardRouting) {
if (getOngoingEventSegmentReplicationState(shardRouting) == null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

        return Optional.ofNullable(getOngoingEventSegmentReplicationState(shardRouting))
            .orElseGet(() -> getOngoingEventSegmentReplicationState(shardRouting));

@github-actions
Copy link
Contributor

github-actions bot commented Feb 7, 2023

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

github-actions bot commented Feb 7, 2023

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testIndexDeletionDuringSnapshotCreationInQueue
      1 org.opensearch.indices.replication.SegmentReplicationApiIT.testSegmentReplicationApiResponse

@github-actions
Copy link
Contributor

github-actions bot commented Feb 7, 2023

Gradle Check (Jenkins) Run Completed with:

Signed-off-by: Rishikesh1159 <[email protected]>
@github-actions
Copy link
Contributor

github-actions bot commented Feb 8, 2023

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.http.SearchRestCancellationIT.testAutomaticCancellationDuringQueryPhase

Copy link
Member

@mch2 mch2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @Rishikesh1159

@mch2 mch2 merged commit e455f56 into opensearch-project:main Feb 8, 2023
@Rishikesh1159 Rishikesh1159 added the backport 2.x Backport to 2.x branch label Feb 8, 2023
opensearch-trigger-bot bot pushed a commit that referenced this pull request Feb 8, 2023
* Initial Draft for adding segment_replication API

Signed-off-by: Rishikesh1159 <[email protected]>

* Adding bytes transfered in each segrep events and additional metrics.

Signed-off-by: Rishikesh1159 <[email protected]>

* Fix broken tests.

Signed-off-by: Rishikesh1159 <[email protected]>

* Fix compile errors

Signed-off-by: Rishikesh1159 <[email protected]>

* Adding Tests and gating logic behind feature flag.

Signed-off-by: Rishikesh1159 <[email protected]>

* Add java docs and enable query parameter detailed.

Signed-off-by: Rishikesh1159 <[email protected]>

* Add temporary documentation URL

Signed-off-by: Rishikesh1159 <[email protected]>

* Fixing failing tests.

Signed-off-by: Rishikesh1159 <[email protected]>

* Spotless Apply.

Signed-off-by: Rishikesh1159 <[email protected]>

* Fix media type copile check.

Signed-off-by: Rishikesh1159 <[email protected]>

* Revert previous changes and fix failing tests.

Signed-off-by: Rishikesh1159 <[email protected]>

* Apply spotless check.

Signed-off-by: Rishikesh1159 <[email protected]>

* Refactoring call to segmentreplicationstate.

Signed-off-by: Rishikesh1159 <[email protected]>

* spotless check

Signed-off-by: Rishikesh1159 <[email protected]>

* Changing invokation of segment replication shard and filtering API response by shard id

Signed-off-by: Rishikesh1159 <[email protected]>

* disable feature flag by default.

Signed-off-by: Rishikesh1159 <[email protected]>

* Apply spotless

Signed-off-by: Rishikesh1159 <[email protected]>

* Address comments on PR.

Signed-off-by: Rishikesh1159 <[email protected]>

* Fix gradle check failures

Signed-off-by: Rishikesh1159 <[email protected]>

* fix failing testSegment_ReplicationActionAction()

Signed-off-by: Rishikesh1159 <[email protected]>

* Exclude empty segment replication events in API response.

Signed-off-by: Rishikesh1159 <[email protected]>

* Apply spotless.

Signed-off-by: Rishikesh1159 <[email protected]>

* Address PR comments and add Integ Tests.

Signed-off-by: Rishikesh1159 <[email protected]>

* Fix failing testSegmentReplicationApiResponse().

Signed-off-by: Rishikesh1159 <[email protected]>

* Refactoring code.

Signed-off-by: Rishikesh1159 <[email protected]>

---------

Signed-off-by: Rishikesh1159 <[email protected]>
(cherry picked from commit e455f56)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
"url":"https://github.com/opensearch-project/documentation-website/issues/2627",
"description":"Returns information about both on-going and latest completed Segment Replication events"
},
"stability":"stable",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Rishikesh1159 Should probably make this experimental, no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @andrross for catching this. Yes I missed this, it should be experimental. Let me make a PR to change this.

dreamer-89 pushed a commit that referenced this pull request Feb 10, 2023
…tion API (#6244)

* [Segment Replication] Implementing cat/segment_replication API (#5718)

* Initial Draft for adding segment_replication API

Signed-off-by: Rishikesh1159 <[email protected]>

* Adding bytes transfered in each segrep events and additional metrics.

Signed-off-by: Rishikesh1159 <[email protected]>

* Fix broken tests.

Signed-off-by: Rishikesh1159 <[email protected]>

* Fix compile errors

Signed-off-by: Rishikesh1159 <[email protected]>

* Adding Tests and gating logic behind feature flag.

Signed-off-by: Rishikesh1159 <[email protected]>

* Add java docs and enable query parameter detailed.

Signed-off-by: Rishikesh1159 <[email protected]>

* Add temporary documentation URL

Signed-off-by: Rishikesh1159 <[email protected]>

* Fixing failing tests.

Signed-off-by: Rishikesh1159 <[email protected]>

* Spotless Apply.

Signed-off-by: Rishikesh1159 <[email protected]>

* Fix media type copile check.

Signed-off-by: Rishikesh1159 <[email protected]>

* Revert previous changes and fix failing tests.

Signed-off-by: Rishikesh1159 <[email protected]>

* Apply spotless check.

Signed-off-by: Rishikesh1159 <[email protected]>

* Refactoring call to segmentreplicationstate.

Signed-off-by: Rishikesh1159 <[email protected]>

* spotless check

Signed-off-by: Rishikesh1159 <[email protected]>

* Changing invokation of segment replication shard and filtering API response by shard id

Signed-off-by: Rishikesh1159 <[email protected]>

* disable feature flag by default.

Signed-off-by: Rishikesh1159 <[email protected]>

* Apply spotless

Signed-off-by: Rishikesh1159 <[email protected]>

* Address comments on PR.

Signed-off-by: Rishikesh1159 <[email protected]>

* Fix gradle check failures

Signed-off-by: Rishikesh1159 <[email protected]>

* fix failing testSegment_ReplicationActionAction()

Signed-off-by: Rishikesh1159 <[email protected]>

* Exclude empty segment replication events in API response.

Signed-off-by: Rishikesh1159 <[email protected]>

* Apply spotless.

Signed-off-by: Rishikesh1159 <[email protected]>

* Address PR comments and add Integ Tests.

Signed-off-by: Rishikesh1159 <[email protected]>

* Fix failing testSegmentReplicationApiResponse().

Signed-off-by: Rishikesh1159 <[email protected]>

* Refactoring code.

Signed-off-by: Rishikesh1159 <[email protected]>

---------

Signed-off-by: Rishikesh1159 <[email protected]>
(cherry picked from commit e455f56)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

* Fix compile error.

Signed-off-by: Rishikesh1159 <[email protected]>

* Fix flaky Tests.

Signed-off-by: Rishikesh1159 <[email protected]>

* Change stability to experimental in cat.segment_replication.json file.

Signed-off-by: Rishikesh1159 <[email protected]>

---------

Signed-off-by: Rishikesh1159 <[email protected]>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Rishikesh1159 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants