Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CELEBORN-1680] Introduce ShuffleFallbackCount metrics #2866

Closed
wants to merge 5 commits into from

Conversation

turboFei
Copy link
Member

@turboFei turboFei commented Oct 30, 2024

What changes were proposed in this pull request?

As title, introduce metrics_ShuffleFallbackCount_Value.

Why are the changes needed?

To provide the insights that how many shuffles fallback to spark built-in shuffle service. It is helpful for us to deprecate the ESS progressively.

Currently, we plan to set the celeborn.client.spark.shuffle.fallback.numPartitionsThreshold to fallback the shuffle with too large shuffle partitions number, for example: 50k.

In the future, we plan to limit the acceptable maximum shuffle partition number so that the bad job would be rejected and not impact the celeborn master health.

Does this PR introduce any user-facing change?

Yes, new metrics.

How was this patch tested?

UT.
image

@turboFei turboFei force-pushed the record_fallback branch 3 times, most recently from caa2287 to 95e2961 Compare October 30, 2024 23:27
@turboFei turboFei changed the title Record fallback Introduce ShuffleFallbackCount metrics Oct 30, 2024
@turboFei turboFei marked this pull request as draft October 30, 2024 23:30
@turboFei turboFei changed the title Introduce ShuffleFallbackCount metrics [CELEBORN-1680] Introduce ShuffleFallbackCount metrics Oct 30, 2024
@turboFei turboFei force-pushed the record_fallback branch 3 times, most recently from e09949f to e929fef Compare October 30, 2024 23:57
@turboFei turboFei marked this pull request as ready for review October 30, 2024 23:58
Copy link
Contributor

@FMX FMX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. These metrics can be useful fo monitoring the Celeborn clusters and spark clusters.

metrics

nit

log shuffle fallback count

invalid intend

docs
sb.append(
s"${normalizeKey(nm.name)}FiveMinuteRate$label ${nm.meter.getFiveMinuteRate} $timestamp\n")
sb.append(
s"${normalizeKey(nm.name)}FifteenMinuteRate$label ${nm.meter.getFifteenMinuteRate} $timestamp\n")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

cc @FMX

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally, we have meters. That's good news.

Copy link
Contributor

@RexXiong RexXiong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Only one question, why not directly use Counter?

@turboFei
Copy link
Member Author

turboFei commented Nov 6, 2024

why not directly use Counter?

Hi @RexXiong, FYI: #2866 (comment)

@RexXiong
Copy link
Contributor

RexXiong commented Nov 6, 2024

why not directly use Counter?

Hi @RexXiong, FYI: #2866 (comment)

IMO, the speed of fallback is not very significant, because the fallback itself is a somewhat random occurrence. cc @FMX

@turboFei
Copy link
Member Author

turboFei commented Nov 6, 2024

because the fallback itself is a somewhat random occurrence

Hi @RexXiong , I am fine for this, will just change the metrics type and reserve the code for Meter metrics in case we need it in the future.

@RexXiong
Copy link
Contributor

RexXiong commented Nov 7, 2024

because the fallback itself is a somewhat random occurrence

Hi @RexXiong , I am fine for this, will just change the metrics type and reserve the code for Meter metrics in case we need it in the future.

Okay, dashboard.json need change too.

@turboFei
Copy link
Member Author

turboFei commented Nov 7, 2024

Okay, dashboard.json need change too.

Thanks for the reminder

@FMX FMX closed this in f1bda46 Nov 7, 2024
@turboFei turboFei deleted the record_fallback branch November 7, 2024 06:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants