-
Notifications
You must be signed in to change notification settings - Fork 359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CELEBORN-1680] Introduce ShuffleFallbackCount metrics #2866
Conversation
caa2287
to
95e2961
Compare
95e2961
to
b7d2307
Compare
e09949f
to
e929fef
Compare
...src/main/java/org/apache/celeborn/service/deploy/master/clustermeta/AbstractMetaManager.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. These metrics can be useful fo monitoring the Celeborn clusters and spark clusters.
master/src/main/scala/org/apache/celeborn/service/deploy/master/Master.scala
Show resolved
Hide resolved
metrics nit log shuffle fallback count invalid intend docs
8b888e2
to
6f0f4a9
Compare
sb.append( | ||
s"${normalizeKey(nm.name)}FiveMinuteRate$label ${nm.meter.getFiveMinuteRate} $timestamp\n") | ||
sb.append( | ||
s"${normalizeKey(nm.name)}FifteenMinuteRate$label ${nm.meter.getFifteenMinuteRate} $timestamp\n") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @FMX
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finally, we have meters. That's good news.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, Only one question, why not directly use Counter?
Hi @RexXiong, FYI: #2866 (comment) |
IMO, the speed of fallback is not very significant, because the fallback itself is a somewhat random occurrence. cc @FMX |
Hi @RexXiong , I am fine for this, will just change the metrics type and reserve the code for Meter metrics in case we need it in the future. |
Okay, dashboard.json need change too. |
Thanks for the reminder |
What changes were proposed in this pull request?
As title, introduce metrics_ShuffleFallbackCount_Value.
Why are the changes needed?
To provide the insights that how many shuffles fallback to spark built-in shuffle service. It is helpful for us to deprecate the ESS progressively.
Currently, we plan to set the
celeborn.client.spark.shuffle.fallback.numPartitionsThreshold
to fallback the shuffle with too large shuffle partitions number, for example: 50k.In the future, we plan to limit the acceptable maximum shuffle partition number so that the bad job would be rejected and not impact the celeborn master health.
Does this PR introduce any user-facing change?
Yes, new metrics.
How was this patch tested?
UT.