[CELEBORN-1679] Estimated ApplicationDiskUsage in cluster should be multiplied by worker count. #2865

Z1Wu · 2024-10-30T16:15:09Z

What changes were proposed in this pull request?

Assumption : For an application, its shuffle data will be equally distributed to every worker, so we can use application disk usage in one worker to estimate application disk usage in whole cluster.

Logic for estimating application disk usage:

Get application disk usage in one worker from heartbeat of worker. This represents the expected disk usage for every worker.
Multiply the expected disk usage per worker by the current number of workers to approximate the total disk usage of the application across the cluster.

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

… cluster should be multiplied by worker size.

Z1Wu · 2024-10-31T16:20:50Z

cc @FMX

FMX · 2024-11-01T11:56:05Z

Thanks for this PR but the assumption is not solid. Every worker will report its disk usage metrics to the master node by the worker heartbeat.
You can not multiply the worker count because all workers will report these metrics.

FMX

This change is incorrect for the following reasons:

The master node collects the total disk usage from all workers, so this value should not be multiplied by the number of workers. Multiplying the usage by the worker count would result in an inflated and inaccurate total, significantly exceeding the actual usage.
Additionally, the shuffle distribution may not be evenly distributed among the workers, particularly with Celeborn workers that support the 'LOADAWARE' slot assignment policy.

Z1Wu · 2024-11-01T16:09:04Z

Thanks for your review and two issues you mentioned are reasonable.

But if I've understood correctly, in current implementation, it appears that an application's usage on a single worker is considered as the usage for that application across the entire cluster, as shown in code blow:

// org.apache.celeborn.common.meta.AppDiskUsageSnapShot#updateAppDiskUsage
// param: usage -> application disk usage in one worker, such as worker A
def updateAppDiskUsage(appId: String, usage: Long): Unit = {
    // drop old application disk usage in topNitems
    val dropIndex = topNItems.indexWhere(usage => usage != null && usage.appId == appId)
    if (dropIndex != -1) {
        drop(dropIndex)
    }
    // find the position to insert to persist the sorted order
    val insertIndex = findInsertPosition(usage)
    // put application disk usage in worker A into topNitems as application disk usage in cluster
    if (insertIndex != -1) {
        shift(insertIndex)
        topNItems(insertIndex) = AppDiskUsage(appId, usage)
    }
}

Due to the issue previously mentioned, this approach would result in the reported Application Disk Usage being significantly lower than the actual usage of the Application across the cluster.
To get accurate application disk usage in the cluster, it would be necessary for the Master to maintain a data structure to record each application's usage on every worker. This information can be obtained from the heartbeat sent from workers. Maintaining such a data structure would have a space complexity O(m * n), where m is number of worker and n is the number of current active applications. WDYT?

FMX · 2024-11-04T07:06:17Z

@Z1Wu Thank you for your enthusiasm! The feature you're interested in has been addressed in this pull request. I recommend removing the AppDiskUsageMetric and the estimatedAppDiskUsage values from the worker's heartbeat, as they are now outdated.

Z1Wu force-pushed the fix/app_disk_usage branch 3 times, most recently from f36b9cc to f1a7c63 Compare October 30, 2024 16:19

[Feat] Update AppDiskUsage calculation logic, ApplicationDiskUsage in…

b1b6c74

… cluster should be multiplied by worker size.

Z1Wu force-pushed the fix/app_disk_usage branch from f1a7c63 to b1b6c74 Compare October 30, 2024 16:20

FMX requested changes Nov 1, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CELEBORN-1679] Estimated ApplicationDiskUsage in cluster should be multiplied by worker count. #2865

[CELEBORN-1679] Estimated ApplicationDiskUsage in cluster should be multiplied by worker count. #2865

Z1Wu commented Oct 30, 2024

Z1Wu commented Oct 31, 2024

FMX commented Nov 1, 2024

FMX left a comment

Z1Wu commented Nov 1, 2024 •

edited

Loading

FMX commented Nov 4, 2024

[CELEBORN-1679] Estimated ApplicationDiskUsage in cluster should be multiplied by worker count. #2865

Are you sure you want to change the base?

[CELEBORN-1679] Estimated ApplicationDiskUsage in cluster should be multiplied by worker count. #2865

Conversation

Z1Wu commented Oct 30, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Z1Wu commented Oct 31, 2024

FMX commented Nov 1, 2024

FMX left a comment

Choose a reason for hiding this comment

Z1Wu commented Nov 1, 2024 • edited Loading

FMX commented Nov 4, 2024

Z1Wu commented Nov 1, 2024 •

edited

Loading