Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CELEBORN-1595] Support quota low watermark for checking quota available #2727

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

s0nskar
Copy link
Contributor

@s0nskar s0nskar commented Sep 5, 2024

What changes were proposed in this pull request?

Support quota low watermark for checking quota available. This will not allow new jobs to run on Celeborn if quota used is above lowWatermark.

Why are the changes needed?

Currently we allow jobs to run even if we're just about to breach quota limits. This is not ideal behaviour, ideally we should not allow any new jobs to run on Celeborn after certain threshold (called lowWatermark here). This will ensure current running jobs will use the quota and will not go way beyond quota usage.

I'll also follow up with a PR to throw CelebornIOException, if quota is breached.

Does this PR introduce any user-facing change?

NA

How was this patch tested?

UTs

@s0nskar
Copy link
Contributor Author

s0nskar commented Sep 5, 2024

IMO this configuration can be a dynamic configuration. Wdyt?

@s0nskar s0nskar marked this pull request as ready for review September 5, 2024 14:14
@s0nskar s0nskar changed the title Support quota low watermark for checking quota available [CELEBORN-1595] Support quota low watermark for checking quota available Sep 13, 2024
@s0nskar
Copy link
Contributor Author

s0nskar commented Sep 13, 2024

cc: @SteNicholas @waitinfuture @mridulm

@s0nskar
Copy link
Contributor Author

s0nskar commented Sep 13, 2024

Also, IMO quota should be a hard limit any application breaching quota should get a CelebornIOException. We can improve the application heartbeat to check for current quota.

@s0nskar
Copy link
Contributor Author

s0nskar commented Sep 23, 2024

ping @SteNicholas @waitinfuture

@s0nskar
Copy link
Contributor Author

s0nskar commented Oct 8, 2024

ping @FMX @SteNicholas @waitinfuture wdyt about this.

val quota = getQuota(userIdentifier)
Quota(
(quota.diskBytesWritten * quotaWatermark).toLong,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @s0nskar , I don't quite understand why can't we just config the quota to quota.diskBytesWritten * 1.0 (some user defined factor)? Is it necessary to introduce this new config?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@waitinfuture Currently Celeborn consider quota as soft limits and just stop allowing new jobs on use Celeborn if the quota is breached while the existing jobs keeps on running while breaching the quota. IMO there are two shortcoming in current quota implementation –

  • It allows new jobs to onboard even if quota is just about to be breached. Ideally it should stop onboarding jobs after a certain threshold to not go overboard with the quota limits (This PR addresses that)
  • Killing jobs if quota for an user is breached (I was planning to implement this but noticed that @leixm raised a PR to handle this today - [CELEBORN-1577][Phase1] Storage quota should support interrupt shuffle. #2801)

Copy link
Contributor Author

@s0nskar s0nskar Oct 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@waitinfuture wdyt about this? IMO this should be a must have to avoid interrupting the user/tenant jobs in bulk.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants