Scheduler Enhancements #7269

clinta · 2021-09-03T13:39:52Z

This PR adds valuable functionality to the scheduler. It makes the worker aware of cgroup memory limits to prevent overloading of a worker that has been memory limited by cgroups. It changes the GPU resource from a bool to a float so that tasks can be represented as requiring multiple, or a portion of a GPU. And it permits workers to override the resource table with environment variables.

Tests have not yet been added, but this is being opened as a draft to get feedback.

magik6k

Very quick first read through the diff - I like this, though we should probably try to use the config instead of env vars (though we can also change this after the initial prototype lands).

The ffi submodule update may be committed by accident

clinta · 2021-09-03T16:50:32Z

Thanks for the feedback. I've fixed the accidental ffi commit.

As for config file vs environment. Personally I find environment variables easier to deal with, especially when running nodes from systemd and using templated units. Also considering that FIL_PROOFS_MULTICORE_SDR_PRODUCERS is set via an environment variable, using environment variables for worker resources seems more consistent with current expectations. If we add the ability to get these from the config file, I'd like to preserve the ability to get them from the environment as well. If we do both I have no strong opinions about which should have priority.

clinta · 2021-09-03T18:42:22Z

Another thought about having this in a config file.

Currently there is an expectation that a user can change resources.go, and compile just lotus-miner, and all workers will get their new defaults from what is in the compiled miner binary. This PR as written does not break this expectation. If a user does not specify environment variables on their worker, this behavior will continue.

If these settings are available in a worker config file, it would take care to preserve this expectation, as the default config in the compiled worker might override the defaults from the miner.

jennijuju · 2021-09-06T17:52:25Z

Id agree config settings are preferred than env var settings for sealing scheduler and ideally we keep everything in the same section https://docs.filecoin.io/mine/lotus/miner-configuration/#sealing-section

magik6k

Thanks for the PR! Changes look good, I missed the part that we pull these settings from workers, which have no config right now (and most worker settings are passed with env vars anyways), so that part is ok.

We also need to bump the minor API versions of Worker/Miner API in api/version.go since those changes are breaking.

Multi-GPU support is going to be great, especially with big improvements coming to gpu scheduling in rust-proofs soon.

extern/sector-storage/resources.go

extern/sector-storage/worker_local.go

clinta · 2021-09-10T01:40:41Z

Experimenting with this has revealed another issue. Now that the scheduler is respecting my memory limits in cgroups, the scheduler failed to assign tasks that could run. The idea of reporting "reserved memory" has a problem because so called "reserved memory" includes the memory used by running tasks.

To correct this I've changed the worker API to individually report the system used memory and used swap. Then the scheduler, which has knowledge of running tasks and their memory requirements can compare the system's used memory with the memory used by running tasks and operate accordingly.

I've rebased to make this the second commit in the PR.

magik6k

Re-reviewed the changes, everything looks reasonable. Any reason to keep this in draft other than tests?

clinta · 2021-09-30T15:44:47Z

I've been running this for about a week now and am pretty happy with it. So yes, from my view the only thing remaining is adding tests.

…

On Thu, Sep 30, 2021 at 11:28 AM Łukasz Magiera ***@***.***> wrote: ***@***.**** commented on this pull request. Re-reviewed the changes, everything looks reasonable. Any reason to keep this in draft other than tests? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#7269 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAYHXXL7WZBCGVFRDS2HPVLUER63VANCNFSM5DLY7UCA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

jennijuju · 2021-10-05T10:28:14Z

@clinta Thank you for submitting the PR! and we would love to get this tested out with MinerX fellows (lotus early testers). Im wondering if we can get a brief user guide on the env var usages and how to set them?

extern/sector-storage/worker_local.go

Worker processes may have memory limitations imposed by Systemd. But /proc/meminfo shows the entire system memory regardless of these limits. This results in the scheduler believing the worker has the entire system memory avaliable and the worker being allocated too many tasks. This change attempts to read cgroup memory limits for the worker process. It supports cgroups v1 and v2, and compares cgroup limits against the system memory and returns the most conservative values to prevent the worker from being allocated too many tasks and potentially triggering an OOM event.

Attempting to report "memory used by other processes" in the MemReserved field fails to take into account the fact that the system's memory used includes memory used by ongoing tasks. To properly account for this, worker should report the memory and swap used, then the scheduler that is aware of the memory requirements for a task can determine if there is sufficient memory available for a task.

Before this change workers can only be allocated one GPU task, regardless of how much of the GPU resources that task uses, or how many GPUs are in the system. This makes GPUUtilization a float which can represent that a task needs a portion, or multiple GPUs. GPUs are accounted for like RAM and CPUs so that workers with more GPUs can be allocated more tasks. A known issue is that PC2 cannot use multiple GPUs. And even if the worker has multiple GPUs and is allocated multiple PC2 tasks, those tasks will only run on the first GPU. This could result in unexpected behavior when a worker with multiple GPUs is assigned multiple PC2 tasks. But this should not suprise any existing users who upgrade, as any existing users who run workers with multiple GPUs should already know this and be running a worker per GPU for PC2. But now those users have the freedom to customize the GPU utilization of PC2 to be less than one and effectively run multiple PC2 processes in a single worker. C2 is capable of utilizing multiple GPUs, and now workers can be customized for C2 accordingly.

In an environment with heterogenious worker nodes, a universal resource table for all workers does not allow effective scheduling of tasks. Some workers may have different proof cache settings, changing the required memory for different tasks. Some workers may have a different count of CPUs per core-complex, changing the max parallelism of PC1. This change allows workers to customize these parameters with environment variables. A worker could set the environment variable PC1_MIN_MEMORY for example to customize the minimum memory requirement for PC1 tasks. If no environment variables are specified, the resource table on the miner is used, except for PC1 parallelism. If PC1_MAX_PARALLELISM is not specified, and FIL_PROOFS_USE_MULTICORE_SDR is set, PC1_MAX_PARALLELSIM will automatically be set to FIL_PROOFS_MULTICORE_SDR_PRODUCERS + 1.

magik6k · 2021-11-29T15:28:46Z

Hey, sorry for taking so long on this PR.

I've opened #7703 to finally land this (rebased with some tests added, and some smaller changes for maintainability (one notable change is adding sector size to the resource env-vars - this was needed because we want to have shared workers which may work on different sector sizes at the same time))

magik6k · 2021-12-01T00:28:17Z

Merged in #7703!

magik6k self-requested a review September 3, 2021 14:09

magik6k reviewed Sep 3, 2021

View reviewed changes

clinta force-pushed the scheduler-enhancements branch from 002fb7c to 2c45d27 Compare September 3, 2021 16:25

clinta force-pushed the scheduler-enhancements branch from 2c45d27 to 6f2be82 Compare September 3, 2021 18:47

magik6k self-requested a review September 7, 2021 19:19

magik6k reviewed Sep 8, 2021

View reviewed changes

extern/sector-storage/resources.go Outdated Show resolved Hide resolved

extern/sector-storage/worker_local.go Outdated Show resolved Hide resolved

extern/sector-storage/worker_local.go Outdated Show resolved Hide resolved

clinta force-pushed the scheduler-enhancements branch 9 times, most recently from 0dea77f to 3cf6a81 Compare September 10, 2021 01:36

clinta force-pushed the scheduler-enhancements branch 2 times, most recently from 7dcedde to a1d4d8a Compare September 22, 2021 00:32

magik6k reviewed Sep 30, 2021

View reviewed changes

jennijuju added the SPX label Oct 5, 2021

jennijuju added P1 P1: Must be resolved need/community-input Hint: Needs Community Input labels Oct 5, 2021

jennijuju added this to the v1.13.1 milestone Oct 5, 2021

magik6k reviewed Oct 7, 2021

View reviewed changes

extern/sector-storage/worker_local.go Outdated Show resolved Hide resolved

clinta force-pushed the scheduler-enhancements branch 2 times, most recently from c2dfdea to f6a2400 Compare October 29, 2021 14:42

clinta added 4 commits November 29, 2021 08:51

clinta force-pushed the scheduler-enhancements branch from f6a2400 to 22b540b Compare November 29, 2021 13:52

magik6k mentioned this pull request Nov 29, 2021

Scheduler enhancements #7703

Merged

magik6k closed this Dec 1, 2021

TippyFlitsUK removed the need/community-input Hint: Needs Community Input label Mar 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scheduler Enhancements #7269

Scheduler Enhancements #7269

clinta commented Sep 3, 2021

magik6k left a comment

clinta commented Sep 3, 2021 •

edited

Loading

clinta commented Sep 3, 2021

jennijuju commented Sep 6, 2021

magik6k left a comment

clinta commented Sep 10, 2021

magik6k left a comment

clinta commented Sep 30, 2021 via email

jennijuju commented Oct 5, 2021 •

edited

Loading

magik6k commented Nov 29, 2021

magik6k commented Dec 1, 2021

Scheduler Enhancements #7269

Scheduler Enhancements #7269

Conversation

clinta commented Sep 3, 2021

magik6k left a comment

Choose a reason for hiding this comment

clinta commented Sep 3, 2021 • edited Loading

clinta commented Sep 3, 2021

jennijuju commented Sep 6, 2021

magik6k left a comment

Choose a reason for hiding this comment

clinta commented Sep 10, 2021

magik6k left a comment

Choose a reason for hiding this comment

clinta commented Sep 30, 2021 via email

jennijuju commented Oct 5, 2021 • edited Loading

magik6k commented Nov 29, 2021

magik6k commented Dec 1, 2021

clinta commented Sep 3, 2021 •

edited

Loading

jennijuju commented Oct 5, 2021 •

edited

Loading