-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scheduler Enhancements #7269
Scheduler Enhancements #7269
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very quick first read through the diff - I like this, though we should probably try to use the config instead of env vars (though we can also change this after the initial prototype lands).
The ffi submodule update may be committed by accident
002fb7c
to
2c45d27
Compare
Thanks for the feedback. I've fixed the accidental ffi commit. As for config file vs environment. Personally I find environment variables easier to deal with, especially when running nodes from systemd and using templated units. Also considering that |
Another thought about having this in a config file. Currently there is an expectation that a user can change resources.go, and compile just lotus-miner, and all workers will get their new defaults from what is in the compiled miner binary. This PR as written does not break this expectation. If a user does not specify environment variables on their worker, this behavior will continue. If these settings are available in a worker config file, it would take care to preserve this expectation, as the default config in the compiled worker might override the defaults from the miner. |
2c45d27
to
6f2be82
Compare
Id agree config settings are preferred than env var settings for sealing scheduler and ideally we keep everything in the same section https://docs.filecoin.io/mine/lotus/miner-configuration/#sealing-section |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! Changes look good, I missed the part that we pull these settings from workers, which have no config right now (and most worker settings are passed with env vars anyways), so that part is ok.
We also need to bump the minor API versions of Worker/Miner API in api/version.go
since those changes are breaking.
Multi-GPU support is going to be great, especially with big improvements coming to gpu scheduling in rust-proofs soon.
0dea77f
to
3cf6a81
Compare
Experimenting with this has revealed another issue. Now that the scheduler is respecting my memory limits in cgroups, the scheduler failed to assign tasks that could run. The idea of reporting "reserved memory" has a problem because so called "reserved memory" includes the memory used by running tasks. To correct this I've changed the worker API to individually report the system used memory and used swap. Then the scheduler, which has knowledge of running tasks and their memory requirements can compare the system's used memory with the memory used by running tasks and operate accordingly. I've rebased to make this the second commit in the PR. |
7dcedde
to
a1d4d8a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re-reviewed the changes, everything looks reasonable. Any reason to keep this in draft other than tests?
I've been running this for about a week now and am pretty happy with it. So
yes, from my view the only thing remaining is adding tests.
…On Thu, Sep 30, 2021 at 11:28 AM Łukasz Magiera ***@***.***> wrote:
***@***.**** commented on this pull request.
Re-reviewed the changes, everything looks reasonable. Any reason to keep
this in draft other than tests?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#7269 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAYHXXL7WZBCGVFRDS2HPVLUER63VANCNFSM5DLY7UCA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
@clinta Thank you for submitting the PR! and we would love to get this tested out with MinerX fellows (lotus early testers). Im wondering if we can get a brief user guide on the env var usages and how to set them? |
c2dfdea
to
f6a2400
Compare
Worker processes may have memory limitations imposed by Systemd. But /proc/meminfo shows the entire system memory regardless of these limits. This results in the scheduler believing the worker has the entire system memory avaliable and the worker being allocated too many tasks. This change attempts to read cgroup memory limits for the worker process. It supports cgroups v1 and v2, and compares cgroup limits against the system memory and returns the most conservative values to prevent the worker from being allocated too many tasks and potentially triggering an OOM event.
Attempting to report "memory used by other processes" in the MemReserved field fails to take into account the fact that the system's memory used includes memory used by ongoing tasks. To properly account for this, worker should report the memory and swap used, then the scheduler that is aware of the memory requirements for a task can determine if there is sufficient memory available for a task.
Before this change workers can only be allocated one GPU task, regardless of how much of the GPU resources that task uses, or how many GPUs are in the system. This makes GPUUtilization a float which can represent that a task needs a portion, or multiple GPUs. GPUs are accounted for like RAM and CPUs so that workers with more GPUs can be allocated more tasks. A known issue is that PC2 cannot use multiple GPUs. And even if the worker has multiple GPUs and is allocated multiple PC2 tasks, those tasks will only run on the first GPU. This could result in unexpected behavior when a worker with multiple GPUs is assigned multiple PC2 tasks. But this should not suprise any existing users who upgrade, as any existing users who run workers with multiple GPUs should already know this and be running a worker per GPU for PC2. But now those users have the freedom to customize the GPU utilization of PC2 to be less than one and effectively run multiple PC2 processes in a single worker. C2 is capable of utilizing multiple GPUs, and now workers can be customized for C2 accordingly.
In an environment with heterogenious worker nodes, a universal resource table for all workers does not allow effective scheduling of tasks. Some workers may have different proof cache settings, changing the required memory for different tasks. Some workers may have a different count of CPUs per core-complex, changing the max parallelism of PC1. This change allows workers to customize these parameters with environment variables. A worker could set the environment variable PC1_MIN_MEMORY for example to customize the minimum memory requirement for PC1 tasks. If no environment variables are specified, the resource table on the miner is used, except for PC1 parallelism. If PC1_MAX_PARALLELISM is not specified, and FIL_PROOFS_USE_MULTICORE_SDR is set, PC1_MAX_PARALLELSIM will automatically be set to FIL_PROOFS_MULTICORE_SDR_PRODUCERS + 1.
f6a2400
to
22b540b
Compare
Hey, sorry for taking so long on this PR. I've opened #7703 to finally land this (rebased with some tests added, and some smaller changes for maintainability (one notable change is adding sector size to the resource env-vars - this was needed because we want to have shared workers which may work on different sector sizes at the same time)) |
Merged in #7703! |
This PR adds valuable functionality to the scheduler. It makes the worker aware of cgroup memory limits to prevent overloading of a worker that has been memory limited by cgroups. It changes the GPU resource from a bool to a float so that tasks can be represented as requiring multiple, or a portion of a GPU. And it permits workers to override the resource table with environment variables.
Tests have not yet been added, but this is being opened as a draft to get feedback.