Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduler Enhancements #7269

Closed
wants to merge 4 commits into from

Conversation

clinta
Copy link
Contributor

@clinta clinta commented Sep 3, 2021

This PR adds valuable functionality to the scheduler. It makes the worker aware of cgroup memory limits to prevent overloading of a worker that has been memory limited by cgroups. It changes the GPU resource from a bool to a float so that tasks can be represented as requiring multiple, or a portion of a GPU. And it permits workers to override the resource table with environment variables.

Tests have not yet been added, but this is being opened as a draft to get feedback.

@magik6k magik6k self-requested a review September 3, 2021 14:09
Copy link
Contributor

@magik6k magik6k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very quick first read through the diff - I like this, though we should probably try to use the config instead of env vars (though we can also change this after the initial prototype lands).

The ffi submodule update may be committed by accident

@clinta clinta force-pushed the scheduler-enhancements branch from 002fb7c to 2c45d27 Compare September 3, 2021 16:25
@clinta
Copy link
Contributor Author

clinta commented Sep 3, 2021

Thanks for the feedback. I've fixed the accidental ffi commit.

As for config file vs environment. Personally I find environment variables easier to deal with, especially when running nodes from systemd and using templated units. Also considering that FIL_PROOFS_MULTICORE_SDR_PRODUCERS is set via an environment variable, using environment variables for worker resources seems more consistent with current expectations. If we add the ability to get these from the config file, I'd like to preserve the ability to get them from the environment as well. If we do both I have no strong opinions about which should have priority.

@clinta
Copy link
Contributor Author

clinta commented Sep 3, 2021

Another thought about having this in a config file.

Currently there is an expectation that a user can change resources.go, and compile just lotus-miner, and all workers will get their new defaults from what is in the compiled miner binary. This PR as written does not break this expectation. If a user does not specify environment variables on their worker, this behavior will continue.

If these settings are available in a worker config file, it would take care to preserve this expectation, as the default config in the compiled worker might override the defaults from the miner.

@clinta clinta force-pushed the scheduler-enhancements branch from 2c45d27 to 6f2be82 Compare September 3, 2021 18:47
@jennijuju
Copy link
Member

Id agree config settings are preferred than env var settings for sealing scheduler and ideally we keep everything in the same section https://docs.filecoin.io/mine/lotus/miner-configuration/#sealing-section

@magik6k magik6k self-requested a review September 7, 2021 19:19
Copy link
Contributor

@magik6k magik6k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Changes look good, I missed the part that we pull these settings from workers, which have no config right now (and most worker settings are passed with env vars anyways), so that part is ok.

We also need to bump the minor API versions of Worker/Miner API in api/version.go since those changes are breaking.

Multi-GPU support is going to be great, especially with big improvements coming to gpu scheduling in rust-proofs soon.

extern/sector-storage/resources.go Outdated Show resolved Hide resolved
extern/sector-storage/worker_local.go Outdated Show resolved Hide resolved
extern/sector-storage/worker_local.go Outdated Show resolved Hide resolved
@clinta clinta force-pushed the scheduler-enhancements branch 9 times, most recently from 0dea77f to 3cf6a81 Compare September 10, 2021 01:36
@clinta
Copy link
Contributor Author

clinta commented Sep 10, 2021

Experimenting with this has revealed another issue. Now that the scheduler is respecting my memory limits in cgroups, the scheduler failed to assign tasks that could run. The idea of reporting "reserved memory" has a problem because so called "reserved memory" includes the memory used by running tasks.

To correct this I've changed the worker API to individually report the system used memory and used swap. Then the scheduler, which has knowledge of running tasks and their memory requirements can compare the system's used memory with the memory used by running tasks and operate accordingly.

I've rebased to make this the second commit in the PR.

@clinta clinta force-pushed the scheduler-enhancements branch 2 times, most recently from 7dcedde to a1d4d8a Compare September 22, 2021 00:32
Copy link
Contributor

@magik6k magik6k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-reviewed the changes, everything looks reasonable. Any reason to keep this in draft other than tests?

@clinta
Copy link
Contributor Author

clinta commented Sep 30, 2021 via email

@jennijuju jennijuju added the SPX label Oct 5, 2021
@jennijuju jennijuju added P1 P1: Must be resolved need/community-input Hint: Needs Community Input labels Oct 5, 2021
@jennijuju jennijuju added this to the v1.13.1 milestone Oct 5, 2021
@jennijuju
Copy link
Member

jennijuju commented Oct 5, 2021

@clinta Thank you for submitting the PR! and we would love to get this tested out with MinerX fellows (lotus early testers). Im wondering if we can get a brief user guide on the env var usages and how to set them?

@clinta clinta force-pushed the scheduler-enhancements branch 2 times, most recently from c2dfdea to f6a2400 Compare October 29, 2021 14:42
Worker processes may have memory limitations imposed by Systemd. But
/proc/meminfo shows the entire system memory regardless of these limits.
This results in the scheduler believing the worker has the entire system
memory avaliable and the worker being allocated too many tasks.

This change attempts to read cgroup memory limits for the worker
process. It supports cgroups v1 and v2, and compares cgroup limits
against the system memory and returns the most conservative values to
prevent the worker from being allocated too many tasks and potentially
triggering an OOM event.
Attempting to report "memory used by other processes" in the MemReserved
field fails to take into account the fact that the system's memory used
includes memory used by ongoing tasks.

To properly account for this, worker should report the memory and swap
used, then the scheduler that is aware of the memory requirements for a
task can determine if there is sufficient memory available for a task.
Before this change workers can only be allocated one GPU task,
regardless of how much of the GPU resources that task uses, or how many
GPUs are in the system.

This makes GPUUtilization a float which can represent that a task needs
a portion, or multiple GPUs. GPUs are accounted for like RAM and CPUs so
that workers with more GPUs can be allocated more tasks.

A known issue is that PC2 cannot use multiple GPUs. And even if the
worker has multiple GPUs and is allocated multiple PC2 tasks, those
tasks will only run on the first GPU.

This could result in unexpected behavior when a worker with multiple
GPUs is assigned multiple PC2 tasks. But this should not suprise any
existing users who upgrade, as any existing users who run workers with
multiple GPUs should already know this and be running a worker per GPU
for PC2. But now those users have the freedom to customize the GPU
utilization of PC2 to be less than one and effectively run multiple PC2
processes in a single worker.

C2 is capable of utilizing multiple GPUs, and now workers can be
customized for C2 accordingly.
In an environment with heterogenious worker nodes, a universal resource
table for all workers does not allow effective scheduling of tasks. Some
workers may have different proof cache settings, changing the required
memory for different tasks. Some workers may have a different count of
CPUs per core-complex, changing the max parallelism of PC1.

This change allows workers to customize these parameters with
environment variables. A worker could set the environment variable
PC1_MIN_MEMORY for example to customize the minimum memory requirement
for PC1 tasks. If no environment variables are specified, the resource
table on the miner is used, except for PC1 parallelism.

If PC1_MAX_PARALLELISM is not specified, and
FIL_PROOFS_USE_MULTICORE_SDR is set, PC1_MAX_PARALLELSIM will
automatically be set to FIL_PROOFS_MULTICORE_SDR_PRODUCERS + 1.
@clinta clinta force-pushed the scheduler-enhancements branch from f6a2400 to 22b540b Compare November 29, 2021 13:52
@magik6k magik6k mentioned this pull request Nov 29, 2021
@magik6k
Copy link
Contributor

magik6k commented Nov 29, 2021

Hey, sorry for taking so long on this PR.

I've opened #7703 to finally land this (rebased with some tests added, and some smaller changes for maintainability (one notable change is adding sector size to the resource env-vars - this was needed because we want to have shared workers which may work on different sector sizes at the same time))

@magik6k
Copy link
Contributor

magik6k commented Dec 1, 2021

Merged in #7703!

@magik6k magik6k closed this Dec 1, 2021
@TippyFlitsUK TippyFlitsUK removed the need/community-input Hint: Needs Community Input label Mar 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 P1: Must be resolved SPX
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants