Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduling #6

Open
wvengen opened this issue Dec 6, 2023 · 10 comments
Open

Scheduling #6

wvengen opened this issue Dec 6, 2023 · 10 comments
Assignees
Labels
config Configuration docker Docker enhancement New feature or request k8s Kuberenetes

Comments

@wvengen
Copy link
Member

wvengen commented Dec 6, 2023

scrapyd has scheduling, while this project immediately starts running when a spider is scheduled.
The idea is to start suspended Kubernetes jobs, and unsuspend them when it can start running.

Follow the way scrapyd is configured to configure this:

  • max_proc
  • max_proc_per_cpu - ok for Docker, not directly translatable to Kubernetes
@wvengen wvengen added the enhancement New feature or request label Dec 6, 2023
@wvengen
Copy link
Member Author

wvengen commented Jan 29, 2024

For Kubernetes, jobs can be started suspended. The scheduler can unsuspend them.
For Docker, jobs can use create instead of run. The scheduler can start them.

@wvengen wvengen added k8s Kuberenetes docker Docker config Configuration labels Jan 31, 2024
@wvengen wvengen moved this to Todo in Hack day 9 Feb 2024 Jan 31, 2024
@wvengen
Copy link
Member Author

wvengen commented Aug 29, 2024

Note that #28 may turn out to have a part that watches Kubernetes events. It is likely that the scheduling feature needs this as well (like, when a job finishes, a new job can be started). It could make sense to share some of the watching setup (so that there are not multiple connections to Kubernetes needed to listen for events, if that is cleanly possible).

@vlerkin vlerkin self-assigned this Sep 17, 2024
@vlerkin
Copy link
Collaborator

vlerkin commented Oct 4, 2024

Hi Willem, just want to ask if I understand the idea correctly.
We want to unsuspend a job when we have enough cluster capacity to complete it, right? I am not sure that a pod watcher I added is helpful here since we only need to monitor idle cpu/memory thing. If you had something else in mind, please, share.

@wvengen
Copy link
Member Author

wvengen commented Oct 4, 2024

Well, this is about the scrapyd-idea of scheduling: run a maximum number of jobs in parallel. We're not talking about cluster capacity here (that is handled automatically by Kubernetes).

So here we'd want to start jobs suspended, and have a scheduler loop unsuspend jobs when the number of currently running jobs is lower than the maximum. See max_proc and max_proc_per_cpu (though we don't have to follow this exactly, let's start with just max_proc).

p.s. you don't need to tackle this together with the log handling. Only when setting up the Kubernetes watcher, it may be useful to realise that another component may want to use watching as well.

@vlerkin
Copy link
Collaborator

vlerkin commented Oct 7, 2024

I think it's good to think about this ticket, just in case if I need a watcher and need to refactor the code of the connected issue.

@vlerkin
Copy link
Collaborator

vlerkin commented Oct 7, 2024

Could you please also elaborate a bit what do you expect from max_proc_per_cpu and its usage in this task? Do you want to override the resources that were defined in the yaml for each job to define which part of cpu we need to assign to a process to run x processes in parallel? Let's say we have 1 cpu and max_proc_per_cpu = 5, then I need to assign 0.2 cpu to each job.

Another point: I think we need a master process which will control job state, it is a separate type of watcher which is not connected to the one in the log handling ticket (that one is made optional and is not active by default, and we don't want to couple things with different responsibilities), does it make sense?

@wvengen
Copy link
Member Author

wvengen commented Oct 7, 2024

The idea is to limit the number of concurrently running jobs ran by scrapyd-k8s: no more than max_proc should be running simultaneously.

For max_proc_per_cpu we might limit the number of jobs to the number of cpus in the cluster. But it's less useful here, Kubernetes has various load handling approaches, let's start with max_proc and then see what other tunables could be useful here. Requests and limits are much more useful here. max_proc then would allow us to not fill the cluster with all running jobs.

@wvengen
Copy link
Member Author

wvengen commented Oct 7, 2024

Note that setting requests and limits (for memory and cpu) are already implemented, no need to worry about that here.

Ok, if controlling job state is a different kind of watcher, then you can ignore it for the purpose of log handling (though I seem to remember that log handling also involved watching for job changes, but I may be wrong).

@vlerkin
Copy link
Collaborator

vlerkin commented Oct 7, 2024

It is watching all pods in the cluster and selects the ones with specific labels and status, yes.

@vlerkin
Copy link
Collaborator

vlerkin commented Oct 7, 2024

On the other hand, we might make some sort of publisher/subscriber model, where a pod watcher is a publisher, once it notices specific changes, for example, in a status, it sends a message to a specific subscriber that activates log watching, or another one that does something else as a response to some other changes in pods.

As you said, better to finish the log PR as it is now and then as part of this issue, I can think of extracting that watcher-publisher logic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
config Configuration docker Docker enhancement New feature or request k8s Kuberenetes
Projects
No open projects
Status: Todo
Development

No branches or pull requests

2 participants