Scheduling #6

wvengen · 2023-12-06T08:37:24Z

scrapyd has scheduling, while this project immediately starts running when a spider is scheduled.
The idea is to start suspended Kubernetes jobs, and unsuspend them when it can start running.

Follow the way scrapyd is configured to configure this:

max_proc
max_proc_per_cpu - ok for Docker, not directly translatable to Kubernetes

The text was updated successfully, but these errors were encountered:

wvengen · 2024-01-29T19:11:39Z

For Kubernetes, jobs can be started suspended. The scheduler can unsuspend them.
For Docker, jobs can use create instead of run. The scheduler can start them.

wvengen · 2024-08-29T08:23:53Z

Note that #28 may turn out to have a part that watches Kubernetes events. It is likely that the scheduling feature needs this as well (like, when a job finishes, a new job can be started). It could make sense to share some of the watching setup (so that there are not multiple connections to Kubernetes needed to listen for events, if that is cleanly possible).

vlerkin · 2024-10-04T08:49:43Z

Hi Willem, just want to ask if I understand the idea correctly.
We want to unsuspend a job when we have enough cluster capacity to complete it, right? I am not sure that a pod watcher I added is helpful here since we only need to monitor idle cpu/memory thing. If you had something else in mind, please, share.

wvengen · 2024-10-04T11:19:32Z

Well, this is about the scrapyd-idea of scheduling: run a maximum number of jobs in parallel. We're not talking about cluster capacity here (that is handled automatically by Kubernetes).

So here we'd want to start jobs suspended, and have a scheduler loop unsuspend jobs when the number of currently running jobs is lower than the maximum. See max_proc and max_proc_per_cpu (though we don't have to follow this exactly, let's start with just max_proc).

p.s. you don't need to tackle this together with the log handling. Only when setting up the Kubernetes watcher, it may be useful to realise that another component may want to use watching as well.

vlerkin · 2024-10-07T07:51:26Z

I think it's good to think about this ticket, just in case if I need a watcher and need to refactor the code of the connected issue.

vlerkin · 2024-10-07T08:19:01Z

Could you please also elaborate a bit what do you expect from max_proc_per_cpu and its usage in this task? Do you want to override the resources that were defined in the yaml for each job to define which part of cpu we need to assign to a process to run x processes in parallel? Let's say we have 1 cpu and max_proc_per_cpu = 5, then I need to assign 0.2 cpu to each job.

Another point: I think we need a master process which will control job state, it is a separate type of watcher which is not connected to the one in the log handling ticket (that one is made optional and is not active by default, and we don't want to couple things with different responsibilities), does it make sense?

wvengen · 2024-10-07T09:48:51Z

The idea is to limit the number of concurrently running jobs ran by scrapyd-k8s: no more than max_proc should be running simultaneously.

For max_proc_per_cpu we might limit the number of jobs to the number of cpus in the cluster. But it's less useful here, Kubernetes has various load handling approaches, let's start with max_proc and then see what other tunables could be useful here. Requests and limits are much more useful here. max_proc then would allow us to not fill the cluster with all running jobs.

wvengen · 2024-10-07T09:51:40Z

Note that setting requests and limits (for memory and cpu) are already implemented, no need to worry about that here.

Ok, if controlling job state is a different kind of watcher, then you can ignore it for the purpose of log handling (though I seem to remember that log handling also involved watching for job changes, but I may be wrong).

vlerkin · 2024-10-07T09:56:54Z

It is watching all pods in the cluster and selects the ones with specific labels and status, yes.

vlerkin · 2024-10-07T11:14:32Z

On the other hand, we might make some sort of publisher/subscriber model, where a pod watcher is a publisher, once it notices specific changes, for example, in a status, it sends a message to a specific subscriber that activates log watching, or another one that does something else as a response to some other changes in pods.

As you said, better to finish the log PR as it is now and then as part of this issue, I can think of extracting that watcher-publisher logic.

wvengen added the enhancement New feature or request label Dec 6, 2023

wvengen added this to Hack day 9 Feb 2024 Jan 31, 2024

wvengen added k8s Kuberenetes docker Docker config Configuration labels Jan 31, 2024

wvengen moved this to Todo in Hack day 9 Feb 2024 Jan 31, 2024

wvengen mentioned this issue May 16, 2024

More persistent logs #28

Closed

vlerkin self-assigned this Sep 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scheduling #6

Scheduling #6

wvengen commented Dec 6, 2023 •

edited

Loading

wvengen commented Jan 29, 2024

wvengen commented Aug 29, 2024

vlerkin commented Oct 4, 2024

wvengen commented Oct 4, 2024

vlerkin commented Oct 7, 2024

vlerkin commented Oct 7, 2024

wvengen commented Oct 7, 2024

wvengen commented Oct 7, 2024

vlerkin commented Oct 7, 2024

vlerkin commented Oct 7, 2024

Scheduling #6

Scheduling #6

Comments

wvengen commented Dec 6, 2023 • edited Loading

wvengen commented Jan 29, 2024

wvengen commented Aug 29, 2024

vlerkin commented Oct 4, 2024

wvengen commented Oct 4, 2024

vlerkin commented Oct 7, 2024

vlerkin commented Oct 7, 2024

wvengen commented Oct 7, 2024

wvengen commented Oct 7, 2024

vlerkin commented Oct 7, 2024

vlerkin commented Oct 7, 2024

wvengen commented Dec 6, 2023 •

edited

Loading