-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scheduling #6
Comments
For Kubernetes, jobs can be started suspended. The scheduler can unsuspend them. |
Note that #28 may turn out to have a part that watches Kubernetes events. It is likely that the scheduling feature needs this as well (like, when a job finishes, a new job can be started). It could make sense to share some of the watching setup (so that there are not multiple connections to Kubernetes needed to listen for events, if that is cleanly possible). |
Hi Willem, just want to ask if I understand the idea correctly. |
Well, this is about the scrapyd-idea of scheduling: run a maximum number of jobs in parallel. We're not talking about cluster capacity here (that is handled automatically by Kubernetes). So here we'd want to start jobs suspended, and have a scheduler loop unsuspend jobs when the number of currently running jobs is lower than the maximum. See p.s. you don't need to tackle this together with the log handling. Only when setting up the Kubernetes watcher, it may be useful to realise that another component may want to use watching as well. |
I think it's good to think about this ticket, just in case if I need a watcher and need to refactor the code of the connected issue. |
Could you please also elaborate a bit what do you expect from max_proc_per_cpu and its usage in this task? Do you want to override the resources that were defined in the yaml for each job to define which part of cpu we need to assign to a process to run x processes in parallel? Let's say we have 1 cpu and max_proc_per_cpu = 5, then I need to assign 0.2 cpu to each job. Another point: I think we need a master process which will control job state, it is a separate type of watcher which is not connected to the one in the log handling ticket (that one is made optional and is not active by default, and we don't want to couple things with different responsibilities), does it make sense? |
The idea is to limit the number of concurrently running jobs ran by scrapyd-k8s: no more than For |
Note that setting requests and limits (for memory and cpu) are already implemented, no need to worry about that here. Ok, if controlling job state is a different kind of watcher, then you can ignore it for the purpose of log handling (though I seem to remember that log handling also involved watching for job changes, but I may be wrong). |
It is watching all pods in the cluster and selects the ones with specific labels and status, yes. |
On the other hand, we might make some sort of publisher/subscriber model, where a pod watcher is a publisher, once it notices specific changes, for example, in a status, it sends a message to a specific subscriber that activates log watching, or another one that does something else as a response to some other changes in pods. As you said, better to finish the log PR as it is now and then as part of this issue, I can think of extracting that watcher-publisher logic. |
scrapyd has scheduling, while this project immediately starts running when a spider is scheduled.
The idea is to start suspended Kubernetes jobs, and unsuspend them when it can start running.
Follow the way scrapyd is configured to configure this:
max_proc
max_proc_per_cpu
- ok for Docker, not directly translatable to KubernetesThe text was updated successfully, but these errors were encountered: