-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* docs(concepts): Add WIP workload lifecycle * docs: Add card grid on landing page * docs(concepts): Add section on jobs identifiers * docs: Fix typo Co-authored-by: Nicholas Junge <[email protected]> * docs: Fix typo Co-authored-by: Nicholas Junge <[email protected]> * docs: Reword confusing terminology * docs: Improve wording of job completion section * docs: Improve landing page * docs: Add high-level architecture overview --------- Co-authored-by: Nicholas Junge <[email protected]>
- Loading branch information
1 parent
fcbd814
commit aec0761
Showing
6 changed files
with
303 additions
and
104 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
# Concepts | ||
|
||
This section covers the basic concepts behind jobq. | ||
|
||
It can help you: | ||
|
||
- Understand the [high-level architecture](architecture.md) of jobq. | ||
- Understand how jobq [identifies jobs](identifiers.md). | ||
- Understand the [lifecycle of a job](lifecycle.md), from its submission to its completion. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,52 @@ | ||
--- | ||
title: Architecture | ||
--- | ||
|
||
# Understanding the jobq architecture | ||
|
||
The jobq high-level architecture consists of two major components: | ||
|
||
1. The [_client-side library_](#client-side-library), which is used to declare and submit jobs to a compute cluster. | ||
2. The [_server-side API_](#server-side-api), which serves as the interface between the client and the compute cluster. | ||
|
||
```mermaid | ||
architecture-beta | ||
group api[Compute Cluster] | ||
service jobq(server)[Jobq API] in api | ||
service kueue(server)[Kueue] in api | ||
service kubernetes(server)[Kubernetes API] in api | ||
service ray(server)[Kuberay] in api | ||
service client(server)[jobq Client] | ||
jobq:R --> L:kueue | ||
jobq:B --> T:kubernetes | ||
kueue:R --> L:ray | ||
kueue:B --> T:kubernetes | ||
client:R --> L:jobq | ||
``` | ||
|
||
## Client-side library | ||
|
||
The client-side Python library provides a high-level interface for declaring and executing jobs, either locally or on a compute cluster. | ||
It is designed to be easy to use and to integrate with other Python libraries and frameworks. | ||
|
||
The library is responsible for: | ||
|
||
- Providing a `@job` decorator to annotate Python functions as jobs. | ||
- Configuring the container image build for a job (through a declarative configuration or explicit `Dockerfile`). | ||
- Setting runtime parameters for a job (e.g., its resource requirements). | ||
- Managing the lifecycle of jobs, including monitoring their status and logs through a command-line interface. | ||
|
||
The library is implemented as a Python package that can be installed using pip. | ||
|
||
## Server-side API | ||
|
||
The server-side API is a RESTful API that provides a low-level interface for managing jobs in a compute cluster. | ||
|
||
It builds on top of Kubernetes and the [Kueue framework](https://kueue.sigs.k8s.io/), which provides a high-level abstraction for managing workloads in a Kubernetes cluster (including queueing, priority-based scheduling, preemption, and resource management). | ||
|
||
Kueue itself can manage workloads of various types, such as Kubernetes `Jobs`, Kuberay `RayJobs`, among others. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,52 @@ | ||
# Job Identifiers | ||
|
||
## Terminology | ||
|
||
In order to understand how jobq identifies jobs, it is important to understand the conceptual components of a workload: | ||
A jobs is composed of an abstract definition of the workload (as a [Kueue `Workload`](https://kueue.sigs.k8s.io/docs/concepts/workload/) custom resource) and a set of Kubernetes resources (for example a Kubernetes `Job`, or a custom resource like the [Kuberay](https://docs.ray.io/en/latest/cluster/kubernetes/index.html) `RayJob`) that make up the executable part of the workload. | ||
|
||
At first these similar-sounding names can be confusing, so let's establish some terminology: | ||
|
||
- A **workload** or **job** (lowercase "w"/"j") is a set of Kubernetes resources that make up the | ||
executable portion of a job. | ||
- A **Workload** (uppercase "W") refers to the Kueue `Workload` custom resource. | ||
- A **Job** (uppercase "J") refers to the Kubernetes `batch/v1/Job` resource (one way how code can be submitted through jobq). | ||
|
||
Kueue handles the `Workload` and updates its status to reflect the current state of the workload. | ||
|
||
![Kueue workload components](https://kueue.sigs.k8s.io/images/queueing-components.svg) | ||
|
||
## Identifying workloads | ||
|
||
Every Workload (as managed by Kueue) carries by an automatically generated unique identifier (UID) as well as a human-readable name and namespace. | ||
Both these could serve as a unique identifier for a Workload. However, a name/namespace combination is not guaranteed to be unique over time (for example when deleting and recreating), whereas UIDs are. | ||
This makes UIDs a slightly better choice for identifying a given Workload resource. | ||
|
||
The concrete workload resource has the same identifiers, a UID and name/namespace combination. | ||
|
||
A given job references its associated Workload in a 1:1 fashion (through its `metadata.ownerReferences` field). | ||
|
||
This theoretically allows to identify a job in the cluster through two different identifiers: | ||
|
||
- the UID of the (concrete) _job_ resource. | ||
- the UID of the (abstract) _Workload_ resource. | ||
|
||
In practice, jobq always uses the **UID of the concrete workload** as the identifier for a job. | ||
All CLI operations return and accept the UID of the concrete workload. | ||
|
||
As an example, imagine the following resources in the cluster after submitting a job: | ||
|
||
```mermaid | ||
graph LR | ||
subgraph "Namespace example" | ||
direction LR | ||
A["`Workload **job-example** <pre>uid-1</pre>`"] --> B["`Job **example** <pre>uid-2</pre>`"] --> C[Pod] | ||
end | ||
``` | ||
|
||
If we want to query the logs of the job, we can do so by calling `jobq logs` with the UID of the concrete workload: | ||
|
||
```console | ||
$ jobq logs uid-2 | ||
[... log output ...] | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,48 @@ | ||
# Job Lifecycle | ||
|
||
Since jobq builds on top of the [Kueue](https://kueue.sigs.k8s.io/) job queuing system for scheduling, | ||
the lifecycle of a job is very similar to the lifecycle of a workload in Kueue. | ||
|
||
The remainder of this document uses the terms _job_ and _workload_ interchangeably. | ||
|
||
A workload roughly goes through three phases after its submission: _queuing and scheduling_, _execution_, and _completion_. | ||
|
||
### Queueing and scheduling | ||
|
||
After its submission, a workload is in the `Submitted` state, where it competes with other workloads for available resource quotas. | ||
Once it is admitted to a cluster queue, it enters the `Pending` state, where Kueue will reserve a quota for it. | ||
Alternatively, if the selected local or cluster queue for the workload are stopped or do not exist, the workload will enter the `Inadmissible` state until this condition is resolved. | ||
|
||
### Execution | ||
|
||
After all admission checks for the workload have passed, it enters the `Admitted` state, it is now eligible for execution by the cluster. | ||
|
||
### Completion | ||
|
||
When the workload terminates successfully, it enters the terminal `Succeeded` state. | ||
If any unrecoverable error occurs during execution, the workload enters the terminal `Failed` state. This does not necessarily happen on the first abnormal termination of a pod, depending on the type of workload and other factors (such as the retry limit in a `batch/v1/Job`). | ||
|
||
A currently executing workload may be preempted by another workload (e.g., by a newly submitted workload with a higher priority). | ||
In this case, Kueue will terminate any pods associated with the preempted workload and either requeue it for later execution or evict it from the cluster queue. | ||
|
||
## State Diagram | ||
|
||
```mermaid | ||
stateDiagram-v2 | ||
direction LR | ||
[*] --> Submitted | ||
Submitted --> Pending: quotaReserved | ||
Submitted --> Inadmissible | ||
Inadmissible --> Submitted | ||
Pending --> Admitted: admitted | ||
Admitted --> Succeeded: success | ||
Admitted --> Failed: error | ||
Admitted --> Submitted: evicted | ||
Admitted --> Pending: requeued | ||
Succeeded --> [*] | ||
Failed --> [*] | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,35 @@ | ||
# Jobq, a cluster workflow scheduling tool | ||
--- | ||
title: Home | ||
--- | ||
|
||
This documentation is work in progress. | ||
# | ||
|
||
!!! warning "Work in progress" | ||
|
||
This documentation is work in progress. | ||
Please excuse frequent changes and missing content. | ||
|
||
<div class="grid cards" markdown> | ||
- [:material-thought-bubble:{ .lg .middle } **Concepts**](concepts/_index.md) | ||
|
||
*** | ||
|
||
Learn about the concepts behind jobq | ||
|
||
- [:material-apple-keyboard-command:{ .lg .middle } **Command-line interface**](cli.md) | ||
|
||
*** | ||
|
||
Learn how to use the `jobq` command-line interface | ||
|
||
- [:material-book-open-variant:{ .lg .middle } **API Reference**](reference/SUMMARY.md) | ||
|
||
*** | ||
|
||
Detailed documentation of the jobq Python API | ||
|
||
</div> | ||
|
||
<hr /> | ||
|
||
This project is licensed under the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0){: target="\_blank" }. |
Oops, something went wrong.