Skip to content

Commit

Permalink
Docs: Concepts section (#98)
Browse files Browse the repository at this point in the history
* docs(concepts): Add WIP workload lifecycle

* docs: Add card grid on landing page

* docs(concepts): Add section on jobs identifiers

* docs: Fix typo

Co-authored-by: Nicholas Junge <[email protected]>

* docs: Fix typo

Co-authored-by: Nicholas Junge <[email protected]>

* docs: Reword confusing terminology

* docs: Improve wording of job completion section

* docs: Improve landing page

* docs: Add high-level architecture overview

---------

Co-authored-by: Nicholas Junge <[email protected]>
  • Loading branch information
AdrianoKF and nicholasjng authored Sep 26, 2024
1 parent fcbd814 commit aec0761
Show file tree
Hide file tree
Showing 6 changed files with 303 additions and 104 deletions.
9 changes: 9 additions & 0 deletions client/docs/concepts/_index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Concepts

This section covers the basic concepts behind jobq.

It can help you:

- Understand the [high-level architecture](architecture.md) of jobq.
- Understand how jobq [identifies jobs](identifiers.md).
- Understand the [lifecycle of a job](lifecycle.md), from its submission to its completion.
52 changes: 52 additions & 0 deletions client/docs/concepts/architecture.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
---
title: Architecture
---

# Understanding the jobq architecture

The jobq high-level architecture consists of two major components:

1. The [_client-side library_](#client-side-library), which is used to declare and submit jobs to a compute cluster.
2. The [_server-side API_](#server-side-api), which serves as the interface between the client and the compute cluster.

```mermaid
architecture-beta
group api[Compute Cluster]
service jobq(server)[Jobq API] in api
service kueue(server)[Kueue] in api
service kubernetes(server)[Kubernetes API] in api
service ray(server)[Kuberay] in api
service client(server)[jobq Client]
jobq:R --> L:kueue
jobq:B --> T:kubernetes
kueue:R --> L:ray
kueue:B --> T:kubernetes
client:R --> L:jobq
```

## Client-side library

The client-side Python library provides a high-level interface for declaring and executing jobs, either locally or on a compute cluster.
It is designed to be easy to use and to integrate with other Python libraries and frameworks.

The library is responsible for:

- Providing a `@job` decorator to annotate Python functions as jobs.
- Configuring the container image build for a job (through a declarative configuration or explicit `Dockerfile`).
- Setting runtime parameters for a job (e.g., its resource requirements).
- Managing the lifecycle of jobs, including monitoring their status and logs through a command-line interface.

The library is implemented as a Python package that can be installed using pip.

## Server-side API

The server-side API is a RESTful API that provides a low-level interface for managing jobs in a compute cluster.

It builds on top of Kubernetes and the [Kueue framework](https://kueue.sigs.k8s.io/), which provides a high-level abstraction for managing workloads in a Kubernetes cluster (including queueing, priority-based scheduling, preemption, and resource management).

Kueue itself can manage workloads of various types, such as Kubernetes `Jobs`, Kuberay `RayJobs`, among others.
52 changes: 52 additions & 0 deletions client/docs/concepts/identifiers.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# Job Identifiers

## Terminology

In order to understand how jobq identifies jobs, it is important to understand the conceptual components of a workload:
A jobs is composed of an abstract definition of the workload (as a [Kueue `Workload`](https://kueue.sigs.k8s.io/docs/concepts/workload/) custom resource) and a set of Kubernetes resources (for example a Kubernetes `Job`, or a custom resource like the [Kuberay](https://docs.ray.io/en/latest/cluster/kubernetes/index.html) `RayJob`) that make up the executable part of the workload.

At first these similar-sounding names can be confusing, so let's establish some terminology:

- A **workload** or **job** (lowercase "w"/"j") is a set of Kubernetes resources that make up the
executable portion of a job.
- A **Workload** (uppercase "W") refers to the Kueue `Workload` custom resource.
- A **Job** (uppercase "J") refers to the Kubernetes `batch/v1/Job` resource (one way how code can be submitted through jobq).

Kueue handles the `Workload` and updates its status to reflect the current state of the workload.

![Kueue workload components](https://kueue.sigs.k8s.io/images/queueing-components.svg)

## Identifying workloads

Every Workload (as managed by Kueue) carries by an automatically generated unique identifier (UID) as well as a human-readable name and namespace.
Both these could serve as a unique identifier for a Workload. However, a name/namespace combination is not guaranteed to be unique over time (for example when deleting and recreating), whereas UIDs are.
This makes UIDs a slightly better choice for identifying a given Workload resource.

The concrete workload resource has the same identifiers, a UID and name/namespace combination.

A given job references its associated Workload in a 1:1 fashion (through its `metadata.ownerReferences` field).

This theoretically allows to identify a job in the cluster through two different identifiers:

- the UID of the (concrete) _job_ resource.
- the UID of the (abstract) _Workload_ resource.

In practice, jobq always uses the **UID of the concrete workload** as the identifier for a job.
All CLI operations return and accept the UID of the concrete workload.

As an example, imagine the following resources in the cluster after submitting a job:

```mermaid
graph LR
subgraph "Namespace example"
direction LR
A["`Workload **job-example** <pre>uid-1</pre>`"] --> B["`Job **example** <pre>uid-2</pre>`"] --> C[Pod]
end
```

If we want to query the logs of the job, we can do so by calling `jobq logs` with the UID of the concrete workload:

```console
$ jobq logs uid-2
[... log output ...]
```
48 changes: 48 additions & 0 deletions client/docs/concepts/lifecycle.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# Job Lifecycle

Since jobq builds on top of the [Kueue](https://kueue.sigs.k8s.io/) job queuing system for scheduling,
the lifecycle of a job is very similar to the lifecycle of a workload in Kueue.

The remainder of this document uses the terms _job_ and _workload_ interchangeably.

A workload roughly goes through three phases after its submission: _queuing and scheduling_, _execution_, and _completion_.

### Queueing and scheduling

After its submission, a workload is in the `Submitted` state, where it competes with other workloads for available resource quotas.
Once it is admitted to a cluster queue, it enters the `Pending` state, where Kueue will reserve a quota for it.
Alternatively, if the selected local or cluster queue for the workload are stopped or do not exist, the workload will enter the `Inadmissible` state until this condition is resolved.

### Execution

After all admission checks for the workload have passed, it enters the `Admitted` state, it is now eligible for execution by the cluster.

### Completion

When the workload terminates successfully, it enters the terminal `Succeeded` state.
If any unrecoverable error occurs during execution, the workload enters the terminal `Failed` state. This does not necessarily happen on the first abnormal termination of a pod, depending on the type of workload and other factors (such as the retry limit in a `batch/v1/Job`).

A currently executing workload may be preempted by another workload (e.g., by a newly submitted workload with a higher priority).
In this case, Kueue will terminate any pods associated with the preempted workload and either requeue it for later execution or evict it from the cluster queue.

## State Diagram

```mermaid
stateDiagram-v2
direction LR
[*] --> Submitted
Submitted --> Pending: quotaReserved
Submitted --> Inadmissible
Inadmissible --> Submitted
Pending --> Admitted: admitted
Admitted --> Succeeded: success
Admitted --> Failed: error
Admitted --> Submitted: evicted
Admitted --> Pending: requeued
Succeeded --> [*]
Failed --> [*]
```
36 changes: 34 additions & 2 deletions client/docs/index.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,35 @@
# Jobq, a cluster workflow scheduling tool
---
title: Home
---

This documentation is work in progress.
#

!!! warning "Work in progress"

This documentation is work in progress.
Please excuse frequent changes and missing content.

<div class="grid cards" markdown>
- [:material-thought-bubble:{ .lg .middle } **Concepts**](concepts/_index.md)

***

Learn about the concepts behind jobq

- [:material-apple-keyboard-command:{ .lg .middle } **Command-line interface**](cli.md)

***

Learn how to use the `jobq` command-line interface

- [:material-book-open-variant:{ .lg .middle } **API Reference**](reference/SUMMARY.md)

***

Detailed documentation of the jobq Python API

</div>

<hr />

This project is licensed under the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0){: target="\_blank" }.
Loading

0 comments on commit aec0761

Please sign in to comment.