From 68465e7944c241ae9cd48c2ff18cf7c89187395a Mon Sep 17 00:00:00 2001 From: Zongheng Yang Date: Fri, 25 Oct 2024 08:42:17 -0700 Subject: [PATCH 1/4] [Docs] Update Managed Jobs page. --- docs/source/examples/managed-jobs.rst | 83 ++++++++++++++++----------- 1 file changed, 48 insertions(+), 35 deletions(-) diff --git a/docs/source/examples/managed-jobs.rst b/docs/source/examples/managed-jobs.rst index a47b4345b9f..66ec0968de5 100644 --- a/docs/source/examples/managed-jobs.rst +++ b/docs/source/examples/managed-jobs.rst @@ -5,14 +5,20 @@ Managed Jobs .. tip:: - This feature is great for scaling out: running a single job for long durations, or running many jobs (pipelines). + This feature is great for scaling out: running a single job for long durations, or running many jobs in parallel. -SkyPilot supports **managed jobs** (:code:`sky jobs`), which can automatically recover from any spot preemptions or hardware failures. -It can be used in three modes: +SkyPilot supports **managed jobs** (:code:`sky jobs`), where "managed" means +any spot preemptions or hardware failures are auto-recovered by SkyPilot. +Users can launch: -#. :ref:`Managed Spot Jobs `: Jobs run on auto-recovering spot instances. This can **save significant costs** (e.g., up to 70\% for GPU VMs) by making preemptible spot instances useful for long-running jobs. -#. :ref:`On-demand `: Jobs run on auto-recovering on-demand instances. This is useful for jobs that require guaranteed resources. -#. :ref:`Pipelines `: Run pipelines that contain multiple tasks (which can have different resource requirements and ``setup``/``run`` commands). This is useful for running a sequence of tasks that depend on each other, e.g., data processing, training a model, and then running inference on it. +.. It can be used in three modes: + +#. :ref:`Managed spot jobs `: Jobs run on auto-recovering spot instances. This **saves significant costs** (e.g., ~70\% for GPU VMs) by making preemptible spot instances useful for long-running jobs. +#. :ref:`On-demand or reserved jobs `: Jobs run on auto-recovering on-demand or reserved instances. Useful for jobs that require guaranteed resources. +#. :ref:`Pipelines `: Run pipelines that contain multiple tasks (which + can have different resource requirements and ``setup``/``run`` commands). + Useful for running a sequence of tasks that depend on each other, e.g., data + processing, training a model, and then running inference on it. .. _spot-jobs: @@ -20,28 +26,12 @@ It can be used in three modes: Managed Spot Jobs ----------------- -In this mode, :code:`sky jobs launch --use-spot` is used to launch a managed spot job. SkyPilot automatically finds available spot resources across regions and clouds to maximize availability. -Any spot preemptions are automatically handled by SkyPilot without user intervention. - +In this mode, jobs run on spot instances, and preemptions are auto-recovered by SkyPilot. -Quick comparison between *unmanaged spot clusters* vs. *managed spot jobs*: +To launch a managed spot job, use :code:`sky jobs launch --use-spot`. +SkyPilot automatically finds available spot instances across regions and clouds to maximize availability. +Any spot preemptions are automatically handled by SkyPilot without user intervention. -.. list-table:: - :widths: 30 18 12 35 - :header-rows: 1 - - * - Command - - Managed? - - SSH-able? - - Best for - * - :code:`sky launch --use-spot` - - Unmanaged spot cluster - - Yes - - Interactive dev on spot instances (especially for hardware with low preemption rates) - * - :code:`sky jobs launch --use-spot` - - Managed spot job (auto-recovery) - - No - - Scaling out long-running jobs (e.g., data processing, training, batch inference) Here is an example of a BERT training job failing over different regions across AWS and GCP. @@ -59,6 +49,25 @@ To use managed spot jobs, there are two requirements: #. :ref:`Checkpointing ` (optional): For job recovery due to preemptions, the user application code can checkpoint its progress periodically to a :ref:`mounted cloud bucket `. The program can reload the latest checkpoint when restarted. +Quick comparison between *managed spot jobs* vs. *launching spot clusters*: + +.. list-table:: + :widths: 30 18 12 35 + :header-rows: 1 + + * - Command + - Managed? + - SSH-able? + - Best for + * - :code:`sky jobs launch --use-spot` + - Yes, preemptions are auto-recovered + - No + - Scaling out long-running jobs (e.g., data processing, training, batch inference) + * - :code:`sky launch --use-spot` + - No, preemptions are not handled + - Yes + - Interactive dev on spot instances (especially for hardware with low preemption rates) + .. _job-yaml: Job YAML @@ -245,11 +254,11 @@ Real-World Examples .. _on-demand: -Using On-Demand Instances +On-Demand or Reserved Instances -------------------------------- The same ``sky jobs launch`` and YAML interfaces can run jobs on auto-recovering -on-demand instances. This is useful to have SkyPilot monitor any underlying +on-demand or reserved instances. This is useful to have SkyPilot monitor any underlying machine failures and transparently recover the job. To do so, simply set :code:`use_spot: false` in the :code:`resources` section, or override it with :code:`--use-spot false` in the CLI. @@ -264,10 +273,10 @@ To do so, simply set :code:`use_spot: false` in the :code:`resources` section, o interface, while ``sky launch`` is a cluster interface (that you can launch tasks on, albeit not managed). -Either Spot Or On-Demand -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Either Spot or On-Demand/Reserved +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -You can use ``any_of`` to specify either spot or on-demand instances as +You can use ``any_of`` to specify either spot or on-demand/reserved instances as candidate resources for a job. See documentation :ref:`here ` for more details. @@ -280,12 +289,17 @@ candidate resources for a job. See documentation :ref:`here - use_spot: false In this example, SkyPilot will perform cost optimizations to select the resource to use, which almost certainly -will be spot instances. If spot instances are not available, SkyPilot will fall back to launch on-demand instances. +will be spot instances. If spot instances are not available, SkyPilot will fall back to launch on-demand/reserved instances. More advanced policies for resource selection, such as the `Can't Be Late `__ (NSDI'24) paper, may be supported in the future. +Running Many Parallel Jobs +------------------------- + +For batch jobs such as **data processing** or **hyperparameter sweeps**, you can launch many jobs in parallel. See :ref:`many-jobs`. + Useful CLIs ----------- @@ -323,7 +337,6 @@ Cancel a managed job: If any failure happens for a managed job, you can check :code:`sky jobs queue -a` for the brief reason of the failure. For more details, it would be helpful to check :code:`sky jobs logs --controller `. - .. _pipeline: Job Pipelines @@ -414,8 +427,8 @@ To submit the pipeline, the same command :code:`sky jobs launch` is used. The pi -Dashboard ---------- +Job Dashboard +------------- Use ``sky jobs dashboard`` to open a dashboard to see all jobs: From f979d4d153a3e4d4bcfda32cc886932e181cc1c2 Mon Sep 17 00:00:00 2001 From: Zongheng Yang Date: Fri, 25 Oct 2024 08:48:04 -0700 Subject: [PATCH 2/4] Lint --- docs/source/examples/managed-jobs.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/examples/managed-jobs.rst b/docs/source/examples/managed-jobs.rst index 66ec0968de5..08a370ab1bf 100644 --- a/docs/source/examples/managed-jobs.rst +++ b/docs/source/examples/managed-jobs.rst @@ -296,7 +296,7 @@ More advanced policies for resource selection, such as the `Can't Be Late paper, may be supported in the future. Running Many Parallel Jobs -------------------------- +-------------------------- For batch jobs such as **data processing** or **hyperparameter sweeps**, you can launch many jobs in parallel. See :ref:`many-jobs`. From ab0c85c18f75670ba62e5c17e9cc56d3a71bc151 Mon Sep 17 00:00:00 2001 From: Zongheng Yang Date: Fri, 25 Oct 2024 15:49:58 -0700 Subject: [PATCH 3/4] Updates --- docs/source/examples/managed-jobs.rst | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/docs/source/examples/managed-jobs.rst b/docs/source/examples/managed-jobs.rst index 08a370ab1bf..d85356c936a 100644 --- a/docs/source/examples/managed-jobs.rst +++ b/docs/source/examples/managed-jobs.rst @@ -8,14 +8,14 @@ Managed Jobs This feature is great for scaling out: running a single job for long durations, or running many jobs in parallel. SkyPilot supports **managed jobs** (:code:`sky jobs`), where "managed" means -any spot preemptions or hardware failures are auto-recovered by SkyPilot. -Users can launch: +if a job's underlying compute experienced any spot preemptions or hardware failures, +SkyPilot will automatically recover the job. -.. It can be used in three modes: +Managed jobs can be used in three modes: #. :ref:`Managed spot jobs `: Jobs run on auto-recovering spot instances. This **saves significant costs** (e.g., ~70\% for GPU VMs) by making preemptible spot instances useful for long-running jobs. -#. :ref:`On-demand or reserved jobs `: Jobs run on auto-recovering on-demand or reserved instances. Useful for jobs that require guaranteed resources. -#. :ref:`Pipelines `: Run pipelines that contain multiple tasks (which +#. :ref:`Managed on-demand/reserved jobs `: Jobs run on auto-recovering on-demand or reserved instances. Useful for jobs that require guaranteed resources. +#. :ref:`Managed pipelines `: Run pipelines that contain multiple tasks (which can have different resource requirements and ``setup``/``run`` commands). Useful for running a sequence of tasks that depend on each other, e.g., data processing, training a model, and then running inference on it. @@ -254,8 +254,8 @@ Real-World Examples .. _on-demand: -On-Demand or Reserved Instances --------------------------------- +Managed On-Demand/Reserved Jobs +------------------------------- The same ``sky jobs launch`` and YAML interfaces can run jobs on auto-recovering on-demand or reserved instances. This is useful to have SkyPilot monitor any underlying @@ -339,8 +339,8 @@ Cancel a managed job: .. _pipeline: -Job Pipelines -------------- +Managed Pipelines +----------------- A pipeline is a managed job that contains a sequence of tasks running one after another. From 6ff149cc3d3d4424c666c1893d06b1de656a02f6 Mon Sep 17 00:00:00 2001 From: Zongheng Yang Date: Fri, 25 Oct 2024 16:05:23 -0700 Subject: [PATCH 4/4] reword --- docs/source/examples/managed-jobs.rst | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/docs/source/examples/managed-jobs.rst b/docs/source/examples/managed-jobs.rst index d85356c936a..993ad361d66 100644 --- a/docs/source/examples/managed-jobs.rst +++ b/docs/source/examples/managed-jobs.rst @@ -7,10 +7,7 @@ Managed Jobs This feature is great for scaling out: running a single job for long durations, or running many jobs in parallel. -SkyPilot supports **managed jobs** (:code:`sky jobs`), where "managed" means -if a job's underlying compute experienced any spot preemptions or hardware failures, -SkyPilot will automatically recover the job. - +SkyPilot supports **managed jobs** (:code:`sky jobs`), which can automatically recover from any underlying spot preemptions or hardware failures. Managed jobs can be used in three modes: #. :ref:`Managed spot jobs `: Jobs run on auto-recovering spot instances. This **saves significant costs** (e.g., ~70\% for GPU VMs) by making preemptible spot instances useful for long-running jobs.