Skip to content

Commit

Permalink
Add user toolkits to all sky custom images and fix PyTorch issue on A…
Browse files Browse the repository at this point in the history
…10 (#4219)

* Add user toolkits to all sky custom images

* address comments
  • Loading branch information
yika-luo authored Oct 30, 2024
1 parent 8568ac4 commit 5dda9cf
Show file tree
Hide file tree
Showing 9 changed files with 55 additions and 9 deletions.
16 changes: 16 additions & 0 deletions docs/source/reference/faq.rst
Original file line number Diff line number Diff line change
Expand Up @@ -192,6 +192,22 @@ For example, if you have access to special regions of GCP, add the data to ``~/.
Also, you can update the catalog for a specific cloud by deleting the CSV file (e.g., ``rm ~/.sky/catalogs/<schema-version>/gcp.csv``).
SkyPilot will automatically download the latest catalog in the next run.

Package Installation
---------------------

Unable to import PyTorch in a SkyPilot task.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
For `PyTorch <https://pytorch.org/>`_ installation, if you are using the default SkyPilot images (not passing in `--image-id`), ``pip install torch`` should work.

But if you use your own image which has an older NVIDIA driver (535.161.08 or lower) and you install the default PyTorch, you may encounter the following error:

.. code-block:: bash
ImportError: /home/azureuser/miniconda3/lib/python3.10/site-packages/torch/lib/../../nvidia/cusparse/lib/libcusparse.so.12: undefined symbol: __nvJitLinkComplete_12_4, version libnvJitLink.so.12
You will need to install a PyTorch version that is compatible with your NVIDIA driver, e.g., ``pip install torch --index-url https://download.pytorch.org/whl/cu121``.


Miscellaneous
-------------

Expand Down
4 changes: 2 additions & 2 deletions sky/clouds/service_catalog/images/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ FYI time to packer build images:
### GCP
1. Build a single global image.
```bash
export TYPE=gpu # Update this
export TYPE=cpu # Update this
export IMAGE=skypilot-gcp-${TYPE}-ubuntu
packer build ${IMAGE}.pkr.hcl
```
Expand All @@ -39,7 +39,7 @@ gcloud compute images add-iam-policy-binding ${IMAGE_NAME} --member='allAuthenti
### AWS
1. Generate the source image for a single region.
```bash
export TYPE=gpu # Update this
export TYPE=cpu # Update this
export IMAGE=skypilot-aws-${TYPE}-ubuntu
packer build ${IMAGE}.pkr.hcl
```
Expand Down
12 changes: 12 additions & 0 deletions sky/clouds/service_catalog/images/provisioners/user-toolkit.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
#!/bin/bash
# This script installs popular toolkits for users to use in the base environment.

eval "$(~/miniconda3/bin/conda shell.bash hook)"
conda activate base
pip install numpy
pip install pandas

if [ "$AZURE_GRID_DRIVER" = 1 ]; then
# Need PyTorch X.X.X+cu121 version to be compatible with older NVIDIA driver (535.161.08 or lower)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
fi
Original file line number Diff line number Diff line change
Expand Up @@ -41,4 +41,7 @@ build {
]
script = "./provisioners/skypilot.sh"
}
provisioner "shell" {
script = "./provisioners/user-toolkit.sh"
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -49,4 +49,7 @@ build {
]
script = "./provisioners/skypilot.sh"
}
provisioner "shell" {
script = "./provisioners/user-toolkit.sh"
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -9,14 +9,12 @@ variable "vm_generation" {
}

locals {
date = formatdate("YYMMDD", timestamp())
version = formatdate("YY.MM.DD", timestamp())
date = formatdate("YYMMDD", timestamp())
version = formatdate("YY.MM.DD", timestamp())
}

source "azure-arm" "cpu-ubuntu" {
managed_image_resource_group_name = "skypilot-images"
// TODO(yika): these fields may not be required as we use community images below instead. We need to double-check if these can be removed.
managed_image_name = "skypilot-azure-cpu-ubuntu-${local.date}"

subscription_id = "59d8c23c-7ef5-42c7-b2f3-a919ad8026a7"
tenant_id = "7c81f068-46f8-4b26-9a46-2fbec2287e3d"
Expand Down Expand Up @@ -67,4 +65,7 @@ build {
]
script = "./provisioners/skypilot.sh"
}
provisioner "shell" {
script = "./provisioners/user-toolkit.sh"
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -15,13 +15,12 @@ variable "use_grid_driver" {
}

locals {
date = formatdate("YYMMDD", timestamp())
version = formatdate("YY.MM.DD", timestamp())
date = formatdate("YYMMDD", timestamp())
version = formatdate("YY.MM.DD", timestamp())
}

source "azure-arm" "gpu-ubuntu" {
managed_image_resource_group_name = "skypilot-images"
managed_image_name = "skypilot-azure-gpu-ubuntu-${local.date}"

subscription_id = "59d8c23c-7ef5-42c7-b2f3-a919ad8026a7"
tenant_id = "7c81f068-46f8-4b26-9a46-2fbec2287e3d"
Expand Down Expand Up @@ -78,4 +77,10 @@ build {
]
script = "./provisioners/skypilot.sh"
}
provisioner "shell" {
environment_vars = [
var.use_grid_driver ? "AZURE_GRID_DRIVER=1" : "AZURE_GRID_DRIVER=0",
]
script = "./provisioners/user-toolkit.sh"
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -27,4 +27,7 @@ build {
]
script = "./provisioners/skypilot.sh"
}
provisioner "shell" {
script = "./provisioners/user-toolkit.sh"
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -40,4 +40,7 @@ build {
]
script = "./provisioners/skypilot.sh"
}
provisioner "shell" {
script = "./provisioners/user-toolkit.sh"
}
}

0 comments on commit 5dda9cf

Please sign in to comment.