- Fine-Tune LLMs with Ray and DeepSpeed on OpenShift AI
-
Admin access to an OpenShift cluster (CRC is fine)
-
Installed OpenDataHub or RHOAI, enabled all Distributed Workload components
-
Installed Go 1.21
-
CODEFLARE_TEST_OUTPUT_DIR
- Output directory for test logs -
CODEFLARE_TEST_TIMEOUT_SHORT
- Timeout duration for short tasks -
CODEFLARE_TEST_TIMEOUT_MEDIUM
- Timeout duration for medium tasks -
CODEFLARE_TEST_TIMEOUT_LONG
- Timeout duration for long tasks -
CODEFLARE_TEST_RAY_IMAGE
(Optional) - Ray image used for raycluster configuration -
MINIO_CLI_IMAGE
(Optional) - Minio CLI image used for uploading/downloading data from/into s3 bucketNOTE:
quay.io/modh/ray:2.35.0-py311-cu121
is the default image used for creating a RayCluster resource. If you have your own custom ray image which suits your purposes, specify it inCODEFLARE_TEST_RAY_IMAGE
environment variable.
FMS_HF_TUNING_IMAGE
- Image tag used in PyTorchJob CR for model training
TEST_NAMESPACE_NAME
(Optional) - Existing namespace where will the Training operator GPU tests be executedHF_TOKEN
- HuggingFace token used to pull models which has limited accessGPTQ_MODEL_PVC_NAME
- Name of PersistenceVolumeClaim containing downloaded GPTQ models
To upload trained model into S3 compatible storage, use the environment variables mentioned below :
AWS_DEFAULT_ENDPOINT
- Storage bucket endpoint to upload trained dataset to, if set then test will upload model into s3 bucketAWS_ACCESS_KEY_ID
- Storage bucket access keyAWS_SECRET_ACCESS_KEY
- Storage bucket secret keyAWS_STORAGE_BUCKET
- Storage bucket nameAWS_STORAGE_BUCKET_MODEL_PATH
(Optional) - Path in the storage bucket where trained model will be stored to
ODH_NAMESPACE
- Namespace where ODH components are installed toNOTEBOOK_USER_NAME
- Username of user used for running WorkbenchNOTEBOOK_USER_TOKEN
- Login token of user used for running WorkbenchNOTEBOOK_IMAGE
- Image used for running Workbench
To download MNIST training script datasets from S3 compatible storage, use the environment variables mentioned below :
AWS_DEFAULT_ENDPOINT
- Storage bucket endpoint from which to download MNIST datasetsAWS_ACCESS_KEY_ID
- Storage bucket access keyAWS_SECRET_ACCESS_KEY
- Storage bucket secret keyAWS_STORAGE_BUCKET
- Storage bucket nameAWS_STORAGE_BUCKET_MNIST_DIR
- Storage bucket directory from which to download MNIST datasets.
Execute tests like standard Go unit tests.
go test -timeout 60m ./tests/kfto/