Skip to content

Commit

Permalink
[GSoC] Update tune API for LLM hyperparameters optimization (#2393)
Browse files Browse the repository at this point in the history
* update tune api for llm hyperparameters optimization

Signed-off-by: helenxie-bit <[email protected]>

* resolve conflict

Signed-off-by: helenxie-bit <[email protected]>

* fix the problem of dependency

Signed-off-by: helenxie-bit <[email protected]>

* fix the format of import statement

Signed-off-by: helenxie-bit <[email protected]>

* adjust the blank lines

Signed-off-by: helenxie-bit <[email protected]>

* delete the trainer to reuse it in Training Operator

Signed-off-by: helenxie-bit <[email protected]>

* update constants

Signed-off-by: helenxie-bit <[email protected]>

* update metrics format

Signed-off-by: helenxie-bit <[email protected]>

* update the type of  and

Signed-off-by: helenxie-bit <[email protected]>

* update the message of 'ImportError'

Signed-off-by: helenxie-bit <[email protected]>

* add TODO of PVC creation

Signed-off-by: helenxie-bit <[email protected]>

* update the name of pvc

Signed-off-by: helenxie-bit <[email protected]>

* reuse constants from Training Operator

Signed-off-by: helenxie-bit <[email protected]>

* keep 'parameters' and update validation

Signed-off-by: helenxie-bit <[email protected]>

* update for test

Signed-off-by: helenxie-bit <[email protected]>

* reuse 'get_container_spec' and 'get_pod_template_spec' from Training Operator

Signed-off-by: helenxie-bit <[email protected]>

* format with black

Signed-off-by: helenxie-bit <[email protected]>

* fix Lint error

Signed-off-by: helenxie-bit <[email protected]>

* fix Lint errors

Signed-off-by: helenxie-bit <[email protected]>

* delete types

Signed-off-by: helenxie-bit <[email protected]>

* fix format

Signed-off-by: helenxie-bit <[email protected]>

* update format

Signed-off-by: helenxie-bit <[email protected]>

* update format

Signed-off-by: helenxie-bit <[email protected]>

* fix e2e test error

Signed-off-by: helenxie-bit <[email protected]>

* add TODO

Signed-off-by: helenxie-bit <[email protected]>

* format with max line length

Signed-off-by: helenxie-bit <[email protected]>

* format docstring

Signed-off-by: helenxie-bit <[email protected]>

* update format

Signed-off-by: helenxie-bit <[email protected]>

* add helper functions

Signed-off-by: helenxie-bit <[email protected]>

* update format

Signed-off-by: helenxie-bit <[email protected]>

* update format

Signed-off-by: helenxie-bit <[email protected]>

* run test again

Signed-off-by: helenxie-bit <[email protected]>

* run test again

Signed-off-by: helenxie-bit <[email protected]>

* run test again

Signed-off-by: helenxie-bit <[email protected]>

* fix dict substitution in training_parameters

Signed-off-by: helenxie-bit <[email protected]>

* fix typo

Signed-off-by: helenxie-bit <[email protected]>

* resolve conflicts and add check for case of no parameters

Signed-off-by: helenxie-bit <[email protected]>

* fix format

Signed-off-by: helenxie-bit <[email protected]>

* fix format

Signed-off-by: helenxie-bit <[email protected]>

* fix format

Signed-off-by: helenxie-bit <[email protected]>

* fix flake8 error

Signed-off-by: helenxie-bit <[email protected]>

* fix format

Signed-off-by: helenxie-bit <[email protected]>

* fix format

Signed-off-by: helenxie-bit <[email protected]>

* fix format

Signed-off-by: helenxie-bit <[email protected]>

* fix format

Signed-off-by: helenxie-bit <[email protected]>

* fix format

Signed-off-by: helenxie-bit <[email protected]>

* update isort file to black and fix typo

Signed-off-by: helenxie-bit <[email protected]>

* modify the set of metrics format

Signed-off-by: helenxie-bit <[email protected]>

* update tune API

Signed-off-by: helenxie-bit <[email protected]>

* add types.TrainerResources class

Signed-off-by: helenxie-bit <[email protected]>

* fix flake8 error

Signed-off-by: helenxie-bit <[email protected]>

* rerun tests

Signed-off-by: helenxie-bit <[email protected]>

* rerun tests

Signed-off-by: helenxie-bit <[email protected]>

* resolve conflict

Signed-off-by: helenxie-bit <[email protected]>

* rerun tests

Signed-off-by: helenxie-bit <[email protected]>

* rerun tests

Signed-off-by: helenxie-bit <[email protected]>

* rerun tests

Signed-off-by: helenxie-bit <[email protected]>

* rerun tests

Signed-off-by: helenxie-bit <[email protected]>

* rerun tests

Signed-off-by: helenxie-bit <[email protected]>

* rerun tests

Signed-off-by: helenxie-bit <[email protected]>

* rerun tests

Signed-off-by: helenxie-bit <[email protected]>

* rerun tests

Signed-off-by: helenxie-bit <[email protected]>

* rerun tests

Signed-off-by: helenxie-bit <[email protected]>

* rerun tests

Signed-off-by: helenxie-bit <[email protected]>

* delete properties of 'TrainerResources'

Signed-off-by: helenxie-bit <[email protected]>

* fix format error

Signed-off-by: helenxie-bit <[email protected]>

* update types

Signed-off-by: helenxie-bit <[email protected]>

* fix format

Signed-off-by: helenxie-bit <[email protected]>

* add import of 'TrainerResources' in '__init__.py' of katib

Signed-off-by: helenxie-bit <[email protected]>

* rerun tests

Signed-off-by: helenxie-bit <[email protected]>

* revert changes and rerun tests

Signed-off-by: helenxie-bit <[email protected]>

* check pvc and pv status of katib deployments

Signed-off-by: helenxie-bit <[email protected]>

* check pvc and pv status of katib deployments

Signed-off-by: helenxie-bit <[email protected]>

* recommit changes

Signed-off-by: helenxie-bit <[email protected]>

* update minikube version when setup

Signed-off-by: helenxie-bit <[email protected]>

* delete the code that disables formatting for the tune function

Signed-off-by: helenxie-bit <[email protected]>

* update according to andrey's feedback

Signed-off-by: helenxie-bit <[email protected]>

* add helper function in utils

Signed-off-by: helenxie-bit <[email protected]>

* fix format

Signed-off-by: helenxie-bit <[email protected]>

* rerun tests

Signed-off-by: helenxie-bit <[email protected]>

* move metrics_collector_spec back & update helper functions & add return type for helper functions

Signed-off-by: helenxie-bit <[email protected]>

* rerun tests

Signed-off-by: helenxie-bit <[email protected]>

* fix some typos

Signed-off-by: helenxie-bit <[email protected]>

* simplify the definition of 'TrainerResources'

Signed-off-by: helenxie-bit <[email protected]>

---------

Signed-off-by: helenxie-bit <[email protected]>
  • Loading branch information
helenxie-bit authored Sep 3, 2024
1 parent a524f33 commit e251a07
Show file tree
Hide file tree
Showing 8 changed files with 584 additions and 151 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/template-setup-e2e-test/action.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ runs:
version: ${{ inputs.kubernetes-version }}

- name: Setup Minikube Cluster
uses: medyagh/[email protected].16
uses: medyagh/[email protected].18
with:
network-plugin: cni
cni: flannel
Expand Down
4 changes: 4 additions & 0 deletions hack/gen-python-sdk/post_gen.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,10 @@ def _rewrite_helper(input_file, output_file, rewrite_rules):
if output_file == "sdk/python/v1beta1/kubeflow/katib/__init__.py":
lines.append("# Import Katib API client.\n")
lines.append("from kubeflow.katib.api.katib_client import KatibClient\n")
lines.append("# Import Katib TrainerResources class.\n")
lines.append(
"from kubeflow.katib.types.trainer_resources import TrainerResources\n"
)
lines.append("# Import Katib report metrics functions\n")
lines.append("from kubeflow.katib.api.report_metrics import report_metrics\n")
lines.append("# Import Katib helper functions.\n")
Expand Down
2 changes: 2 additions & 0 deletions sdk/python/v1beta1/kubeflow/katib/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,8 @@

# Import Katib API client.
from kubeflow.katib.api.katib_client import KatibClient
# Import Katib TrainerResources class.
from kubeflow.katib.types.trainer_resources import TrainerResources
# Import Katib report metrics functions
from kubeflow.katib.api.report_metrics import report_metrics
# Import Katib helper functions.
Expand Down
567 changes: 419 additions & 148 deletions sdk/python/v1beta1/kubeflow/katib/api/katib_client.py

Large diffs are not rendered by default.

5 changes: 5 additions & 0 deletions sdk/python/v1beta1/kubeflow/katib/constants/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -60,3 +60,8 @@
BASE_IMAGE_MXNET = "docker.io/mxnet/python:1.9.1_native_py3"

DEFAULT_DB_MANAGER_ADDRESS = "katib-db-manager.kubeflow:6789"

# The default value for dataset and model storage PVC.
PVC_DEFAULT_SIZE = "10Gi"
# The default value for PVC access modes.
PVC_DEFAULT_ACCESS_MODES = ["ReadWriteOnce", "ReadOnlyMany"]
10 changes: 10 additions & 0 deletions sdk/python/v1beta1/kubeflow/katib/types/trainer_resources.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
class TrainerResources(object):
def __init__(
self,
num_workers=None,
num_procs_per_worker=None,
resources_per_worker=None,
):
self.num_workers = num_workers
self.num_procs_per_worker = num_procs_per_worker
self.resources_per_worker = resources_per_worker
142 changes: 140 additions & 2 deletions sdk/python/v1beta1/kubeflow/katib/utils/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,15 +12,19 @@
# See the License for the specific language governing permissions and
# limitations under the License.

import copy
import inspect
import json
import logging
import os
import textwrap
from typing import Any, Callable
from typing import Any, Callable, Dict, List, Optional, Union

from kubeflow.katib import models
from kubeflow.katib.constants import constants

logger = logging.getLogger(__name__)


def is_running_in_k8s():
return os.path.isdir("/var/run/secrets/kubernetes.io/")
Expand Down Expand Up @@ -85,7 +89,6 @@ def validate_metrics_value(value: Any):


def validate_objective_function(objective: Callable):

# Check if objective function is callable.
if not callable(objective):
raise ValueError(
Expand Down Expand Up @@ -129,3 +132,138 @@ class FakeResponse:

def __init__(self, obj):
self.data = json.dumps(obj)


class SetEncoder(json.JSONEncoder):
def default(self, obj):
if isinstance(obj, set):
return list(obj)
if isinstance(obj, type):
return obj.__name__
return json.JSONEncoder.default(self, obj)


def get_trial_substitutions_from_dict(
parameters: Dict[str, Any],
experiment_params: List[models.V1beta1ParameterSpec],
trial_params: List[models.V1beta1TrialParameterSpec],
) -> Dict[str, str]:
for p_name, p_value in parameters.items():
# If input parameter value is Katib Experiment parameter sample.
if isinstance(p_value, models.V1beta1ParameterSpec):
# Wrap value for the function input.
parameters[p_name] = f"${{trialParameters.{p_name}}}"

# Add value to the Katib Experiment parameters.
p_value.name = p_name
experiment_params.append(p_value)

# Add value to the Katib Experiment's Trial parameters.
trial_params.append(
models.V1beta1TrialParameterSpec(name=p_name, reference=p_name)
)
else:
# Otherwise, add value to the function input.
parameters[p_name] = p_value

return parameters


def get_trial_substitutions_from_trainer(
parameters: Union["TrainingArguments", "LoraConfig"], # noqa: F821
experiment_params: List[models.V1beta1ParameterSpec],
trial_params: List[models.V1beta1TrialParameterSpec],
) -> Dict[str, str]:
from peft import LoraConfig # noqa: F401
from transformers import TrainingArguments # noqa: F401

if isinstance(parameters, TrainingArguments):
parameters_dict = parameters.to_dict()
else:
parameters_dict = parameters.__dict__

for p_name, p_value in parameters_dict.items():
if not hasattr(parameters, p_name):
logger.warning(f"Training parameter {p_name} is not supported.")
continue

if isinstance(p_value, models.V1beta1ParameterSpec):
old_attr = getattr(parameters, p_name, None)
if old_attr is not None:
value = f"${{trialParameters.{p_name}}}"
setattr(parameters, p_name, value)
p_value.name = p_name
experiment_params.append(p_value)
trial_params.append(
models.V1beta1TrialParameterSpec(name=p_name, reference=p_name)
)
elif p_value is not None:
old_attr = getattr(parameters, p_name, None)
if old_attr is not None:
if isinstance(p_value, dict):
# Update the existing dictionary without nesting
value = copy.deepcopy(p_value)
else:
value = type(old_attr)(p_value)
setattr(parameters, p_name, value)

if isinstance(parameters, TrainingArguments):
parameters = json.dumps(parameters.to_dict())
else:
parameters = json.dumps(parameters.__dict__, cls=SetEncoder)

return parameters


def get_exec_script_from_objective(
objective: Callable,
input_params: Dict[str, Any] = None,
packages_to_install: Optional[List[str]] = None,
pip_index_url: str = "https://pypi.org/simple",
) -> str:
"""
Get executable script for container args from the given objective function and parameters.
"""
# Validate objective function.
validate_objective_function(objective)

# Extract objective function implementation.
objective_code = inspect.getsource(objective)

# Objective function might be defined in some indented scope
# (e.g. in another function). We need to dedent the function code.
objective_code = textwrap.dedent(objective_code)

# Wrap objective function to execute it from the file. For example:
# def objective(parameters):
# print(f'Parameters are {parameters}')
# objective({
# 'lr': '${trialParameters.lr}',
# 'epochs': '${trialParameters.epochs}',
# 'is_dist': False
# })
objective_code = f"{objective_code}\n{objective.__name__}({input_params})\n"

# Prepare execute script template.
exec_script = textwrap.dedent(
"""
program_path=$(mktemp -d)
read -r -d '' SCRIPT << EOM\n
{objective_code}
EOM
printf "%s" "$SCRIPT" > $program_path/ephemeral_objective.py
python3 -u $program_path/ephemeral_objective.py"""
)

# Add objective code to the execute script.
exec_script = exec_script.format(objective_code=objective_code)

# Install Python packages if that is required.
if packages_to_install is not None:
exec_script = (
get_script_for_python_packages(packages_to_install, pip_index_url)
+ exec_script
)

# Return executable script to execute objective function.
return exec_script
3 changes: 3 additions & 0 deletions sdk/python/v1beta1/setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -85,4 +85,7 @@
"Topic :: Software Development :: Libraries :: Python Modules",
],
install_requires=REQUIRES,
extras_require={
"huggingface": ["kubeflow-training[huggingface]==1.8.0"],
},
)

0 comments on commit e251a07

Please sign in to comment.