Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubeflow TensorFlow Training Operator Add Evaluator #1870

Merged
merged 31 commits into from
Nov 8, 2023

Conversation

Future-Outlier
Copy link
Member

@Future-Outlier Future-Outlier commented Oct 4, 2023

TL;DR

Enable running a data service in kubeflow tensorflow training operator by utilizing the evaluator section in the TF_CONFIG.

Describe your changes

Enable running a data service by utilizing the evaluator section in the TF_CONFIG to configure data service worker information, as discussed in this Slack conversation.

The use case previously doesn't include the evaluator section, so we have to give it a default value so that we can take the case into account.

Setup Process

I test it in two ways, by specifying the Dockerfile or using ImageSpec.

Dockerfile

FROM python:3.9-slim-buster
USER root
WORKDIR /root
ENV PYTHONPATH /root
RUN apt-get update && apt-get install build-essential -y
RUN apt-get install git -y
# The following line is an example of how to install your modified plugins. In this case, it demonstrates how to install the 'deck' plugin.
# RUN pip install -U git+https://github.com/Yicheng-Lu-llll/flytekit.git@"demo#egg=flytekitplugins-deck-standard&subdirectory=plugins/flytekit-deck-standard" # replace with your own repo and branch
RUN pip install -U git+https://github.com/Future-Outlier/flytekit.git@98ddd542a02551a9a9eb122b98004d0d092abbe9#subdirectory=plugins/flytekit-kf-tensorflow

RUN pip install -U git+https://github.com/Future-Outlier/flyte.git@647b8f4eeeab1a65866d19fab13c416ed0e4a07f#subdirectory=flyteidl

RUN pip install -U git+https://github.com/Future-Outlier/flytekit.git@98ddd542a02551a9a9eb122b98004d0d092abbe9

Use the code below

from flytekit import ImageSpec, Resources, task
from flytekit.configuration import Image, ImageConfig, SerializationSettings
from flytekitplugins.kftensorflow import (PS, Chief, CleanPodPolicy, Evaluator,
                                          RestartPolicy, RunPolicy, TfJob,
                                          Worker)

task_config = TfJob(
    worker=Worker(replicas=1),
    chief=Chief(replicas=1),
    ps=PS(replicas=1),
    evaluator=Evaluator(replicas=1),
)


@task(
    task_config=task_config,
    cache=True,
    requests=Resources(cpu="1"),
    cache_version="1",
)
def my_tensorflow_task(x: int, y: str) -> int:
    return x


if __name__ == "__main__":
    print(my_tensorflow_task(x=10, y="hello"))

Run it to flyte-console by this command

pyflyte run --remote --image futureoutlier/kubeflow:tfoperator-v2 \
kubeflow_tf_evaluator.py my_tensorflow_task --x 100 --y acc

ImageSpec

from flytekit import ImageSpec, Resources, task
from flytekit.configuration import Image, ImageConfig, SerializationSettings
from flytekitplugins.kftensorflow import (PS, Chief, CleanPodPolicy, Evaluator,
                                          RestartPolicy, RunPolicy, TfJob,
                                          Worker)

kubeflow_plugin = "git+https://github.com/Future-Outlier/flytekit.git@98ddd542a02551a9a9eb122b98004d0d092abbe9#subdirectory=plugins/flytekit-kf-tensorflow"
kubeflow_idl = "git+https://github.com/Future-Outlier/flyte.git@e3d022ae86466632f0b8eeae80bc07441827e403#subdirectory=flyteidl"
flytekit = "git+https://github.com/Future-Outlier/flytekit.git@98ddd542a02551a9a9eb122b98004d0d092abbe9"

# base_image="futureoutlier/kubeflow:tfoperator-v2"
image_spec = ImageSpec(
    packages=[flytekit, kubeflow_idl, kubeflow_plugin],
    apt_packages=["git"],
    registry="futureoutlier",
)
# build-essential git
task_config = TfJob(
    worker=Worker(replicas=1),
    chief=Chief(replicas=1),
    ps=PS(replicas=1),
    evaluator=Evaluator(replicas=1),
)


@task(
    task_config=task_config,
    cache=True,
    requests=Resources(cpu="1"),
    cache_version="1",
    container_image=image_spec,
)
def my_tensorflow_task(x: int, y: str) -> int:
    return x


if __name__ == "__main__":
    print(my_tensorflow_task(x=10, y="hello"))
pyflyte run --remote kubeflow_tf_evaluator.py my_tensorflow_task --x 20231008 --y AMAZING

Screenshot

Dockerfile

image

ImageSpec

image

Kubeflow Training Operator Pods

image

Type

  • Bug Fix
  • Feature
  • Plugin

Are all requirements met?

  • Code completed
  • Smoke tested
  • Unit tests added
  • Code documentation added
  • Any pending items have an associated Issue

Complete description

The TFJob task config doesn't contain an element for evaluators which is part of the TFJob spec.
Let's make it optional!

Tracking Issue

flyteorg/flyte#4167
flyteorg/flyte#4168

@codecov
Copy link

codecov bot commented Oct 4, 2023

Codecov Report

Attention: 1 lines in your changes are missing coverage. Please review.

Comparison is base (e41ec1e) 94.95% compared to head (7a0f729) 62.83%.
Report is 1 commits behind head on master.

Additional details and impacted files
@@             Coverage Diff             @@
##           master    #1870       +/-   ##
===========================================
- Coverage   94.95%   62.83%   -32.12%     
===========================================
  Files         136      307      +171     
  Lines        6165    22998    +16833     
  Branches        0     3490     +3490     
===========================================
+ Hits         5854    14451     +8597     
- Misses        311     8125     +7814     
- Partials        0      422      +422     
Files Coverage Δ
...ensorflow/flytekitplugins/kftensorflow/__init__.py 100.00% <100.00%> (ø)
...ytekit-kf-tensorflow/tests/test_tensorflow_task.py 94.82% <100.00%> (ø)
...kf-tensorflow/flytekitplugins/kftensorflow/task.py 93.33% <92.85%> (-0.08%) ⬇️

... and 171 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@kumare3
Copy link
Contributor

kumare3 commented Oct 5, 2023

Can we write more about what is the use of evaluators? have you found some example of its usage?

@Future-Outlier
Copy link
Member Author

Can we write more about what is the use of evaluators? have you found some example of its usage?

No problem, I will do it.

Future Outlier and others added 7 commits October 6, 2023 17:23
@Future-Outlier
Copy link
Member Author

We need to merge this pull request, then the test will be passed.
flyteorg/flyte#4168

@Future-Outlier
Copy link
Member Author

I've asked Linkedin software engineer @yubofredwang about the PR, he said that it is great!

Copy link
Member

@pingsutw pingsutw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, will merge this after FLyteIDL is released.

@Future-Outlier Future-Outlier changed the title Kubeflow TensorFlow Training Operator Kubeflow TensorFlow Training Operator Add Evaluator Nov 5, 2023
Signed-off-by: Future Outlier <[email protected]>
pingsutw
pingsutw previously approved these changes Nov 5, 2023
Signed-off-by: Future Outlier <[email protected]>
@pingsutw pingsutw merged commit 2c98809 into flyteorg:master Nov 8, 2023
69 of 71 checks passed
ringohoffman pushed a commit to ringohoffman/flytekit that referenced this pull request Nov 24, 2023
---------

Signed-off-by: Future Outlier <[email protected]>
Co-authored-by: Future Outlier <[email protected]>
RRap0so pushed a commit to RRap0so/flytekit that referenced this pull request Dec 15, 2023
---------

Signed-off-by: Future Outlier <[email protected]>
Co-authored-by: Future Outlier <[email protected]>
Signed-off-by: Rafael Raposo <[email protected]>
@Future-Outlier Future-Outlier mentioned this pull request Oct 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants