Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add integration tests for PyTorch, TGI and TEI DLCs #79

Open
wants to merge 82 commits into
base: main
Choose a base branch
from

Conversation

alvarobartt
Copy link
Member

@alvarobartt alvarobartt commented Aug 30, 2024

Description

This PR adds some integration tests for the following Hugging Face DLCs on Google Cloud:

  • TGI only on GPU
  • TEI on both CPU and GPU
  • PyTorch Inference on both CPU and GPU
  • PyTorch Training only on GPU

The tests related to the inference try different alternatives, as well as emulate the Vertex AI environments via the AIP_ environment variables exposed by Vertex AI and handled within the Hugging Face DLCs on Google Cloud for a seamless integration.

As it will be reused within the TGI and TEI tests
Pass args via `text_generation_launcher_kwargs` and include the VertexAI
environment mimic via the `AIP_` environment variables.
@alvarobartt alvarobartt changed the title [TESTS] Add some integration tests (WIP) Add integration tests for PyTorch, TGI and TEI DLCs Sep 2, 2024
Copy link
Member

@philschmid philschmid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work. Added some minor comments

Comment on lines 37 to 46
- name: Set up uv
run: |
curl -LsSf https://astral.sh/uv/install.sh | sh
export PATH=$HOME/.cargo/bin:$PATH
uv --version

- name: Install dependencies
run: |
uv venv --python 3.10
uv pip install -r tests/requirements.txt
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add a "cache"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK the VMs are ephemeral so the cache would be destroyed after each job is done, and uv is already pretty fast (downloads those under 10 seconds).

Comment on lines 37 to 39
training-dlc: us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-pytorch-training-cu121.transformers.4-42.ubuntu2204.py310
inference-dlc: us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-pytorch-inference-cu121.2-2.transformers.4-44.ubuntu2204.py311
tgi-dlc: us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu121.2-2.ubuntu2204.py310
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mhm is there a better way to specify those? Feels like we can easily forget updating them?

tests/pytorch/training/test_trl.py Outdated Show resolved Hide resolved
tests/pytorch/training/test_trl.py Outdated Show resolved Hide resolved
tests/requirements.txt Outdated Show resolved Hide resolved
tests/requirements.txt Outdated Show resolved Hide resolved
tests/tei/test_tei.py Outdated Show resolved Hide resolved
tests/tei/test_tei.py Outdated Show resolved Hide resolved
tests/tgi/test_tgi.py Outdated Show resolved Hide resolved
- Capture `container_uri` from environment variable before running
testand remove the default value to prevent issues when testing
- Remove `max_train_epochs=-1` as not required since `max_steps`
isalready specified
- Rename `test_transformers` to `test_huggingface_inference_toolkit`
- Remove `transformers` and `jinja2` dependencies as not required, as
well as `AutoTokenizer` usage for prompt formatting

Co-authored-by: Philipp Schmid <[email protected]>
…ia-smi`

Those dependencies where not needed, not actively maintained and adding
extra complexity; instead, it has been replaced with `subprocess`
running `nvidia-smi`.
- TEI condition on container port was reversed
- `gpu_available` raises exception instead of `returncode` if command
doesn't exist
In most of the cases, splitting those is for the best and to reduce
execution time, assuming we tend to update the DLCs one at a time, so
it's not really probable for all the containers to change at once.

Pros: easier to manage, more granular, no need for extra `docker pull`s,
just runs what's modified

Cons: when modifying a bunch of tests it will be slower as `docker pull`
needs to be done per each test as instances are ephemeral
The `type: choice` with `options` is only supported for
`workflow_dispatch` i.e. when triggering the GitHub Action manually; not
via `workflow_call` i.e. when the workflow is just reused from another
workflow.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants