Add integration tests for PyTorch, TGI and TEI DLCs #79

alvarobartt · 2024-08-30T12:12:57Z

Description

This PR adds some integration tests for the following Hugging Face DLCs on Google Cloud:

TGI only on GPU
TEI on both CPU and GPU
PyTorch Inference on both CPU and GPU
PyTorch Training only on GPU

The tests related to the inference try different alternatives, as well as emulate the Vertex AI environments via the AIP_ environment variables exposed by Vertex AI and handled within the Hugging Face DLCs on Google Cloud for a seamless integration.

As it will be reused within the TGI and TEI tests

Pass args via `text_generation_launcher_kwargs` and include the VertexAI environment mimic via the `AIP_` environment variables.

…...`)

Which is odd, since `jinja2` is a core dependency of `transformers`, see https://github.com/huggingface/transformers/blob/174890280b340b89c5bfa092f6b4fb0e2dc2d7fc/setup.py#L127

philschmid

Great work. Added some minor comments

philschmid · 2024-09-02T12:15:21Z

.github/workflows/run-tests-action.yml

+      - name: Set up uv
+        run: |
+          curl -LsSf https://astral.sh/uv/install.sh | sh
+          export PATH=$HOME/.cargo/bin:$PATH
+          uv --version
+
+      - name: Install dependencies
+        run: |
+          uv venv --python 3.10
+          uv pip install -r tests/requirements.txt


should we add a "cache"?

AFAIK the VMs are ephemeral so the cache would be destroyed after each job is done, and uv is already pretty fast (downloads those under 10 seconds).

philschmid · 2024-09-02T12:15:53Z

.github/workflows/test-huggingface-dlcs.yml

+      training-dlc: us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-pytorch-training-cu121.transformers.4-42.ubuntu2204.py310
+      inference-dlc: us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-pytorch-inference-cu121.2-2.transformers.4-44.ubuntu2204.py311
+      tgi-dlc: us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu121.2-2.ubuntu2204.py310


Mhm is there a better way to specify those? Feels like we can easily forget updating them?

tests/pytorch/inference/test_huggingface_inference_toolkit.py

tests/pytorch/training/test_trl.py

tests/requirements.txt

tests/tei/test_tei.py

tests/tgi/test_tgi.py

- Capture `container_uri` from environment variable before running testand remove the default value to prevent issues when testing - Remove `max_train_epochs=-1` as not required since `max_steps` isalready specified - Rename `test_transformers` to `test_huggingface_inference_toolkit` - Remove `transformers` and `jinja2` dependencies as not required, as well as `AutoTokenizer` usage for prompt formatting Co-authored-by: Philipp Schmid <[email protected]>

…ia-smi` Those dependencies where not needed, not actively maintained and adding extra complexity; instead, it has been replaced with `subprocess` running `nvidia-smi`.

- TEI condition on container port was reversed - `gpu_available` raises exception instead of `returncode` if command doesn't exist

In most of the cases, splitting those is for the best and to reduce execution time, assuming we tend to update the DLCs one at a time, so it's not really probable for all the containers to change at once. Pros: easier to manage, more granular, no need for extra `docker pull`s, just runs what's modified Cons: when modifying a bunch of tests it will be slower as `docker pull` needs to be done per each test as instances are ephemeral

The `type: choice` with `options` is only supported for `workflow_dispatch` i.e. when triggering the GitHub Action manually; not via `workflow_call` i.e. when the workflow is just reused from another workflow.

alvarobartt added 30 commits August 26, 2024 15:24

Add tests/local structure

a036a98

Add tests/local/training/test_trl.py (WIP)

beed550

Update tests/local/training/test_trl.py

2427601

Rename tests/local to tests/pytorch

e18b8d5

Add tests/pytorch/inference/test_transformers.py

698613a

Update test_transformers.py

7ce8ec8

Update and rename to test_huggingface_inference_toolkit.py

f00b801

Add tests/requirements.txt

224cbca

Skip tests/pytorch/training if not CUDA_AVAILABLE

dd0cd1f

Handle CUDA_AVAILABLE in tests/pytorch/inference

da1845f

Add docker in tests/requirements.txt

d139796

Remove volumes mounted for local testing

3367f91

Add pytest.init configuration file

dd96f7a

Add .github/actions/pytorch-dlcs-tests.yml

f87f9d2

Add .github/workflows/run-pytorch-dlcs-tests.yml

926960d

Update tests/pytorch/training/test_trl.py (WIP)

e2712ac

Fix tests/pytorch/training/test_trl.py

440a353

Fix tests/pytorch/inference/test_huggingface_inference_toolkit.py

3e3071d

Add background log-streaming via threading

893d046

Move stream_logs to tests/utils.py

e6097d5

As it will be reused within the TGI and TEI tests

Add tests/tgi/test_tgi.py (WIP)

b4edbc3

Add transformers to tests/requirements.txt

b8e3b93

Fix decoding of container.logs()

d5c4c50

Update tests/tgi/test_tgi.py

6ec0dca

Add .github/workflows/run-tgi-dlc-tests.yml

db72a57

Update .github/workflows

82e433a

Update tests/tgi/test_tgi.py

ce31efd

Pass args via `text_generation_launcher_kwargs` and include the VertexAI environment mimic via the `AIP_` environment variables.

Fix decoding of container_logs

09adb69

Use relative imports in tests

19ef319

Add tests/tei

ef0e437

alvarobartt force-pushed the add-integration-tests branch from 181b21c to 83e2c95 Compare September 2, 2024 07:17

alvarobartt added 6 commits September 2, 2024 09:36

Remove runtime=nvidia and enable interactive mode (`docker run -it …

fa3b178

…...`)

Remove manual mock file creation for debugging

438c9ad

Revert docker checks in run-tests-action.yml

38abf36

Remove tty and stdin_open interactive mode

4224bc7

Update tmp_path with --basetmp (debug)

beef705

Fix TGI_DLC environment variable value

9446a3e

alvarobartt force-pushed the add-integration-tests branch from 157ab15 to 9446a3e Compare September 2, 2024 09:22

alvarobartt added 3 commits September 2, 2024 12:07

Check container.status to prevent extra healtchecks

99d353c

Add nvidia-ml-py to set USE_FLASH_ATTENTION based on compute cap

c99e0ed

Add jinja2 dependency in tests/requirements.txt

4212a58

Which is odd, since `jinja2` is a core dependency of `transformers`, see https://github.com/huggingface/transformers/blob/174890280b340b89c5bfa092f6b4fb0e2dc2d7fc/setup.py#L127

alvarobartt changed the title ~~[TESTS] Add some integration tests (WIP)~~ Add integration tests for PyTorch, TGI and TEI DLCs Sep 2, 2024

alvarobartt requested a review from philschmid September 2, 2024 11:20

alvarobartt added 2 commits September 2, 2024 13:35

Update trigger in .github/workflows/test-huggingface-dlcs.yml

3909567

Merge branch 'main' into add-integration-tests

7c4bf87

philschmid reviewed Sep 2, 2024

View reviewed changes

alvarobartt force-pushed the add-integration-tests branch from 3af2bcf to 7ce5aeb Compare September 2, 2024 13:37

Add missing tei-dlc after removing defaults

349df29

alvarobartt force-pushed the add-integration-tests branch from 6eb06b5 to 349df29 Compare September 2, 2024 13:46

alvarobartt added 10 commits September 3, 2024 09:20

Remove GPUtil and nvidia-ml-py in favour of subprocess on `nvid…

eeb711d

…ia-smi` Those dependencies where not needed, not actively maintained and adding extra complexity; instead, it has been replaced with `subprocess` running `nvidia-smi`.

Fix integration tests

6b55963

- TEI condition on container port was reversed - `gpu_available` raises exception instead of `returncode` if command doesn't exist

Rename run-tests-action.yml to run-tests-reusable.yml

35bc4d8

Add options and update name in run-tests-reusable.yml

b71a392

Set type: choice to use options

d654b94

Update name for test-pytorch-{inference,training}-dlcs.yml

0fc8ef5

Fix .github/workflows/run-tests-reusable.yml

34281bb

The `type: choice` with `options` is only supported for `workflow_dispatch` i.e. when triggering the GitHub Action manually; not via `workflow_call` i.e. when the workflow is just reused from another workflow.

Add missing type: ignore

4768af1

Update tei-dlc on CPU and update port mapping

9f6dcc0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add integration tests for PyTorch, TGI and TEI DLCs #79

Add integration tests for PyTorch, TGI and TEI DLCs #79

alvarobartt commented Aug 30, 2024 •

edited

Loading

philschmid left a comment

philschmid Sep 2, 2024

alvarobartt Sep 2, 2024

philschmid Sep 2, 2024

Add integration tests for PyTorch, TGI and TEI DLCs #79

Are you sure you want to change the base?

Add integration tests for PyTorch, TGI and TEI DLCs #79

Conversation

alvarobartt commented Aug 30, 2024 • edited Loading

Description

philschmid left a comment

Choose a reason for hiding this comment

philschmid Sep 2, 2024

Choose a reason for hiding this comment

alvarobartt Sep 2, 2024

Choose a reason for hiding this comment

philschmid Sep 2, 2024

Choose a reason for hiding this comment

alvarobartt commented Aug 30, 2024 •

edited

Loading