Minimum changes to make CI workflows work again #814

mhucka · 2024-11-30T16:45:59Z

The aim of this PR is to stop the current CI workflow failures in ci.yaml and cirq_compatibility.yaml (both in .github/workflows/).

In the case of the failure in ci.yaml, the change is only a stopgap measure: it disables the address sanitizer tests. The failures happen when the workflow runners are updated from Ubuntu 16.04 to 20.04; this update is necessary because GitHub no longer offers the Ubuntu 16 runners.

After spending a ridiculous amount of time testing various combinations of TensorFlow, TensorFlow Quantum, and compiler toolchains on a more recent Linux, my conclusion is that the ASAN failures stem from differences in the toolchains used to produce the copy of TensorFlow 2.15.0 we get from PyPI, and what we get under Ubuntu 20 when compiling TFQ on GitHub. This conclusion comes from the fact if I build a local copy of TensorFlow 2.15.0, and then build TFQ against that, using Clang for everything, the ASAN failures go away.

Given that we can't build TensorFlow as part of this workflow (it takes 2 hours to build using 24 cores on a fast machine), it's not clear what can be done to resolve the ASAN failures correctly. So, I'm temporarily commenting out the leak tests in this workflow so that we can proceed on doing other updates and releasing a new version of TFQ. However, this needs to be revisited at some point.

Ubuntu 16.04 is no longer supported by GitHub. Updated the runner to use Ubuntu 20.04.

The current failures in the Cirq compatibility CI workflow are limited to the Address Sanitizer (ASAN) tests in `scripts/msan_test.sh`. They started happening only when we updated the version of Linux used by the workflow from Ubuntu 16.04 to 20.04, because GitHub no longer offers the Ubuntu 16 runners. After spending a ridiculous amount of time testing various combinations of TensorFlow, TensorFlow Quantum, and compiler toolchains on a more recent Linux, my conclusion is that the ASAN failures stem from differences in the toolchains used to produce the copy of TensorFlow 2.15.0 we get from PyPI, and the current toolchain used to compile TFQ on GitHub. This conclusion comes from the fact if I build a local copy of TensorFlow, and then build TFQ against that, using Clang for everything, the ASAN failures go away. Given that we can't build TensorFlow as part of this workflow (it takes 2 hours to build using 24-cores on a fast machine), it's not clear what can be done to stop the ASAN failures. I'm temporarily commenting out the leak tests in this workflow so that we can proceed on doing other updates and releasing a new version of TFQ. However, this needs to be revisited at some point.

mhucka added 2 commits November 30, 2024 16:15

Update ubuntu version used for runner

3edf568

Ubuntu 16.04 is no longer supported by GitHub. Updated the runner to use Ubuntu 20.04.

mhucka self-assigned this Nov 30, 2024

mhucka requested a review from MichaelBroughton November 30, 2024 23:28

mhucka marked this pull request as ready for review November 30, 2024 23:30

MichaelBroughton approved these changes Dec 2, 2024

View reviewed changes

MichaelBroughton merged commit 605d282 into tensorflow:master Dec 2, 2024
6 of 7 checks passed

mhucka deleted the mhucka-stopgap-ci-fixes branch December 3, 2024 23:51

mhucka restored the mhucka-stopgap-ci-fixes branch December 4, 2024 00:04

mhucka added the area/ci Concerns continuous integration workflows and infrastructure label Dec 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minimum changes to make CI workflows work again #814

Minimum changes to make CI workflows work again #814

mhucka commented Nov 30, 2024 •

edited

Loading

Minimum changes to make CI workflows work again #814

Minimum changes to make CI workflows work again #814

Conversation

mhucka commented Nov 30, 2024 • edited Loading

mhucka commented Nov 30, 2024 •

edited

Loading