Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minimum changes to make CI workflows work again #814

Merged
merged 2 commits into from
Dec 2, 2024

Conversation

mhucka
Copy link
Member

@mhucka mhucka commented Nov 30, 2024

The aim of this PR is to stop the current CI workflow failures in ci.yaml and cirq_compatibility.yaml (both in .github/workflows/).

In the case of the failure in ci.yaml, the change is only a stopgap measure: it disables the address sanitizer tests. The failures happen when the workflow runners are updated from Ubuntu 16.04 to 20.04; this update is necessary because GitHub no longer offers the Ubuntu 16 runners.

After spending a ridiculous amount of time testing various combinations of TensorFlow, TensorFlow Quantum, and compiler toolchains on a more recent Linux, my conclusion is that the ASAN failures stem from differences in the toolchains used to produce the copy of TensorFlow 2.15.0 we get from PyPI, and what we get under Ubuntu 20 when compiling TFQ on GitHub. This conclusion comes from the fact if I build a local copy of TensorFlow 2.15.0, and then build TFQ against that, using Clang for everything, the ASAN failures go away.

Given that we can't build TensorFlow as part of this workflow (it takes 2 hours to build using 24 cores on a fast machine), it's not clear what can be done to resolve the ASAN failures correctly. So, I'm temporarily commenting out the leak tests in this workflow so that we can proceed on doing other updates and releasing a new version of TFQ. However, this needs to be revisited at some point.

Ubuntu 16.04 is no longer supported by GitHub. Updated the runner to
use Ubuntu 20.04.
The current failures in the Cirq compatibility CI workflow are limited
to the Address Sanitizer (ASAN) tests in `scripts/msan_test.sh`. They
started happening only when we updated the version of Linux used by
the workflow from Ubuntu 16.04 to 20.04, because GitHub no longer
offers the Ubuntu 16 runners.

After spending a ridiculous amount of time testing various
combinations of TensorFlow, TensorFlow Quantum, and compiler
toolchains on a more recent Linux, my conclusion is that the ASAN
failures stem from differences in the toolchains used to produce the
copy of TensorFlow 2.15.0 we get from PyPI, and the current toolchain
used to compile TFQ on GitHub. This conclusion comes from the fact if
I build a local copy of TensorFlow, and then build TFQ against that,
using Clang for everything, the ASAN failures go away.

Given that we can't build TensorFlow as part of this workflow (it
takes 2 hours to build using 24-cores on a fast machine), it's not
clear what can be done to stop the ASAN failures.

I'm temporarily commenting out the leak tests in this workflow so that
we can proceed on doing other updates and releasing a new version of
TFQ. However, this needs to be revisited at some point.
@mhucka mhucka self-assigned this Nov 30, 2024
@mhucka mhucka marked this pull request as ready for review November 30, 2024 23:30
@MichaelBroughton MichaelBroughton merged commit 605d282 into tensorflow:master Dec 2, 2024
6 of 7 checks passed
@mhucka mhucka deleted the mhucka-stopgap-ci-fixes branch December 3, 2024 23:51
@mhucka mhucka restored the mhucka-stopgap-ci-fixes branch December 4, 2024 00:04
@mhucka mhucka added the area/ci Concerns continuous integration workflows and infrastructure label Dec 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ci Concerns continuous integration workflows and infrastructure
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants