Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows CI keeps failing (flaky) #520

Closed
jku opened this issue Feb 26, 2023 · 6 comments
Closed

Windows CI keeps failing (flaky) #520

jku opened this issue Feb 26, 2023 · 6 comments
Labels

Comments

@jku
Copy link
Collaborator

jku commented Feb 26, 2023

At least three failures since the CI was enabled on windows, all look roughly like this:

y: commands[1]> python -c "import os; os.environ[\"PYKCS11LIB\"]"
py: commands[2]> coverage run tests/aggregate_tests.py
................................Exporting master key '8465a1e2e0fb2b40adb2478e18fb3f537e0c8a17' including subkeys '6a112fd3390b2e53afc2e57f8fc8e12099aeceea, c5a0abe6ec19d0d65f85e2c39be9df5131d924e9' for passed keyid 'C5A0ABE6EC19D0D65F85E2C39BE9DF5131D924E9'.
..........................E........................s..........................Exporting master key '8465a1e2e0fb2b40adb2478e18fb3f537e0c8a17' including subkeys '6a112fd3390b2e53afc2e57f8fc8e12099aeceea, c5a0abe6ec19d0d65f85e2c39be9df5131d924e9' for passed keyid 'C5A0ABE6EC19D0D65F85E2C39BE9DF5131D924E9'.
...................s............
LoadLibrary() failed with error 126: The specified module could not be found.

======================================================================
ERROR: setUpClass (tests.test_hsm_signer.TestHSM)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "D:\a\securesystemslib\securesystemslib\tests\test_hsm_signer.py", line 113, in setUpClass
    lib = PYKCS11LIB()
  File "D:\a\securesystemslib\securesystemslib\securesystemslib\signer\_hsm_signer.py", line 76, in PYKCS11LIB
    _PYKCS11LIB.load()
  File "D:\a\securesystemslib\securesystemslib\.tox\py\lib\site-packages\PyKCS11\__init__.py", line 488, in load
    raise PyKCS11Error(rv, pkcs11dll_filename)
PyKCS11.PyKCS11Error: Load (C:\SoftHSM2\lib\softhsm2-x64.dll)
  • "LoadLibrary() failed" line seems to jump around so this could be a timing issue.
  • happened on at least python 3.7 and 3.10

https://github.com/secure-systems-lab/securesystemslib/actions/runs/4240908943/jobs/7370472661

@jku jku added the bug label Feb 26, 2023
@lukpueh
Copy link
Member

lukpueh commented Feb 27, 2023

Thanks for reporting!

happened on at least python 3.7 and 3.10

I've seen it on 3.9 as well.

@lukpueh
Copy link
Member

lukpueh commented Mar 2, 2023

I hope it's okay to make this a catch all flaky tests on Windows ticket... Here are two more issues:

py: commands[0]> python -m tests.check_gpg_available
F
======================================================================
FAIL: test_gpg_available (__main__.TestGpgAvailable)
Test that GPG is available.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "D:\a\securesystemslib\securesystemslib\tests\check_gpg_available.py", line 39, in test_gpg_available
py: exit 1 ([15](https://github.com/secure-systems-lab/securesystemslib/actions/runs/4281618767/jobs/7454863852#step:6:16).45 seconds) D:\a\securesystemslib\securesystemslib> python -m tests.check_gpg_available pid=2508
    self.assertTrue(securesystemslib.gpg.constants.have_gpg())
AssertionError: False is not true

----------------------------------------------------------------------
Ran 1 test in 10.520s
  py: FAIL code 1 (77.[17](https://github.com/secure-systems-lab/securesystemslib/actions/runs/4281618767/jobs/7454863852#step:6:18)=setup[61.72]+cmd[15.45] seconds)

https://github.com/secure-systems-lab/securesystemslib/actions/runs/4281618767/jobs/7454863852

@lukpueh
Copy link
Member

lukpueh commented Mar 2, 2023

py: commands[2]> coverage run tests/aggregate_tests.py
...............................EExporting master key '8465a1e2e0fb2b40adb2478e18fb3f537e0c8a17' including subkeys '6a112fd3390b2e53afc2e57f8fc8e12099aeceea, c5a0abe6ec19d0d65f85e2c39be9df5131d924e9' for passed keyid 'C5A0ABE6EC19D0D65F85E2C39BE9DF5131D924E9'.
...E................................................s..........................Exporting master key '8465a1e2e0fb2b40adb2478e18fb3f537e0c8a17' including subkeys '6a112fd3390b2e53afc2e57f8fc8e12099aeceea, c5a0abe6ec19d0d65f85e2c39be9df5131d924e9' for passed keyid 'C5A0ABE6EC19D0D65F85E2C39BE9DF5131D924E9'.
...................s............
======================================================================
ERROR: test_create_signature_with_expired_key (tests.test_gpg.TestGPGRSA)
Test signing with expired key raises gpg CommandError.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "D:\a\securesystemslib\securesystemslib\tests\test_gpg.py", line 712, in test_create_signature_with_expired_key
    create_signature(
  File "D:\a\securesystemslib\securesystemslib\securesystemslib\gpg\functions.py", line 129, in create_signature
    gpg_process = subprocess.run(  # nosec
  File "C:\hostedtoolcache\windows\Python\3.10.10\x64\lib\subprocess.py", line 505, in run
    stdout, stderr = process.communicate(input, timeout=timeout)
  File "C:\hostedtoolcache\windows\Python\3.10.10\x64\lib\subprocess.py", line 1154, in communicate
    stdout, stderr = self._communicate(input, endtime, timeout)
  File "C:\hostedtoolcache\windows\Python\3.10.10\x64\lib\subprocess.py", line 1530, in _communicate
    raise TimeoutExpired(self.args, orig_timeout)
subprocess.TimeoutExpired: Command '['gpg', '--detach-sign', '--digest-algo', 'SHA256', '--local-user', 'E8AC80C9[24](https://github.com/secure-systems-lab/securesystemslib/actions/runs/4281618767/jobs/7454864323#step:6:25)116DABB51D4B987CB07D6D2C199C7C', '--homedir', 'rsa']' timed out after 10 seconds

https://github.com/secure-systems-lab/securesystemslib/actions/runs/4281618767/jobs/7454864323

@lukpueh
Copy link
Member

lukpueh commented Mar 14, 2023

I think these two gpg related failures are actually both timeouts, given that have_gpg() also just returns False on timeout. A logging statement would be helpful there.

@jku
Copy link
Collaborator Author

jku commented Mar 14, 2023

Looking at test runs on main branch, 10-15% currently pass completely, almost 90% fail in one or more windows builds.

Should we (partially?) disable windows tests on CI again if flakiness is at this level?

(unless you're planning to invest time into this in near future -- then it makes sense to log as much as possible)

@lukpueh
Copy link
Member

lukpueh commented Mar 20, 2023

#542 works around the HSM LoadLibrary() failure, by disabling HSM tests on Windows, AND likely fixes the gpg timeout failures by increasing gpg subprocess timeout (although only time can prove if a flaky test is indeed fixed)

FWIW, I ssh'ed into a runner that had timed out on a gpg test and was able to successfully run the same test manually. This is no prove either, but at least suggests that the failures were actually timeouts.

Let's close here with #542 and open a new ticket that requests re-instating HSM tests on Windows, including an actual fix for the documented problem.

Btw. I briefly considered switching to the SoftHSM2-for-Windows "portable" ZIP archive instead of using the installer from chocolatey, because this is what the author of PyKCS11 uses. But then I saw that they have the same error message in their build logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants