-
-
Notifications
You must be signed in to change notification settings - Fork 30.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gh-108342: Make ssl TestPreHandshakeClose more reliable #108370
Conversation
Non-Linux platforms have this issue because of the
Without this change, test_ssl fails with "env changed" on Windows (on an idle machine):
With this change, the test pass. I tested my change by stressing my Windows VM:
I cannot reproduce the issue with this method on Windows anymore with this change. |
This issue also impacts Linux machines, see buildbot failures starting at this comment: #108344 (comment) |
I plan to schedule buildbot jobs on this PR to see if my fix works as expected. I tried to write the smallest change to fix the issue, without putting |
On Windows x64, test_ssl failed with ENV CHANGED, but it's a different test (unrelated to this PR):
|
Oh, on "AMD64 Fedora Stable Clang Installed PR", the bug is still there :-(
Oh. But here, the bug occurred in test_https_client_non_tls_response_ignored(). |
* In preauth tests of test_ssl, explicitly break reference cycles invoving SingleConnectionTestServerThread to make sure that the thread is deleted. Otherwise, the test marks the environment as altered because the threading module sees a "dangling thread" (SingleConnectionTestServerThread). This test leak was introduced by the test added for the fix of issue pythongh-108310. * Use support.SHORT_TIMEOUT instead of hardcoded 1.0 or 2.0 seconds timeout. * SingleConnectionTestServerThread.run() catchs TimeoutError * Fix a race condition (missing synchronization) in test_preauth_data_to_tls_client(): the server now waits until the client connect() completed in call_after_accept(). * Replace socket.send() with socket.sendall() * test_https_client_non_tls_response_ignored() calls server.join() explicitly. * Replace "localhost" with server.listener.getsockname()[0].
It also still fails on Windows: |
ff74fd7
to
51e1e09
Compare
Maybe let's not merge this until it really fixes the issue now? |
Yeah, I didn't touch test_https_client_non_tls_response_ignored() whereas this test was also fragile :-/ I rewrote my PR to make the 3 tests of TestPreHandshakeClose more reliable. The overall change is more complicated than expected, so I fixed multiple issues (more or less important). |
I stress-tested the updated PR (commit 51e1e09) on the 3 tests on Linux on Windows. Linux:
Windows:
Results:
|
I don't plan to merge before I'm sure that the test is very reliable :-) |
@@ -4847,28 +4879,33 @@ def call_after_accept(conn_to_client): | |||
server_responding.set() | |||
return True # Tell the server to stop. | |||
|
|||
timeout = 2.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is set to 2.0
, all the changes to generalize the timeout in SingleConnectionTestServerThread
aren't really used?
Can we at least set this to, say, 5.0
to give it some breathing room compared to the previous value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is set to 2.0, all the changes to generalize the timeout in SingleConnectionTestServerThread aren't really used?
I increased the timeout from 2 seconds to SHORT_TIMEOUT (at least 30 seconds) in the 2 other tests to make these tests more reliable (2 seconds may be too short on a busy system).
Can we at least set this to, say, 5.0 to give it some breathing room compared to the previous value?
On Windows, the test takes timeout * 2
seconds (4 seconds) to complete :-( I don't understand why the client doesn't fail with a timeout error as soon as the server closes its listener connection!? Right now, I prefer to fix the known issues of dangling threads before attempting to change this fragile timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's also fair to just skip the test on windows if reliability is platform specific. the primary reproducer only happens on Linux anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The behavior on Windows is very different than on Linux. On Windows, test_https_client_non_tls_response_ignored() takes 4 seconds (server times out, then client times out). On Linux, it completes in 100 ms.
I didn't spend too much time to try to understand why/how. I prefer to continue running the test on Windows, unless there is a good reason to no do so. With my change, the test is reliable on Windows.
The challenge with these is that having an environment that the failure can be reliably reproduced in is hard. So "fixed" isn't well defined. I'm satisfied that this PR improves things and that Victor has found reproducers that appear to be healthier. But a concrete "fixed" can really only be declared after weeks of diverse buildbot data even when we're 99% confident "this is the last one for real this time". It'd be fine to do the releases without this or the related preceeding test refleak fix PR in FWIW. Our refleaks checks are pedantry we impose upon ourselves that also catch things that are not actual CPython bugs as is the case here. I found and fixed a one reference cycle from the test that I saw come up during initial development. I had no way to notice or find others given I couldn't use public buildbots and I don't maintain a pile of unusually overloaded diverse OS VMs. |
For me, it was quite easy to reproduce the 2 issues on Windows, in a few seconds:
Moreover, on Linux, using |
I updated my PR to address @gpshead's review. @gpshead @ambv: I propose that you review and approve it, and then merge it, rather than waiting for a 3rd batch of buildbot jobs on this PR. We can use the regular buildbot process: merge the PR and then watch buildbots. Then only add backport labels to this PR once enough buildbots completed successfully. |
OK, let me just run this PR branch without Greg's original fix to see if the changed tests still trigger the issue. |
Yup, it still works:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the https cllient test still proves flaky, disabling or removing it is fine by me. (it wasn't a regression test)
Buildbots are more happy with this change. Only one Windows Refleak buildbot still fails, just because it didn't run with the new code yet. The build 494 is running and it looks good as well ("test_ssl passed")! |
…ythonGH-108370) * In preauth tests of test_ssl, explicitly break reference cycles invoving SingleConnectionTestServerThread to make sure that the thread is deleted. Otherwise, the test marks the environment as altered because the threading module sees a "dangling thread" (SingleConnectionTestServerThread). This test leak was introduced by the test added for the fix of issue pythongh-108310. * Use support.SHORT_TIMEOUT instead of hardcoded 1.0 or 2.0 seconds timeout. * SingleConnectionTestServerThread.run() catchs TimeoutError * Fix a race condition (missing synchronization) in test_preauth_data_to_tls_client(): the server now waits until the client connect() completed in call_after_accept(). * test_https_client_non_tls_response_ignored() calls server.join() explicitly. * Replace "localhost" with server.listener.getsockname()[0]. (cherry picked from commit 592bacb) Co-authored-by: Victor Stinner <[email protected]>
GH-108404 is a backport of this pull request to the 3.12 branch. |
…ythonGH-108370) * In preauth tests of test_ssl, explicitly break reference cycles invoving SingleConnectionTestServerThread to make sure that the thread is deleted. Otherwise, the test marks the environment as altered because the threading module sees a "dangling thread" (SingleConnectionTestServerThread). This test leak was introduced by the test added for the fix of issue pythongh-108310. * Use support.SHORT_TIMEOUT instead of hardcoded 1.0 or 2.0 seconds timeout. * SingleConnectionTestServerThread.run() catchs TimeoutError * Fix a race condition (missing synchronization) in test_preauth_data_to_tls_client(): the server now waits until the client connect() completed in call_after_accept(). * test_https_client_non_tls_response_ignored() calls server.join() explicitly. * Replace "localhost" with server.listener.getsockname()[0]. (cherry picked from commit 592bacb) Co-authored-by: Victor Stinner <[email protected]>
GH-108405 is a backport of this pull request to the 3.11 branch. |
…ythonGH-108370) * In preauth tests of test_ssl, explicitly break reference cycles invoving SingleConnectionTestServerThread to make sure that the thread is deleted. Otherwise, the test marks the environment as altered because the threading module sees a "dangling thread" (SingleConnectionTestServerThread). This test leak was introduced by the test added for the fix of issue pythongh-108310. * Use support.SHORT_TIMEOUT instead of hardcoded 1.0 or 2.0 seconds timeout. * SingleConnectionTestServerThread.run() catchs TimeoutError * Fix a race condition (missing synchronization) in test_preauth_data_to_tls_client(): the server now waits until the client connect() completed in call_after_accept(). * test_https_client_non_tls_response_ignored() calls server.join() explicitly. * Replace "localhost" with server.listener.getsockname()[0]. (cherry picked from commit 592bacb) Co-authored-by: Victor Stinner <[email protected]>
GH-108406 is a backport of this pull request to the 3.10 branch. |
…ythonGH-108370) * In preauth tests of test_ssl, explicitly break reference cycles invoving SingleConnectionTestServerThread to make sure that the thread is deleted. Otherwise, the test marks the environment as altered because the threading module sees a "dangling thread" (SingleConnectionTestServerThread). This test leak was introduced by the test added for the fix of issue pythongh-108310. * Use support.SHORT_TIMEOUT instead of hardcoded 1.0 or 2.0 seconds timeout. * SingleConnectionTestServerThread.run() catchs TimeoutError * Fix a race condition (missing synchronization) in test_preauth_data_to_tls_client(): the server now waits until the client connect() completed in call_after_accept(). * test_https_client_non_tls_response_ignored() calls server.join() explicitly. * Replace "localhost" with server.listener.getsockname()[0]. (cherry picked from commit 592bacb) Co-authored-by: Victor Stinner <[email protected]>
GH-108407 is a backport of this pull request to the 3.9 branch. |
…ythonGH-108370) * In preauth tests of test_ssl, explicitly break reference cycles invoving SingleConnectionTestServerThread to make sure that the thread is deleted. Otherwise, the test marks the environment as altered because the threading module sees a "dangling thread" (SingleConnectionTestServerThread). This test leak was introduced by the test added for the fix of issue pythongh-108310. * Use support.SHORT_TIMEOUT instead of hardcoded 1.0 or 2.0 seconds timeout. * SingleConnectionTestServerThread.run() catchs TimeoutError * Fix a race condition (missing synchronization) in test_preauth_data_to_tls_client(): the server now waits until the client connect() completed in call_after_accept(). * test_https_client_non_tls_response_ignored() calls server.join() explicitly. * Replace "localhost" with server.listener.getsockname()[0]. (cherry picked from commit 592bacb) Co-authored-by: Victor Stinner <[email protected]>
GH-108408 is a backport of this pull request to the 3.8 branch. |
…8370) (#108404) * In preauth tests of test_ssl, explicitly break reference cycles invoving SingleConnectionTestServerThread to make sure that the thread is deleted. Otherwise, the test marks the environment as altered because the threading module sees a "dangling thread" (SingleConnectionTestServerThread). This test leak was introduced by the test added for the fix of issue gh-108310. * Use support.SHORT_TIMEOUT instead of hardcoded 1.0 or 2.0 seconds timeout. * SingleConnectionTestServerThread.run() catchs TimeoutError * Fix a race condition (missing synchronization) in test_preauth_data_to_tls_client(): the server now waits until the client connect() completed in call_after_accept(). * test_https_client_non_tls_response_ignored() calls server.join() explicitly. * Replace "localhost" with server.listener.getsockname()[0]. (cherry picked from commit 592bacb) Co-authored-by: Victor Stinner <[email protected]>
…8370) (#108405) * In preauth tests of test_ssl, explicitly break reference cycles invoving SingleConnectionTestServerThread to make sure that the thread is deleted. Otherwise, the test marks the environment as altered because the threading module sees a "dangling thread" (SingleConnectionTestServerThread). This test leak was introduced by the test added for the fix of issue gh-108310. * Use support.SHORT_TIMEOUT instead of hardcoded 1.0 or 2.0 seconds timeout. * SingleConnectionTestServerThread.run() catchs TimeoutError * Fix a race condition (missing synchronization) in test_preauth_data_to_tls_client(): the server now waits until the client connect() completed in call_after_accept(). * test_https_client_non_tls_response_ignored() calls server.join() explicitly. * Replace "localhost" with server.listener.getsockname()[0]. (cherry picked from commit 592bacb) Co-authored-by: Victor Stinner <[email protected]>
…8370) (#108406) * In preauth tests of test_ssl, explicitly break reference cycles invoving SingleConnectionTestServerThread to make sure that the thread is deleted. Otherwise, the test marks the environment as altered because the threading module sees a "dangling thread" (SingleConnectionTestServerThread). This test leak was introduced by the test added for the fix of issue gh-108310. * Use support.SHORT_TIMEOUT instead of hardcoded 1.0 or 2.0 seconds timeout. * SingleConnectionTestServerThread.run() catchs TimeoutError * Fix a race condition (missing synchronization) in test_preauth_data_to_tls_client(): the server now waits until the client connect() completed in call_after_accept(). * test_https_client_non_tls_response_ignored() calls server.join() explicitly. * Replace "localhost" with server.listener.getsockname()[0]. (cherry picked from commit 592bacb) Co-authored-by: Victor Stinner <[email protected]>
) (#108407) * In preauth tests of test_ssl, explicitly break reference cycles invoving SingleConnectionTestServerThread to make sure that the thread is deleted. Otherwise, the test marks the environment as altered because the threading module sees a "dangling thread" (SingleConnectionTestServerThread). This test leak was introduced by the test added for the fix of issue gh-108310. * Use support.SHORT_TIMEOUT instead of hardcoded 1.0 or 2.0 seconds timeout. * SingleConnectionTestServerThread.run() catchs TimeoutError * Fix a race condition (missing synchronization) in test_preauth_data_to_tls_client(): the server now waits until the client connect() completed in call_after_accept(). * test_https_client_non_tls_response_ignored() calls server.join() explicitly. * Replace "localhost" with server.listener.getsockname()[0]. (cherry picked from commit 592bacb) Co-authored-by: Victor Stinner <[email protected]>
) (#108408) * In preauth tests of test_ssl, explicitly break reference cycles invoving SingleConnectionTestServerThread to make sure that the thread is deleted. Otherwise, the test marks the environment as altered because the threading module sees a "dangling thread" (SingleConnectionTestServerThread). This test leak was introduced by the test added for the fix of issue gh-108310. * Use support.SHORT_TIMEOUT instead of hardcoded 1.0 or 2.0 seconds timeout. * SingleConnectionTestServerThread.run() catchs TimeoutError * Fix a race condition (missing synchronization) in test_preauth_data_to_tls_client(): the server now waits until the client connect() completed in call_after_accept(). * test_https_client_non_tls_response_ignored() calls server.join() explicitly. * Replace "localhost" with server.listener.getsockname()[0]. (cherry picked from commit 592bacb)
…ythonGH-108370) (python#108407) * In preauth tests of test_ssl, explicitly break reference cycles invoving SingleConnectionTestServerThread to make sure that the thread is deleted. Otherwise, the test marks the environment as altered because the threading module sees a "dangling thread" (SingleConnectionTestServerThread). This test leak was introduced by the test added for the fix of issue pythongh-108310. * Use support.SHORT_TIMEOUT instead of hardcoded 1.0 or 2.0 seconds timeout. * SingleConnectionTestServerThread.run() catchs TimeoutError * Fix a race condition (missing synchronization) in test_preauth_data_to_tls_client(): the server now waits until the client connect() completed in call_after_accept(). * test_https_client_non_tls_response_ignored() calls server.join() explicitly. * Replace "localhost" with server.listener.getsockname()[0]. (cherry picked from commit 592bacb) Co-authored-by: Victor Stinner <[email protected]>
invoving SingleConnectionTestServerThread to make sure that the
thread is deleted. Otherwise, the test marks the environment as
altered because the threading module sees a "dangling thread"
(SingleConnectionTestServerThread). This test leak was introduced
by the test added for the fix of issue CVE-2023-40217: Bypass TLS handshake on closed sockets #108310.
timeout.
test_preauth_data_to_tls_client(): the server now waits until the
client connect() completed in call_after_accept().
explicitly.