Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Update flaky test script to print more details and detect flaky excep…
…tions and timeouts (envoyproxy#14731) Update the flaky test script to print more details and detect flaky, unexpected test errors like exceptions and timeouts, with the goal of making the notifications more actionable. Signed-off-by: Randy Miller <[email protected]> Additional Description: The flaky test script, ci/flaky/process_xml.py, is executed on every CI run, delivering a notification to the Slack channel "test-flaky" if there are any flaky test failures. Those notifications aren't as useful as they could be though, for a number of reasons: 1) Direct links to the CI run, the related commit, and the related PR are not included. 2) There's no indication of which stage or job experienced the flake. 3) The notifications are not uniformly formatted, so they can be a bit hard to read. 4) Some notifications do not include any information about the flake(s). The goal of this PR is to make these flaky test notifications more actionable by addressing the 4 bullets above. Below is what a notification would look like should this PR get merged. The last 2 flakes are not captured at all today by the current state of the script, as those flakes are unexpected test "errors" (eg, exceptions or timeouts) rather than test "failures" (eg, test assert failed). ``` Target: bazel.release Stage: Windows release Pull request: envoyproxy#14665 Commmit: envoyproxy@f1184f2 CI results: https://dev.azure.com/cncf/envoy/_build/results?buildId=63454 Origin: https://github.com/rmiller14/envoy Upstream: https://github.com/envoyproxy/envoy Latest ref: heads/flaky_test_script Last commit: commit f1184f2d74d052942f7484beecf98d7cfde137e0 Author: Randy Miller <[email protected]> Date: Fri Jan 15 00:58:51 2021 -0800 Update flaky test script to print more actionable details as well as detect flaky, unexpected test errors like exceptions and timeouts. Signed-off-by: Randy Miller <[email protected]> --------------------------------------------------------------------------------------------------- Test flake details: - Test suite: IpVersions/DnsImplTest - Test case: LocalLookup/IPv4 - Log path: C:/_eb/_bazel_LocalAdmin/sonr4fdz/external/envoy/bazel-testlogs/test/common/network/dns_impl_test/test_attempts/attempt_1.log - Details: test/common/network/dns_impl_test.cc:609 Expected equality of these values: nullptr Which is: NULL resolveWithExpectations("localhost", DnsLookupFamily::V4Only, DnsResolver::ResolutionStatus::Success, {"127.0.0.1"}, {"::1"}, absl::nullopt) Which is: 0000017500F82190 Stack trace: 00007FF69688B586: (unknown) 00007FF6968A4468: (unknown) 00007FF6968A462D: (unknown) 00007FF6968A508D: (unknown) ... Google Test internal frames ... --------------------------------------------------------------------------------------------------- Test flake details: - Test suite: ThriftConnManagerIntegrationTest - Test case: IDLException/HeaderCompact - Log path: C:/_eb/_bazel_LocalAdmin/sonr4fdz/external/envoy/bazel-testlogs/test/extensions/filters/network/thrift_proxy/integration_test/shard_1_of_4/test_attempts/attempt_1.log - Error: Exited with error code 3 (No such process) - Relevant snippet: Traceback (most recent call last): File "\\?\C:\Windows\TEMP\Bazel.runfiles_3v_lc0rq\runfiles\envoy\test\extensions\filters\network\thrift_proxy\driver\server.py", line 232, in <module> main(cfg) File "\\?\C:\Windows\TEMP\Bazel.runfiles_3v_lc0rq\runfiles\envoy\test\extensions\filters\network\thrift_proxy\driver\server.py", line 175, in main server.serve() File "\\?\C:\Windows\TEMP\Bazel.runfiles_3v_lc0rq\runfiles\thrift_pip3_pypi__thrift_0_13_0\thrift\server\TServer.py", line 121, in serve self.serverTransport.listen() File "\\?\C:\Windows\TEMP\Bazel.runfiles_3v_lc0rq\runfiles\thrift_pip3_pypi__thrift_0_13_0\thrift\transport\TSocket.py", line 208, in listen self.handle.bind(res[4]) OSError: [WinError 10013] An attempt was made to access a socket in a way forbidden by its access permissions Could not connect to any of [('127.0.0.1', 50670)] Unhandled Thrift Exception: Could not connect to any of [('127.0.0.1', 50670)] C:/envoy/test/extensions/filters/network/thrift_proxy/driver/generate_fixture.sh: line 1: kill: (1819) - No such process Failed bash -c "PYTHONPATH=$(dirname C:/envoy/test/extensions/filters/network/thrift_proxy/driver/generate_fixture.sh) C:/envoy/test/extensions/filters/network/thrift_proxy/driver/generate_fixture.sh idl-exception header compact -H x-header-1=x-value-1,x-header-2=0.6,x-header-3=150,x-header-4=user_id:10,x-header-5=garbage_asdf -T C:/_eb/execroot/envoy/_tmp/2540819d34883b5a5d1e62d549fbcdeb execute " [2021-01-15 11:35:27.958][5624][critical][assert] [test/test_common/environment.cc:414] assert failure: false. --------------------------------------------------------------------------------------------------- Test flake details: - Test suite: TcpProxyIntegrationTest - Test case: TestCloseOnHealthFailure/IPv6_OriginalConnPool - Log path: C:/_eb/_bazel_LocalAdmin/sonr4fdz/external/envoy/bazel-testlogs/test/integration/tcp_proxy_integration_test/shard_1_of_2/test_attempts/attempt_1.log - Error: Exited with error code 142 (Unknown error) - Note: This error is likely a timeout (test duration == 300, a well known timeout value). - Last 1 line(s): [ RUN ] TcpProxyIntegrationTestParams/TcpProxyIntegrationTest.TestCloseOnHealthFailure/IPv6_OriginalConnPool --------------------------------------------------------------------------------------------------- ``` Risk Level: N/A for code/test, low for the flaky test script due to the amount of churn. Testing: Ran locally many, many times. For a portion of those runs, I treated normal failures as flakes to get better coverage on the parsing helpers. Not sure how to test the changes to bazel.yml though. Signed-off-by: Randy Miller <[email protected]>
- Loading branch information