-
Notifications
You must be signed in to change notification settings - Fork 589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI Failure (critical check uploaded.size() != 0 has failed) in test_archival_service_rpfixture
test_manifest_spillover
test_archival_service_rpfixture.test_manifest_spillover
#13275
Comments
|
@Lazin was looking into this? |
#12726 same issue |
test_archival_service_rpfixture
test_archival_service_rpfixture
test_manifest_spillover
test_archival_service_rpfixture.test_manifest_spillover
It was previously possible that the archiver was started while the Raft term hadn't been confirmed, and the subsequent spillover exits early because the archiver isn't synced yet. This commit fixes this by syncing the manifest. Fixes redpanda-data#13275
Reopening as this issue was seen a few times with recent work to update seastar, i don't believe these changes are related to the test failures https://buildkite.com/redpanda/vtools/builds/13472#018f4e3c-aac9-470b-97c2-1e3a8a38ee81 |
Looking at that build, I'm seeing:
Looks like there was a http client connection timeout when uploading the partition manifest, followed by many failed attempts to connect the http client and list from the cluster metadata uploader, and finally the test fails because the partition manifest upload failed. Briefly chatted with Rob and he said he'd seen this test pass with the Seastar v24.2.x branch and he wasn't able to reproduce this locally with the v24.2.x upgrade, so it's not immediately clear that the v24.2.x changes caused this (but they may have... not sure) |
@andrwng super strange that this cannot be replicated locally though, only on CI |
To me it looks like our timeout is hit too early somehow. It is redpanda/src/v/net/transport.cc Line 19 in a5234c8
Reproduced by running
i introduced a long sleep just before the test teardown and verified that the imposter listener is up after a failed run
it also does accept connections fine
heavy duty logging
|
If I change return ss::with_timeout(timeout, std::move(f))
- .handle_exception([socket, address, log](const std::exception_ptr& e) {
- log->trace("error connecting to {} - {}", address, e);
- socket->shutdown();
- return ss::make_exception_future<ss::connected_socket>(e);
- });
+ .handle_exception(
+ [socket, address, log, timeout](const std::exception_ptr& e) {
+ std::cout << "VVV time in exception handler: "
+ << (seastar::lowres_clock::now().time_since_epoch())
+ << "; deadline=" << timeout.time_since_epoch()
+ << "; port: " << address << std::endl;
+ log->trace("error connecting to {} - {}", address, e);
+ socket->shutdown();
+ return ss::make_exception_future<ss::connected_socket>(e);
+ }); it logs something like
This indicates that current time is 967ms past the deadline (these are machine uptime in ms; ).
A difference in 0.015 seconds again. Oversubscribed reactor? |
Adding extra logging like
lowres clock drifts too much at times and the timeouts don't make much sense anymore |
Frequency seems to be up post seastar https://buildkite.com/redpanda/redpanda/builds/48911#018f5fd7-7ab7-46e3-a790-3cbe98dc38a9 |
@nvartolomei : could you help chase this? Failing often given the seastar changes brought in. |
Adding a second takes ~1ms. Adding multiple of them in a single batch can stall the reactor for a while. This, together with a recent seastar change[^1] caused some timeouts to fire very frequently[^2]. Fix the test by breaking the batch passed to add_segments into smaller segments so that we execute more finer grained tasks and yield more often to the scheduler. [^1]: scylladb/seastar#2238 [^2]: redpanda-data#13275
I spotted that in some rpfixture tests the cluster metadata uploader is running with an incorrect port[^1]. Beside logging lots of connection refused errors, I found it to adding a non-negligible overhead to some variations of the test. The overhead is result of relatively tight loop in the `client::get_connected` method which I'm trying to remove in this commit. I don't believe this is necessary. Returning the error to the caller is much better. The callers usually have better retry mechanisms in place with backoff and can log much more useful messages since they have access to more context about the operation. [^1]: We need to investigate separately why it doesn't get the updated configuration when we configure cloud storage (you can try this test for example redpanda-data#13275) or if it needs to run as well.
Adding a segment takes ~1ms. Adding multiple segments in a single batch can stall the reactor for a while. This, together with a recent seastar change[^1] caused some timeouts to fire very frequently[^2]. Fix the problematic test by splitting the batch passed to add_segments so that we execute more finer grained tasks and yield more often to the scheduler. [^1]: scylladb/seastar#2238 [^2]: redpanda-data#13275
Adding a segment takes ~1ms. Adding multiple segments in a single batch can stall the reactor for a while. This, together with a recent seastar change[^1] caused some timeouts to fire very frequently[^2]. Fix the problematic test by splitting the batch passed to add_segments so that we execute more finer grained tasks and yield more often to the scheduler. [^1]: scylladb/seastar#2238 [^2]: redpanda-data#13275
Adding a segment takes ~1ms. Adding multiple segments in a single batch can stall the reactor for a while. This, together with a recent seastar change[^1] caused some timeouts to fire very frequently[^2]. Fix the problematic test by splitting the batch passed to add_segments so that we execute more finer grained tasks and yield more often to the scheduler. [^1]: scylladb/seastar#2238 [^2]: redpanda-data#13275
https://buildkite.com/redpanda/redpanda/builds/36337#018a5e9c-fcf7-47f9-8fb5-96b3f48b6a88
JIRA Link: CORE-2803
The text was updated successfully, but these errors were encountered: