Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI Failure (unknown error) in cloud_storage_rpfixture #12120

Closed
michael-redpanda opened this issue Jul 14, 2023 · 15 comments
Closed

CI Failure (unknown error) in cloud_storage_rpfixture #12120

michael-redpanda opened this issue Jul 14, 2023 · 15 comments
Assignees
Labels
area/cloud-storage Shadow indexing subsystem ci-failure kind/bug Something isn't working rpunit unit test ci-failure (not ducktape) sev/low Bugs which are non-functional paper cuts, e.g. typos, issues in log messages stale

Comments

@michael-redpanda
Copy link
Contributor

https://buildkite.com/redpanda/redpanda/builds/33139#0189548e-5b3f-430d-b310-22a98bb116ed

Analysis of the log did not point to which test case failed within cloud_storage_rpfixture. More analysis required.

Created issue to move #12088 forward.

@michael-redpanda michael-redpanda added kind/bug Something isn't working ci-failure area/cloud-storage Shadow indexing subsystem labels Jul 14, 2023
@andijcr andijcr added the rpunit unit test ci-failure (not ducktape) label Jul 14, 2023
@andrwng andrwng self-assigned this Jul 14, 2023
@andrwng
Copy link
Contributor

andrwng commented Jul 14, 2023

Just noting the failure:

unknown location(0): �[4;31;49mfatal error: in "test_remote_partition_scan_full_truncated_segments": cloud_storage::stuck_reader_exception: remote_segment.cc:1338 - segment_reader is stuck, segment ntp: {kafka/test-topic/42}, _cur_rp_offset: 94, _bytes_consumed: 27627�[0;39;49m
/var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-027502674116270b2-1/redpanda/redpanda/src/v/cloud_storage/tests/util.cc(568): �[1;36;49mlast checkpoint�[0;39;49m
�[1;34;49m/var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-027502674116270b2-1/redpanda/redpanda/src/v/cloud_storage/tests/remote_partition_test.cc(638): Leaving test case "test_remote_partition_scan_full_truncated_segments"; testing time: 189168us
�[0;39;49m�[1;34;49m/var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-027502674116270b2-1/redpanda/redpanda/src/v/cloud_storage/tests/remote_partition_test.cc(654): Entering test case "test_remote_partition_scan_first_half"
�[0;39;49mDEBUG 2023-07-14 14:58:35,095 [shard 0] cloud_storage - cache_service.cc:763 - Starting archival cache service, data directory: "/tmp/9bf3bd.tmp"

@michael-redpanda
Copy link
Contributor Author

Thanks @andrwng !

@andrwng
Copy link
Contributor

andrwng commented Jul 14, 2023

My guess so far is this is an edge case in the consume path, where with the exact right set of bytes per batch and per segment, go down the "stuck consumer" path and return.

@andrwng
Copy link
Contributor

andrwng commented Jul 17, 2023

@abhijat I think the side effect of this is that we return early, but it shouldn't block clients from making progress, right?

I'm also wondering if this is an existing bug, or whether it could be related to chunk hydration

@abhijat
Copy link
Contributor

abhijat commented Jul 18, 2023

Looking at the logs it seems the test worked as expected, I am not sure why it was reported as an error.

I'm also wondering if this is an existing bug, or whether it could be related to chunk hydration

In these tests chunk hydration does not happen because the impostor does not have the index, so the fallback path is used:

WARN  2023-07-14 14:58:35,042 [shard 0] cloud_storage - [fiber1605~0 kafka/test-topic/42 [0:187]] - remote_segment.cc:1505 - Failed to hydrate index 01d730c9/kafka/test-topic/42_0/0-187-55254-1-v1.log.index: cloud_storage::download_exception (NotFound)
INFO  2023-07-14 14:58:35,042 [shard 0] cloud_storage - [fiber1605~0 kafka/test-topic/42 [0:187]] - remote_segment.cc:925 - failed to download index with error [cloud_storage::download_exception (NotFound)], switching to fallback mode and retrying hydration.

The download request is for the full segment:

TRACE 2023-07-14 14:58:35,045 [shard 0] http - [/01d730c9/kafka/test-topic/42_0/0-187-55254-1-v1.log] - client.cc:90 - client.make_request GET /01d730c9/kafka/test-topic/42_0/0-187-55254-1-v1.log HTTP/1.1
User-Agent: redpanda.vectorized.io
Host: bucket.127.0.0.1
Content-Length: 0
x-amz-date: 20230714T145835Z
x-amz-content-sha256: [secret]
Authorization: [secret]

The segment size is reported as 55254 bytes. The test sets up the segments to be truncated:

    auto segments = setup_s3_imposter(
      *this, 3, 10, manifest_inconsistency::truncated_segments);

This basically cuts the body returned by the impostor in half, so 27627 bytes.

Looking at the batch consumer logs, the segment is consumed up to that size (summing up the batch sizes: 1808+6466+2098+1498+8740+7017):

TRACE 2023-07-14 14:58:35,062 [shard 0] cloud_storage - [fiber1605~0~0~0 kafka/test-topic/42] - remote_segment.cc:1134 - accept_batch_start {header_crc:2054118994, size_bytes:1808, base_offset:{0}, type:batch_ty
pe::raft_data, crc:-1336848020, attrs:{compression:none, type:CreateTime, transactional: 0, control: 0}, last_offset_delta:5, first_timestamp:{timestamp: 1689346714915}, max_timestamp:{timestamp: 1689346714920},
 producer_id:-1, producer_epoch:-1, base_sequence:0, record_count:6, ctx:{term:{-9223372036854775808}, owner_shard:{0}}}, current delta: 0
TRACE 2023-07-14 14:58:35,062 [shard 0] cloud_storage - [fiber1605~0~0~0 kafka/test-topic/42] - remote_segment.cc:1198 - consume_batch_start called for 0
DEBUG 2023-07-14 14:58:35,062 [shard 0] cloud_storage - [fiber1605~0~0 kafka/test-topic/42] - remote_segment.cc:1372 - remote_segment_batch_reader::produce
TRACE 2023-07-14 14:58:35,063 [shard 0] cloud_storage - [fiber1605~0~0~0 kafka/test-topic/42] - remote_segment.cc:1134 - accept_batch_start {header_crc:1473820545, size_bytes:6466, base_offset:{6}, type:batch_ty
pe::raft_data, crc:276715364, attrs:{compression:none, type:CreateTime, transactional: 0, control: 0}, last_offset_delta:21, first_timestamp:{timestamp: 1689346714915}, max_timestamp:{timestamp: 1689346714936}, producer_id:-1, producer_epoch:-1, base_sequence:0, record_count:22, ctx:{term:{-9223372036854775808}, owner_shard:{0}}}, current delta: 0
TRACE 2023-07-14 14:58:35,063 [shard 0] cloud_storage - [fiber1605~0~0~0 kafka/test-topic/42] - remote_segment.cc:1198 - consume_batch_start called for 6
DEBUG 2023-07-14 14:58:35,063 [shard 0] cloud_storage - [fiber1605~0~0 kafka/test-topic/42] - remote_segment.cc:1372 - remote_segment_batch_reader::produce
TRACE 2023-07-14 14:58:35,064 [shard 0] cloud_storage - [fiber1605~0~0~0 kafka/test-topic/42] - remote_segment.cc:1134 - accept_batch_start {header_crc:1316626887, size_bytes:2098, base_offset:{28}, type:batch_type::raft_data, crc:-912728662, attrs:{compression:none, type:CreateTime, transactional: 0, control: 0}, last_offset_delta:6, first_timestamp:{timestamp: 1689346714915}, max_timestamp:{timestamp: 1689346714921}, producer_id:-1, producer_epoch:-1, base_sequence:0, record_count:7, ctx:{term:{-9223372036854775808}, owner_shard:{0}}}, current delta: 0
TRACE 2023-07-14 14:58:35,064 [shard 0] cloud_storage - [fiber1605~0~0~0 kafka/test-topic/42] - remote_segment.cc:1198 - consume_batch_start called for 28
DEBUG 2023-07-14 14:58:35,064 [shard 0] cloud_storage - [fiber1605~0~0 kafka/test-topic/42] - remote_segment.cc:1372 - remote_segment_batch_reader::produce
TRACE 2023-07-14 14:58:35,065 [shard 0] cloud_storage - [fiber1605~0~0~0 kafka/test-topic/42] - remote_segment.cc:1134 - accept_batch_start {header_crc:650709844, size_bytes:1498, base_offset:{35}, type:batch_type::raft_data, crc:-960559717, attrs:{compression:none, type:CreateTime, transactional: 0, control: 0}, last_offset_delta:4, first_timestamp:{timestamp: 1689346714915}, max_timestamp:{timestamp: 1689346714919}, producer_id:-1, producer_epoch:-1, base_sequence:0, record_count:5, ctx:{term:{-9223372036854775808}, owner_shard:{0}}}, current delta: 0
TRACE 2023-07-14 14:58:35,065 [shard 0] cloud_storage - [fiber1605~0~0~0 kafka/test-topic/42] - remote_segment.cc:1198 - consume_batch_start called for 35
DEBUG 2023-07-14 14:58:35,065 [shard 0] cloud_storage - [fiber1605~0~0 kafka/test-topic/42] - remote_segment.cc:1372 - remote_segment_batch_reader::produce
TRACE 2023-07-14 14:58:35,066 [shard 0] cloud_storage - [fiber1605~0~0~0 kafka/test-topic/42] - remote_segment.cc:1134 - accept_batch_start {header_crc:1534074726, size_bytes:8740, base_offset:{40}, type:batch_type::raft_data, crc:-660823431, attrs:{compression:none, type:CreateTime, transactional: 0, control: 0}, last_offset_delta:29, first_timestamp:{timestamp: 1689346714915}, max_timestamp:{timestamp: 1689346714944}, producer_id:-1, producer_epoch:-1, base_sequence:0, record_count:30, ctx:{term:{-9223372036854775808}, owner_shard:{0}}}, current delta: 0
TRACE 2023-07-14 14:58:35,066 [shard 0] cloud_storage - [fiber1605~0~0~0 kafka/test-topic/42] - remote_segment.cc:1198 - consume_batch_start called for 40
DEBUG 2023-07-14 14:58:35,066 [shard 0] cloud_storage - [fiber1605~0~0 kafka/test-topic/42] - remote_segment.cc:1372 - remote_segment_batch_reader::produce
TRACE 2023-07-14 14:58:35,067 [shard 0] cloud_storage - [fiber1605~0~0~0 kafka/test-topic/42] - remote_segment.cc:1134 - accept_batch_start {header_crc:716560967, size_bytes:7017, base_offset:{70}, type:batch_type::raft_data, crc:345388756, attrs:{compression:none, type:CreateTime, transactional: 0, control: 0}, last_offset_delta:23, first_timestamp:{timestamp: 1689346714915}, max_timestamp:{timestamp: 1689346714938}, producer_id:-1, producer_epoch:-1, base_sequence:0, record_count:24, ctx:{term:{-9223372036854775808}, owner_shard:{0}}}, current delta: 0
TRACE 2023-07-14 14:58:35,067 [shard 0] cloud_storage - [fiber1605~0~0~0 kafka/test-topic/42] - remote_segment.cc:1198 - consume_batch_start called for 70
DEBUG 2023-07-14 14:58:35,067 [shard 0] cloud_storage - [fiber1605~0~0 kafka/test-topic/42] - remote_segment.cc:1372 - remote_segment_batch_reader::produce

At this point the downloaded segment is fully consumed, and when the next maybe_reset_reader is called, a stuck reader should be the expected outcome, because there is no more data to read (note the bytes consumed in the error message has not changed):

WARN  2023-07-14 14:58:35,069 [shard 0] cloud_storage - [fiber1606 kafka/test-topic/42] - remote_partition.cc:415 - exception thrown while reading from remote_partition: remote_segment.cc:1338 - segment_reader is stuck, segment ntp: {kafka/test-topic/42}, _cur_rp_offset: 94, _bytes_consumed: 27627

The test asserts that an exception should be thrown:

    BOOST_REQUIRE_THROW(
      scan_remote_partition(*this, base, max), std::system_error);

But it fails, perhaps the exception thrown does not match std::system_error, we did recently change the type of exception thrown a couple of months ago. But the new exception type is derived from the old exception type

class stuck_reader_exception final : public std::runtime_error

so it should not fail because of this change. I will try to run this test locally and see what error is thrown during the test.

@abhijat
Copy link
Contributor

abhijat commented Jul 18, 2023

Running the test locally shows the following error being thrown:

ERROR 2023-07-18 13:39:14,732 [shard 0] storage - parser.cc:99 - Stopping parser, short read. Expected to read 7268 bytes, but read 2948 bytes. context: parser::consume_records
DEBUG 2023-07-18 13:39:14,732 [shard 0] cloud_storage - [fiber3 kafka/test-topic/42] - remote_partition.cc:554 - maybe_reset_reader called
DEBUG 2023-07-18 13:39:14,732 [shard 0] cloud_storage - [fiber3 kafka/test-topic/42] - remote_partition.cc:576 - maybe_reset_reader, config start_offset: 84, reader max_offset: 187
DEBUG 2023-07-18 13:39:14,732 [shard 0] cloud_storage - [fiber3 kafka/test-topic/42] - remote_partition.cc:647 - maybe_reset_reader completed, reader is present: true, is end of stream: false
DEBUG 2023-07-18 13:39:14,732 [shard 0] cloud_storage - [fiber3 kafka/test-topic/42] - remote_partition.cc:340 - Invoking 'read_some' on current log reader with config: {start_offset:{84}, max_offset:{526}, min_bytes:0, max_bytes:18446744073709551615, type_filter:nullopt, first_timestamp:nullopt, bytes_consumed:24663, over_budget:0, strict_max_bytes:0, skip_batch_cache:0}
DEBUG 2023-07-18 13:39:14,732 [shard 0] cloud_storage - [fiber3 kafka/test-topic/42] - remote_partition.cc:349 - Error while reading from stream 'storage::parser_errc:3'
DEBUG 2023-07-18 13:39:14,732 [shard 0] cloud_storage - [fiber2~0~0 kafka/test-topic/42] - remote_segment.cc:1393 - remote_segment_batch_reader::stop
DEBUG 2023-07-18 13:39:14,732 [shard 0] cloud_storage - [fiber2~0~0 kafka/test-topic/42] - remote_segment.cc:1396 - remote_segment_batch_reader::stop - parser-close
WARN  2023-07-18 13:39:14,732 [shard 0] cloud_storage - [fiber3 kafka/test-topic/42] - remote_partition.cc:415 - exception thrown while reading from remote_partition: parser_errc::input_stream_not_enough_bytes

This error originates from the parser trying to read an iobuf and not having the expected number of bytes, which results in the parser returning an error and the partition reader throwing the system error which matches the test assertion.

In the failing unit test it seems that the truncation performed by the test happened exactly at the batch boundary, so that the batches in the truncated segment were intact, the system error was never thrown.

@abhijat
Copy link
Contributor

abhijat commented Jul 18, 2023

My guess so far is this is an edge case in the consume path, where with the exact right set of bytes per batch and per segment, go down the "stuck consumer" path and return.

This is what seems to be happening in the test. Maybe we should assert for either system error or stuck reader exception in the test.

@BenPope
Copy link
Member

BenPope commented Jul 25, 2023

This might be the same:
https://buildkite.com/redpanda/redpanda/builds/33923#01898cb3-bc49-42be-bc54-d333c749f024

ERROR 2023-07-25 11:52:23,851 [shard 0] cloud_storage - util.cc:71 - Unexpected error std::runtime_error (topic_manifest.cc:306 - Failed to parse topic manifest: Invalid value. at offset 0)

@bharathv
Copy link
Contributor

https://buildkite.com/redpanda/redpanda/builds/33591#018971d7-5d7f-4e9c-b6b6-600d21dad304 (23.1.x)

unknown location(0): fatal error: in "test_remote_partition_scan_full_truncated_segments": cloud_storage::stuck_reader_exception: remote_segment.cc:1044 - segment_reader is stuck, segment ntp: {test-ns/test-topic/42}, _cur_rp_offset: 67, _bytes_consumed: 19717

@andijcr
Copy link
Contributor

andijcr commented Jul 28, 2023

@andijcr
Copy link
Contributor

andijcr commented Aug 8, 2023

@andijcr
Copy link
Contributor

andijcr commented Sep 8, 2023

wonder if the resetter did any effect on this #13146

@andijcr andijcr added the sev/low Bugs which are non-functional paper cuts, e.g. typos, issues in log messages label Sep 8, 2023
@abhijat
Copy link
Contributor

abhijat commented Oct 16, 2023

seen on #14113

WARN  2023-07-14 14:58:35,069 [shard 0] cloud_storage - [fiber1606 kafka/test-topic/42] - remote_partition.cc:370 - stuck reader: current rp offset: 94, max rp offset: 187
WARN  2023-07-14 14:58:35,069 [shard 0] cloud_storage - [fiber1606 kafka/test-topic/42] - remote_partition.cc:415 - exception thrown while reading from remote_partition: remote_segment.cc:1338 - segment_reader is stuck, segment ntp: {kafka/test-topic/42}, _cur_rp_offset: 94, _bytes_consumed: 27627
...
unknown location(0): fatal error: in "test_remote_partition_scan_full_truncated_segments": cloud_storage::stuck_reader_exception: remote_segment.cc:1338 - segment_reader is stuck, segment ntp: {kafka/test-topic/42}, _cur_rp_offset: 94, _bytes_consumed: 27627

Copy link

This issue hasn't seen activity in 3 months. If you want to keep it open, post a comment or remove the stale label – otherwise this will be closed in two weeks.

Copy link

This issue was closed due to lack of activity. Feel free to reopen if it's still relevant.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Feb 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cloud-storage Shadow indexing subsystem ci-failure kind/bug Something isn't working rpunit unit test ci-failure (not ducktape) sev/low Bugs which are non-functional paper cuts, e.g. typos, issues in log messages stale
Projects
None yet
Development

No branches or pull requests

6 participants