Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI Failure (RuntimeError: Failed checks: 1; Incomplete checks: 0) in CloudStorageTimingStressTest.test_cloud_storage #10241

Closed
ztlpn opened this issue Apr 20, 2023 · 4 comments · Fixed by #10262
Labels
area/cloud-storage Shadow indexing subsystem ci-disabled-test ci-failure kind/bug Something isn't working

Comments

@ztlpn
Copy link
Contributor

ztlpn commented Apr 20, 2023

https://buildkite.com/redpanda/redpanda/builds/27621#01879f66-e9e1-41fe-9300-f597bf6f52af

Module: rptest.tests.cloud_storage_timing_stress_test
Class:  CloudStorageTimingStressTest
Method: test_cloud_storage
test_id:    rptest.tests.cloud_storage_timing_stress_test.CloudStorageTimingStressTest.test_cloud_storage
status:     FAIL
run time:   56.446 seconds


    RuntimeError('Failed checks: 1; Incomplete checks: 0')
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/root/tests/rptest/services/cluster.py", line 49, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/cloud_storage_timing_stress_test.py", line 451, in test_cloud_storage
    self.do_checks()
  File "/root/tests/rptest/tests/cloud_storage_timing_stress_test.py", line 436, in do_checks
    raise RuntimeError(
RuntimeError: Failed checks: 1; Incomplete checks: 0
@ztlpn ztlpn added kind/bug Something isn't working ci-failure area/cloud-storage Shadow indexing subsystem labels Apr 20, 2023
@dotnwat dotnwat mentioned this issue Apr 20, 2023
7 tasks
@dotnwat
Copy link
Member

dotnwat commented Apr 20, 2023

@andrwng
Copy link
Contributor

andrwng commented Apr 21, 2023

The failed checks:

[INFO  - 2023-04-20 16:12:30,802 - cloud_storage_timing_stress_test - _validate_cloud_log_size_bytes - lineno:126]: Reported cloud log size does not match manifest: 256993701 != 248568368
[INFO  - 2023-04-20 16:12:30,802 - cloud_storage_timing_stress_test - _validate_cloud_log_offsets - lineno:141]: Reported cloud log start does not match manifest: 3384 != 4028
[INFO  - 2023-04-20 16:12:30,802 - cloud_storage_timing_stress_test - _validate_cloud_log_size_bytes - lineno:126]: Reported cloud log size does not match manifest: 256993701 != 248568368
[INFO  - 2023-04-20 16:12:30,802 - cloud_storage_timing_stress_test - _validate_cloud_log_offsets - lineno:141]: Reported cloud log start does not match manifest: 3384 != 4028
[INFO  - 2023-04-20 16:12:30,802 - cloud_storage_timing_stress_test - _validate_cloud_log_size_bytes - lineno:126]: Reported cloud log size does not match manifest: 248568335 != 248568368
[INFO  - 2023-04-20 16:12:30,802 - cloud_storage_timing_stress_test - _validate_cloud_log_offsets - lineno:141]: Reported cloud log start does not match manifest: 3640 != 4028
[INFO  - 2023-04-20 16:12:30,802 - cloud_storage_timing_stress_test - _validate_cloud_log_size_bytes - lineno:126]: Reported cloud log size does not match manifest: 248568335 != 248568368
[INFO  - 2023-04-20 16:12:30,802 - cloud_storage_timing_stress_test - _validate_cloud_log_offsets - lineno:141]: Reported cloud log start does not match manifest: 3640 != 4028
[INFO  - 2023-04-20 16:12:30,802 - cloud_storage_timing_stress_test - _validate_cloud_log_size_bytes - lineno:126]: Reported cloud log size does not match manifest: 252781681 != 248568368
[INFO  - 2023-04-20 16:12:30,802 - cloud_storage_timing_stress_test - _validate_cloud_log_offsets - lineno:141]: Reported cloud log start does not match manifest: 3640 != 4028
[INFO  - 2023-04-20 16:12:30,802 - cloud_storage_timing_stress_test - _validate_cloud_log_size_bytes - lineno:126]: Reported cloud log size does not match manifest: 256994342 != 248568368
[INFO  - 2023-04-20 16:12:30,802 - cloud_storage_timing_stress_test - _validate_cloud_log_offsets - lineno:141]: Reported cloud log start does not match manifest: 3640 != 4028
[INFO  - 2023-04-20 16:12:30,802 - cloud_storage_timing_stress_test - _validate_cloud_log_size_bytes - lineno:126]: Reported cloud log size does not match manifest: 256994342 != 248568368
[INFO  - 2023-04-20 16:12:30,802 - cloud_storage_timing_stress_test - _validate_cloud_log_offsets - lineno:141]: Reported cloud log start does not match manifest: 3640 != 4028
[INFO  - 2023-04-20 16:12:30,802 - cloud_storage_timing_stress_test - _validate_cloud_log_offsets - lineno:141]: Reported cloud log start does not match manifest: 4024 != 4028
[INFO  - 2023-04-20 16:12:30,802 - cloud_storage_timing_stress_test - _validate_cloud_log_offsets - lineno:141]: Reported cloud log start does not match manifest: 4024 != 4028
[INFO  - 2023-04-20 16:12:30,802 - cloud_storage_timing_stress_test - _validate_cloud_log_offsets - lineno:141]: Reported cloud log start does not match manifest: 4024 != 4028
[ERROR - 2023-04-20 16:12:31,003 - cloud_storage_timing_stress_test - do_checks - lineno:422]: Check cloud_storage_status_endpoint threw an exception: Cloud storage partition status did not match the manifest
[INFO  - 2023-04-20 16:12:31,003 - cloud_storage_timing_stress_test - do_checks - lineno:426]: Check cloud_storage_usage completed successfuly
[ERROR - 2023-04-20 16:12:31,003 - cluster - wrapped - lineno:56]: Test failed, doing failure checks on RedpandaService-0-140400403129552...
Traceback (most recent call last):
  File "/root/tests/rptest/services/cluster.py", line 49, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/cloud_storage_timing_stress_test.py", line 451, in test_cloud_storage
    self.do_checks()
  File "/root/tests/rptest/tests/cloud_storage_timing_stress_test.py", line 436, in do_checks
    raise RuntimeError(
RuntimeError: Failed checks: 1; Incomplete checks: 0

Snippets of the manifest:

 'start_offset': 4070,
              '3682-1-v1.log': {'archiver_term': 1,
                                'base_offset': 3682,
                                'base_timestamp': 1682007119156,
                                'committed_offset': 3810,
                                'delta_offset': 42,
                                'delta_offset_end': 43,
                                'is_compacted': False,
                                'max_timestamp': 1682007119653,
                                'metadata_size_hint': 0,
                                'segment_term': 1,
                                'size_bytes': 4213193,
                                'sname_format': 3},
              '4070-1-v1.log': {'archiver_term': 1,
                                'base_offset': 4070,
                                'base_timestamp': 1682007120656,
                                'committed_offset': 4198,
                                'delta_offset': 46,
                                'delta_offset_end': 47,
                                'is_compacted': False,
                                'max_timestamp': 1682007121153,
                                'metadata_size_hint': 0,
                                'segment_term': 1,
                                'size_bytes': 4213193,

The Kafka start offset reported by Redpanda's admin endpoint (4024, as indicated by the start offset - the segment that contains the start offset's delta offset = 4070 - 46) does not match with what the test sees in S3 (4028, as indicated by the start offset - first segment's delta_offset = 4070 - 42).

This logic doesn't match what Redpanda actually does to compute the Kafka start offset: find the exact segment with the base offset that matches the start offset, else the first segment if it has a higher offset than the start, and use its delta to compute the Kafka offset.

andrwng added a commit to andrwng/redpanda that referenced this issue Apr 21, 2023
The bucket view previously always used the first segment's delta offset
to translate the start offset. This is no longer how Redpanda reports
the start offset.

Instead, it finds the segment whose base offset matches the start
offset, or the first segment if its base offset is higher than the start
offset, and uses its delta to do the translation[1].

[1] https://github.com/redpanda-data/redpanda/blob/e133de2fe76ae34d8e9972eb8a225dd346d3628a/src/v/cloud_storage/partition_manifest.cc#L308

Fixes redpanda-data#10241
@andrwng
Copy link
Contributor

andrwng commented Apr 21, 2023

It isn't clear why this started failing more frequently recently, given the code to report Kafka start offset hasn't changed in a couple weeks. Maybe there are multiple failure cases here. I put up #10247 to address the start offset issue.

@ztlpn
Copy link
Contributor Author

ztlpn commented Oct 25, 2023

open-close to make pandatriage happy (issue is stale, last occurence 2023-07-03)

@ztlpn ztlpn closed this as completed Oct 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cloud-storage Shadow indexing subsystem ci-disabled-test ci-failure kind/bug Something isn't working
Projects
None yet
4 participants