Optimize `replicated_partition::validate_fetch_offset` #19161

ballard26 · 2024-06-09T05:05:02Z

In the common fetch path where Redpanda is reading batches from the batch cache validate_fetch_offset can consume ~33%(or 3.8% of overall reactor utilization) of the total time spent reading from a given NTP as seen in;

After applying the changes in this PR this is reduced to ~9%(0.9% of overall reactor utilization);

Backports Required

Release Notes

none

vbotbuildovich · 2024-06-20T06:20:03Z

new failures in https://buildkite.com/redpanda/redpanda/builds/50476#01903418-7120-4b30-b2fc-11e6c48e8ccc:

"rptest.tests.compatibility.arroyo_test.ArroyoTest.test_arroyo_test_suite"

new failures in https://buildkite.com/redpanda/redpanda/builds/50476#01903418-7122-49e6-8663-1ae0fd2b8bd8:

"rptest.tests.read_replica_e2e_test.TestReadReplicaService.test_identical_lwms_after_delete_records.partition_count=5.cloud_storage_type=CloudStorageType.S3"

new failures in https://buildkite.com/redpanda/redpanda/builds/50476#01903418-7125-4db8-9513-f870f193b115:

"rptest.tests.simple_e2e_test.SimpleEndToEndTest.test_consumer_interruption"
"rptest.tests.delete_records_test.DeleteRecordsTest.test_delete_records_topic_start_delta.cloud_storage_enabled=True"

new failures in https://buildkite.com/redpanda/redpanda/builds/50476#01903418-7124-4acc-9ca8-75fcf4bc962f:

"rptest.tests.data_transforms_test.DataTransformsLoggingMetricsTest.test_manager_metrics_values"

new failures in https://buildkite.com/redpanda/redpanda/builds/50476#01903430-a59d-40a9-a552-7453cdffe574:

"rptest.tests.read_replica_e2e_test.TestReadReplicaService.test_identical_lwms_after_delete_records.partition_count=5.cloud_storage_type=CloudStorageType.S3"
"rptest.tests.data_transforms_test.DataTransformsLoggingMetricsTest.test_manager_metrics_values"

new failures in https://buildkite.com/redpanda/redpanda/builds/50476#01903430-a59b-4401-9e3d-75ece1181001:

"rptest.tests.simple_e2e_test.SimpleEndToEndTest.test_consumer_interruption"
"rptest.tests.compatibility.arroyo_test.ArroyoTest.test_arroyo_test_suite"

new failures in https://buildkite.com/redpanda/redpanda/builds/50476#01903430-a59e-496a-a227-3386af8c544f:

"rptest.tests.delete_records_test.DeleteRecordsTest.test_delete_records_topic_start_delta.cloud_storage_enabled=True"

vbotbuildovich · 2024-06-20T06:21:50Z

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/50476#01903418-7120-4b30-b2fc-11e6c48e8ccc

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/50476#01903418-7125-4db8-9513-f870f193b115

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/51291#01909ad2-263d-436a-bf65-db67c46f603f

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/51291#01909ad2-2641-4548-9081-bd06a44334ae

andrwng · 2024-06-21T19:22:24Z

src/v/cluster/partition.cc

+    // - The topic has been remotely recovered and the log_eviction_stm won't
+    //   have any state.
+    // - The start offset override hasn't been set and the log_eviction_stm is
+    //   returning model::offset{} to indicate that.


nit: or this is a read replica

andrwng · 2024-06-21T19:25:45Z

src/v/cluster/partition.h

+    std::optional<model::offset> _cached_start_offset_raft,
+      _cached_start_offset_kafka;


nit: please put these on separate lines

andrwng · 2024-06-21T19:46:38Z

src/v/cluster/partition.cc

        // The eviction STM only keeps track of DeleteRecords truncations
        // as Raft offsets. Translate if possible.
        if (
-          offset_res.value() != model::offset{}
+          offset_override != model::offset{}
          && _raft->start_offset() < offset_res.value()) {


nit: use offset_override here too?

andrwng · 2024-06-21T21:53:46Z

src/v/cluster/partition.cc

+    // For the first case if we've read the start_offset_override from the
+    // archival_meta_stm then it won't change until the log_eviction_stm has
+    // local state once more. Hence we can just return the offset we read
+    // without re-syncing.
+    if (_cached_start_offset_kafka && offset_override == model::offset{}) {
+        co_return *_cached_start_offset_kafka;
+    }


This seems kind of off -- can't we get here without any syncing at all e.g. for read replicas?

Yeah, forgot to account for read replicas initially. In update I just pushed I have it so we're always syncing the archival stm and getting the start offset override from it afterwards for read replicas.

andrwng · 2024-06-21T22:03:21Z

src/v/cluster/partition.cc

+            // Check if the translation has been cached.
+            if (
+              _cached_start_offset_raft
+              && offset_override == *_cached_start_offset_raft) {
+                co_return *_cached_start_offset_kafka;
+            }


Should this happen regardless of raft->start_offset() < offset_res.value()? Then I don't think we'd need the check at L1281?

Yeah, removed the duplicate check in the updated version.

andrwng · 2024-06-22T02:25:02Z

src/v/cluster/partition.cc

 ss::future<result<model::offset, std::error_code>>
 partition::sync_kafka_start_offset_override(
  model::timeout_clock::duration timeout) {
+    model::offset offset_override;
+
    if (_log_eviction_stm && !is_read_replica_mode_enabled()) {
        auto offset_res
          = co_await _log_eviction_stm->sync_start_offset_override(timeout);
        if (offset_res.has_failure()) {
            co_return offset_res.as_failure();
        }
+
+        offset_override = offset_res.value();
+


I tried taking a step back to understand how this looks / should look with these new cached variables. I think this is roughly what we have here, but there are some nuances. It seems like the cached values are basically caching the translation, but in a potentially imperfect way.

raft_start_override = co_await sync_log_eviction_stm() if raft_start_override.has_value(): if raft_start_override == raft_start_cache: # Easy case, we've translated before. return kafka_start_cache # We haven't translated before, but data is still in local log. if we can translate: translated_override = translate(raft_start_override) raft_start_cache = raft_start_override kafka_start_cache = translated_override # Can't translate because a race has likely occurred: start override has # fallen out of local log, and there is no cached value to fall back on. # Fall through. # We may or may not have already synced above, but we sync again to ensure the # value we get from archival is up-to-date. kafka_start_override = co_await sync_archival_stm() if raft_start_override.has_value() && kafka_start_override.has_value(): # Note that even if there is an override from the log_eviction_stm AND from # the archival_stm, they don't necessarily correspond to the same record # (e.g. consider a race where DeleteRecords comes in after the first sync). # # If this is the case though, subsequent calls will resync the log eviction # stm and not use this cached value. raft_start_cache = raft_start_override kafka_start_cache = kafka_start_override return kafka_start_override

Lmk if this is more or less what you're thinking here. If so, maybe we can incorporate the structure or comments into the implementation?

I've hopefully simplified the caching logic and added clearer comments in the updated version as per our conversations on Slack. Let me know if they're still lacking though.

src/v/cluster/partition.cc

andrwng · 2024-06-27T15:54:36Z

src/v/cluster/partition.cc

        auto start_kafka_offset
          = _archival_meta_stm->manifest().get_start_kafka_offset_override();
-        if (start_kafka_offset != kafka::offset{}) {
-            co_return kafka::offset_cast(start_kafka_offset);
-        }
+        _cached_kafka_start_override = kafka::offset_cast(start_kafka_offset);


Doesn't this mean that we'll never do the archival sync after we've cached a start override?

Yes, that's the hope.

The log eviction stm is now caching the kafka offset in every prefix truncate batch that is being applied against it; even if they don't land in local storage. So the only cases where we'd need to get the offset override from the archival stm are where there hasn't been any prefix truncate batches applied to the log eviction stm since the RP broker started. In these cases we only need to sync the archival stm once to get the most up-to-date prefix truncation from cloud storage. And then after that if any additional prefix truncate batches come in the log eviction stm will be aware of them.

src/v/cluster/partition.cc

andrwng · 2024-06-27T17:03:37Z

src/v/cluster/partition.cc

+    // Once it has been sync'd we can cache the override value as it won't
+    // change until `_log_eviction_stm->cached_kafka_start_offset()` does.
+    if (!_cached_kafka_start_override || !_log_eviction_stm) {
        auto term = _raft->term();
        if (!co_await _archival_meta_stm->sync(timeout)) {


nit: may be clearer as some

if (_cached_kafka_start_override.has_value() && _log_eviction_stm) { return *_cached_kafka_start_override; } auto term = _raft->term(); ...

Actually I'm thinking it might make sense for us to early return very early, if !_log_eviction_stm, in which case we can omit these other checks for _log_eviction_stm.

It's probably worth checking if _log_eviction_stm can ever be removed -- I don't think it can, but maybe someone from the replication team will have a clearer understanding, since that code has changed in the last few months

Currently there are a couple ntps that don't have a _log_eviction_stm. The controller, the consumer offsets, tx manager... Luckily none of these are ktps and none are consumable in the normal kafka manner. So for any partitions we're concerned about here the log eviction stm will always exist. I think I should still check for its existence though as I think it makes things clearer.

Luckily none of these are ktps and none are consumable in the normal kafka manner

__consumer_offets is a kafka partition and AFAIK is consumable in the normal kafka way? IIRC this was a change we made a couple of years ago: it used to be in the Redpanda namespace and hence invisible to Kafka for for compatibility we moved it into kafka namespace with the same name as Apache Kafka so it would work in the same way for anyone who was accessing it directly (Enterprise team would probably have the most context here).

Good point, then lets assume there are kafka consumable partitions without the log eviction stm(like in consumer offsets as you've mentioned). Then there are two paths the function can take;

Either !_archival_meta_stm in which case it returns model::offset{} which is correct since there can be no overridden offset without the log eviction stm.
Or _archival_meta_stm in which case it'll sync the archival stm once then return _archival_meta_stm->manifest().get_start_kafka_offset_override() for every call afterwards. The result of this function should be model::offset{} as well since we shouldn't technically be able to apply delete record batches to partitions without the log eviction stm

Either way the function does return the correct value.

SG.

@andrwng any other concerns on this thread of discussion?

Yea this looks good. Thanks for the ping Travis, and Brandon for updating!

src/v/cluster/partition.cc

This allows the log_eviction_stm to know what the current kafka start offset override is even if it doesn't reside in the local log. However, since this cached value is not snapshotted by the stm it may not be recovered when repanda restarts. In this case one would need to fallback on the archival stm to find the kafka start offset override.

…verride

The most common path for validate_fetch_offset results in a number of short lived coroutines. In these cases the allocation/deallocation for the coroutine's frame ends up dominating the runtime for the function. This commit removes the coroutines in favor for then chains which can avoid the allocation if the task quota hasn't been met.

ballard26 · 2024-07-08T23:26:56Z

All CI failures seem to be known issues;
For https://ci-artifacts.dev.vectorized.cloud/redpanda/51032/019076ec-8a70-4549-b3af-e7559f345ac3/vbuild/ducktape/results/2024-07-03--001/report.html

Failure in MaintenanceTest.test_exclusive_maintenance is CI Failure (Node never became leader) in MaintenanceTest.test_exclusive_maintenance #20592
For https://ci-artifacts.dev.vectorized.cloud/redpanda/51032/01907701-e06e-4ad9-90b7-0b7168a638ab/vbuild/ducktape/results/2024-07-03--001/report.html

For https://ci-artifacts.dev.vectorized.cloud/redpanda/51032/01907701-e06e-4ad9-90b7-0b7168a638ab/vbuild/ducktape/results/2024-07-03--001/report.html

Failure in AWSRoleFetchTests.test_write is CI Failure (data len is 81863 but the expected records size is -61) in AWSRoleFetchTests.test_write #20298
Failure in RaftAvailabilityTest.test_leadership_transfer seems to be CI Failure (MetricCheckFailed) in RaftAvailabilityTest.test_leadership_transfer #20574

travisdowns · 2024-07-10T02:52:58Z

/ci-repeat 1

ballard26 · 2024-07-10T14:25:16Z

The failure in the latest CI run was a known issue #20570

mmaslankaprv · 2024-07-11T06:53:20Z

this looks good to me, i am wondering do we have a test where we do delete records with an offset that is in the range included only in cloud ?

andrwng · 2024-07-11T17:09:48Z

this looks good to me, i am wondering do we have a test where we do delete records with an offset that is in the range included only in cloud ?

Looking around actually I had a hard time finding one that does this explicitly. It should be fairly easy to add one to cloud_storage/tests/cloud_storage_e2e_test.cc or cloud_storage/tests/delete_records_e2e_test.cc (both have full access to the partition, and we can easily force an aggressive local housekeeping run before deleting)

andrwng

The changes LGTM, but good callout from Michal about ensuring test coverage for cloud-only deletes. I'm okay with that going in separately, but also happy to wait on such a test to merge (and help if needed)

ballard26 · 2024-07-11T17:13:35Z

this looks good to me, i am wondering do we have a test where we do delete records with an offset that is in the range included only in cloud ?

Looking around actually I had a hard time finding one that does this explicitly. It should be fairly easy to add one to cloud_storage/tests/cloud_storage_e2e_test.cc or cloud_storage/tests/delete_records_e2e_test.cc (both have full access to the partition, and we can easily force an aggressive local housekeeping run before deleting)

I will add a test to do that today. I think that should resolve the last blocker for this PR.

ballard26 · 2024-07-14T19:53:17Z

Created a new PR that adds the discussed test #21382 .

piyushredpanda · 2024-07-15T00:49:49Z

Thanks, @ballard26; merging this one then.

vbotbuildovich · 2024-07-15T00:50:10Z

/backport v24.1.x

vbotbuildovich · 2024-07-15T00:50:10Z

/backport v23.3.x

vbotbuildovich · 2024-07-15T00:50:11Z

/backport v23.2.x

vbotbuildovich · 2024-07-15T00:51:18Z

Failed to create a backport PR to v23.3.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-19161-v23.3.x-402 remotes/upstream/v23.3.x
git cherry-pick -x feeec639f2d52379d33100c505682113deb03817 3ba042d4ee19e1329c8faedd336d65d483c3b462 e96451ad86d860097a7f1928ac3445f411459318

Workflow run logs.

vbotbuildovich · 2024-07-15T00:51:22Z

Failed to create a backport PR to v23.2.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-19161-v23.2.x-732 remotes/upstream/v23.2.x
git cherry-pick -x feeec639f2d52379d33100c505682113deb03817 3ba042d4ee19e1329c8faedd336d65d483c3b462 e96451ad86d860097a7f1928ac3445f411459318

Workflow run logs.

github-actions bot added the area/redpanda label Jun 9, 2024

ballard26 changed the title ~~[WIP] Optimize replicated_partition::validate_fetch_offset~~ Optimize replicated_partition::validate_fetch_offset Jun 20, 2024

ballard26 marked this pull request as ready for review June 20, 2024 04:12

ballard26 requested review from mmaslankaprv, andrwng and travisdowns June 20, 2024 04:12

andrwng reviewed Jun 22, 2024

View reviewed changes

ballard26 force-pushed the fetch-opt branch 3 times, most recently from 57dbe14 to 59ee771 Compare June 27, 2024 03:05

ballard26 requested a review from andrwng June 27, 2024 05:38

andrwng reviewed Jun 27, 2024

View reviewed changes

ballard26 force-pushed the fetch-opt branch from 59ee771 to a5f58db Compare July 3, 2024 03:26

ballard26 added 3 commits July 2, 2024 23:27

cluster: avoid syncing archival_meta_stm in sync_kafka_start_offset_o…

3ba042d

…verride

ballard26 force-pushed the fetch-opt branch from a5f58db to e96451a Compare July 3, 2024 03:27

ballard26 requested a review from andrwng July 3, 2024 03:28

andrwng approved these changes Jul 5, 2024

View reviewed changes

andrwng approved these changes Jul 11, 2024

View reviewed changes

ballard26 mentioned this pull request Jul 14, 2024

Add delete_records test for offsets only in cloud storage #21382

Open

7 tasks

piyushredpanda merged commit a4e54a3 into redpanda-data:dev Jul 15, 2024
15 of 18 checks passed

vbotbuildovich mentioned this pull request Jul 15, 2024

[v24.1.x] Optimize replicated_partition::validate_fetch_offset #21383

Merged

vbotbuildovich mentioned this pull request Jul 15, 2024

[v23.3.x] Optimize replicated_partition::validate_fetch_offset #21384

Closed

vbotbuildovich mentioned this pull request Jul 15, 2024

[v23.2.x] Optimize replicated_partition::validate_fetch_offset #21385

Closed

		std::optional<model::offset> _cached_start_offset_raft,
		_cached_start_offset_kafka;

Optimize replicated_partition::validate_fetch_offset #19161

Optimize replicated_partition::validate_fetch_offset #19161

Conversation

ballard26 commented Jun 9, 2024 • edited Loading

Backports Required

Release Notes

vbotbuildovich commented Jun 20, 2024 • edited Loading

vbotbuildovich commented Jun 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrwng Jun 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ballard26 commented Jul 8, 2024 • edited Loading

travisdowns commented Jul 10, 2024

ballard26 commented Jul 10, 2024

mmaslankaprv commented Jul 11, 2024

andrwng commented Jul 11, 2024

andrwng left a comment

Choose a reason for hiding this comment

ballard26 commented Jul 11, 2024

ballard26 commented Jul 14, 2024

piyushredpanda commented Jul 15, 2024

vbotbuildovich commented Jul 15, 2024

vbotbuildovich commented Jul 15, 2024

vbotbuildovich commented Jul 15, 2024

vbotbuildovich commented Jul 15, 2024

vbotbuildovich commented Jul 15, 2024

Optimize `replicated_partition::validate_fetch_offset` #19161

Optimize `replicated_partition::validate_fetch_offset` #19161

ballard26 commented Jun 9, 2024 •

edited

Loading

vbotbuildovich commented Jun 20, 2024 •

edited

Loading

vbotbuildovich commented Jun 20, 2024 •

edited

Loading

andrwng Jun 22, 2024 •

edited

Loading

ballard26 commented Jul 8, 2024 •

edited

Loading