Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v22.3.x] archival: remove timeouts from archival_metadata_stm #12690

Merged

Commits on Aug 10, 2023

  1. raft: let state_machine::wait take an abort source

    The underlying wait already takes one, this is just
    a pass-through.
    
    (cherry picked from commit 7be3f80)
    jcsp authored and andrwng committed Aug 10, 2023
    Configuration menu
    Copy the full SHA
    1080d1e View commit details
    Browse the repository at this point in the history
  2. cluster: update signature of persisted_stm::wait_no_throw

    There's an upcoming change to remove the timeout from the
    archival_metadata_stm. Making that change will be much simpler if the
    signature of persisted_stm::wait_no_throw() is a timeout instead of a
    duration, since the underlying raft::state_machine::wait() call uses the
    model::no_timeout deadline value to determine whether to set a timer.
    
    Along the way I added an optional abort source reference that is
    currently unused.
    
    (cherry picked from commit cac4ac9)
    andrwng committed Aug 10, 2023
    Configuration menu
    Copy the full SHA
    0bca7d4 View commit details
    Browse the repository at this point in the history
  3. archival: remove timeouts from archival_metadata_stm

    At a high level, the archival loop is implemented with the following
    steps:
    
    ```
    while not aborted:
      wait until node is leader
      sync that offsets from previous terms are applied
      while node is leader:
        upload segments based on what is in the manifest
        replicate changes to manifest via archival stm
        wait for archival stm update to be applied (updates manifest)
    ```
    
    The archival loop currently expects that updates to the manifest are
    either driven by it (or the housekeeping service, though ignoring this
    for the sake of example).
    
    One case where this invariant is broken is the case where the
    replication/apply of archival batches times out -- the actual apply of
    the op may proceed as a part of the raft::state_machine background
    fiber, and the archival loop itself may proceed without waiting, meaning
    the updates to the manifest may race with a subsequent iteration of the
    archival loop.
    
    This commit remediates this by removing the timeout from the
    archival_metadata_stm::do_replicate_commands() call, forcing the caller
    to wait indefinitely for the op to complete.
    
    (cherry picked from commit 0dd1dc8)
    andrwng committed Aug 10, 2023
    Configuration menu
    Copy the full SHA
    9de0d54 View commit details
    Browse the repository at this point in the history
  4. archival: silence broken_named_semaphore in upload loop

    CONFLICT:
    - ntp_archiver_service has changed a bunch since 22.3, but the effective
      change is the same here: catch broken semaphore exceptions
    
    Replication may throw a ss::broken_named_semaphore when met with a
    shutting down error:
    
    ```
    DEBUG 2023-03-24 22:54:01,854 [shard 0] archival - [fiber4 kafka/panda-topic/7] - ntp_archiver_service.cc:494 - Updating the archival_meta_stm in read-replica mode, in-sync offset: -9223372036854775808, last uploaded offset: 0, last compacted offset: -9223372036854775808
    DEBUG 2023-03-24 22:54:01,854 [shard 0] cluster - ntp: {kafka/panda-topic/7} - archival_metadata_stm.cc:188 - command_batch_builder::replicate called
    WARN  2023-03-24 22:54:01,854 [shard 0] cluster - ntp: {kafka/panda-topic/7} - archival_metadata_stm.cc:333 - error on replicating remote segment metadata: raft::errc:19
    ERROR 2023-03-24 22:54:01,854 [shard 0] archival - [fiber4 kafka/panda-topic/7] - ntp_archiver_service.cc:252 - upload loop error: seastar::broken_named_semaphore (Semaphore broken: mutex)
    ```
    
    This commit handles the exception the same we handle other shutdown exceptions:
    by doing nothing.
    
    (cherry picked from commit cdb2b39)
    andrwng committed Aug 10, 2023
    Configuration menu
    Copy the full SHA
    c94ef84 View commit details
    Browse the repository at this point in the history