Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error on retrieving write buffer from log stream #10141

Closed
Zelldon opened this issue Aug 22, 2022 · 7 comments · Fixed by #23371
Closed

Error on retrieving write buffer from log stream #10141

Zelldon opened this issue Aug 22, 2022 · 7 comments · Fixed by #23371
Assignees
Labels
component/partition-transitions component/zeebe Related to the Zeebe component/team good first issue Marks an issue as simple enough for first time contributors kind/bug Categorizes an issue or PR as a bug severity/low Marks a bug as having little to no noticeable impact for the user version:8.4.13 version:8.5.9 version:8.6.4 version:8.7.0-alpha1 Label that represents issues released on verions 8.7.0-alpha1

Comments

@Zelldon
Copy link
Member

Zelldon commented Aug 22, 2022

Describe the bug

On becoming a leader the LogStream seems to be closed and thrown an exception if the CommandAPI wants to create a new writer.

To Reproduce

Not 100% sure, but it looks like it is related to concurrent transitions, becoming leader and going into inactive.
In general, I can't see really an impact/effect, since the cluster was almost immediately deleted afterward.

metrics
metrics2

We see after the error happening also a lot of #8606 might be related.

Expected behavior

No error, and handling the case more gracefully.

Log/Stacktrace

Full Stacktrace

java.lang.RuntimeException: Actor is closed
	at io.camunda.zeebe.logstreams.impl.log.LogStreamImpl.newLogStreamWriter(LogStreamImpl.java:203) ~[zeebe-logstreams-8.1.0-alpha4.jar:8.1.0-alpha4]
	at io.camunda.zeebe.logstreams.impl.log.LogStreamImpl.newLogStreamRecordWriter(LogStreamImpl.java:181) ~[zeebe-logstreams-8.1.0-alpha4.jar:8.1.0-alpha4]
	at io.camunda.zeebe.broker.transport.commandapi.CommandApiServiceImpl.lambda$onBecomingLeader$3(CommandApiServiceImpl.java:112) ~[zeebe-broker-8.1.0-alpha4.jar:8.1.0-alpha4]
	at io.camunda.zeebe.scheduler.ActorControl.lambda$call$0(ActorControl.java:123) ~[zeebe-scheduler-8.1.0-alpha4.jar:8.1.0-alpha4]
	at io.camunda.zeebe.scheduler.ActorJob.invoke(ActorJob.java:86) ~[zeebe-scheduler-8.1.0-alpha4.jar:8.1.0-alpha4]
	at io.camunda.zeebe.scheduler.ActorJob.execute(ActorJob.java:45) ~[zeebe-scheduler-8.1.0-alpha4.jar:8.1.0-alpha4]
	at io.camunda.zeebe.scheduler.ActorTask.execute(ActorTask.java:119) ~[zeebe-scheduler-8.1.0-alpha4.jar:8.1.0-alpha4]
	at io.camunda.zeebe.scheduler.ActorThread.executeCurrentTask(ActorThread.java:106) ~[zeebe-scheduler-8.1.0-alpha4.jar:8.1.0-alpha4]
	at io.camunda.zeebe.scheduler.ActorThread.doWork(ActorThread.java:87) ~[zeebe-scheduler-8.1.0-alpha4.jar:8.1.0-alpha4]
	at io.camunda.zeebe.scheduler.ActorThread.run(ActorThread.java:198) ~[zeebe-scheduler-8.1.0-alpha4.jar:8.1.0-alpha4]

Logs: https://drive.google.com/drive/u/0/folders/1DKM8gPL92xcdWeVZNfLQYWEdCHKGij4r

Error group: https://console.cloud.google.com/errors/detail/COaxsKTpuvHoOQ;service=zeebe;time=P7D?project=camunda-cloud-240911

Environment:

  • OS: ultrachaos
  • Zeebe Version: 8.1.0-alpha4
  • Configuration: trial
@Zelldon Zelldon added kind/bug Categorizes an issue or PR as a bug severity/low Marks a bug as having little to no noticeable impact for the user labels Aug 22, 2022
@megglos
Copy link
Contributor

megglos commented Sep 2, 2022

Triage: It shouldn't be a severe issue, the cluster should recover from this. It might indicate an issue still.
We can close it and wait for the error reporting to reopen it.

@megglos megglos closed this as completed Sep 2, 2022
@remcowesterhoud
Copy link
Contributor

remcowesterhoud commented Sep 28, 2022

This occurred again on 8.1.0-alpha5. @megglos Do you want me to reopen the issue now that it reappeared?
https://console.cloud.google.com/errors/detail/COaxsKTpuvHoOQ;service=zeebe;time=P7D?project=camunda-cloud-240911

@megglos megglos reopened this Sep 28, 2022
@Zelldon
Copy link
Member Author

Zelldon commented Nov 28, 2022

@Zelldon
Copy link
Member Author

Zelldon commented Dec 27, 2022

@deepthidevaki
Copy link
Contributor

Happened in 8.1.13 during shutdown. No impact. Shutdown was completed successfully.

@romansmirnov romansmirnov added the component/zeebe Related to the Zeebe component/team label Mar 5, 2024
@npepinpe npepinpe added the good first issue Marks an issue as simple enough for first time contributors label May 13, 2024
@npepinpe
Copy link
Member

Here's a possible reproducer:

/*
 * Copyright Camunda Services GmbH and/or licensed to Camunda Services GmbH under
 * one or more contributor license agreements. See the NOTICE file distributed
 * with this work for additional information regarding copyright ownership.
 * Licensed under the Camunda License 1.0. You may not use this file
 * except in compliance with the Camunda License 1.0.
 */
package io.camunda.zeebe.broker.transport.commandapi;

import static org.assertj.core.api.Assertions.assertThat;

import io.camunda.zeebe.broker.system.configuration.QueryApiCfg;
import io.camunda.zeebe.engine.state.QueryService;
import io.camunda.zeebe.logstreams.impl.flowcontrol.RateLimit;
import io.camunda.zeebe.logstreams.util.ListLogStorage;
import io.camunda.zeebe.logstreams.util.SyncLogStream;
import io.camunda.zeebe.scheduler.future.ActorFuture;
import io.camunda.zeebe.scheduler.testing.ControlledActorSchedulerExtension;
import io.camunda.zeebe.stream.api.StreamClock;
import io.camunda.zeebe.transport.ServerTransport;
import java.time.Duration;
import org.junit.jupiter.api.Test;
import org.junit.jupiter.api.extension.RegisterExtension;
import org.mockito.Mock;
import org.mockito.junit.jupiter.MockitoSettings;
import org.mockito.quality.Strictness;

@MockitoSettings(strictness = Strictness.STRICT_STUBS)
final class CommandApiServiceImplTest {
  @RegisterExtension
  private final ControlledActorSchedulerExtension scheduler =
      new ControlledActorSchedulerExtension();

  @Mock private ServerTransport transport;
  @Mock private QueryService queryService;

  @Test
  void shouldThrowException() {
    // given
    final var logStream =
        SyncLogStream.builder()
            .withActorSchedulingService(scheduler.scheduler())
            .withClock(StreamClock.system())
            .withMaxFragmentSize(1024 * 1024)
            .withLogStorage(new ListLogStorage())
            .withPartitionId(1)
            .withWriteRateLimit(RateLimit.disabled())
            .withLogName("test")
            .build();
    final var service =
        new CommandApiServiceImpl(transport, scheduler.scheduler(), new QueryApiCfg());
    scheduler.submitActor(service);
    scheduler.workUntilDone();

    // when - `onBecomingLeader` enqueues a call to the underlying actor, but the log stream 
    // is closed before the actor is scheduled (via workUntilDone), so an exception is thrown
    final ActorFuture<?> result = service.onBecomingLeader(1, 1, logStream, queryService);
    logStream.close();
    scheduler.workUntilDone();

    // then
    assertThat(result).failsWithin(Duration.ofSeconds(1));
  }
}

As for solving the issue, we have two options:

  1. Call onBecomingLeader not with a LogStream, but with the writer already. This is now possible since the LogStream is entirely synchronous now. This is not easily back portable however.
  2. The caller of CommandApiServiceImpl#onBecomingLeader handles errors properly, such that if it fails, there may be a certain class of exceptions (e.g. the log stream is closed) which may be OK to ignore. For this, though, we need to propagate the error from the log stream all the way to the caller.

Additionally, in CommandApiServiceImpl#onBecomingLeader (and possibly onBecomingFollower?), if there is an uncaught exception, we don't handle it, and the returned future is never completed. We should also ensure that the returned future is always completed.

@entangled90 entangled90 self-assigned this Oct 9, 2024
@entangled90
Copy link
Contributor

entangled90 commented Oct 9, 2024

Probably solution 2 is the most general and will be probably useful for other issues, while solution 1 is more or less valid only for this specific situation. However, solution 2 will require be more careful with which exceptions are thrown ecc.
I see that a couple of exceptions are already used for classifying recoverable vs unrecoverable errors:

  • UnrecoverableException
  • RecoverablePartitionTransitionException

Regarding the last point, the current code creates a CompletableActorFuture that is manually completed at the end of task (if no exception is thrown), ignoring the future returned by the actor.call().

public ActorFuture<Void> onBecomingLeader(
      final int partitionId,
      final long term,
      final LogStream logStream,
      final QueryService queryService) {
    final CompletableActorFuture<Void> future = new CompletableActorFuture<>();
    actor.call(
        () -> {
          leadPartitions.add(partitionId);
          queryHandler.addPartition(partitionId, queryService);
          serverTransport.subscribe(partitionId, RequestType.QUERY, queryHandler);

          final var logStreamWriter = logStream.newLogStreamWriter();
          commandHandler.addPartition(partitionId, logStreamWriter);
          serverTransport.subscribe(partitionId, RequestType.COMMAND, commandHandler);
          future.complete(null);
        });
    return future;
  }

Eliminating the first future, the second one is always completed in case of exceptional errors (as it is done by the actor).
Can the first future be safely removed?

entangled90 added a commit that referenced this issue Oct 11, 2024
While there a transition to a leader is cancelled, the logStream will be
closed, throwing an exception when creating a Writer.

The returned future is now failed when such error happens instead of
non-completing.
Moreover, the writer is instantiated as a first step, so in case of
errors no subscriptions to the other components is done.

closes #10141
entangled90 added a commit that referenced this issue Oct 18, 2024
While there a transition to a leader is cancelled, the logStream will be
closed, throwing an exception when creating a Writer.

The returned future is now failed when such error happens instead of
non-completing.
Moreover, the writer is instantiated as a first step, so in case of
errors no subscriptions to the other components is done.

closes #10141
entangled90 added a commit that referenced this issue Oct 21, 2024
While there a transition to a leader is cancelled, the logStream will be
closed, throwing an exception when creating a Writer.

The returned future is now failed when such error happens instead of
non-completing.
Moreover, the writer is instantiated as a first step, so in case of
errors no subscriptions to the other components is done.

closes #10141
entangled90 added a commit that referenced this issue Oct 21, 2024
While there a transition to a leader is cancelled, the logStream will be
closed, throwing an exception when creating a Writer.

The returned future is now failed when such error happens instead of
non-completing.
Moreover, the writer is instantiated as a first step, so in case of
errors no subscriptions to the other components is done.

closes #10141
github-merge-queue bot pushed a commit that referenced this issue Oct 21, 2024
…23371)

## Description
While there a transition to a leader is cancelled, the logStream will be
closed, throwing an exception when creating a Writer.

To avoid concurrency issues, the `CommandServiceApiImpl` is not a
`PartitionListener` anymore, but it's a `PartitionTransitionStep`: this
should avoid it as they would be executed serially.

## Related issues
closes #10141
backport-action pushed a commit that referenced this issue Oct 21, 2024
While there a transition to a leader is cancelled, the logStream will be
closed, throwing an exception when creating a Writer.

The returned future is now failed when such error happens instead of
non-completing.
Moreover, the writer is instantiated as a first step, so in case of
errors no subscriptions to the other components is done.

closes #10141

(cherry picked from commit a00267f)
github-merge-queue bot pushed a commit that referenced this issue Oct 21, 2024
… change is cancelled (#23775)

# Description
Backport of #23371 to `stable/8.6`.

relates to #10141
original author: @entangled90
github-merge-queue bot pushed a commit that referenced this issue Oct 22, 2024
… change is cancelled (#23774)

# Description
Backport of #23371 to `stable/8.5`.

relates to #10141
original author: @entangled90
github-merge-queue bot pushed a commit that referenced this issue Oct 23, 2024
… change is cancelled (#23824)

# Description
Backport of #23774 to `stable/8.4`.

relates to #23371 #10141
original author: @backport-action
@camundait camundait added the version:8.7.0-alpha1 Label that represents issues released on verions 8.7.0-alpha1 label Nov 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/partition-transitions component/zeebe Related to the Zeebe component/team good first issue Marks an issue as simple enough for first time contributors kind/bug Categorizes an issue or PR as a bug severity/low Marks a bug as having little to no noticeable impact for the user version:8.4.13 version:8.5.9 version:8.6.4 version:8.7.0-alpha1 Label that represents issues released on verions 8.7.0-alpha1
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants