Deadlock with parallel processing of single partition stream #852

vladimirkl · 2023-05-18T20:03:19Z

Hi, I need to take n records from single partition, process in parallel batches, and commit. Unfortunately this code causes a deadlock in Consumer on commit:

 val consumerLayer: ZLayer[Kafka, Throwable, Consumer] =
     ZLayer.scoped ( ZIO.serviceWithZIO[Kafka] { kafka =>
       Consumer.make(ConsumerSettings(kafka.bootstrapServers).withGroupId("group1").withOffsetRetrieval(Consumer.OffsetRetrieval.Auto(AutoOffsetStrategy.Earliest)))
     })

...

Producer.produceChunk(Chunk.fromIterable(1 to 1000).map(n => new ProducerRecord(topic, n, n.toString)), Serde.int, Serde.string) *>
  Consumer.plainStream(Subscription.topics(topic), Serde.int, Serde.string)
    .take(100)
    .groupedWithin(10, 100.millis)
    .mapZIOPar(2)(c => ZIO.debug(c.size) as c.map(_.offset))
    .map(OffsetBatch.apply)
    .debug("Offset")
    .mapZIO(_.commit)
    .debug("Commit")
    .runDrain zipPar Fiber.dumpAll.delay(20.seconds)

Replacing groupedWithin with grouped makes this code working. Also I can leave groupedWithin in place and remove take - no deadlock occurs but I need to terminate stream exactly for n records. It looks like race condition on stream termination with take. Tested with zio-kafka 2.3.0, Scala 2.13.10 and embedded Kafka

The text was updated successfully, but these errors were encountered:

guizmaii · 2023-05-19T09:27:14Z

Thanks for your report @vladimirkl 🙂

I'm wondering if this is a bug related to zio-kafka or zio-streams 🤔
Because, for example, changing groupedWithin with grouped as you did, has nothing to do with zio-kafka.
Why do you think it comes from zio-kafka?

Sorry for the questions. I'm trying to better understand your issue.

vladimirkl · 2023-05-19T09:38:53Z

It may be some zio-streams issue, but I never encountered it without zio-kafka - while I have similar aggregations in other places . It locks exactly on commit. If you wait some time after deadlock you will see RunloopTimeout exception. Without commit take terminates stream correctly

vladimirkl · 2023-05-19T09:56:54Z

I even tried to replace commit with something non interruptible like ZIO.attemptBlocking(Thread.sleep(2000)) and it works correctly. So it's very likely Consumer issue

guizmaii · 2023-05-20T05:56:50Z

Thanks for the additional details :)

erikvanoosten · 2023-05-21T15:24:41Z

I used this program to test the issue: https://gist.github.com/erikvanoosten/5e9f34d8ff43de32b583c021c858e309

This problem is caused by an immediate unsubscribe after the stream ends (see https://github.com/zio/zio-kafka/blob/master/zio-kafka/src/main/scala/zio/kafka/consumer/Consumer.scala#L254). Since we are no longer subscribed, we (can) no longer poll. When we stop polling, the commit callbacks will not be called anymore and all progress is halted.

Only programs that do not consume the entire stream (the example here uses take()) are affected by this bug.

Potential solutions:

Delay unsubscribe until there are no more pending commits.
Upon unsubscribe, cancel all pending commits.
Wait for Await commits during revoke #830 and use the new rebalanceSafeCommits mode. With this mode enabled we await commit completion during partition revocation. Unsubscribe also causes such a revocation and therefore this program would work fine with this mode enabled. (Note that this PR needs an as of yet unreleased Kafka, at least 3.6.0 if all goes well.)

@svroonland WDYT?

vladimirkl · 2023-05-21T18:48:29Z

Why same code works with grouped instead of groupedWithin? Perhaps order of finalization is different?

erikvanoosten · 2023-05-21T19:48:39Z

Why same code works with grouped instead of groupedWithin? Perhaps order of finalization is different?

groupedWithin is time based. This causes a fiber split, everything above the groupedWithin runs on another fiber than what follows. I suspect this causes the unsubscribe to happen fractionally later than with grouped.

vladimirkl · 2023-06-02T14:31:02Z

Things are much worse with zio-streams 2.0.14. Even simpler example hangs forever:

    Consumer.plainStream(Subscription.topics(topic), Serde.int, Serde.string)
      .take(100)
      .map(_.offset)
      .aggregateAsync(Consumer.offsetBatches)
      .debug("Offset")
      .mapZIO(_.commit)
      .debug("Commit")
      .runDrain

This code works perfectly with zio-streams 2.0.13. Unfortunately take becomes dangerous for zio-kafka streams at all.

…hub.com/erikvanoosten/5e9f34d8ff43de32b583c021c858e309

svroonland · 2023-06-03T07:40:31Z

@erikvanoosten I believe your analysis is correct, the take ends the stream and unsubscribes. I have similar issues with take and finalizer race conditions in zio-kinesis.

Regarding the possible solutions: if it's a race condition, then I'm not sure we always have pending commits to await before unsubscribing.

We could look into a usage pattern that uses graceful shutdown to end the stream but keep the subscription.

guizmaii · 2023-06-03T07:46:31Z

@svroonland Have a look at #890. The issue seems to be with the commitAsync

svroonland · 2023-06-03T08:14:14Z

As in, commitAsync requires poll calls to complete? Yeah

…hub.com/erikvanoosten/5e9f34d8ff43de32b583c021c858e309

vladimirkl · 2023-07-15T08:39:35Z

I encountered few more issues with hanging commit in other scenarios - when broker dies, then dies KafkaConsumer, but async commit hangs. Compared with fs2-kafka implementation - it has similar behaviour, but uses timeout for commit operation (15 seconds by default). zip-kafka user can add timeouts to commit everywhere in their code, but I think it's a good idea to add default timeout for safety - similar to fs2-kafka. I can create a PR. What do you think?

guizmaii · 2023-07-15T08:43:19Z

I can create a PR. What do you think?

Please do :)

guizmaii · 2023-08-05T12:07:11Z

Isn't this issue fixed by #982? Can we close it?

vladimirkl · 2023-08-05T12:48:52Z

Not sure - timeout is definitely better than deadlock, but original issue with async grouping still remains. However if we cannot handle it all, we can close it for now.

erikvanoosten · 2023-11-29T10:37:04Z

As of zio-kafka 2.7.0 there should be no dead-locks while committing anymore. See #1109.

In addition, since zio-kafka 2.7.1 there is a work around. See #1123.

erikvanoosten added the bug Something isn't working label May 19, 2023

guizmaii mentioned this issue Jun 2, 2023

Reproducer for #852 #890

Draft

guizmaii added a commit that referenced this issue Jun 2, 2023

Reproducer for #852 copied from @erikvanoosten Gist: https://gist.git…

16301a3

…hub.com/erikvanoosten/5e9f34d8ff43de32b583c021c858e309

guizmaii added a commit that referenced this issue Jun 2, 2023

Reproducer for #852 copied from @erikvanoosten Gist: https://gist.git…

7045a9b

…hub.com/erikvanoosten/5e9f34d8ff43de32b583c021c858e309

guizmaii added a commit that referenced this issue Jun 2, 2023

Reproducer for #852 copied from @erikvanoosten Gist: https://gist.git…

0b3d946

…hub.com/erikvanoosten/5e9f34d8ff43de32b583c021c858e309

guizmaii added a commit that referenced this issue Jun 11, 2023

Reproducer for #852 copied from @erikvanoosten Gist: https://gist.git…

d09e7ac

…hub.com/erikvanoosten/5e9f34d8ff43de32b583c021c858e309

guizmaii added a commit that referenced this issue Jun 11, 2023

Reproducer for #852 copied from @erikvanoosten Gist: https://gist.git…

de5c031

…hub.com/erikvanoosten/5e9f34d8ff43de32b583c021c858e309

guizmaii mentioned this issue Jun 18, 2023

[WIP] Fix commit mechanism design #930

Closed

vladimirkl mentioned this issue Jul 15, 2023

Implement default commit timeout #982

Merged

vladimirkl mentioned this issue Aug 28, 2023

Added a failing test for stopConsumption with aggregateAsync #1028

Closed

guizmaii mentioned this issue Sep 9, 2023

Add Consumer.commit and Consumer.commitOrRetry methods #1022

Closed

erikvanoosten mentioned this issue Oct 11, 2023

Automatically merge commits #1073

Merged

erikvanoosten mentioned this issue Nov 29, 2023

Adds test for stopping consumption while doing async commits #1123

Merged

erikvanoosten closed this as completed Nov 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deadlock with parallel processing of single partition stream #852

Deadlock with parallel processing of single partition stream #852

vladimirkl commented May 18, 2023

guizmaii commented May 19, 2023

vladimirkl commented May 19, 2023

vladimirkl commented May 19, 2023

guizmaii commented May 20, 2023

erikvanoosten commented May 21, 2023 •

edited

Loading

vladimirkl commented May 21, 2023

erikvanoosten commented May 21, 2023

vladimirkl commented Jun 2, 2023

svroonland commented Jun 3, 2023

guizmaii commented Jun 3, 2023 •

edited

Loading

svroonland commented Jun 3, 2023

vladimirkl commented Jul 15, 2023

guizmaii commented Jul 15, 2023

guizmaii commented Aug 5, 2023

vladimirkl commented Aug 5, 2023

erikvanoosten commented Nov 29, 2023 •

edited

Loading

Deadlock with parallel processing of single partition stream #852

Deadlock with parallel processing of single partition stream #852

Comments

vladimirkl commented May 18, 2023

guizmaii commented May 19, 2023

vladimirkl commented May 19, 2023

vladimirkl commented May 19, 2023

guizmaii commented May 20, 2023

erikvanoosten commented May 21, 2023 • edited Loading

vladimirkl commented May 21, 2023

erikvanoosten commented May 21, 2023

vladimirkl commented Jun 2, 2023

svroonland commented Jun 3, 2023

guizmaii commented Jun 3, 2023 • edited Loading

svroonland commented Jun 3, 2023

vladimirkl commented Jul 15, 2023

guizmaii commented Jul 15, 2023

guizmaii commented Aug 5, 2023

vladimirkl commented Aug 5, 2023

erikvanoosten commented Nov 29, 2023 • edited Loading

erikvanoosten commented May 21, 2023 •

edited

Loading

guizmaii commented Jun 3, 2023 •

edited

Loading

erikvanoosten commented Nov 29, 2023 •

edited

Loading