You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
OS Platform and Distribution (e.g., Linux Ubuntu 20.0): Mariner 5.15.111.1-1.cm2
JDK version: 17
Describe the problem
Partitions move to error state due to zk update getting interrupted unexpectedly. We need a strategy that makes this more resilient, temporary disconnects should be tolerable. Exception trace below:
Tracking information
2023/07/10 17:49:59.569 ERROR [IngestionNotificationDispatcher for [ Topic: venice_system_store_participant_store_cluster_cert-1_v196 ] ] [venice-shared-consumer-for-kafka.venice.kafka.ei-ltx1.atd.stg.linkedin.com:16637-t1] [venice-server-war] [] Error reporting status to notifier class com.linkedin.davinci.notifier.PushMonitorNotifier
com.linkedin.venice.exceptions.ZkDataAccessException: Can not do operation:compare and update on path: /cert-1/OfflinePushes/venice_system_store_participant_store_cluster_cert-1_v196/2 after retry:3 times
at com.linkedin.venice.utils.HelixUtils.compareAndUpdate(HelixUtils.java:235) ~[com.linkedin.venice.venice-common-0.4.103.jar:?]
at com.linkedin.venice.utils.HelixUtils.compareAndUpdate(HelixUtils.java:220) ~[com.linkedin.venice.venice-common-0.4.103.jar:?]
at com.linkedin.venice.helix.VeniceOfflinePushMonitorAccessor.compareAndUpdateReplicaStatus(VeniceOfflinePushMonitorAccessor.java:290) ~[com.linkedin.venice.venice-common-0.4.103.jar:?]
at com.linkedin.venice.helix.VeniceOfflinePushMonitorAccessor.updateReplicaStatus(VeniceOfflinePushMonitorAccessor.java:250) ~[com.linkedin.venice.venice-common-0.4.103.jar:?]
at com.linkedin.davinci.notifier.PushMonitorNotifier.started(PushMonitorNotifier.java:48) ~[com.linkedin.venice.da-vinci-client-0.4.103.jar:?]
at com.linkedin.davinci.notifier.VeniceNotifier.started(VeniceNotifier.java:17) ~[com.linkedin.venice.da-vinci-client-0.4.103.jar:?]
at com.linkedin.davinci.kafka.consumer.IngestionNotificationDispatcher.lambda$reportStarted$1(IngestionNotificationDispatcher.java:108) ~[com.linkedin.venice.da-vinci-client-0.4.103.jar:?]
at com.linkedin.davinci.kafka.consumer.IngestionNotificationDispatcher.report(IngestionNotificationDispatcher.java:73) ~[com.linkedin.venice.da-vinci-client-0.4.103.jar:?]
at com.linkedin.davinci.kafka.consumer.IngestionNotificationDispatcher.report(IngestionNotificationDispatcher.java:96) ~[com.linkedin.venice.da-vinci-client-0.4.103.jar:?]
at com.linkedin.davinci.kafka.consumer.IngestionNotificationDispatcher.reportStarted(IngestionNotificationDispatcher.java:108) ~[com.linkedin.venice.da-vinci-client-0.4.103.jar:?]
at com.linkedin.davinci.kafka.consumer.StatusReportAdapter.lambda$reportStarted$1(StatusReportAdapter.java:84) ~[com.linkedin.venice.da-vinci-client-0.4.103.jar:?]
at com.linkedin.davinci.kafka.consumer.StatusReportAdapter$PartitionReportStatus.maybeReportStatus(StatusReportAdapter.java:234) [com.linkedin.venice.da-vinci-client-0.4.103.jar:?]
at com.linkedin.davinci.kafka.consumer.StatusReportAdapter$PartitionReportStatus.recordSubPartitionStatus(StatusReportAdapter.java:220) [com.linkedin.venice.da-vinci-client-0.4.103.jar:?]
at com.linkedin.davinci.kafka.consumer.StatusReportAdapter.report(StatusReportAdapter.java:141) [com.linkedin.venice.da-vinci-client-0.4.103.jar:?]
at com.linkedin.davinci.kafka.consumer.StatusReportAdapter.report(StatusReportAdapter.java:132) [com.linkedin.venice.da-vinci-client-0.4.103.jar:?]
at com.linkedin.davinci.kafka.consumer.StatusReportAdapter.reportStarted(StatusReportAdapter.java:84) [com.linkedin.venice.da-vinci-client-0.4.103.jar:?]
at com.linkedin.davinci.kafka.consumer.StoreIngestionTask.processStartOfPush(StoreIngestionTask.java:2290) [com.linkedin.venice.da-vinci-client-0.4.103.jar:?]
at com.linkedin.davinci.kafka.consumer.StoreIngestionTask.produceToStoreBufferServiceOrKafka(StoreIngestionTask.java:991) [com.linkedin.venice.da-vinci-client-0.4.103.jar:?]
at com.linkedin.davinci.kafka.consumer.StorePartitionDataReceiver.write(StorePartitionDataReceiver.java:75) [com.linkedin.venice.da-vinci-client-0.4.103.jar:?]
at com.linkedin.davinci.kafka.consumer.StorePartitionDataReceiver.write(StorePartitionDataReceiver.java:17) [com.linkedin.venice.da-vinci-client-0.4.103.jar:?]
at com.linkedin.davinci.kafka.consumer.ConsumptionTask.run(ConsumptionTask.java:143) [com.linkedin.venice.da-vinci-client-0.4.103.jar:?]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) [?:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
at java.lang.Thread.run(Thread.java:833) [?:?]
2023/07/10 17:49:59.569 WARN [ZKHelixManager] [venice-shared-consumer-for-kafka.venice.kafka.ei-ltx1.atd.stg.linkedin.com:16637-t1] [venice-server-war] [] zkClient to zk-ltx1-venice.stg.linkedin.com:2622/venice is not connected, wait for 10000ms.
2023/07/10 17:49:59.569 ERROR [IngestionNotificationDispatcher for [ Topic: venice_system_store_participant_store_cluster_cert-1_v196 ] ] [venice-shared-consumer-for-kafka.venice.kafka.ei-ltx1.atd.stg.linkedin.com:16637-t1] [venice-server-war] [] Error reporting status to notifier class com.linkedin.davinci.notifier.PartitionPushStatusNotifier
org.apache.helix.zookeeper.zkclient.exception.ZkInterruptedException: java.lang.InterruptedException
at org.apache.helix.zookeeper.zkclient.ZkClient.acquireEventLock(ZkClient.java:1942) ~[org.apache.helix.helix-common-1.0.4.jar:?]
at org.apache.helix.zookeeper.zkclient.ZkClient.waitForKeeperState(ZkClient.java:1919) ~[org.apache.helix.helix-common-1.0.4.jar:?]
at org.apache.helix.zookeeper.zkclient.ZkClient.waitUntilConnected(ZkClient.java:1910) ~[org.apache.helix.helix-common-1.0.4.jar:?]
at org.apache.helix.manager.zk.ZKHelixManager.checkConnected(ZKHelixManager.java:411) ~[org.apache.helix.helix-core-1.0.4.jar:1.0.4]
at org.apache.helix.manager.zk.ZKHelixManager.getHelixDataAccessor(ZKHelixManager.java:681) ~[org.apache.helix.helix-core-1.0.4.jar:1.0.4]
at org.apache.helix.customizedstate.CustomizedStateProvider.updateCustomizedState(CustomizedStateProvider.java:67) ~[org.apache.helix.helix-core-1.0.4.jar:1.0.4]
at org.apache.helix.customizedstate.CustomizedStateProvider.updateCustomizedState(CustomizedStateProvider.java:58) ~[org.apache.helix.helix-core-1.0.4.jar:1.0.4]
at com.linkedin.venice.helix.HelixPartitionStateAccessor.updateReplicaStatus(HelixPartitionStateAccessor.java:34) ~[com.linkedin.venice.venice-common-0.4.103.jar:?]
at com.linkedin.venice.helix.HelixPartitionStatusAccessor.updateReplicaStatus(HelixPartitionStatusAccessor.java:31) ~[com.linkedin.venice.venice-common-0.4.103.jar:?]
at com.linkedin.davinci.notifier.PartitionPushStatusNotifier.started(PartitionPushStatusNotifier.java:20) ~[com.linkedin.venice.da-vinci-client-0.4.103.jar:?]
at com.linkedin.davinci.notifier.VeniceNotifier.started(VeniceNotifier.java:17) ~[com.linkedin.venice.da-vinci-client-0.4.103.jar:?]
at com.linkedin.davinci.kafka.consumer.IngestionNotificationDispatcher.lambda$reportStarted$1(IngestionNotificationDispatcher.java:108) ~[com.linkedin.venice.da-vinci-client-0.4.103.jar:?]
at com.linkedin.davinci.kafka.consumer.IngestionNotificationDispatcher.report(IngestionNotificationDispatcher.java:73) ~[com.linkedin.venice.da-vinci-client-0.4.103.jar:?]
at com.linkedin.davinci.kafka.consumer.IngestionNotificationDispatcher.report(IngestionNotificationDispatcher.java:96) ~[com.linkedin.venice.da-vinci-client-0.4.103.jar:?]
at com.linkedin.davinci.kafka.consumer.IngestionNotificationDispatcher.reportStarted(IngestionNotificationDispatcher.java:108) ~[com.linkedin.venice.da-vinci-client-0.4.103.jar:?]
at com.linkedin.davinci.kafka.consumer.StatusReportAdapter.lambda$reportStarted$1(StatusReportAdapter.java:84) ~[com.linkedin.venice.da-vinci-client-0.4.103.jar:?]
at com.linkedin.davinci.kafka.consumer.StatusReportAdapter$PartitionReportStatus.maybeReportStatus(StatusReportAdapter.java:234) [com.linkedin.venice.da-vinci-client-0.4.103.jar:?]
at com.linkedin.davinci.kafka.consumer.StatusReportAdapter$PartitionReportStatus.recordSubPartitionStatus(StatusReportAdapter.java:220) [com.linkedin.venice.da-vinci-client-0.4.103.jar:?]
at com.linkedin.davinci.kafka.consumer.StatusReportAdapter.report(StatusReportAdapter.java:141) [com.linkedin.venice.da-vinci-client-0.4.103.jar:?]
at com.linkedin.davinci.kafka.consumer.StatusReportAdapter.report(StatusReportAdapter.java:132) [com.linkedin.venice.da-vinci-client-0.4.103.jar:?]
at com.linkedin.davinci.kafka.consumer.StatusReportAdapter.reportStarted(StatusReportAdapter.java:84) [com.linkedin.venice.da-vinci-client-0.4.103.jar:?]
at com.linkedin.davinci.kafka.consumer.StoreIngestionTask.processStartOfPush(StoreIngestionTask.java:2290) [com.linkedin.venice.da-vinci-client-0.4.103.jar:?]
at com.linkedin.davinci.kafka.consumer.StoreIngestionTask.produceToStoreBufferServiceOrKafka(StoreIngestionTask.java:991) [com.linkedin.venice.da-vinci-client-0.4.103.jar:?]
at com.linkedin.davinci.kafka.consumer.StorePartitionDataReceiver.write(StorePartitionDataReceiver.java:75) [com.linkedin.venice.da-vinci-client-0.4.103.jar:?]
at com.linkedin.davinci.kafka.consumer.StorePartitionDataReceiver.write(StorePartitionDataReceiver.java:17) [com.linkedin.venice.da-vinci-client-0.4.103.jar:?]
at com.linkedin.davinci.kafka.consumer.ConsumptionTask.run(ConsumptionTask.java:143) [com.linkedin.venice.da-vinci-client-0.4.103.jar:?]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) [?:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
at java.lang.Thread.run(Thread.java:833) [?:?]
Caused by: java.lang.InterruptedException
at java.util.concurrent.locks.ReentrantLock$Sync.lockInterruptibly(ReentrantLock.java:159) ~[?:?]
at java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:372) ~[?:?]
at org.apache.helix.zookeeper.zkclient.ZkClient.acquireEventLock(ZkClient.java:1940) ~[org.apache.helix.helix-common-1.0.4.jar:?]
... 30 more
Code to reproduce bug
No response
What component(s) does this bug affect?
Controller: This is the control-plane for Venice. Used to create/update/query stores and their metadata.
Router: This is the stateless query-routing layer for serving read requests.
Server: This is the component that persists all the store data.
VenicePushJob: This is the component that pushes derived data from Hadoop to Venice backend.
VenicePulsarSink: This is a Sink connector for Apache Pulsar that pushes data from Pulsar into Venice.
Thin Client: This is a stateless client users use to query Venice Router for reading store data.
Fast Client: This is a stateful client users use to query Venice Server for reading store data.
Da Vinci Client: This is an embedded, stateful client that materializes store data locally.
Alpini: This is the framework that fast-client and routers use to route requests to the storage nodes that have the data.
Samza: This is the library users use to make nearline updates to store data.
Admin Tool: This is the stand-alone client used for ad-hoc operations on Venice.
Scripts: These are the various ops scripts in the repo.
The text was updated successfully, but these errors were encountered:
Willingness to contribute
No. I cannot contribute a bug fix at this time.
Venice version
0.4.139
System information
Describe the problem
Partitions move to error state due to zk update getting interrupted unexpectedly. We need a strategy that makes this more resilient, temporary disconnects should be tolerable. Exception trace below:
Tracking information
Code to reproduce bug
No response
What component(s) does this bug affect?
Controller
: This is the control-plane for Venice. Used to create/update/query stores and their metadata.Router
: This is the stateless query-routing layer for serving read requests.Server
: This is the component that persists all the store data.VenicePushJob
: This is the component that pushes derived data from Hadoop to Venice backend.VenicePulsarSink
: This is a Sink connector for Apache Pulsar that pushes data from Pulsar into Venice.Thin Client
: This is a stateless client users use to query Venice Router for reading store data.Fast Client
: This is a stateful client users use to query Venice Server for reading store data.Da Vinci Client
: This is an embedded, stateful client that materializes store data locally.Alpini
: This is the framework that fast-client and routers use to route requests to the storage nodes that have the data.Samza
: This is the library users use to make nearline updates to store data.Admin Tool
: This is the stand-alone client used for ad-hoc operations on Venice.Scripts
: These are the various ops scripts in the repo.The text was updated successfully, but these errors were encountered: