You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm using clickhouse-operator version 0.24.0 and I've encountered the following issue:
When applying new change to clickhouseKeeper cluster operator does not ensure that a ClickHouseKeeper pod is running before proceeding with the restart of another pod (even though the previous one is still being created).
Let's look at the status of the pods when I changed ClickHouseKeeperInstallation:
Cluster is applying the new change:
NAME READY STATUS RESTARTS AGE
chk-extended-cluster1-0-0-0 1/1 Terminating 0 6m10s
chk-extended-cluster1-0-1-0 1/1 Running 0 6m10s
chk-extended-cluster1-0-2-0 1/1 Running 0 5m25s
(...)
NAME READY STATUS RESTARTS AGE
chk-extended-cluster1-0-0-0 0/1 ContainerCreating 0 65s
chk-extended-cluster1-0-1-0 1/1 Running 0 7m19s
chk-extended-cluster1-0-2-0 1/1 Running 0 6m34s
(...)
NAME READY STATUS RESTARTS AGE
chk-extended-cluster1-0-0-0 1/1 Running 0 78s
chk-extended-cluster1-0-1-0 1/1 Running 0 7m32s
chk-extended-cluster1-0-2-0 1/1 Running 0 6m47s
So far so good, but let's see what happens next:
NAME READY STATUS RESTARTS AGE
chk-extended-cluster1-0-0-0 1/1 Running 0 81s
chk-extended-cluster1-0-1-0 1/1 Terminating 0 7m35s
chk-extended-cluster1-0-2-0 1/1 Running 0 6m50s
(...)
NAME READY STATUS RESTARTS AGE
chk-extended-cluster1-0-0-0 1/1 Running 0 82s
chk-extended-cluster1-0-2-0 1/1 Running 0 6m51s
and here goes our problem
NAME READY STATUS RESTARTS AGE
chk-extended-cluster1-0-0-0 1/1 Running 0 86s
chk-extended-cluster1-0-1-0 0/1 ContainerCreating 0 1s
chk-extended-cluster1-0-2-0 1/1 Terminating 0 6m55s
As you can see, pod cluster1-0-1-0 is still in ContainerCreating state, but the operator has already decided to terminate pod cluster1-0-2-0.
This caused the cluster to lose quorum for a short time, which ClickHouse did not liked, resulting in the following error:
error": "(CreateMemoryTableQueryOnCluster) Error when executing query: code: 999, message: All connection tries failed while connecting to ZooKeeper. nodes: 10.233.71.16:9181, 10.233.81.20:9181, 10.233.70.35:9181\nCode: 999. Coordination::Exception: Keeper server rejected the connection during the handshake. Possibly it's overloaded, doesn't see leader or is stale: while receiving handshake from ZooKeeper. (KEEPER_EXCEPTION) (version 24.8.2.3 (official build)), 10.233.71.16:9181\nPoco::Exception. Code: 1000, e.code() = 111, Connection refused (version 24.8.2.3 (official build))
I was expecting that clickhouse keeper cluster will apply new changes without any disruptions to clickhouse cluster.
The text was updated successfully, but these errors were encountered:
mandreasik
changed the title
Operator does not ensure that a ClickHouseKeeper pod is running before proceeding with the restart of another pod
Operator does not ensure that a clickhouse keeper pod is running before proceeding with the restart of another pod
Dec 12, 2024
Hi,
I'm using clickhouse-operator version 0.24.0 and I've encountered the following issue:
When applying new change to clickhouseKeeper cluster operator does not ensure that a ClickHouseKeeper pod is running before proceeding with the restart of another pod (even though the previous one is still being created).
Let's look at the status of the pods when I changed
ClickHouseKeeperInstallation
:Cluster is applying the new change:
So far so good, but let's see what happens next:
and here goes our problem
As you can see, pod
cluster1-0-1-0
is still inContainerCreating
state, but the operator has already decided to terminate podcluster1-0-2-0
.This caused the cluster to lose quorum for a short time, which ClickHouse did not liked, resulting in the following error:
I was expecting that clickhouse keeper cluster will apply new changes without any disruptions to clickhouse cluster.
The text was updated successfully, but these errors were encountered: