Scale out Clickhouse to a multinode cluster #3494

karencfv · 2023-07-05T06:55:05Z

Replication

This PR implements an initial 2 replica 3 coordinator ClickHouse set up.

I've settled on this initial lean architecture as I want to avoid cluttering with what may be unnecessary additional nodes and using up our customers resources. As we gauge the system alongside our first customers we can decide if we really do need more replicas or not. Inserting an additional replica is very straightforward, as we only need to make a few changes to the templates/service count and restart the ClickHouse services.

Sharding

Sharding can prove to be very resource intensive, and we have yet to fully understand our customer's needs. I'd like to avoid a situation where we are prematurely optimising when we have so many unknowns. We also have not had time to perform long running testing. See official ClickHouse recommendations.

Like additional replicas, we can have additional shards if we find them to be necessary down the track.

Testing

I have left most tests as a single node set up. It feels unnecessary to spin up so many things constantly. If people disagree, I can modify this.

I have run many many manual tests, starting and stopping services and so far the set up has held up.

An Omicron build with these changes is currently running on sn21 please feel free to poke it and prod it :)
UPDATE: The build is borked because some things changed recently and I can't get Omicron running on the gimlet the way I used to :(

NB: I wasn't able to get nexus running on the gimlet :P sorry. I tried and failed miserably

Using a ClickHouse client:

root@oxz_clickhouse_af08dce0-41ce-4922-8d51-0f546f23ff3e:~# ifconfig
<redacted>
oxControlService13:1: flags=21002000841<UP,RUNNING,MULTICAST,IPv6,FIXEDMTU> mtu 9000 index 2
	inet6 fd00:1122:3344:101::f/64 
root@oxz_clickhouse_af08dce0-41ce-4922-8d51-0f546f23ff3e:~# cd /opt/oxide/clickhouse/
root@oxz_clickhouse_af08dce0-41ce-4922-8d51-0f546f23ff3e:/opt/oxide/clickhouse# ./clickhouse client --host fd00:1122:3344:101::f
ClickHouse client version 22.8.9.1.
Connecting to fd00:1122:3344:101::f:9000 as user default.
Connected to ClickHouse server version 22.8.9 revision 54460.

oximeter_cluster node 2 :) SELECT * FROM oximeter.fields_i64

SELECT *
FROM oximeter.fields_i64

Query id: dedbfbba-d949-49bd-9f9c-0f81a1240798

┌─timeseries_name───┬───────timeseries_key─┬─field_name─┬─field_value─┐
│ data_link:enabled │  9572423277405807617 │ link_id    │           0 │
│ data_link:enabled │ 12564290087547100823 │ link_id    │           0 │
│ data_link:enabled │ 16314114164963669893 │ link_id    │           0 │
│ data_link:link_up │  9572423277405807617 │ link_id    │           0 │
│ data_link:link_up │ 12564290087547100823 │ link_id    │           0 │
│ data_link:link_up │ 16314114164963669893 │ link_id    │           0 │
└───────────────────┴──────────────────────┴────────────┴─────────────┘

6 rows in set. Elapsed: 0.003 sec.

To retrieve information about the keepers you can use the provided commands within each of the keeper zones.

Example:

root@oxz_clickhouse_keeper_9b70b23c-a7c4-4102-a7a1-525537dcf463:~# ifconfig
<redacted>
oxControlService17:1: flags=21002000841<UP,RUNNING,MULTICAST,IPv6,FIXEDMTU> mtu 9000 index 2
	inet6 fd00:1122:3344:101::11/64 
root@oxz_clickhouse_keeper_9b70b23c-a7c4-4102-a7a1-525537dcf463:~# echo mntr | nc fd00:1122:3344:101::11 9181
zk_version	v22.8.9.1-lts-ac5a6cababc9153320b1892eece62e1468058c26
zk_avg_latency	0
zk_max_latency	1
zk_min_latency	0
zk_packets_received	60
zk_packets_sent	60
zk_num_alive_connections	1
zk_outstanding_requests	0
zk_server_state	leader
zk_znode_count	6
zk_watch_count	1
zk_ephemerals_count	0
zk_approximate_data_size	1271
zk_key_arena_size	4096
zk_latest_snapshot_size	0
zk_followers	2
zk_synced_followers	2

Follow up work

Write up a troubleshooting document
Implement zone resource quotas?
We'll have to have some sort of "ops" APIs for maintenance tasks at some point.
Generate config templates dynamically.

Closes: #2158

Update

As mentioned previously, we came to an agreement at the last control plane meeting that software installed on the racks should not diverge due to replicated ClickHouse. This means that while ClickHouse replication is functional in this PR, it has been disabled in the last commit in the following manner:

The method_script.sh for the clickhouse service is set to run single node mode by default, but can be switched to run on replicated mode by swapping a variable to false. When we migrate all racks to a replicated ClickHouse setup, all logic related to running on single node will be removed from that file.
The number of zones defined through RSS will stay the same. Instructions on how to tweak them to launch in replicated mode have been left in the form of comments.

Testing

I ran the full CI testing suite on both replicated and single node mode. You can find the replicated test results here, and the single node with disabled replication here

Additionally, I have added tests that validate the replicated db_init file here, and incorporated checks in tests that validate whether a CH instance is part of a cluster or not.

Next steps

To keep this PR compact (if you can call 2000 lines compact), I have created several issues to tackle after this PR is merged from the review comments. In prioritised order, these are:

Pull in latest service improvements

karencfv · 2023-07-28T04:48:02Z

Found a bug :'( oxidecomputer/garbage-compactor#12

karencfv · 2023-08-31T01:27:42Z

Update

As mentioned previously, we came to an agreement at the last control plane meeting that software installed on the racks should not diverge due to replicated ClickHouse. This means that while ClickHouse replication is functional in this PR, it has been disabled in the last commit in the following manner:

The method_script.sh for the clickhouse service is set to run single node mode by default, but can be switched to run on replicated mode by swapping a variable to false. When we migrate all racks to a replicated ClickHouse setup, all logic related to running on single node will be removed from that file.
The number of zones defined through RSS will stay the same. Instructions on how to tweak them to launch in replicated mode have been left in the form of comments.

Testing

I ran the full CI testing suite on both replicated and single node mode. You can find the replicated test results here, and the single node with disabled replication here

Additionally, I have added tests that validate the replicated db_init file here, and incorporated checks in tests that validate whether a CH instance is part of a cluster or not.

Next steps

To keep this PR compact (if you can call 2000 lines compact), I have created several issues to tackle after this PR is merged from the review comments. In prioritised order, these are:

end-to-end-tests/src/helpers/ctx.rs

schema/crdb/4.0.0/up.sql

smf/clickhouse/method_script.sh

smklein

I'm stoked we could get tests for this, and that we got the deployment job to pass.

This looks solid - I have a couple comments below, but the structure of this makes sense to me. Thanks for pushing through this work -- it's hard enough by itself, but harder on a moving target where we're supporting existing platforms.

I'd check-in with @bnaecker to see if he has additional feedback before merging, but modulo my last set of comments, this LGTM!

smklein · 2023-08-31T19:06:08Z

schema/crdb/4.0.1/up.sql

@@ -0,0 +1,24 @@
+-- CRDB documentation recommends the following:


FWIW in my original recommendation, I figured it would have been the following, within a single file named 4.0.0/up.sql, but what you have done works too.

BEGIN; SELECT CAST( IF( ( SELECT version = '3.0.3' and target_version = '4.0.0' FROM omicron.public.db_metadata WHERE singleton = true ), 'true', 'Invalid starting version for schema change' ) AS BOOL ); ALTER TYPE omicron.public.service_kind ADD VALUE IF NOT EXISTS 'clickhouse_keeper'; COMMIT; BEGIN; SELECT CAST( IF( ( SELECT version = '3.0.3' and target_version = '4.0.0' FROM omicron.public.db_metadata WHERE singleton = true ), 'true', 'Invalid starting version for schema change' ) AS BOOL ); ALTER TYPE omicron.public.dataset_kind ADD VALUE IF NOT EXISTS 'clickhouse_keeper'; COMMIT;

bnaecker

Thanks for all the work here @karencfv, this is really a monster PR. I've a few questions about how we're actually going to roll this out, and what can go wrong when we do. Those are mostly not that big a deal, given that it seems like there is some time before we'll actually deploy this.

One bigger question is around whether we should use distributed tables now. I think it would be a good idea, but it's also possible we can defer it a little bit. There's a lot of work here that is independent of the exact DB organization we end up deploying. But I think we want to revisit that question prior to actually pulling the trigger, to minimize the chances of making another breaking change in the future.

bnaecker · 2023-08-31T21:51:19Z

oximeter/db/src/db-replicated-init.sql

+ORDER BY (timeseries_name, timeseries_key, timestamp)
+TTL toDateTime(timestamp) + INTERVAL 30 DAY;
+--
+CREATE TABLE IF NOT EXISTS oximeter.measurements_f64 ON CLUSTER oximeter_cluster


So for the measurement tables, we should consider making these Distributed now.

Here's my thinking. It seems like we're probably going to just lose some data when we make the change to the replicated tables. I'd like to avoid that now, but I think we need to avoid that in the future. So, perhaps we could make each of these tables the "local" version, e.g., append _local to the name, and then create a table with these names that's actually Distributed.

For example:

CREATE TABLE IF NOT EXISTS oximeter.measurements_f64_local ON CLUSTER oximeter_cluster ( ... ) ENGINE = ReplicatedMergeTree('/clickhouse/tables/{share}/measurements_f64_local', '{replica}') ...

And then later:

CREATE TABLE IF NOT EXISTS oximeter.measurements_f64 ON CLUSTER oximeter_cluster ( ... ) ENGINE = Distributed('oximeter_cluster', 'oximeter', 'measurements_f64_local', timeseries_name) ...

The advantage of this is

if or when we do get to sharding the data, this will happen without changes to the SQL. New data will be automatically distributed among the new shards. We can reshard if we want to separately.

queries also always query all the data, and take advantage of the sharding to split the workload

we don't need to load-balance ourselves, ClickHouse does that

That's a great idea! I've made the change

bnaecker · 2023-08-31T22:00:29Z

smf/clickhouse/method_script.sh

+# TEMPORARY: Racks will be set up with single node ClickHouse until
+# Nexus provisions services so there is no divergence between racks
+# https://github.com/oxidecomputer/omicron/issues/732
+single_node=true


How will this be changed? We'll make a commit which changes it?

At that point, when we boot this new zone...what happens? There already is a database oximeter. Now, we can delete that ourselves manually, by just removing the files. But if we don't do that, we'll then run CREATE DATABASE oximeter ON CLUSTER oximeter_cluster. Does that conflict? When we create those tables, do those conflict?

I've been mulling it over, but tbh I can't say I have a clear answer as to which steps we'll take for the migration (or perhaps remove everything and start from scratch?). I was planning on having that discussion here -> #4000 I think a lot of it also depends on "how" Nexus will provision services.

If we are able to perform a one off job for the migration, we may be able to use remote() after renaming the old tables or something like that.

bnaecker · 2023-08-31T22:03:36Z

test-utils/src/dev/clickhouse.rs

+            // There seems to be a bug using ipv6 with a replicated set up
+            // when installing all servers and coordinator nodes on the same
+            // server. For this reason we will be using ipv4 for testing.


What's the nature of this bug? Will it affect us if we happen to deploy two replicas on the same host?

It only appears when using localhost in the same server. It shouldn't be a problem for any rack installation.

test-utils/src/dev/clickhouse.rs

bnaecker · 2023-08-31T22:04:57Z

test-utils/src/dev/clickhouse.rs

+            .stdin(Stdio::null())
+            .stdout(Stdio::null())
+            .stderr(Stdio::null())
+            .env("CLICKHOUSE_WATCHDOG_ENABLE", "0")


We should probably do .env_clear() here to be safe.

Unfortunately that removes the environment variables just before the config files are populated with them so I had to leave it as is.

bnaecker · 2023-08-31T22:05:10Z

test-utils/src/dev/clickhouse.rs

+            .stdin(Stdio::null())
+            .stdout(Stdio::null())
+            .stderr(Stdio::null())
+            .env("CLICKHOUSE_WATCHDOG_ENABLE", "0")


We should probably do .env_clear() here to be safe.

karencfv · 2023-09-01T20:55:14Z

Thanks for taking the time to review @smklein @bnaecker ! I know this isn't a very easy PR to review.

@bnaecker I've addressed your comments. On the matter of implementing distributed tables, I think it's a great idea! I've changed the SQL file with your suggestions, and we can refine things if needed on #3982 which is what I'm planning on tackling next.

karencfv · 2023-09-02T01:00:49Z

sigh

I am really at a loss as to why the build-and-test (ubuntu-20.04) test is failing with:

--- STDERR:              oximeter-db client::tests::test_build_replicated ---
--
5282 | 2023-09-02T00:36:53.298Z | thread 'client::tests::test_build_replicated' panicked at 'Failed to initialize timeseries
 database: Database("Code: 62. DB::Exception: There was an error on [127.0.0.1:9000]: Code: 62. DB::Exception: No macro
 'share' in config while processing substitutions in '/clickhouse/tables/{share}/measurements_i64_local' at '20' or macro is
 not supported here. (SYNTAX_ERROR) (version 22.8.9.24 (official build)). (SYNTAX_ERROR) (version 22.8.9.24 (official 
build))\n")', oximeter/db/src/client.rs:761:14

On atrium the test runs just fine

karen@atrium ~/src/omicron $ cargo nextest run oximeter-db client::tests::test_build_replicated --nocapture
    Finished test [unoptimized + debuginfo] target(s) in 2.56s
    Starting 1 test across 94 binaries (824 skipped)
       START             oximeter-db client::tests::test_build_replicated

running 1 test
test client::tests::test_build_replicated ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 48 filtered out; finished in 15.33s

        PASS [  15.367s] oximeter-db client::tests::test_build_replicated
------------
     Summary [  15.370s] 1 test run: 1 passed, 824 skipped

karencfv · 2023-09-02T01:22:18Z

gargh, ran this on Linux and it passes as well!!! I don't know what's going on here

coatlicue@pop-os:~/src/omicron$ cargo nextest run oximeter-db client::tests::test_build_replicated --nocapture
   Compiling serde_json v1.0.105
<...>
    Finished test [unoptimized + debuginfo] target(s) in 8m 40s
    Starting 1 test across 94 binaries (811 skipped)
       START             oximeter-db client::tests::test_build_replicated

running 1 test
test client::tests::test_build_replicated ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 48 filtered out; finished in 10.76s

        PASS [  10.768s] oximeter-db client::tests::test_build_replicated
------------
     Summary [  10.769s] 1 test run: 1 passed, 811 skipped

Anybody have any ideas on what I should do about this?

bnaecker · 2023-09-03T02:34:06Z

--- STDERR:              oximeter-db client::tests::test_build_replicated ---
--
5282 | 2023-09-02T00:36:53.298Z | thread 'client::tests::test_build_replicated' panicked at 'Failed to initialize timeseries
 database: Database("Code: 62. DB::Exception: There was an error on [127.0.0.1:9000]: Code: 62. DB::Exception: No macro
 'share' in config while processing substitutions in '/clickhouse/tables/{share}/measurements_i64_local' at '20' or macro is
 not supported here. (SYNTAX_ERROR) (version 22.8.9.24 (official build)). (SYNTAX_ERROR) (version 22.8.9.24 (official 
build))\n")', oximeter/db/src/client.rs:761:14

This says there is no macro named share. Indeed, if you look at the configuration file, the macro is named shard. You'll need to change the interpolated variable here (and the other similar table engine definitions) to refer to the shard macro.

I think it's worth understanding why the tests pass on atrium in this case. Presumably the files are the same, so let's track down why that's not failing. Is the macro share defined somewhere in that test case? It also seems plausible that the files (either XML or SQL) are slightly different when run on a native host (atrium or a Linux machine). Perhaps those environments have some macro share defined in them somehow, or the SQL file is different?

karencfv · 2023-09-05T18:47:04Z

Thanks @bnaecker! It was really strange because I couldn't replicate that error on a Linux machine either 🤷‍♀️ . All the configuration and SQL files are the same.

Made the change, keeping my fingers crossed!

karencfv · 2023-09-05T20:34:02Z

Woohoo! I'm still not sure why the previous run passed in both of my local environments (Helios/Linux) but not in Ubuntu CI. I'm going to look deeper into this and improve testing as part of #4001

karencfv added 2 commits July 5, 2023 12:08

Initial functional 2 replica 3 coordinator cluster

0b21df2

Create config templates and pseudocode for updated init config

3281cbd

karencfv added the database Related to database access label Jul 5, 2023

karencfv added 12 commits July 6, 2023 19:03

Dynamically build configs for servers and keepers

b628be6

Create a separate service for keepers

cdfb1d1

Update manifest and file location

41518c8

clean up

795327e

make linter happy

73e5e22

Zone image is clickhouse-keeper.tar.gz not clickhouse_keeper.tar.gz

83449a2

Merge branch 'main' into ch-replicated-engine

38fb86d

Only use underscores to simplify

06b8f21

Merge remote-tracking branch 'upstream' into ch-replicated-engine

b5dd484

Pull in latest service improvements

Create composite packages to include internal-dns tar

ca7ad33

Get internal DNS working

bbbbd28

Add datastore to keeper service

7b7b245

karencfv added 11 commits July 31, 2023 19:17

Append default and custom configs

72cc038

Give keepers dynamic discoverable IDs

34f370b

Clean up scripts and configs

79dd329

Clean up

94f8376

First pass at making tests pass

cea0612

gargh linter

95be228

Add additional zpools for dev envs

e24a2dd

Add flag to internal-dns-cli to output host name only

f6aac77

Revert testing configuration and clean up

758fd39

Run oximeter on replicated or single node set ups

ef914b1

fmt

7eb06dd

karencfv marked this pull request as ready for review August 2, 2023 12:53

karencfv requested review from smklein and bnaecker August 2, 2023 12:54

bnaecker mentioned this pull request Aug 30, 2023

Adds versions to timeseries schema #3996

Closed

Merge branch 'main' into ch-replicated-engine

7f67c7b

karencfv requested a review from bnaecker August 31, 2023 01:28

Small fix after merge

251df8a

smklein reviewed Aug 31, 2023

View reviewed changes

end-to-end-tests/src/helpers/ctx.rs Outdated Show resolved Hide resolved

Revert e2e timeout duration

c58c8fe

smklein reviewed Aug 31, 2023

View reviewed changes

schema/crdb/4.0.0/up.sql Outdated Show resolved Hide resolved

smklein reviewed Aug 31, 2023

View reviewed changes

smf/clickhouse/method_script.sh Outdated Show resolved Hide resolved

smklein approved these changes Aug 31, 2023

View reviewed changes

This was referenced Aug 31, 2023

[ClickHouse] Enable replicated mode once Nexus provisions services #4000

Closed

[ClickHouse] Improve replicated mode testing #4001

Closed

Address review comments

4832dda

smklein reviewed Aug 31, 2023

View reviewed changes

make the linter happy

c7e3598

bnaecker reviewed Aug 31, 2023

View reviewed changes

karencfv added 4 commits September 1, 2023 13:24

Address comments

0adffb7

Create distributed tables

98c705b

Stop forgetting to run cargo fmt before pushing the commit

77a3492

Also don't forget about clippy 🤦

5933f51

Small fix to referenced macro in SQL

6c82c8a

karencfv merged commit 0319d2c into oxidecomputer:main Sep 5, 2023

karencfv deleted the ch-replicated-engine branch September 5, 2023 21:00

iliana mentioned this pull request Sep 6, 2023

test problem: oximeter-db client::tests::test_build_replicated: Failed to detect ClickHouse subprocess within timeout #4037

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scale out Clickhouse to a multinode cluster #3494

Scale out Clickhouse to a multinode cluster #3494

karencfv commented Jul 5, 2023 •

edited

Loading

karencfv commented Jul 28, 2023

karencfv commented Aug 31, 2023 •

edited

Loading

smklein left a comment

smklein Aug 31, 2023

karencfv Aug 31, 2023

bnaecker left a comment

bnaecker Aug 31, 2023

karencfv Sep 1, 2023

bnaecker Aug 31, 2023

karencfv Sep 1, 2023

bnaecker Aug 31, 2023

karencfv Sep 1, 2023

bnaecker Aug 31, 2023

karencfv Sep 1, 2023

bnaecker Aug 31, 2023

karencfv Sep 1, 2023

karencfv commented Sep 1, 2023

karencfv commented Sep 2, 2023

karencfv commented Sep 2, 2023 •

edited

Loading

bnaecker commented Sep 3, 2023

karencfv commented Sep 5, 2023 •

edited

Loading

karencfv commented Sep 5, 2023 •

edited

Loading

		@@ -0,0 +1,24 @@
		-- CRDB documentation recommends the following:

Scale out Clickhouse to a multinode cluster #3494

Scale out Clickhouse to a multinode cluster #3494

Conversation

karencfv commented Jul 5, 2023 • edited Loading

Replication

Sharding

Testing

Follow up work

Update

Testing

Next steps

karencfv commented Jul 28, 2023

karencfv commented Aug 31, 2023 • edited Loading

Update

Testing

Next steps

smklein left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bnaecker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

karencfv commented Sep 1, 2023

karencfv commented Sep 2, 2023

karencfv commented Sep 2, 2023 • edited Loading

bnaecker commented Sep 3, 2023

karencfv commented Sep 5, 2023 • edited Loading

karencfv commented Sep 5, 2023 • edited Loading

karencfv commented Jul 5, 2023 •

edited

Loading

karencfv commented Aug 31, 2023 •

edited

Loading

karencfv commented Sep 2, 2023 •

edited

Loading

karencfv commented Sep 5, 2023 •

edited

Loading

karencfv commented Sep 5, 2023 •

edited

Loading