Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scale out Clickhouse to a multinode cluster #3494

Merged
merged 69 commits into from
Sep 5, 2023

Conversation

karencfv
Copy link
Contributor

@karencfv karencfv commented Jul 5, 2023

Replication

This PR implements an initial 2 replica 3 coordinator ClickHouse set up.

I've settled on this initial lean architecture as I want to avoid cluttering with what may be unnecessary additional nodes and using up our customers resources. As we gauge the system alongside our first customers we can decide if we really do need more replicas or not. Inserting an additional replica is very straightforward, as we only need to make a few changes to the templates/service count and restart the ClickHouse services.

Sharding

Sharding can prove to be very resource intensive, and we have yet to fully understand our customer's needs. I'd like to avoid a situation where we are prematurely optimising when we have so many unknowns. We also have not had time to perform long running testing. See official ClickHouse recommendations.

Like additional replicas, we can have additional shards if we find them to be necessary down the track.

Testing

I have left most tests as a single node set up. It feels unnecessary to spin up so many things constantly. If people disagree, I can modify this.

I have run many many manual tests, starting and stopping services and so far the set up has held up.

An Omicron build with these changes is currently running on sn21 please feel free to poke it and prod it :)
UPDATE: The build is borked because some things changed recently and I can't get Omicron running on the gimlet the way I used to :(

NB: I wasn't able to get nexus running on the gimlet :P sorry. I tried and failed miserably

Using a ClickHouse client:

root@oxz_clickhouse_af08dce0-41ce-4922-8d51-0f546f23ff3e:~# ifconfig
<redacted>
oxControlService13:1: flags=21002000841<UP,RUNNING,MULTICAST,IPv6,FIXEDMTU> mtu 9000 index 2
	inet6 fd00:1122:3344:101::f/64 
root@oxz_clickhouse_af08dce0-41ce-4922-8d51-0f546f23ff3e:~# cd /opt/oxide/clickhouse/
root@oxz_clickhouse_af08dce0-41ce-4922-8d51-0f546f23ff3e:/opt/oxide/clickhouse# ./clickhouse client --host fd00:1122:3344:101::f
ClickHouse client version 22.8.9.1.
Connecting to fd00:1122:3344:101::f:9000 as user default.
Connected to ClickHouse server version 22.8.9 revision 54460.

oximeter_cluster node 2 :) SELECT * FROM oximeter.fields_i64

SELECT *
FROM oximeter.fields_i64

Query id: dedbfbba-d949-49bd-9f9c-0f81a1240798

┌─timeseries_name───┬───────timeseries_key─┬─field_name─┬─field_value─┐
│ data_link:enabled │  9572423277405807617 │ link_id    │           0 │
│ data_link:enabled │ 12564290087547100823 │ link_id    │           0 │
│ data_link:enabled │ 16314114164963669893 │ link_id    │           0 │
│ data_link:link_up │  9572423277405807617 │ link_id    │           0 │
│ data_link:link_up │ 12564290087547100823 │ link_id    │           0 │
│ data_link:link_up │ 16314114164963669893 │ link_id    │           0 │
└───────────────────┴──────────────────────┴────────────┴─────────────┘

6 rows in set. Elapsed: 0.003 sec. 

To retrieve information about the keepers you can use the provided commands within each of the keeper zones.

Example:

root@oxz_clickhouse_keeper_9b70b23c-a7c4-4102-a7a1-525537dcf463:~# ifconfig
<redacted>
oxControlService17:1: flags=21002000841<UP,RUNNING,MULTICAST,IPv6,FIXEDMTU> mtu 9000 index 2
	inet6 fd00:1122:3344:101::11/64 
root@oxz_clickhouse_keeper_9b70b23c-a7c4-4102-a7a1-525537dcf463:~# echo mntr | nc fd00:1122:3344:101::11 9181
zk_version	v22.8.9.1-lts-ac5a6cababc9153320b1892eece62e1468058c26
zk_avg_latency	0
zk_max_latency	1
zk_min_latency	0
zk_packets_received	60
zk_packets_sent	60
zk_num_alive_connections	1
zk_outstanding_requests	0
zk_server_state	leader
zk_znode_count	6
zk_watch_count	1
zk_ephemerals_count	0
zk_approximate_data_size	1271
zk_key_arena_size	4096
zk_latest_snapshot_size	0
zk_followers	2
zk_synced_followers	2

Follow up work

  • Write up a troubleshooting document
  • Implement zone resource quotas?
  • We'll have to have some sort of "ops" APIs for maintenance tasks at some point.
  • Generate config templates dynamically.

Closes: #2158

Update

As mentioned previously, we came to an agreement at the last control plane meeting that software installed on the racks should not diverge due to replicated ClickHouse. This means that while ClickHouse replication is functional in this PR, it has been disabled in the last commit in the following manner:

  • The method_script.sh for the clickhouse service is set to run single node mode by default, but can be switched to run on replicated mode by swapping a variable to false. When we migrate all racks to a replicated ClickHouse setup, all logic related to running on single node will be removed from that file.
  • The number of zones defined through RSS will stay the same. Instructions on how to tweak them to launch in replicated mode have been left in the form of comments.

Testing

I ran the full CI testing suite on both replicated and single node mode. You can find the replicated test results here, and the single node with disabled replication here

Additionally, I have added tests that validate the replicated db_init file here, and incorporated checks in tests that validate whether a CH instance is part of a cluster or not.

Next steps

To keep this PR compact (if you can call 2000 lines compact), I have created several issues to tackle after this PR is merged from the review comments. In prioritised order, these are:

@karencfv karencfv added the database Related to database access label Jul 5, 2023
@karencfv
Copy link
Contributor Author

Found a bug :'( oxidecomputer/garbage-compactor#12

@karencfv karencfv marked this pull request as ready for review August 2, 2023 12:53
@karencfv karencfv requested review from smklein and bnaecker August 2, 2023 12:54
@karencfv
Copy link
Contributor Author

karencfv commented Aug 31, 2023

Update

As mentioned previously, we came to an agreement at the last control plane meeting that software installed on the racks should not diverge due to replicated ClickHouse. This means that while ClickHouse replication is functional in this PR, it has been disabled in the last commit in the following manner:

  • The method_script.sh for the clickhouse service is set to run single node mode by default, but can be switched to run on replicated mode by swapping a variable to false. When we migrate all racks to a replicated ClickHouse setup, all logic related to running on single node will be removed from that file.
  • The number of zones defined through RSS will stay the same. Instructions on how to tweak them to launch in replicated mode have been left in the form of comments.

Testing

I ran the full CI testing suite on both replicated and single node mode. You can find the replicated test results here, and the single node with disabled replication here

Additionally, I have added tests that validate the replicated db_init file here, and incorporated checks in tests that validate whether a CH instance is part of a cluster or not.

Next steps

To keep this PR compact (if you can call 2000 lines compact), I have created several issues to tackle after this PR is merged from the review comments. In prioritised order, these are:

@karencfv karencfv requested a review from bnaecker August 31, 2023 01:28
schema/crdb/4.0.0/up.sql Outdated Show resolved Hide resolved
Copy link
Collaborator

@smklein smklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm stoked we could get tests for this, and that we got the deployment job to pass.

This looks solid - I have a couple comments below, but the structure of this makes sense to me. Thanks for pushing through this work -- it's hard enough by itself, but harder on a moving target where we're supporting existing platforms.

I'd check-in with @bnaecker to see if he has additional feedback before merging, but modulo my last set of comments, this LGTM!

@@ -0,0 +1,24 @@
-- CRDB documentation recommends the following:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW in my original recommendation, I figured it would have been the following, within a single file named 4.0.0/up.sql, but what you have done works too.

BEGIN;
SELECT CAST(
    IF(
        (
            SELECT version = '3.0.3' and target_version = '4.0.0'
            FROM omicron.public.db_metadata WHERE singleton = true
        ),
        'true',
        'Invalid starting version for schema change'
    ) AS BOOL
);

ALTER TYPE omicron.public.service_kind ADD VALUE IF NOT EXISTS 'clickhouse_keeper';
COMMIT;

BEGIN;
SELECT CAST(
    IF(
        (
            SELECT version = '3.0.3' and target_version = '4.0.0'
            FROM omicron.public.db_metadata WHERE singleton = true
        ),
        'true',
        'Invalid starting version for schema change'
    ) AS BOOL
);

ALTER TYPE omicron.public.dataset_kind ADD VALUE IF NOT EXISTS 'clickhouse_keeper';
COMMIT;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thnx!

Copy link
Collaborator

@bnaecker bnaecker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the work here @karencfv, this is really a monster PR. I've a few questions about how we're actually going to roll this out, and what can go wrong when we do. Those are mostly not that big a deal, given that it seems like there is some time before we'll actually deploy this.

One bigger question is around whether we should use distributed tables now. I think it would be a good idea, but it's also possible we can defer it a little bit. There's a lot of work here that is independent of the exact DB organization we end up deploying. But I think we want to revisit that question prior to actually pulling the trigger, to minimize the chances of making another breaking change in the future.

ORDER BY (timeseries_name, timeseries_key, timestamp)
TTL toDateTime(timestamp) + INTERVAL 30 DAY;
--
CREATE TABLE IF NOT EXISTS oximeter.measurements_f64 ON CLUSTER oximeter_cluster
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So for the measurement tables, we should consider making these Distributed now.

Here's my thinking. It seems like we're probably going to just lose some data when we make the change to the replicated tables. I'd like to avoid that now, but I think we need to avoid that in the future. So, perhaps we could make each of these tables the "local" version, e.g., append _local to the name, and then create a table with these names that's actually Distributed.

For example:

CREATE TABLE IF NOT EXISTS oximeter.measurements_f64_local ON CLUSTER oximeter_cluster
(
 ...
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{share}/measurements_f64_local', '{replica}')
...

And then later:

CREATE TABLE IF NOT EXISTS oximeter.measurements_f64 ON CLUSTER oximeter_cluster
(
    ...
)
ENGINE = Distributed('oximeter_cluster', 'oximeter', 'measurements_f64_local', timeseries_name)
...

The advantage of this is

  • if or when we do get to sharding the data, this will happen without changes to the SQL. New data will be automatically distributed among the new shards. We can reshard if we want to separately.
  • queries also always query all the data, and take advantage of the sharding to split the workload
  • we don't need to load-balance ourselves, ClickHouse does that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great idea! I've made the change

# TEMPORARY: Racks will be set up with single node ClickHouse until
# Nexus provisions services so there is no divergence between racks
# https://github.com/oxidecomputer/omicron/issues/732
single_node=true
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will this be changed? We'll make a commit which changes it?

At that point, when we boot this new zone...what happens? There already is a database oximeter. Now, we can delete that ourselves manually, by just removing the files. But if we don't do that, we'll then run CREATE DATABASE oximeter ON CLUSTER oximeter_cluster. Does that conflict? When we create those tables, do those conflict?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been mulling it over, but tbh I can't say I have a clear answer as to which steps we'll take for the migration (or perhaps remove everything and start from scratch?). I was planning on having that discussion here -> #4000 I think a lot of it also depends on "how" Nexus will provision services.

If we are able to perform a one off job for the migration, we may be able to use remote() after renaming the old tables or something like that.

Comment on lines +151 to +153
// There seems to be a bug using ipv6 with a replicated set up
// when installing all servers and coordinator nodes on the same
// server. For this reason we will be using ipv4 for testing.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the nature of this bug? Will it affect us if we happen to deploy two replicas on the same host?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It only appears when using localhost in the same server. It shouldn't be a problem for any rack installation.

test-utils/src/dev/clickhouse.rs Outdated Show resolved Hide resolved
.stdin(Stdio::null())
.stdout(Stdio::null())
.stderr(Stdio::null())
.env("CLICKHOUSE_WATCHDOG_ENABLE", "0")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably do .env_clear() here to be safe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately that removes the environment variables just before the config files are populated with them so I had to leave it as is.

.stdin(Stdio::null())
.stdout(Stdio::null())
.stderr(Stdio::null())
.env("CLICKHOUSE_WATCHDOG_ENABLE", "0")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably do .env_clear() here to be safe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

@karencfv
Copy link
Contributor Author

karencfv commented Sep 1, 2023

Thanks for taking the time to review @smklein @bnaecker ! I know this isn't a very easy PR to review.

@bnaecker I've addressed your comments. On the matter of implementing distributed tables, I think it's a great idea! I've changed the SQL file with your suggestions, and we can refine things if needed on #3982 which is what I'm planning on tackling next.

@karencfv
Copy link
Contributor Author

karencfv commented Sep 2, 2023

sigh

I am really at a loss as to why the build-and-test (ubuntu-20.04) test is failing with:

--- STDERR:              oximeter-db client::tests::test_build_replicated ---
--
5282 | 2023-09-02T00:36:53.298Z | thread 'client::tests::test_build_replicated' panicked at 'Failed to initialize timeseries
 database: Database("Code: 62. DB::Exception: There was an error on [127.0.0.1:9000]: Code: 62. DB::Exception: No macro
 'share' in config while processing substitutions in '/clickhouse/tables/{share}/measurements_i64_local' at '20' or macro is
 not supported here. (SYNTAX_ERROR) (version 22.8.9.24 (official build)). (SYNTAX_ERROR) (version 22.8.9.24 (official 
build))\n")', oximeter/db/src/client.rs:761:14

On atrium the test runs just fine

karen@atrium ~/src/omicron $ cargo nextest run oximeter-db client::tests::test_build_replicated --nocapture
    Finished test [unoptimized + debuginfo] target(s) in 2.56s
    Starting 1 test across 94 binaries (824 skipped)
       START             oximeter-db client::tests::test_build_replicated

running 1 test
test client::tests::test_build_replicated ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 48 filtered out; finished in 15.33s

        PASS [  15.367s] oximeter-db client::tests::test_build_replicated
------------
     Summary [  15.370s] 1 test run: 1 passed, 824 skipped

@karencfv
Copy link
Contributor Author

karencfv commented Sep 2, 2023

gargh, ran this on Linux and it passes as well!!! I don't know what's going on here

coatlicue@pop-os:~/src/omicron$ cargo nextest run oximeter-db client::tests::test_build_replicated --nocapture
   Compiling serde_json v1.0.105
<...>
    Finished test [unoptimized + debuginfo] target(s) in 8m 40s
    Starting 1 test across 94 binaries (811 skipped)
       START             oximeter-db client::tests::test_build_replicated

running 1 test
test client::tests::test_build_replicated ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 48 filtered out; finished in 10.76s

        PASS [  10.768s] oximeter-db client::tests::test_build_replicated
------------
     Summary [  10.769s] 1 test run: 1 passed, 811 skipped

Anybody have any ideas on what I should do about this?

@bnaecker
Copy link
Collaborator

bnaecker commented Sep 3, 2023

--- STDERR:              oximeter-db client::tests::test_build_replicated ---
--
5282 | 2023-09-02T00:36:53.298Z | thread 'client::tests::test_build_replicated' panicked at 'Failed to initialize timeseries
 database: Database("Code: 62. DB::Exception: There was an error on [127.0.0.1:9000]: Code: 62. DB::Exception: No macro
 'share' in config while processing substitutions in '/clickhouse/tables/{share}/measurements_i64_local' at '20' or macro is
 not supported here. (SYNTAX_ERROR) (version 22.8.9.24 (official build)). (SYNTAX_ERROR) (version 22.8.9.24 (official 
build))\n")', oximeter/db/src/client.rs:761:14

This says there is no macro named share. Indeed, if you look at the configuration file, the macro is named shard. You'll need to change the interpolated variable here (and the other similar table engine definitions) to refer to the shard macro.

I think it's worth understanding why the tests pass on atrium in this case. Presumably the files are the same, so let's track down why that's not failing. Is the macro share defined somewhere in that test case? It also seems plausible that the files (either XML or SQL) are slightly different when run on a native host (atrium or a Linux machine). Perhaps those environments have some macro share defined in them somehow, or the SQL file is different?

@karencfv
Copy link
Contributor Author

karencfv commented Sep 5, 2023

Thanks @bnaecker! It was really strange because I couldn't replicate that error on a Linux machine either 🤷‍♀️ . All the configuration and SQL files are the same.

Made the change, keeping my fingers crossed!

@karencfv
Copy link
Contributor Author

karencfv commented Sep 5, 2023

Woohoo! I'm still not sure why the previous run passed in both of my local environments (Helios/Linux) but not in Ubuntu CI. I'm going to look deeper into this and improve testing as part of #4001

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
database Related to database access Metrics
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Initialize multi-node ClickHouse rather than single-node
3 participants