-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scale out Clickhouse to a multinode cluster #3494
Conversation
Pull in latest service improvements
Found a bug :'( oxidecomputer/garbage-compactor#12 |
UpdateAs mentioned previously, we came to an agreement at the last control plane meeting that software installed on the racks should not diverge due to replicated ClickHouse. This means that while ClickHouse replication is functional in this PR, it has been disabled in the last commit in the following manner:
TestingI ran the full CI testing suite on both replicated and single node mode. You can find the replicated test results here, and the single node with disabled replication here Additionally, I have added tests that validate the replicated db_init file here, and incorporated checks in tests that validate whether a CH instance is part of a cluster or not. Next stepsTo keep this PR compact (if you can call 2000 lines compact), I have created several issues to tackle after this PR is merged from the review comments. In prioritised order, these are: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm stoked we could get tests for this, and that we got the deployment job to pass.
This looks solid - I have a couple comments below, but the structure of this makes sense to me. Thanks for pushing through this work -- it's hard enough by itself, but harder on a moving target where we're supporting existing platforms.
I'd check-in with @bnaecker to see if he has additional feedback before merging, but modulo my last set of comments, this LGTM!
schema/crdb/4.0.1/up.sql
Outdated
@@ -0,0 +1,24 @@ | |||
-- CRDB documentation recommends the following: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW in my original recommendation, I figured it would have been the following, within a single file named 4.0.0/up.sql
, but what you have done works too.
BEGIN;
SELECT CAST(
IF(
(
SELECT version = '3.0.3' and target_version = '4.0.0'
FROM omicron.public.db_metadata WHERE singleton = true
),
'true',
'Invalid starting version for schema change'
) AS BOOL
);
ALTER TYPE omicron.public.service_kind ADD VALUE IF NOT EXISTS 'clickhouse_keeper';
COMMIT;
BEGIN;
SELECT CAST(
IF(
(
SELECT version = '3.0.3' and target_version = '4.0.0'
FROM omicron.public.db_metadata WHERE singleton = true
),
'true',
'Invalid starting version for schema change'
) AS BOOL
);
ALTER TYPE omicron.public.dataset_kind ADD VALUE IF NOT EXISTS 'clickhouse_keeper';
COMMIT;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thnx!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the work here @karencfv, this is really a monster PR. I've a few questions about how we're actually going to roll this out, and what can go wrong when we do. Those are mostly not that big a deal, given that it seems like there is some time before we'll actually deploy this.
One bigger question is around whether we should use distributed tables now. I think it would be a good idea, but it's also possible we can defer it a little bit. There's a lot of work here that is independent of the exact DB organization we end up deploying. But I think we want to revisit that question prior to actually pulling the trigger, to minimize the chances of making another breaking change in the future.
ORDER BY (timeseries_name, timeseries_key, timestamp) | ||
TTL toDateTime(timestamp) + INTERVAL 30 DAY; | ||
-- | ||
CREATE TABLE IF NOT EXISTS oximeter.measurements_f64 ON CLUSTER oximeter_cluster |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So for the measurement tables, we should consider making these Distributed
now.
Here's my thinking. It seems like we're probably going to just lose some data when we make the change to the replicated tables. I'd like to avoid that now, but I think we need to avoid that in the future. So, perhaps we could make each of these tables the "local" version, e.g., append _local
to the name, and then create a table with these names that's actually Distributed
.
For example:
CREATE TABLE IF NOT EXISTS oximeter.measurements_f64_local ON CLUSTER oximeter_cluster
(
...
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{share}/measurements_f64_local', '{replica}')
...
And then later:
CREATE TABLE IF NOT EXISTS oximeter.measurements_f64 ON CLUSTER oximeter_cluster
(
...
)
ENGINE = Distributed('oximeter_cluster', 'oximeter', 'measurements_f64_local', timeseries_name)
...
The advantage of this is
- if or when we do get to sharding the data, this will happen without changes to the SQL. New data will be automatically distributed among the new shards. We can reshard if we want to separately.
- queries also always query all the data, and take advantage of the sharding to split the workload
- we don't need to load-balance ourselves, ClickHouse does that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a great idea! I've made the change
# TEMPORARY: Racks will be set up with single node ClickHouse until | ||
# Nexus provisions services so there is no divergence between racks | ||
# https://github.com/oxidecomputer/omicron/issues/732 | ||
single_node=true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How will this be changed? We'll make a commit which changes it?
At that point, when we boot this new zone...what happens? There already is a database oximeter
. Now, we can delete that ourselves manually, by just removing the files. But if we don't do that, we'll then run CREATE DATABASE oximeter ON CLUSTER oximeter_cluster
. Does that conflict? When we create those tables, do those conflict?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been mulling it over, but tbh I can't say I have a clear answer as to which steps we'll take for the migration (or perhaps remove everything and start from scratch?). I was planning on having that discussion here -> #4000 I think a lot of it also depends on "how" Nexus will provision services.
If we are able to perform a one off job for the migration, we may be able to use remote()
after renaming the old tables or something like that.
// There seems to be a bug using ipv6 with a replicated set up | ||
// when installing all servers and coordinator nodes on the same | ||
// server. For this reason we will be using ipv4 for testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the nature of this bug? Will it affect us if we happen to deploy two replicas on the same host?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It only appears when using localhost in the same server. It shouldn't be a problem for any rack installation.
.stdin(Stdio::null()) | ||
.stdout(Stdio::null()) | ||
.stderr(Stdio::null()) | ||
.env("CLICKHOUSE_WATCHDOG_ENABLE", "0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably do .env_clear()
here to be safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately that removes the environment variables just before the config files are populated with them so I had to leave it as is.
.stdin(Stdio::null()) | ||
.stdout(Stdio::null()) | ||
.stderr(Stdio::null()) | ||
.env("CLICKHOUSE_WATCHDOG_ENABLE", "0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably do .env_clear()
here to be safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
Thanks for taking the time to review @smklein @bnaecker ! I know this isn't a very easy PR to review. @bnaecker I've addressed your comments. On the matter of implementing distributed tables, I think it's a great idea! I've changed the SQL file with your suggestions, and we can refine things if needed on #3982 which is what I'm planning on tackling next. |
sigh I am really at a loss as to why the build-and-test (ubuntu-20.04) test is failing with:
On atrium the test runs just fine
|
gargh, ran this on Linux and it passes as well!!! I don't know what's going on here
Anybody have any ideas on what I should do about this? |
This says there is no macro named I think it's worth understanding why the tests pass on |
Thanks @bnaecker! It was really strange because I couldn't replicate that error on a Linux machine either 🤷♀️ . All the configuration and SQL files are the same. Made the change, keeping my fingers crossed! |
Woohoo! I'm still not sure why the previous run passed in both of my local environments (Helios/Linux) but not in Ubuntu CI. I'm going to look deeper into this and improve testing as part of #4001 |
Replication
This PR implements an initial 2 replica 3 coordinator ClickHouse set up.
I've settled on this initial lean architecture as I want to avoid cluttering with what may be unnecessary additional nodes and using up our customers resources. As we gauge the system alongside our first customers we can decide if we really do need more replicas or not. Inserting an additional replica is very straightforward, as we only need to make a few changes to the templates/service count and restart the ClickHouse services.
Sharding
Sharding can prove to be very resource intensive, and we have yet to fully understand our customer's needs. I'd like to avoid a situation where we are prematurely optimising when we have so many unknowns. We also have not had time to perform long running testing. See official ClickHouse recommendations.
Like additional replicas, we can have additional shards if we find them to be necessary down the track.
Testing
I have left most tests as a single node set up. It feels unnecessary to spin up so many things constantly. If people disagree, I can modify this.
I have run many many manual tests, starting and stopping services and so far the set up has held up.
An Omicron build with these changes is currently running on sn21 please feel free to poke it and prod it :)
UPDATE: The build is borked because some things changed recently and I can't get Omicron running on the gimlet the way I used to :(
NB: I wasn't able to get nexus running on the gimlet :P sorry. I tried and failed miserably
Using a ClickHouse client:
To retrieve information about the keepers you can use the provided commands within each of the keeper zones.
Example:
Follow up work
Closes: #2158
Update
As mentioned previously, we came to an agreement at the last control plane meeting that software installed on the racks should not diverge due to replicated ClickHouse. This means that while ClickHouse replication is functional in this PR, it has been disabled in the last commit in the following manner:
method_script.sh
for theclickhouse
service is set to run single node mode by default, but can be switched to run on replicated mode by swapping a variable to false. When we migrate all racks to a replicated ClickHouse setup, all logic related to running on single node will be removed from that file.Testing
I ran the full CI testing suite on both replicated and single node mode. You can find the replicated test results here, and the single node with disabled replication here
Additionally, I have added tests that validate the replicated db_init file here, and incorporated checks in tests that validate whether a CH instance is part of a cluster or not.
Next steps
To keep this PR compact (if you can call 2000 lines compact), I have created several issues to tackle after this PR is merged from the review comments. In prioritised order, these are: