Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scale out Clickhouse to a multinode cluster #3494

Merged
merged 69 commits into from
Sep 5, 2023
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
69 commits
Select commit Hold shift + click to select a range
0b21df2
Initial functional 2 replica 3 coordinator cluster
karencfv Jul 5, 2023
3281cbd
Create config templates and pseudocode for updated init config
karencfv Jul 5, 2023
b628be6
Dynamically build configs for servers and keepers
karencfv Jul 6, 2023
cdfb1d1
Create a separate service for keepers
karencfv Jul 10, 2023
41518c8
Update manifest and file location
karencfv Jul 11, 2023
795327e
clean up
karencfv Jul 11, 2023
73e5e22
make linter happy
karencfv Jul 11, 2023
83449a2
Zone image is clickhouse-keeper.tar.gz not clickhouse_keeper.tar.gz
karencfv Jul 11, 2023
38fb86d
Merge branch 'main' into ch-replicated-engine
karencfv Jul 19, 2023
06b8f21
Only use underscores to simplify
karencfv Jul 20, 2023
b5dd484
Merge remote-tracking branch 'upstream' into ch-replicated-engine
karencfv Jul 20, 2023
ca7ad33
Create composite packages to include internal-dns tar
karencfv Jul 24, 2023
bbbbd28
Get internal DNS working
karencfv Jul 25, 2023
7b7b245
Add datastore to keeper service
karencfv Jul 26, 2023
72cc038
Append default and custom configs
karencfv Jul 31, 2023
34f370b
Give keepers dynamic discoverable IDs
karencfv Jul 31, 2023
79dd329
Clean up scripts and configs
karencfv Aug 1, 2023
94f8376
Clean up
karencfv Aug 1, 2023
cea0612
First pass at making tests pass
karencfv Aug 1, 2023
95be228
gargh linter
karencfv Aug 1, 2023
e24a2dd
Add additional zpools for dev envs
karencfv Aug 2, 2023
f6aac77
Add flag to internal-dns-cli to output host name only
karencfv Aug 2, 2023
758fd39
Revert testing configuration and clean up
karencfv Aug 2, 2023
ef914b1
Run oximeter on replicated or single node set ups
karencfv Aug 2, 2023
7eb06dd
fmt
karencfv Aug 2, 2023
1abe9dd
Merge branch 'main' into ch-replicated-engine
karencfv Aug 2, 2023
e2a4060
Small fix after merge with main branch
karencfv Aug 3, 2023
9c759e6
expectoration
karencfv Aug 3, 2023
bc33e97
Address comments
karencfv Aug 4, 2023
1ebcf14
fmt
karencfv Aug 4, 2023
23df4ef
address review comments
karencfv Aug 7, 2023
80eb1d1
Merge branch 'main' into ch-replicated-engine
karencfv Aug 8, 2023
2b1edd9
save config env vars to file
karencfv Aug 8, 2023
5fd1e75
fix scripts and configuration for bench gimlet
karencfv Aug 9, 2023
3bda6b3
Explicitly declare if a database is single node or replicated
karencfv Aug 9, 2023
2492ae8
foundation to test replicated nodes
karencfv Aug 9, 2023
8541d17
Testing utils
karencfv Aug 10, 2023
148dda9
Test replicated nodes
karencfv Aug 10, 2023
cb0cd66
First try at testing
karencfv Aug 11, 2023
b9e64cd
Keeper doesn't like absolute paths :(
karencfv Aug 11, 2023
9d8d019
Get test keepers going
karencfv Aug 14, 2023
81ab2ad
Make the test work
karencfv Aug 14, 2023
a8a02d4
Correct way to check whether a replicated server is ready for connect…
karencfv Aug 14, 2023
c116370
Clean up
karencfv Aug 14, 2023
28354be
Rename test config directories
karencfv Aug 14, 2023
9562f0e
fmt
karencfv Aug 15, 2023
9520449
fix tests
karencfv Aug 15, 2023
8872b09
Refine testing
karencfv Aug 16, 2023
3af0769
Revert bench gimlet configuration and fmt
karencfv Aug 16, 2023
f621b80
Bump clickhouse readyness testing timeout and make clippy happy
karencfv Aug 16, 2023
e51bd0f
Merge branch 'main' into ch-replicated-engine
karencfv Aug 17, 2023
298ca4e
Give end to end tests more time to bring up nexus
karencfv Aug 21, 2023
1930e14
Merge branch 'main' into ch-replicated-engine
karencfv Aug 21, 2023
e7a4635
Automatically detect whether ClickHouse set up is replicated or singl…
karencfv Aug 23, 2023
b8ccf29
Works on my machine, increase timeout
karencfv Aug 23, 2023
d83dc85
Merge branch 'main' into ch-replicated-engine
karencfv Aug 29, 2023
691d9d5
Update CRDB with new service enums
karencfv Aug 29, 2023
4a1c179
Disable replicated ClickHouse
karencfv Aug 30, 2023
fe124fd
Make clippy happy
karencfv Aug 30, 2023
7f67c7b
Merge branch 'main' into ch-replicated-engine
karencfv Aug 31, 2023
251df8a
Small fix after merge
karencfv Aug 31, 2023
c58c8fe
Revert e2e timeout duration
karencfv Aug 31, 2023
4832dda
Address review comments
karencfv Aug 31, 2023
c7e3598
make the linter happy
karencfv Aug 31, 2023
0adffb7
Address comments
karencfv Sep 1, 2023
98c705b
Create distributed tables
karencfv Sep 1, 2023
77a3492
Stop forgetting to run cargo fmt before pushing the commit
karencfv Sep 1, 2023
5933f51
Also don't forget about clippy :facepalm:
karencfv Sep 1, 2023
6c82c8a
Small fix to referenced macro in SQL
karencfv Sep 5, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion oximeter/db/src/client.rs
Original file line number Diff line number Diff line change
Expand Up @@ -761,7 +761,7 @@ mod tests {
.expect("Failed to initialize timeseries database");

// Wait to make sure data has been synchronised.
// TODO: Waiting for 5 secs is a bit sloppy,
// TODO(https://github.com/oxidecomputer/omicron/issues/4001): Waiting for 5 secs is a bit sloppy,
// come up with a better way to do this.
sleep(Duration::from_secs(5)).await;

Expand Down
2 changes: 1 addition & 1 deletion oximeter/db/src/configs/replica_config.xml
Original file line number Diff line number Diff line change
Expand Up @@ -315,7 +315,7 @@

<remote_servers replace="true">
<oximeter_cluster>
<!-- TODO: secret handling TBD -->
<!-- TODO(https://github.com/oxidecomputer/omicron/issues/3823): secret handling TBD -->
<secret>mysecretphrase</secret>
<shard>
<internal_replication>true</internal_replication>
Expand Down
1 change: 0 additions & 1 deletion schema/crdb/4.0.0/up.sql
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,5 @@ SELECT CAST(
);

ALTER TYPE omicron.public.service_kind ADD VALUE IF NOT EXISTS 'clickhouse_keeper';
ALTER TYPE omicron.public.dataset_kind ADD VALUE IF NOT EXISTS 'clickhouse_keeper';

COMMIT;
24 changes: 24 additions & 0 deletions schema/crdb/4.0.1/up.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
-- CRDB documentation recommends the following:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW in my original recommendation, I figured it would have been the following, within a single file named 4.0.0/up.sql, but what you have done works too.

BEGIN;
SELECT CAST(
    IF(
        (
            SELECT version = '3.0.3' and target_version = '4.0.0'
            FROM omicron.public.db_metadata WHERE singleton = true
        ),
        'true',
        'Invalid starting version for schema change'
    ) AS BOOL
);

ALTER TYPE omicron.public.service_kind ADD VALUE IF NOT EXISTS 'clickhouse_keeper';
COMMIT;

BEGIN;
SELECT CAST(
    IF(
        (
            SELECT version = '3.0.3' and target_version = '4.0.0'
            FROM omicron.public.db_metadata WHERE singleton = true
        ),
        'true',
        'Invalid starting version for schema change'
    ) AS BOOL
);

ALTER TYPE omicron.public.dataset_kind ADD VALUE IF NOT EXISTS 'clickhouse_keeper';
COMMIT;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thnx!

-- "Execute schema changes either as single statements (as an implicit transaction),
-- or in an explicit transaction consisting of the single schema change statement."
--
-- For each schema change, we transactionally:
-- 1. Check the current version
-- 2. Apply the idempotent update

BEGIN;

SELECT CAST(
IF(
(
SELECT version = '4.0.0' and target_version = '4.0.1'
FROM omicron.public.db_metadata WHERE singleton = true
),
'true',
'Invalid starting version for schema change'
) AS BOOL
);

ALTER TYPE omicron.public.dataset_kind ADD VALUE IF NOT EXISTS 'clickhouse_keeper';

COMMIT;
2 changes: 1 addition & 1 deletion schema/crdb/dbinit.sql
Original file line number Diff line number Diff line change
Expand Up @@ -2562,7 +2562,7 @@ INSERT INTO omicron.public.db_metadata (
version,
target_version
) VALUES
( TRUE, NOW(), NOW(), '4.0.0', NULL)
( TRUE, NOW(), NOW(), '4.0.1', NULL)
ON CONFLICT DO NOTHING;

COMMIT;
4 changes: 2 additions & 2 deletions sled-agent/src/rack_setup/plan/service.rs
Original file line number Diff line number Diff line change
Expand Up @@ -56,11 +56,11 @@ const CRDB_COUNT: usize = 5;
const OXIMETER_COUNT: usize = 1;
// TODO(https://github.com/oxidecomputer/omicron/issues/732): Remove
// when Nexus provisions Clickhouse.
// TODO: Set to 2 once we enable replicated ClickHouse
// TODO(https://github.com/oxidecomputer/omicron/issues/4000): Set to 2 once we enable replicated ClickHouse
const CLICKHOUSE_COUNT: usize = 1;
// TODO(https://github.com/oxidecomputer/omicron/issues/732): Remove
// when Nexus provisions Clickhouse keeper.
// TODO: Set to 3 once we enable replicated ClickHouse
// TODO(https://github.com/oxidecomputer/omicron/issues/4000): Set to 3 once we enable replicated ClickHouse
const CLICKHOUSE_KEEPER_COUNT: usize = 0;
// TODO(https://github.com/oxidecomputer/omicron/issues/732): Remove.
// when Nexus provisions Crucible.
Expand Down
2 changes: 1 addition & 1 deletion smf/clickhouse/config_replica.xml
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@

<remote_servers replace="true">
<oximeter_cluster>
<!-- TODO: secret handling TBD -->
<!-- TODO(https://github.com/oxidecomputer/omicron/issues/3823): secret handling TBD -->
<secret>mysecretphrase</secret>
<shard>
<internal_replication>true</internal_replication>
Expand Down
7 changes: 4 additions & 3 deletions smf/clickhouse/method_script.sh
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ route get -inet6 default -inet6 "$GATEWAY" || route add -inet6 default -inet6 "$
single_node=true
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will this be changed? We'll make a commit which changes it?

At that point, when we boot this new zone...what happens? There already is a database oximeter. Now, we can delete that ourselves manually, by just removing the files. But if we don't do that, we'll then run CREATE DATABASE oximeter ON CLUSTER oximeter_cluster. Does that conflict? When we create those tables, do those conflict?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been mulling it over, but tbh I can't say I have a clear answer as to which steps we'll take for the migration (or perhaps remove everything and start from scratch?). I was planning on having that discussion here -> #4000 I think a lot of it also depends on "how" Nexus will provision services.

If we are able to perform a one off job for the migration, we may be able to use remote() after renaming the old tables or something like that.


command=()
# TODO: Remove single node mode once all racks are running in replicated mode
# TODO((https://github.com/oxidecomputer/omicron/issues/4000)): Remove single node mode once all racks are running in replicated mode
if $single_node
then
command+=(
Expand Down Expand Up @@ -94,8 +94,9 @@ else
fi

# Identify the node type this is as this will influence how the config is constructed
# TODO: There are probably much better ways to do this service discovery, but this works
# for now. The services contain the same IDs as the hostnames.
# TODO(https://github.com/oxidecomputer/omicron/issues/3824): There are probably much
# better ways to do this service discovery, but this works for now.
# The services contain the same IDs as the hostnames.
CLICKHOUSE_SVC="$(zonename | tr -dc [:digit:])"
REPLICA_IDENTIFIER_01="$( echo "${REPLICA_HOST_01}" | tr -dc [:digit:])"
REPLICA_IDENTIFIER_02="$( echo "${REPLICA_HOST_02}" | tr -dc [:digit:])"
Expand Down
2 changes: 1 addition & 1 deletion smf/clickhouse_keeper/method_script.sh
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ KEEPER_ID_02="$( echo "${KEEPER_HOST_02}" | tr -dc [:digit:] | cut -c1-7)"
KEEPER_ID_03="$( echo "${KEEPER_HOST_03}" | tr -dc [:digit:] | cut -c1-7)"

# Identify the node type this is as this will influence how the config is constructed
# TODO: There are probably much better ways to do this service name lookup, but this works
# TODO(https://github.com/oxidecomputer/omicron/issues/3824): There are probably much better ways to do this service name lookup, but this works
# for now. The services contain the same IDs as the hostnames.
KEEPER_SVC="$(zonename | tr -dc [:digit:] | cut -c1-7)"
if [[ $KEEPER_ID_01 == $KEEPER_SVC ]]
Expand Down
3 changes: 2 additions & 1 deletion test-utils/src/dev/clickhouse.rs
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,8 @@ use tokio::{
use crate::dev::poll;

// Timeout used when starting up ClickHouse subprocess.
const CLICKHOUSE_TIMEOUT: Duration = Duration::from_secs(60);
// build-and-test (ubuntu-20.04) needs a little longer to get going
const CLICKHOUSE_TIMEOUT: Duration = Duration::from_secs(90);

/// A `ClickHouseInstance` is used to start and manage a ClickHouse single node server process.
#[derive(Debug)]
Expand Down