[reconfigurator] `clickhouse_server` SMF service and oximeter replicated mode #6343

karencfv · 2024-08-15T04:57:21Z

Overview

This commit introduces a few changes:

a new clickhouse_server smf service which runs the old "replicated" mode from the clickhouse service
a new replicated field for the oximeter configuration file which is consumed by the oximeter binary that runs the replicated SQL against a database. It now connects to the listen address from ServiceName::ClickhouseServer or ServiceName::Clickhouse depending which zone has been deployed.
a new --clickhouse-topology build target flag which builds artifacts based on either a single-node or replicated-cluster setup. The difference between the two is whether the oximeter SMF service is executing the oximeter CLI with the --replicated flag or not. CAVEAT: It's still necessary to manually change the RSS node count constants to the specified amount for each clickhouse topology mode. This requirement will be short lived as we are moving to use reconfigurator.

Usage

To run single node ClickHouse nothing changes, artifacts can be built the same way as before.

To run replicated ClickHouse set the node count constants to the specified amount, and set the build target in the following manner:

$ cargo run --locked --release --bin omicron-package -- -t <NAME> target create -i standard -m non-gimlet -s softnpu -r single-sled -c replicated-cluster
    Finished `release` profile [optimized] target(s) in 1.03s
     Running `target/release/omicron-package -t <NAME> target create -i standard -m non-gimlet -s softnpu -r single-sled -c replicated-cluster`
Logging to: /home/coatlicue/src/omicron/out/LOG
Created new build target 'centzon' and set it as active
$ cargo run --locked --release --bin omicron-package -- -t <NAME> package
<...>
$ pfexec ./target/release/omicron-package -t <NAME> install

Purpose

As laid out in RFD 468, to roll out replicated ClickHouse we will need the ability to roll out either replicated or single node ClickHouse for an undetermined amount of time. This commit is a step in that direction. We need to have separate services for running replicated or single-node ClickHouse servers.

Testing

Deploying omicron on a helios box with both modes.

Single node:

$ cargo run --locked --release --bin omicron-package -- -t centzon target create -i standard -m non-gimlet -s softnpu -r single-sled
    Finished `release` profile [optimized] target(s) in 0.94s
     Running `target/release/omicron-package -t centzon target create -i standard -m non-gimlet -s softnpu -r single-sled`
Logging to: /home/coatlicue/src/omicron/out/LOG
Created new build target 'centzon' and set it as active
$ cargo run --locked --release --bin omicron-package -- -t centzon package
<...>
$ pfexec ./target/release/omicron-package -t centzon install
Logging to: /home/coatlicue/src/omicron/out/LOG
$ zoneadm list | grep clickhouse
oxz_clickhouse_7ce86c8b-2c9e-4d02-a857-269cb0a99c2e
root@oxz_clickhouse_7ce86c8b:~# /opt/oxide/clickhouse/clickhouse client --host fd00:1122:3344:101::e
ClickHouse client version 23.8.7.1.
Connecting to fd00:1122:3344:101::e:9000 as user default.
Connected to ClickHouse server version 23.8.7 revision 54465.

oxz_clickhouse_7ce86c8b-2c9e-4d02-a857-269cb0a99c2e.local :) SHOW TABLES FROM oximeter

SHOW TABLES FROM oximeter

Query id: 5e91fafb-4d70-4a27-a188-75fb83bb7e5e

┌─name───────────────────────┐
│ fields_bool                │
│ fields_i16                 │
│ fields_i32                 │
│ fields_i64                 │
│ fields_i8                  │
│ fields_ipaddr              │
│ fields_string              │
│ fields_u16                 │
│ fields_u32                 │
│ fields_u64                 │
│ fields_u8                  │
│ fields_uuid                │
│ measurements_bool          │
│ measurements_bytes         │
│ measurements_cumulativef32 │
│ measurements_cumulativef64 │
│ measurements_cumulativei64 │
│ measurements_cumulativeu64 │
│ measurements_f32           │
│ measurements_f64           │
│ measurements_histogramf32  │
│ measurements_histogramf64  │
│ measurements_histogrami16  │
│ measurements_histogrami32  │
│ measurements_histogrami64  │
│ measurements_histogrami8   │
│ measurements_histogramu16  │
│ measurements_histogramu32  │
│ measurements_histogramu64  │
│ measurements_histogramu8   │
│ measurements_i16           │
│ measurements_i32           │
│ measurements_i64           │
│ measurements_i8            │
│ measurements_string        │
│ measurements_u16           │
│ measurements_u32           │
│ measurements_u64           │
│ measurements_u8            │
│ timeseries_schema          │
│ version                    │
└────────────────────────────┘

41 rows in set. Elapsed: 0.014 sec.

oxz_clickhouse_7ce86c8b-2c9e-4d02-a857-269cb0a99c2e.local :) SELECT * FROM oximeter.fields_i64

SELECT *
FROM oximeter.fields_i64

Query id: 4bbcec72-101f-4cf4-9966-680381f5b62c

┌─timeseries_name────────────────────────┬───────timeseries_key─┬─field_name──┬─field_value─┐
│ http_service:request_latency_histogram │  8326032694586838023 │ status_code │         200 │
<...>

$ pfexec zlogin oxz_oximeter_b235200f-f0ad-4218-9184-d995df5acaf0
[Connected to zone 'oxz_oximeter_b235200f-f0ad-4218-9184-d995df5acaf0' pts/3]
The illumos Project     helios-2.0.22784        July 2024
root@oxz_oximeter_b235200f:~# cat /var/svc/manifest/site/oximeter/config.toml 
# Example configuration file for running an oximeter collector server

[db]
batch_size = 1000
batch_interval = 5 # In seconds
replicated = false

[log]
level = "debug"
mode = "file"
path = "/dev/stdout"
if_exists = "append"

Replicated cluster:

$ cargo run --locked --release --bin omicron-package -- -t centzon target create -i standard -m non-gimlet -s softnpu -r single-sled -c replicated-cluster
    Finished `release` profile [optimized] target(s) in 1.03s
     Running `target/release/omicron-package -t centzon target create -i standard -m non-gimlet -s softnpu -r single-sled -c replicated-cluster`
Logging to: /home/coatlicue/src/omicron/out/LOG
Created new build target 'centzon' and set it as active
$ cargo run --locked --release --bin omicron-package -- -t centzon package
<...>
$ pfexec ./target/release/omicron-package -t centzon install
Logging to: /home/coatlicue/src/omicron/out/LOG
$ zoneadm list | grep clickhouse
oxz_clickhouse_keeper_73e7fda2-20af-4a90-9a61-c89ceed47c1a
oxz_clickhouse_server_74876663-5337-4d9b-85cb-99d1e88bdf8a
oxz_clickhouse_keeper_8eaac4f9-d9e0-4d56-b269-eab7da0c73a3
oxz_clickhouse_keeper_01f3a6af-5249-4dff-b9a4-f1076e467c9a
oxz_clickhouse_server_bc6010bf-507c-4b5a-ad4c-3a7af889a6c0
$ pfexec zlogin oxz_clickhouse_server_74876663-5337-4d9b-85cb-99d1e88bdf8a
[Connected to zone 'oxz_clickhouse_server_74876663-5337-4d9b-85cb-99d1e88bdf8a' pts/3]
The illumos Project     helios-2.0.22784        July 2024
root@oxz_clickhouse_server_74876663:~# /opt/oxide/clickhouse_server/clickhouse client --host fd00:1122:3344:101::e
ClickHouse client version 23.8.7.1.
Connecting to fd00:1122:3344:101::e:9000 as user default.
Connected to ClickHouse server version 23.8.7 revision 54465.

oximeter_cluster node 1 :) SHOW TABLES FROM oximeter

SHOW TABLES FROM oximeter

Query id: a5603063-1cbc-41a5-bfbd-33c986764e92

┌─name─────────────────────────────┐
│ fields_bool                      │
│ fields_bool_local                │
│ fields_i16                       │
│ fields_i16_local                 │
│ fields_i32                       │
│ fields_i32_local                 │
│ fields_i64                       │
│ fields_i64_local                 │
│ fields_i8                        │
│ fields_i8_local                  │
│ fields_ipaddr                    │
│ fields_ipaddr_local              │
│ fields_string                    │
│ fields_string_local              │
│ fields_u16                       │
│ fields_u16_local                 │
│ fields_u32                       │
│ fields_u32_local                 │
│ fields_u64                       │
│ fields_u64_local                 │
│ fields_u8                        │
│ fields_u8_local                  │
│ fields_uuid                      │
│ fields_uuid_local                │
│ measurements_bool                │
│ measurements_bool_local          │
│ measurements_bytes               │
│ measurements_bytes_local         │
│ measurements_cumulativef32       │
│ measurements_cumulativef32_local │
│ measurements_cumulativef64       │
│ measurements_cumulativef64_local │
│ measurements_cumulativei64       │
│ measurements_cumulativei64_local │
│ measurements_cumulativeu64       │
│ measurements_cumulativeu64_local │
│ measurements_f32                 │
│ measurements_f32_local           │
│ measurements_f64                 │
│ measurements_f64_local           │
│ measurements_histogramf32        │
│ measurements_histogramf32_local  │
│ measurements_histogramf64        │
│ measurements_histogramf64_local  │
│ measurements_histogrami16        │
│ measurements_histogrami16_local  │
│ measurements_histogrami32        │
│ measurements_histogrami32_local  │
│ measurements_histogrami64        │
│ measurements_histogrami64_local  │
│ measurements_histogrami8         │
│ measurements_histogrami8_local   │
│ measurements_histogramu16        │
│ measurements_histogramu16_local  │
│ measurements_histogramu32        │
│ measurements_histogramu32_local  │
│ measurements_histogramu64        │
│ measurements_histogramu64_local  │
│ measurements_histogramu8         │
│ measurements_histogramu8_local   │
│ measurements_i16                 │
│ measurements_i16_local           │
│ measurements_i32                 │
│ measurements_i32_local           │
│ measurements_i64                 │
│ measurements_i64_local           │
│ measurements_i8                  │
│ measurements_i8_local            │
│ measurements_string              │
│ measurements_string_local        │
│ measurements_u16                 │
│ measurements_u16_local           │
│ measurements_u32                 │
│ measurements_u32_local           │
│ measurements_u64                 │
│ measurements_u64_local           │
│ measurements_u8                  │
│ measurements_u8_local            │
│ timeseries_schema                │
│ timeseries_schema_local          │
│ version                          │
└──────────────────────────────────┘

81 rows in set. Elapsed: 0.010 sec. 

oximeter_cluster node 1 :) SELECT * FROM oximeter.fields_i64

SELECT *
FROM oximeter.fields_i64

Query id: 14f07468-0e33-4de1-8893-df3e11eb7660

┌─timeseries_name────────────────────────┬───────timeseries_key─┬─field_name──┬─field_value─┐
│ http_service:request_latency_histogram │   436117616059041516 │ status_code │         200 │
<...>

$ pfexec zlogin oxz_oximeter_bcba1c06-1ca5-49cf-b277-8c2387975274
[Connected to zone 'oxz_oximeter_bcba1c06-1ca5-49cf-b277-8c2387975274' pts/3]
The illumos Project     helios-2.0.22784        July 2024
root@oxz_oximeter_bcba1c06:~# cat /var/svc/manifest/site/oximeter/config.toml
# Example configuration file for running an oximeter collector server

[db]
batch_size = 1000
batch_interval = 5 # In seconds
replicated = true

[log]
level = "debug"
mode = "file"
path = "/dev/stdout"
if_exists = "append"

Related: #5999

karencfv · 2024-08-16T05:50:58Z

smf/clickhouse_server/method_script.sh

I have made no changes here. This is simply extracted from smf/clickhouse/method_script.sh. This script will be superseded by config file generation via clickward's libraries in a follow up PR.

bnaecker

Thanks for putting this together. I have a few questions and suggestions, but looks pretty good overall!

oximeter/collector/src/agent.rs

bnaecker · 2024-08-16T16:03:23Z

oximeter/collector/src/bin/oximeter.rs

@@ -49,6 +49,12 @@ enum Args {
        /// The socket address at which `oximeter`'s HTTP server runs.
        #[clap(short, long, action)]
        address: SocketAddrV6,
+
+        // TODO (https://github.com/oxidecomputer/omicron/issues/4148): This flag
+        // should be removed once single node functionality is removed.


I'll keep saying it: I believe we'll always want to retain the ability to run a single-node, non-replicated ClickHouse database. I'm not sure if this particular argument will change, I'm referring to the functionality itself.

This is stage three of the plan of action to roll out replicated clickhouse in RFD 468. Can we move the discussion of the decision to keep or remove single node functionality there?

In the mean time I'd like to keep these TODOs to keep track of the places that would need to be modified if we decide to remove single node functionality (I'll rephrase to say "if" instead of "once"). If we decide that single node functionality will never be removed I'll make sure to remove all of these comments :)

oximeter/collector/src/bin/oximeter.rs

bnaecker · 2024-08-16T16:09:10Z

smf/oximeter/replicated-cluster/manifest.xml

+    </method_environment>
+  </method_context>
+  <exec_method type='method' name='start'
+    exec='ctrun -l child -o noorphan,regent /opt/oxide/oximeter-collector/bin/oximeter run /var/svc/manifest/site/oximeter/config.toml --address %{config/address} --id %{config/id} --replicated &amp;'


Just to make sure I understand, this whole file is duplicated so that we can pass the --replicated flag to this invocation? If so, is there a way to simplify that at all? E.g., can we provide this as an SMF property or in the TOML configuration file?

Hm, let me think about how I can make this nicer

I've modified the code so that the replicated setting is now a field in the oximeter configuration file instead of a flag in the CLI.

Example:

[db] batch_size = 1000 batch_interval = 5 # In seconds replicated = false [log] level = "debug" mode = "file" path = "/dev/stdout" if_exists = "append"

We still need to have two configuration files as there really isn't a way to pass some sort of environment variable via omicron-package. But, it's now much more obvious to see what the difference between the files is, and it's cleaner.

package/src/bin/omicron-package.rs

package/src/target.rs

bnaecker · 2024-08-16T16:23:34Z

sled-agent/src/rack_setup/plan/service.rs

@@ -58,14 +58,23 @@ const OXIMETER_COUNT: usize = 1;
 // when Nexus provisions Clickhouse.
 // TODO(https://github.com/oxidecomputer/omicron/issues/4000): Use
 // omicron_common::policy::CLICKHOUSE_SERVER_REDUNDANCY once we enable
-// replicated ClickHouse
+// replicated ClickHouse.
+// Set to 0 when testing replicated ClickHouse.


It seems like this is the number of single-node ClickHouse servers. If that's right, can we make that clear with some documentation?

I think it might be even better to have one single const here, to avoid confusion and accidentally starting something we didn't intend. For example:

enum ClickHouseSetup { Replicated { n_servers: usize, n_keepers: usize, }, SingleNode { n_servers: usize }, } // We're using single-node for now. const CLICKHOUSE_SETUP = ClickHouseSetup::SingleNode { n_servers: 1 }; // If you want to test replication, uncomment this: // const CLICKHOUSE_SETUP = ClickHouseSetup::Replicated { n_servers: 2, n_keepers: 3 };

Then later you can match on that type and figure out how many of each kind of zone to start.

We still haven't 100% ruled out implementing stage one of RFD 468. In this scenario we'll be wanting to have the ability to deploy both alongside each other. I'd like to keep these constants as they are unless we decide that we will not implement stage 1.

Just want to add that these variables are going to go away very soon anyway as part of the reconfigurator work.

It's unfortunate that we have to generate a Blueprint from a service plan, rather than as service plan from an initial blueprint. Otherwise, I'd expect this all to be somewhat simpler. In any case, as @karencfv says, these variables are going away soon.

sled-agent/src/rack_setup/plan/service.rs

sled-agent/src/services.rs

karencfv

Thanks for taking the time to review @bnaecker 🙇‍♀️ I've addressed all your comments except for https://github.com/oxidecomputer/omicron/pull/6343/files#r1720042719 . Will play around with this to see if I can come up with something better

karencfv · 2024-08-19T07:42:13Z

oximeter/collector/src/bin/oximeter.rs

@@ -49,6 +49,12 @@ enum Args {
        /// The socket address at which `oximeter`'s HTTP server runs.
        #[clap(short, long, action)]
        address: SocketAddrV6,
+
+        // TODO (https://github.com/oxidecomputer/omicron/issues/4148): This flag
+        // should be removed once single node functionality is removed.


This is stage three of the plan of action to roll out replicated clickhouse in RFD 468. Can we move the discussion of the decision to keep or remove single node functionality there?

In the mean time I'd like to keep these TODOs to keep track of the places that would need to be modified if we decide to remove single node functionality (I'll rephrase to say "if" instead of "once"). If we decide that single node functionality will never be removed I'll make sure to remove all of these comments :)

package/src/bin/omicron-package.rs

karencfv · 2024-08-19T08:45:14Z

sled-agent/src/rack_setup/plan/service.rs

@@ -58,14 +58,23 @@ const OXIMETER_COUNT: usize = 1;
 // when Nexus provisions Clickhouse.
 // TODO(https://github.com/oxidecomputer/omicron/issues/4000): Use
 // omicron_common::policy::CLICKHOUSE_SERVER_REDUNDANCY once we enable
-// replicated ClickHouse
+// replicated ClickHouse.
+// Set to 0 when testing replicated ClickHouse.


We still haven't 100% ruled out implementing stage one of RFD 468. In this scenario we'll be wanting to have the ability to deploy both alongside each other. I'd like to keep these constants as they are unless we decide that we will not implement stage 1.

sled-agent/src/rack_setup/plan/service.rs

sled-agent/src/services.rs

karencfv · 2024-08-19T08:51:38Z

smf/oximeter/replicated-cluster/manifest.xml

+    </method_environment>
+  </method_context>
+  <exec_method type='method' name='start'
+    exec='ctrun -l child -o noorphan,regent /opt/oxide/oximeter-collector/bin/oximeter run /var/svc/manifest/site/oximeter/config.toml --address %{config/address} --id %{config/id} --replicated &amp;'


Hm, let me think about how I can make this nicer

karencfv · 2024-08-20T01:30:04Z

@bnaecker @andrewjstone , I think I've addressed all the comments above. Please let me know if you think there's anything missing!

andrewjstone · 2024-08-20T01:41:50Z

@bnaecker @andrewjstone , I think I've addressed all the comments above. Please let me know if you think there's anything missing!

Sorry @karencfv I haven't had a chance to look yet. I'll take a look a bit later tonight.

bnaecker

Thanks for addressing the comments! Looks good, modulo one small fix using lookup_socket_v6(). Thanks!

bnaecker · 2024-08-20T03:15:49Z

oximeter/collector/src/agent.rs

+            let mut address = resolver
+                .lookup_socket_v6(ServiceName::ClickhouseServer)
+                .await?;
+            address.set_port(CLICKHOUSE_PORT);


We don't need to set the port here. The call to lookup_socket_v6() has a port in it, we don't want to overwrite it!

bnaecker · 2024-08-20T03:17:10Z

oximeter/collector/src/agent.rs

-            )
+            let mut address =
+                resolver.lookup_socket_v6(ServiceName::Clickhouse).await?;
+            address.set_port(CLICKHOUSE_PORT);


Same here, please just use the return value as-is.

karencfv · 2024-08-20T03:21:20Z

smf/clickhouse_server/manifest.xml

Ugh, github got rid of my comment, I'll copy it here so it doesn't get lost:

I have made no changes here. This is simply extracted from smf/clickhouse/method_script.sh. This script will be superseded by config file generation via clickward's libraries in a follow up PR.

andrewjstone

Hi Karen,

Thanks for taking this on. It's nice to see work done to package up the new zones. However, I think a large emphasis has been put on distinguishing replicated from single-node setups when it isn't needed. I left specific comments around this. I think without doing this, the PR shrinks quite a bit, and less of it will have to be ripped out in the next week or two.

andrewjstone · 2024-08-20T02:41:09Z

sled-agent/src/rack_setup/plan/service.rs

+            };
+            let id = OmicronZoneUuid::new_v4();
+            let ip = sled.addr_alloc.next().expect("Not enough addrs");
+            // TODO: This may need to be a different port if/when to have single node


I don't think this is necessary, as they should not be running in the same zones and will therefore have different IP addresses.

So we're definitely not doing stage 1 of RFD 468 then?

I don't think this rules out doing stage 1. I don't want to do that right now.

andrewjstone · 2024-08-20T02:48:49Z

oximeter/collector/src/agent.rs

@@ -394,11 +395,17 @@ impl OximeterAgent {
        // database.
        let db_address = if let Some(address) = db_config.address {
            address
+        } else if replicated {


I don't understand why we need a replicated flag at all here. Why not always perform lookups from internal DNS? We just have to ensure that the single node ClickHouse also publishes a DNS entry. Oximeter is always only going to talk to one replica, right? If so it can just look up that replica from DNS. It doesn't have to know if there is only one of them or not because the client interface doesn't change. If there is a failure, then oximeter should perform another lookup so that it can find a healthy server.

Please correct me if I'm in the wrong here, but my understanding is that internal DNS entries for clickhouse and clickhouse_server will look quite different

omicron/internal-dns/src/config.rs

Lines 508 to 516 in ea32c0c

assert_eq!(ServiceName::Clickhouse.dns_name(), "_clickhouse._tcp",);

assert_eq!(

ServiceName::ClickhouseKeeper.dns_name(),

"_clickhouse-keeper._tcp",

);

assert_eq!(

ServiceName::ClickhouseServer.dns_name(),

"_clickhouse-server._tcp",

);

Even if we were to use internal DNS we'd have to differentiate between the two zone types no?

Yeah, this was me assuming that single node and multi-node wouldn't be running at the same time. We can't actually guarantee that as you pointed out above. I don't want to entirely rule out step 1 of RFD 468 yet. Sorry for the confusion here. My mind was clearly elsewhere!

andrewjstone · 2024-08-20T02:51:15Z

oximeter/collector/src/lib.rs

+    // TODO (https://github.com/oxidecomputer/omicron/issues/4148): This field
+    // should be removed if single node functionality is removed.
+    /// Whether ClickHouse is running as a replicated cluster or
+    /// single-node server.


As stated above, I'm unclear why Oximeter has to know anything about whether there are one or more clickhouse servers. I think things would be much simpler without this flag.

Oximeter has to initialize the database. It can't do that by asking whether it's talking to a replicated or single-node ClickHouse server, it has to set up the server. Doesn't that mean we need some kind of information like this?

Unless, based on your other comment, that oximeter isn't going to set up the databases in the long run. Is that the job of the clickhouse-admin server in the future? If so, then I agree with you :)

Ah, I get it. I totally forgot about this dichotomy in the setup SQL.

It could certainly be done via clickhouse-admin, but that was not my intent. I still think oximeter should likely run the sql to configure the database, but it should be told what to do by reconfigurator, especially when the old zone containing the single-node setup gets torn down and a replicated setup put in it's place. For now though, I guess that's all done via the oximeter config file. That negates a lot of my comments ;)

Yeah, like @bnaecker says, we need to know which SQL file to use to initiate the database. Even if we use clickhouse-admin to set up the server in the future, we don't have that mechanism in place yet, so we need to have a way to differentiate between the two in the mean time. WDYT?

Ok, that makes sense. One could imagine providing something like the same configuration that we give to clickhouse-admin to oximeter as well. For example, it already needs to register itself with Nexus. That request could provide information about which DB type to set up, for example, or oximeter could have an API through which Nexus / Reconfigurator provide that configuration.

andrewjstone · 2024-08-20T02:57:51Z

sled-agent/src/rack_setup/plan/service.rs

@@ -58,14 +58,23 @@ const OXIMETER_COUNT: usize = 1;
 // when Nexus provisions Clickhouse.
 // TODO(https://github.com/oxidecomputer/omicron/issues/4000): Use
 // omicron_common::policy::CLICKHOUSE_SERVER_REDUNDANCY once we enable
-// replicated ClickHouse
+// replicated ClickHouse.
+// Set to 0 when testing replicated ClickHouse.


It's unfortunate that we have to generate a Blueprint from a service plan, rather than as service plan from an initial blueprint. Otherwise, I'd expect this all to be somewhat simpler. In any case, as @karencfv says, these variables are going away soon.

andrewjstone · 2024-08-20T03:04:55Z

sled-agent/src/services.rs

+                        listen_addr.to_string(),
+                    )
+                    .add_property("listen_port", "astring", listen_port)
+                    .add_property("store", "astring", "/data");


This is fine for testing, but this is not how I anticipate the system working. When the zone is launched, a clickhouse server will not be running. The clickhouse-admin dropshot server instead will be waiting for a ReplicaConfig to be pushed to it so that it can generate a config file and start the service. In the case of an existing dataset, the config file may already be there, along with a cached version of the ReplicaConfig input and a generation number, so that only updates generate new config files. Once there is a valid config file, then clickhouse-admin will start the server.

The only properties passed through should be those needed for starting the clickhouse-admin server. This allows us to use a self-assembling zone and not go through sled-agent when the configuration changes.

ah sweet! Should I leave this code as is for now until we've modified clickhouse-admin to do what you mention? I'd like to leave replicated clickhouse working with this PR even if we remove things later

ah sweet! Should I leave this code as is for now until we've modified clickhouse-admin to do what you mention? I'd like to leave replicated clickhouse working with this PR even if we remove things later

Absolutely leave it for now. Sorry again for the confusion. Everything in this PR works, so no reason to break it for some future code! Thanks again for hanging with me.

andrewjstone · 2024-08-20T03:15:40Z

smf/clickhouse_server/method_script.sh

@@ -0,0 +1,124 @@
+#!/bin/bash


I know this file was mostly just moved over, but I don't expect it to end up looking like this at all in the very near future. It will only start clickhouse-admin which will then wait for a ReplicaConfig. If a clickhouse-config.xml already exists then it should start clickhouse with that config. clickhouse-admin will continue waiting for updates indefinitely and rewrite clickhouse-config.xml when the configuration changes. Clickhouse will automatically reload this configuration.

ah yeah, definitely! I expect this to be gone soon.

andrewjstone · 2024-08-20T03:27:22Z

package/src/target.rs

+#[derive(Clone, Debug, strum::EnumString, strum::Display, ValueEnum)]
+#[strum(serialize_all = "kebab-case")]
+#[clap(rename_all = "kebab-case")]
+pub enum ClickhouseTopology {


I don't believe this is strictly needed. Why do we need to build differently depending upon the way we run clickhouse. I think you can always just build Clickhouse, ClickhouseServer and ClickhouseKeeper zones and package them up, but only deploy zones as decided at runtime. Isn't that exactly what we plan to do for production?

You already have to change the hardcoded variables in RSS to run multi-node, so we can just leave that for now until Reconfigurator is up and running. We'll only start regular Clickhouse zones when configured in the RSS Plan, and start replicated ones otherwise.

The idea here is to have this ClickhouseTopology for as long as we have the ability to deploy single-node and replicated clickhouse services in the repo. My view is that since the hardcoding of constants is temporary, we'd only have to manually change them until reconfigurator takes over.

When we swap over to using replicated cluster as the default installation, we can change the default setting to replicated-cluster. If we decide to remove all single-node functionality, we can get rid of ClickhouseTopology altogether. Does this make sense?

I'm not sure I follow here. This is just about building packages, not about what gets run via RSS, right? If so, why can't we build all the zone packages without specifying here? Then if we chose to remove all single-node functionality, we just remove that zone type. We don't also have to remove this option.

What am I missing?

andrewjstone

@karencfv As @bnaecker reminded me, we need to run different initialization sql to setup replicated and non-replicated clickhouses. I don't think that necessitates having the ClickhouseTopology in omicron-package nor not always using DNS. But I'm also fine cleaning that up in post :)

karencfv

Thanks for taking the time to review @andrewjstone 🙇‍♀️ ! I've left some comments, I hope it all makes sense

karencfv · 2024-08-20T03:43:26Z

oximeter/collector/src/agent.rs

@@ -394,11 +395,17 @@ impl OximeterAgent {
        // database.
        let db_address = if let Some(address) = db_config.address {
            address
+        } else if replicated {


Please correct me if I'm in the wrong here, but my understanding is that internal DNS entries for clickhouse and clickhouse_server will look quite different

omicron/internal-dns/src/config.rs

Lines 508 to 516 in ea32c0c

assert_eq!(ServiceName::Clickhouse.dns_name(), "_clickhouse._tcp",);

assert_eq!(

ServiceName::ClickhouseKeeper.dns_name(),

"_clickhouse-keeper._tcp",

);

assert_eq!(

ServiceName::ClickhouseServer.dns_name(),

"_clickhouse-server._tcp",

);

Even if we were to use internal DNS we'd have to differentiate between the two zone types no?

karencfv · 2024-08-20T03:47:22Z

oximeter/collector/src/lib.rs

+    // TODO (https://github.com/oxidecomputer/omicron/issues/4148): This field
+    // should be removed if single node functionality is removed.
+    /// Whether ClickHouse is running as a replicated cluster or
+    /// single-node server.


Yeah, like @bnaecker says, we need to know which SQL file to use to initiate the database. Even if we use clickhouse-admin to set up the server in the future, we don't have that mechanism in place yet, so we need to have a way to differentiate between the two in the mean time. WDYT?

karencfv · 2024-08-20T03:55:07Z

package/src/target.rs

+#[derive(Clone, Debug, strum::EnumString, strum::Display, ValueEnum)]
+#[strum(serialize_all = "kebab-case")]
+#[clap(rename_all = "kebab-case")]
+pub enum ClickhouseTopology {


The idea here is to have this ClickhouseTopology for as long as we have the ability to deploy single-node and replicated clickhouse services in the repo. My view is that since the hardcoding of constants is temporary, we'd only have to manually change them until reconfigurator takes over.

When we swap over to using replicated cluster as the default installation, we can change the default setting to replicated-cluster. If we decide to remove all single-node functionality, we can get rid of ClickhouseTopology altogether. Does this make sense?

karencfv · 2024-08-20T03:57:17Z

sled-agent/src/rack_setup/plan/service.rs

+            };
+            let id = OmicronZoneUuid::new_v4();
+            let ip = sled.addr_alloc.next().expect("Not enough addrs");
+            // TODO: This may need to be a different port if/when to have single node


So we're definitely not doing stage 1 of RFD 468 then?

karencfv · 2024-08-20T04:00:00Z

sled-agent/src/services.rs

+                        listen_addr.to_string(),
+                    )
+                    .add_property("listen_port", "astring", listen_port)
+                    .add_property("store", "astring", "/data");


ah sweet! Should I leave this code as is for now until we've modified clickhouse-admin to do what you mention? I'd like to leave replicated clickhouse working with this PR even if we remove things later

karencfv · 2024-08-20T04:00:56Z

smf/clickhouse_server/method_script.sh

@@ -0,0 +1,124 @@
+#!/bin/bash


ah yeah, definitely! I expect this to be gone soon.

karencfv added 6 commits August 15, 2024 15:20

clickhouse_server SMF service

4484d36

clean up

9829915

optional clichouse topology build target flag

7f37f2d

fmt

ab361ba

Merge branch 'main' into clickhouse-server-smf

ffc8807

clean up

ecc95de

karencfv marked this pull request as ready for review August 16, 2024 05:47

karencfv requested review from andrewjstone and bnaecker August 16, 2024 05:48

karencfv mentioned this pull request Aug 16, 2024

Reconfigurator: Tracking issue for multi-node clickhouse deployment #5999

Open

20 tasks

karencfv commented Aug 16, 2024

View reviewed changes

karencfv changed the title ~~[reconfigurator] WIP configure clickhouse_server SMF service~~ [reconfigurator] clickhouse_server SMF service and oximeter --replicated mode Aug 16, 2024

bnaecker reviewed Aug 16, 2024

View reviewed changes

Address comments

7ef85c5

karencfv commented Aug 19, 2024

View reviewed changes

karencfv added 2 commits August 20, 2024 12:58

set if replicated in oximeter config file

4a935a0

Add files

ea32c0c

karencfv changed the title ~~[reconfigurator] clickhouse_server SMF service and oximeter --replicated mode~~ [reconfigurator] clickhouse_server SMF service and oximeter replicated mode Aug 20, 2024

karencfv requested a review from bnaecker August 20, 2024 01:30

bnaecker approved these changes Aug 20, 2024

View reviewed changes

karencfv commented Aug 20, 2024

View reviewed changes

andrewjstone reviewed Aug 20, 2024

View reviewed changes

andrewjstone approved these changes Aug 20, 2024

View reviewed changes

karencfv commented Aug 20, 2024

View reviewed changes

karencfv added 2 commits August 20, 2024 16:09

address comments

dcd4f34

Merge branch 'main' into clickhouse-server-smf

cd06dec

karencfv enabled auto-merge (squash) August 20, 2024 04:30

karencfv merged commit 6bd999b into oxidecomputer:main Aug 20, 2024
16 checks passed

karencfv deleted the clickhouse-server-smf branch August 20, 2024 06:08

	assert_eq!(ServiceName::Clickhouse.dns_name(), "_clickhouse._tcp",);
	assert_eq!(
	ServiceName::ClickhouseKeeper.dns_name(),
	"_clickhouse-keeper._tcp",
	);
	assert_eq!(
	ServiceName::ClickhouseServer.dns_name(),
	"_clickhouse-server._tcp",
	);

[reconfigurator] clickhouse_server SMF service and oximeter replicated mode #6343

[reconfigurator] clickhouse_server SMF service and oximeter replicated mode #6343

Conversation

karencfv commented Aug 15, 2024 • edited Loading

Overview

Usage

Purpose

Testing

Choose a reason for hiding this comment

bnaecker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

karencfv left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

karencfv commented Aug 20, 2024

andrewjstone commented Aug 20, 2024

bnaecker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrewjstone left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrewjstone left a comment

Choose a reason for hiding this comment

karencfv left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

[reconfigurator] `clickhouse_server` SMF service and oximeter replicated mode #6343

[reconfigurator] `clickhouse_server` SMF service and oximeter replicated mode #6343

karencfv commented Aug 15, 2024 •

edited

Loading

karencfv left a comment •

edited

Loading