Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Start integrating clickhouse clusters to blueprints #6627

Merged
merged 20 commits into from
Oct 2, 2024

Conversation

andrewjstone
Copy link
Contributor

@andrewjstone andrewjstone commented Sep 21, 2024

This PR integrates integrates clickhouse keeper and server node allocation via ClickhouseAllocator into the BlueprintBuilder. It goes about testing that allocation works as expected via the planner. There are many more specific tests for the ClickhouseAllocator and the tests here are mainly checking that everything fits together and the same results occur when run through the planner.

I've left a few comments regarding things discovered while implementing this and am very open to suggestions.

This code is safe to merge because it is currently inert. There is no way to enable the ClickhousePolicy outside of tests yet. This will come in one or two follow up PRs where we add an internal nexus endpoint for enabling the policy and then an OMDB command to trigger the endpoint. Further OMDB support will be added for monitoring. We expect that for the foreseeable future we will always deploy with ClickhousePolicy::deploy_with_standalone = true. This is stage 1 of RFD 468 where we run replicated clickhouse and the existing single node clickhouse together. The OMDB command will not allow us to disable this at first. It's still a question of whether we want to be able to disable the policy altogether and expunge any clickhouse zones. I can add that functionality and a test to this PR or a future one if desired, as mentioned in a comment below.

The inertness of this PR will not necessarily change if we decide to deploy zones via RSS. I'm currently leaning against doing this as it adds complexity. Please see the related comment below. One thing that I did realize while writing that out and that should have been relatively obvious is that deploying keepers will currently take a long time without some extra work to trigger inventory collection. This is because keepers must be added or removed one at a time, and we must retrieve the keeper membership status to reflect that before moving on. However, this status lives in inventory which is currently only collected every 10 minutes. I think we have a way to force collection in OMDB, and so we'll have to interleave that with our blueprint reconfigurations, unless that's already done automatically. I honestly don't remember at this point.

Lastly, there will be other PRs to plug in the actual inventory collection and execution phases for clickhouse cluster reconfiguration. We shouldn't bother even implementing the OMDB policy enablement until all that is complete as it just won't work.

@andrewjstone andrewjstone changed the title Start integrating clickhouse clusters to blueprints WIP: Start integrating clickhouse clusters to blueprints Sep 21, 2024
@@ -143,9 +137,27 @@ impl ClickhouseAllocator {
// If we fail to retrieve any inventory for keepers in the current
// collection than we must not modify our keeper config, as we don't
// know whether a configuration is ongoing or not.
//
// There is an exception to this rule: on *new* clusters that have
// keeper zones deployed buty do not have any keepers running we must
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actually has me thinking that we should gate removal of the last ClickhouseKeeper node by not allowing it to be expunged. However, that seems a lot easier said than done. I think in general this is not really a problem as we will have 5 of them and we won't expunge them all at once. Each time one is expunged a new one will be provisioned to reach the target in the same way we deal with CRDB.

@@ -808,6 +808,8 @@ impl<'a, N: NexusServer> ControlPlaneTestContextBuilder<'a, N> {
cockroachdb_fingerprint: String::new(),
cockroachdb_setting_preserve_downgrade:
CockroachDbPreserveDowngrade::DoNotModify,
// TODO(ajs): Should we create this in RSS? it relies on policy
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we have three options here:

  1. Start the clickhouse clusters by enabling the policy with OMDB (not implemented yet) once the control plane boots. Reconfigurator will then bring the clusters up via manual OMDB blueprint generation. This requires no change to RSS, but will take a few rounds to bring up all nodes as shown the tests in this PR. We'll have to do this for deployed systems anyway, as they won't rerun RSS.
  2. Add a policy knob to rss_config.toml and manually start all the zones with the right configuration and put them in the blueprint if the policy is enabled. It's actually unclear to me how this will be better, since Nexus still has to generate and push the configs down to clickhouse-admin inside the ClickhouseServer and ClickhouseKeeper zones.
  3. Always enable the policy and do the zone config as in option 2.

I'm currently leaning towards 1 as it's the least amount of code and we have to do the manual work to bring up the nodes anyway via reconfigurator. If we wanted to get fancy, we may be able to figure out a way to generate the configuration for the clickhouse keeper and server nodes and trigger clickhouse-admin from RSS. That would at least make the startup automatic. But it would duplicate the work and complicate things a bit. Once we automate reconfigurator it will also be unnecessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 seems reasonable to me, with the caveat of needing to document / notify folks that once we're in a multinode clickhouse world, RSS alone isn't enough to set up the rack and manual omdb steps are required after initialization. (Assuming we don't automate reconfigurator before that, and assuming I'm understanding correctly?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, last week I actually changed the reconfigurator bits such that for new cluster configurations they can be brought up at once rather than keeper node by keeper node. This will reduce the iterations.

@@ -1486,6 +1486,9 @@ pub(crate) fn build_initial_blueprint_from_sled_configs(
cockroachdb_fingerprint: String::new(),
cockroachdb_setting_preserve_downgrade:
CockroachDbPreserveDowngrade::DoNotModify,
// TODO: Allocate keepers and servers from the plan if there is a policy,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my other comment regarding RSS.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We decided not to configure clusters in RSS. I updated the comment.

@andrewjstone andrewjstone changed the title WIP: Start integrating clickhouse clusters to blueprints Start integrating clickhouse clusters to blueprints Sep 25, 2024
@andrewjstone andrewjstone marked this pull request as ready for review September 25, 2024 22:44
Copy link
Contributor

@karencfv karencfv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really excited to see this coming along!

nexus/reconfigurator/planning/src/planner.rs Outdated Show resolved Hide resolved
nexus/reconfigurator/planning/src/planner.rs Outdated Show resolved Hide resolved
Comment on lines +2548 to +2550
// The planner should expunge a zone based on the sled being expunged. Since this
// is a clickhouse keeper zone, the clickhouse keeper configuration should change
// to reflect this.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for my own understanding. Let's say we want to move a keeper node (from a cluster or 3) from one zone to another newly created zone. In this scenario we would:

  1. Change the XML configuration files to a 2 keeper node cluster and deploy it .
  2. Change the XML configuration files to a 3 keeper node cluster (including the host address of the new zone and a new keeper ID) and deploy it.

This means we would not:

  1. Change the XML configuration files to a 3 keeper node cluster by removing the old keeper node's host address and ID and adding the new keeper node's host address and ID, and deploying that

Is this accurate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. However, I'm still not sure if we'll end up using file configuration or the reconfig command for the keepers. I need to do some more testing once I get to the execution phase.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly, it appears that our current version of the keeper cli does not have the reconfig command :(

$ clickhouse keeper-client -h localhost -p 20001 --q help
cd [path] -- Change the working path (default `.`)
create <path> <value> [mode] -- Creates new node with the set value
delete_stale_backups -- Deletes ClickHouse nodes used for backups that are now inactive
find_big_family [path] [n] -- Returns the top n nodes with the biggest family in the subtree (default path = `.` and n = 10)
find_super_nodes <threshold> [path] -- Finds nodes with number of children larger than some threshold for the given path (default `.`)
flwc <command> -- Executes four-letter-word command
get <path> -- Returns the node's value
get_stat [path] -- Returns the node's stat (default `.`)
help -- Prints this message
ls [path] -- Lists the nodes for the given path (default: cwd)
rm <path> -- Remove the node
rmr <path> -- Recursively deletes path. Confirmation required
set <path> <value> [version] -- Updates the node's value. Only update if version matches (default: -1)
touch <path> -- Creates new node with an empty string as value. Doesn't throw an exception if the node already exists

$ clickhouse keeper-client -h localhost -p 20001 --q reconfig
Syntax error: failed at position 1 ('reconfig'):

reconfig

Expected one of: Keeper client query, cd, create, delete_stale_backups, find_big_family, find_super_nodes, flwc, get, get_stat, help, ls, rm, rmr, set, touch, lgif, csnp, dump, wchp, rqld, wchc, isro, crst, dirs, cons, srst, envi, conf, stat, ruok, srvr, wchs, mntr

We may have to use the configuration files for now until we're able to get onto a newer version

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know. Thanks for pointing this out @karencfv.

nexus/types/src/deployment/zone_type.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@karencfv karencfv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing my comments! Everything looks good from my side, but I'd love it if someone who has a better grasp of reconfigurator than me took a look as well

Copy link
Contributor

@jgallagher jgallagher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reconfigurator bits LGTM. Just a couple questions, mostly out of clickhouse ignorance.

@@ -808,6 +808,8 @@ impl<'a, N: NexusServer> ControlPlaneTestContextBuilder<'a, N> {
cockroachdb_fingerprint: String::new(),
cockroachdb_setting_preserve_downgrade:
CockroachDbPreserveDowngrade::DoNotModify,
// TODO(ajs): Should we create this in RSS? it relies on policy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 seems reasonable to me, with the caveat of needing to document / notify folks that once we're in a multinode clickhouse world, RSS alone isn't enough to set up the rack and manual omdb steps are required after initialization. (Assuming we don't automate reconfigurator before that, and assuming I'm understanding correctly?)


// Updating the inventory to reflect the keepers
// should result in the same state, except for the
// `highest_seen_keeper_leader_committed_log_index`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just making sure I understand this - does this mean we expect the blueprint contents to change regularly in a way that doesn't really matter while the system is in a steady state? I think we kind of already have something like that in {internal,external}_dns_version (which can change in response to silo creation or deletion, for example, even in a steady state). I don't think there's anything wrong here; I want to understand for when we start talking about automating blueprint generation and deciding when to set a new target (e.g., if they only thing that's changed is from blueprint A to blueprint B highest_seen_keeper_leader_committed_log_index, then we'd say there's no need to advance the target from A to B?).

Copy link
Contributor Author

@andrewjstone andrewjstone Oct 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this mean we expect the blueprint contents to change regularly in a way that doesn't really matter while the system is in a steady state?

Yep. The keeper will continue to commit entries to the raft log and inventory will continue to pick up the update.

I want to understand for when we start talking about automating blueprint generation and deciding when to set a new target (e.g., if they only thing that's changed is from blueprint A to blueprint B highest_seen_keeper_leader_committed_log_index, then we'd say there's no need to advance the target from A to B?).

Yes, this log index is mainly used as an inventory generation number that maps to the configuration. It is solely for allowing us to know when a changed configuration is newer than another configuration since we pull configurations from multiple nodes and different nodes can be online or offline at a given time. If the configuration in the inventory isn't different than what's in the blueprint then we don't need to generate a new blueprint.

@andrewjstone andrewjstone enabled auto-merge (squash) October 1, 2024 22:48
@andrewjstone andrewjstone merged commit 6d65e1e into main Oct 2, 2024
18 checks passed
@andrewjstone andrewjstone deleted the wire-up-clickhouse-to-blueprint branch October 2, 2024 03:40
hawkw pushed a commit that referenced this pull request Oct 2, 2024
This PR integrates integrates clickhouse keeper and server node
allocation via `ClickhouseAllocator` into the `BlueprintBuilder`. It
goes about testing that allocation works as expected via the planner.
There are many more specific tests for the `ClickhouseAllocator` and the
tests here are mainly checking that everything fits together and the
same results occur when run through the planner. There is an additional
test and code to allow us to disable the cluster policy and expunge 
all clickhouse keeper and server zones in one shot.

This code is safe to merge because it is currently inert. There is no
way to enable the `ClickhousePolicy` outside of tests yet. This will
come in one or two follow up PRs where we add an internal nexus endpoint
for enabling the policy and then an OMDB command to trigger the
endpoint. Further OMDB support will be added for monitoring. We expect
that for the foreseeable future we will always deploy with
`ClickhousePolicy::deploy_with_standalone = true`. This is stage 1 of
RFD 468 where we run replicated clickhouse and the existing single node
clickhouse together. 

Lastly, there will be other PRs to plug in the actual inventory
collection and execution phases for clickhouse cluster reconfiguration.
We shouldn't bother even implementing the OMDB policy enablement until
all that is complete as it just won't work.
iliana added a commit that referenced this pull request Oct 2, 2024
iliana added a commit that referenced this pull request Oct 3, 2024
#6758 moved `ClickhouseKeeperClusterMembership` from `nexus-types` to
`clickhouse-admin-types`, but #6627 added new references to the
`ClickhouseKeeperClusterMembership` in `nexus-types` earlier.
andrewjstone added a commit that referenced this pull request Oct 3, 2024
Clickhouse clusters are only provisioned via reconfigurator as of
#6627.

This drastically simplifies things since we don't have to put more
policy knobs in RSS and wicket. It also matches the fact that these
zones always need nexus running in order to get their configurations
and run the appropriate processes anyway.

This commit removes the vestigal zone creation that was never actually
used.
andrewjstone added a commit that referenced this pull request Oct 3, 2024
Clickhouse clusters are only provisioned via reconfigurator as of
#6627.

This drastically simplifies things since we don't have to put more
policy knobs in RSS and wicket. It also matches the fact that these
zones always need nexus running in order to get their configurations and
run the appropriate processes anyway.

This commit removes the vestigal zone creation that was never actually
used.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants