-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Start integrating clickhouse clusters to blueprints #6627
Conversation
@@ -143,9 +137,27 @@ impl ClickhouseAllocator { | |||
// If we fail to retrieve any inventory for keepers in the current | |||
// collection than we must not modify our keeper config, as we don't | |||
// know whether a configuration is ongoing or not. | |||
// | |||
// There is an exception to this rule: on *new* clusters that have | |||
// keeper zones deployed buty do not have any keepers running we must |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This actually has me thinking that we should gate removal of the last ClickhouseKeeper
node by not allowing it to be expunged. However, that seems a lot easier said than done. I think in general this is not really a problem as we will have 5 of them and we won't expunge them all at once. Each time one is expunged a new one will be provisioned to reach the target in the same way we deal with CRDB.
nexus/test-utils/src/lib.rs
Outdated
@@ -808,6 +808,8 @@ impl<'a, N: NexusServer> ControlPlaneTestContextBuilder<'a, N> { | |||
cockroachdb_fingerprint: String::new(), | |||
cockroachdb_setting_preserve_downgrade: | |||
CockroachDbPreserveDowngrade::DoNotModify, | |||
// TODO(ajs): Should we create this in RSS? it relies on policy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we have three options here:
- Start the clickhouse clusters by enabling the policy with OMDB (not implemented yet) once the control plane boots. Reconfigurator will then bring the clusters up via manual OMDB blueprint generation. This requires no change to RSS, but will take a few rounds to bring up all nodes as shown the tests in this PR. We'll have to do this for deployed systems anyway, as they won't rerun RSS.
- Add a policy knob to
rss_config.toml
and manually start all the zones with the right configuration and put them in the blueprint if the policy is enabled. It's actually unclear to me how this will be better, since Nexus still has to generate and push the configs down toclickhouse-admin
inside theClickhouseServer
andClickhouseKeeper
zones. - Always enable the policy and do the zone config as in option 2.
I'm currently leaning towards 1 as it's the least amount of code and we have to do the manual work to bring up the nodes anyway via reconfigurator. If we wanted to get fancy, we may be able to figure out a way to generate the configuration for the clickhouse keeper and server nodes and trigger clickhouse-admin
from RSS. That would at least make the startup automatic. But it would duplicate the work and complicate things a bit. Once we automate reconfigurator it will also be unnecessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 seems reasonable to me, with the caveat of needing to document / notify folks that once we're in a multinode clickhouse world, RSS alone isn't enough to set up the rack and manual omdb steps are required after initialization. (Assuming we don't automate reconfigurator before that, and assuming I'm understanding correctly?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, last week I actually changed the reconfigurator bits such that for new cluster configurations they can be brought up at once rather than keeper node by keeper node. This will reduce the iterations.
sled-agent/src/rack_setup/service.rs
Outdated
@@ -1486,6 +1486,9 @@ pub(crate) fn build_initial_blueprint_from_sled_configs( | |||
cockroachdb_fingerprint: String::new(), | |||
cockroachdb_setting_preserve_downgrade: | |||
CockroachDbPreserveDowngrade::DoNotModify, | |||
// TODO: Allocate keepers and servers from the plan if there is a policy, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my other comment regarding RSS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We decided not to configure clusters in RSS. I updated the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really excited to see this coming along!
// The planner should expunge a zone based on the sled being expunged. Since this | ||
// is a clickhouse keeper zone, the clickhouse keeper configuration should change | ||
// to reflect this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for my own understanding. Let's say we want to move a keeper node (from a cluster or 3) from one zone to another newly created zone. In this scenario we would:
- Change the XML configuration files to a 2 keeper node cluster and deploy it .
- Change the XML configuration files to a 3 keeper node cluster (including the host address of the new zone and a new keeper ID) and deploy it.
This means we would not:
- Change the XML configuration files to a 3 keeper node cluster by removing the old keeper node's host address and ID and adding the new keeper node's host address and ID, and deploying that
Is this accurate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. However, I'm still not sure if we'll end up using file configuration or the reconfig
command for the keepers. I need to do some more testing once I get to the execution phase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sadly, it appears that our current version of the keeper cli does not have the reconfig
command :(
$ clickhouse keeper-client -h localhost -p 20001 --q help
cd [path] -- Change the working path (default `.`)
create <path> <value> [mode] -- Creates new node with the set value
delete_stale_backups -- Deletes ClickHouse nodes used for backups that are now inactive
find_big_family [path] [n] -- Returns the top n nodes with the biggest family in the subtree (default path = `.` and n = 10)
find_super_nodes <threshold> [path] -- Finds nodes with number of children larger than some threshold for the given path (default `.`)
flwc <command> -- Executes four-letter-word command
get <path> -- Returns the node's value
get_stat [path] -- Returns the node's stat (default `.`)
help -- Prints this message
ls [path] -- Lists the nodes for the given path (default: cwd)
rm <path> -- Remove the node
rmr <path> -- Recursively deletes path. Confirmation required
set <path> <value> [version] -- Updates the node's value. Only update if version matches (default: -1)
touch <path> -- Creates new node with an empty string as value. Doesn't throw an exception if the node already exists
$ clickhouse keeper-client -h localhost -p 20001 --q reconfig
Syntax error: failed at position 1 ('reconfig'):
reconfig
Expected one of: Keeper client query, cd, create, delete_stale_backups, find_big_family, find_super_nodes, flwc, get, get_stat, help, ls, rm, rmr, set, touch, lgif, csnp, dump, wchp, rqld, wchc, isro, crst, dirs, cons, srst, envi, conf, stat, ruok, srvr, wchs, mntr
We may have to use the configuration files for now until we're able to get onto a newer version
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to know. Thanks for pointing this out @karencfv.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing my comments! Everything looks good from my side, but I'd love it if someone who has a better grasp of reconfigurator than me took a look as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reconfigurator bits LGTM. Just a couple questions, mostly out of clickhouse ignorance.
nexus/test-utils/src/lib.rs
Outdated
@@ -808,6 +808,8 @@ impl<'a, N: NexusServer> ControlPlaneTestContextBuilder<'a, N> { | |||
cockroachdb_fingerprint: String::new(), | |||
cockroachdb_setting_preserve_downgrade: | |||
CockroachDbPreserveDowngrade::DoNotModify, | |||
// TODO(ajs): Should we create this in RSS? it relies on policy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 seems reasonable to me, with the caveat of needing to document / notify folks that once we're in a multinode clickhouse world, RSS alone isn't enough to set up the rack and manual omdb steps are required after initialization. (Assuming we don't automate reconfigurator before that, and assuming I'm understanding correctly?)
|
||
// Updating the inventory to reflect the keepers | ||
// should result in the same state, except for the | ||
// `highest_seen_keeper_leader_committed_log_index` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just making sure I understand this - does this mean we expect the blueprint contents to change regularly in a way that doesn't really matter while the system is in a steady state? I think we kind of already have something like that in {internal,external}_dns_version
(which can change in response to silo creation or deletion, for example, even in a steady state). I don't think there's anything wrong here; I want to understand for when we start talking about automating blueprint generation and deciding when to set a new target (e.g., if they only thing that's changed is from blueprint A to blueprint B highest_seen_keeper_leader_committed_log_index
, then we'd say there's no need to advance the target from A to B?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this mean we expect the blueprint contents to change regularly in a way that doesn't really matter while the system is in a steady state?
Yep. The keeper will continue to commit entries to the raft log and inventory will continue to pick up the update.
I want to understand for when we start talking about automating blueprint generation and deciding when to set a new target (e.g., if they only thing that's changed is from blueprint A to blueprint B highest_seen_keeper_leader_committed_log_index, then we'd say there's no need to advance the target from A to B?).
Yes, this log index is mainly used as an inventory generation number that maps to the configuration. It is solely for allowing us to know when a changed configuration is newer than another configuration since we pull configurations from multiple nodes and different nodes can be online or offline at a given time. If the configuration in the inventory isn't different than what's in the blueprint then we don't need to generate a new blueprint.
This PR integrates integrates clickhouse keeper and server node allocation via `ClickhouseAllocator` into the `BlueprintBuilder`. It goes about testing that allocation works as expected via the planner. There are many more specific tests for the `ClickhouseAllocator` and the tests here are mainly checking that everything fits together and the same results occur when run through the planner. There is an additional test and code to allow us to disable the cluster policy and expunge all clickhouse keeper and server zones in one shot. This code is safe to merge because it is currently inert. There is no way to enable the `ClickhousePolicy` outside of tests yet. This will come in one or two follow up PRs where we add an internal nexus endpoint for enabling the policy and then an OMDB command to trigger the endpoint. Further OMDB support will be added for monitoring. We expect that for the foreseeable future we will always deploy with `ClickhousePolicy::deploy_with_standalone = true`. This is stage 1 of RFD 468 where we run replicated clickhouse and the existing single node clickhouse together. Lastly, there will be other PRs to plug in the actual inventory collection and execution phases for clickhouse cluster reconfiguration. We shouldn't bother even implementing the OMDB policy enablement until all that is complete as it just won't work.
Clickhouse clusters are only provisioned via reconfigurator as of #6627. This drastically simplifies things since we don't have to put more policy knobs in RSS and wicket. It also matches the fact that these zones always need nexus running in order to get their configurations and run the appropriate processes anyway. This commit removes the vestigal zone creation that was never actually used.
Clickhouse clusters are only provisioned via reconfigurator as of #6627. This drastically simplifies things since we don't have to put more policy knobs in RSS and wicket. It also matches the fact that these zones always need nexus running in order to get their configurations and run the appropriate processes anyway. This commit removes the vestigal zone creation that was never actually used.
This PR integrates integrates clickhouse keeper and server node allocation via
ClickhouseAllocator
into theBlueprintBuilder
. It goes about testing that allocation works as expected via the planner. There are many more specific tests for theClickhouseAllocator
and the tests here are mainly checking that everything fits together and the same results occur when run through the planner.I've left a few comments regarding things discovered while implementing this and am very open to suggestions.
This code is safe to merge because it is currently inert. There is no way to enable the
ClickhousePolicy
outside of tests yet. This will come in one or two follow up PRs where we add an internal nexus endpoint for enabling the policy and then an OMDB command to trigger the endpoint. Further OMDB support will be added for monitoring. We expect that for the foreseeable future we will always deploy withClickhousePolicy::deploy_with_standalone = true
. This is stage 1 of RFD 468 where we run replicated clickhouse and the existing single node clickhouse together. The OMDB command will not allow us to disable this at first. It's still a question of whether we want to be able to disable the policy altogether and expunge any clickhouse zones. I can add that functionality and a test to this PR or a future one if desired, as mentioned in a comment below.The inertness of this PR will not necessarily change if we decide to deploy zones via RSS. I'm currently leaning against doing this as it adds complexity. Please see the related comment below. One thing that I did realize while writing that out and that should have been relatively obvious is that deploying keepers will currently take a long time without some extra work to trigger inventory collection. This is because keepers must be added or removed one at a time, and we must retrieve the keeper membership status to reflect that before moving on. However, this status lives in inventory which is currently only collected every 10 minutes. I think we have a way to force collection in OMDB, and so we'll have to interleave that with our blueprint reconfigurations, unless that's already done automatically. I honestly don't remember at this point.
Lastly, there will be other PRs to plug in the actual inventory collection and execution phases for clickhouse cluster reconfiguration. We shouldn't bother even implementing the OMDB policy enablement until all that is complete as it just won't work.