-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Auto-master detection for storages, rebalancer location selection, and bug fixes #439
Auto-master detection for storages, rebalancer location selection, and bug fixes #439
Conversation
SENT bucket could be deleted by GC before bucket_send() is completed. It could lead to a problem that bucket_send() would notice that in the end and treat as an error. The bucket should not be deleted while its transfer is still in progress. The bug is fixed now because it started manifesting frequently in new tests for the next commits. Closes #433 NO_DOC=bugfix
f3e863c
to
45ee7e2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great patchset! Here're some comments. Feel free to resolve any nits without changing the code
SENDING bucket could be recovered back to ACTIVE state if the destination instance didn't yet apply the first bucket_recv(). Could happen if the recovery was just faster than the bucket sending procedure. That shouldn't happen. The patch makes the recovery skip the local buckets which are being transferred right now, regardless of their state. NO_DOC=bugfix
The old name looked like it meant the object itself is an auto assigned master. But in fact it is a replicaset flag which means the master role should be auto-discovered. The patch makes the name clearer. The reason for it now is that the flag is going to be used more in the next commits. Part of #429 NO_DOC=refactoring
The master check is going to be used in more than one place in the next commits. Better do it via one function which this commit introduces. Part of #429 NO_DOC=refactoring NO_TEST=refactoring
The function used to return 'false' on error. But all functions which can fail are supposed to return nil on error (exception on critical errors such as misusage). The patch makes the function conform with the rules. This is done as a preparation before the function would start returning another error in a next commit. Part of #429 NO_DOC=not visible NO_TEST=already covered
A few places in storage code still used fiber_cond:wait() directly. It is not safe for negative timeouts and is too easy to forget to convert an error to vshard format. The patch replaces those usages with util.fiber_cond_wait. This was noticed while fixed another bug related to sleeps and fiber cancellation. NO_DOC=not visible
Apparently storage service fibers could live quite a while after their cancellation, because all what they are doing is wrapped into pcalls (built-in cancellation points didn't throw), and the cancel wasn't checked often enough. This commit adds some more cancellation points to recovery, gc, rebalancer. It is fixed now because in the next commits it was getting problematic to change master role. When a master was switched off and then back on, then its new services couldn't be started right away until the old ones finally died. It would clog the logs a lot. NO_DOC=bugfix NO_TEST=covered in next commits
It is moved into its own section where the next commits will almost fully rewrite it. The move is done separately, because some parts will remain unchanged and it should be easier to see them in the diff this way. Part of #429 NO_DOC=refactoring NO_TEST=refactoring
It was inside storage_cfg, but it is going to grow too big to contain it in the cfg code further. The patch moves it to the rebalancer management section. Part of #429 NO_DOC=refactoring NO_TEST=refactoring
It was almost entirely done in a single function, was too difficult to extend in next commits. This patch splits the config code into several functions doing a certain step of the configuration. The patch also introduces a config context. It is an object where one time per cfg call are calculated and saved all the most useful config details such as the current replica and replicaset UUID, whether the instance was a master before and is now, config of this particular instance and replicaset, and more. The context is introduced because these attributes are going to be used more widely across several functions, and passing them as arguments everywhere would be just unhandy. As well as calculating them again in each function. For example, it allows storage_cfg_master_prepare() and storage_cfg_master_commit() just use the already provided was_master_in_cfg and is_master_in_cfg and concentrate on the switch logic instead of figuring out how did the state change. Part of #429 NO_DOC=refactoring NO_TEST=refactoring
It was done after schema upgrade, but now is done before. It makes sense because master commit should enable read-write mode which is needed for the schema changes. It worked so far, because in manual master config the read-only cfg update is done at the box.cfg stage. I.e. before the schema update. But soon the master role will be allowed to turn on and off automatically. And then the old order wouldn't work anymore. At the same time not everything can be done before the schema is installed. Because some services, like recovery and bucket GC do need the schema (`_bucket` space at least). The solution is to make those services depend on whether the first configuration is performed. They are started only after that. Part of #429 NO_DOC=refactoring NO_TEST=refactoring
There is this_is_master() function, but it wasn't used in all places, and it used to rely on the config. Soon the config won't be telling who the master is when the role is set to be automatic in next commits. The patch introduces a flag is_master which reflects the current actual state, not the config. For now it is only managed by the config, but in the next commit it won't be so anymore. Vtest also now won't rely on the config. It uses the actual master state. Part of #429 NO_DOC=internal change
It is now storage_call_test.lua. The reason is that it contains just vshard.storage.call() tests. There are going to be added some new files and this one was getting more confusing. Needed for #429 NO_DOC=refactoring
45ee7e2
to
feddae2
Compare
Replicaset config gets a new option 'master' with the only possible value (besides nil) 'auto'. That makes the nodes of this replicaset decide who of them is the master by looking at their local box.info.ro value. Such replicaset is yet unable to run the rebalancer, and other replicasets and routers can't make rw requests to it. This is all supported in next commits. Closes #429 NO_DOC=covered by a next commit
bucket_recv, bucket_send, rebalancer_apply_routes, and rebalancer_request_state only make sense on the master instance. Now it is checked explicitly. This becomes necessary for the next commits which will enable master discovery at the storage level, and the likelihood of making those calls on a non-master instances rises up. Part of #430 NO_DOC=internal change
The new function is used by bucket recovery instead of bucket_stat(). The problem of the latter is that it allows to be invoked on a non-master instance which is not ok for recovery - it needs to get the most up to date state of the bucket. On the other hand it wasn't possible to just ban bucket_stat() on non-master instances, because it is a public function and it was never documented to work exclusively on master instances. Part of #430 NO_DOC=internal change
It wraps replicaset:callrw(). The purpose is to handle the NON_MASTER error in a single place. It is going to become much more likely to get after the next commits which will enable master discovery at the storage level. It will be important to notice that the node, which was thought to be the master, is not in that role anymore, and a new discovery is needed. Part of #430 NO_DOC=internal change NO_TEST=covered in next commits
Those functions replace the old split() function. It is needed mainly for tests which are going to pass read_only options per-instance in the global config. But it also looks like an actually useful feature even outside of tests. In tests that is needed for auto-master cases. There is already one place - auto_master_2_test.lua, but for that one the patch in vtest was enough and helped only for bootstrap. The current patch allows to reconfigure the instances after bootstrap via their own private config sections. Needed for #430 @TarantoolBot document Title: vshard: per-instance box options in the config Previously the config could include box options only on the root level like so: ```Lua { -- VShard options: sharding = ..., weights = ..., bucket_count = ..., ... -- Box options: replication_timeout = ..., memtx_use_mvcc_engine = ..., net_msg_max = ..., } ``` Those box-options are propagated to `box.cfg{}` on each instance using this config. But there was no way to pass different options for difference instances via one config. Now it is possible via their config sections in `sharding`: ```Lua { sharding = { [replicaset_uuid] = { replicas = { [replica_uuid] = { replication_timeout = ..., net_msg_max = ..., read_only = ..., } }, ... }, ... }, ... -- Box options: replication_timeout = ..., memtx_use_mvcc_engine = ..., net_msg_max = ..., } ``` For each instance its resulting `box.cfg` is merged from the root box options and from the instance's section in `sharding`. If some options are specified in both places, then the priority is for the replica-local options. For example: ``` { sharding = { [replicaset_uuid] = { replicas = { [replica_uuid] = { replication_timeout = 100, net_msg_max = 200, } }, }, }, replication_timeout = 50, txn_timeout = 300, } ``` The resulting `box.cfg` config for the given replica will be: ```Lua { replication_timeout = 100, net_msg_max = 200, txn_timeout = 300, } ```
Many of them used to use the default timeout (replica.net_timeout). It is working good enough, but isn't really suitable for storages - this auto-tuning timeout thing was designed for routers mostly. Besides, soon in master_call() it will be needed to know the timeout before doing the call itself. Getting the net_timeout out of the master object would work for now, but soon this won't be possible when auto-master will be enabled with discovery. That all means either trying schemas like the router does for the same job, or simply always pass an explicit timeout to all master calls. The patch follows the latter way. The need to know the timeout will appear for doing retries inside master_call() when the master is not known right now, but chances are high is going to be discovered in a few milliseconds. For the waiting a deadline is needed, and for that a timeout is necessary. Part of #430 NO_DOC=internal change NO_TEST=covered
locate_master() would not even try to create connections to the instances, which didn't have an already existing one. As a result, if the actual master wasn't ever contacted by any calls, then it would stay undiscovered. Part of #430 NO_DOC=bugfix
feddae2
to
bff0d2f
Compare
Until now only master discovery of own replicaset was supported on the storage level. But now the storages can find masters of other replicasets. The design is similar to the router's master discovery, except that it is not proactive. Storages rarely need to talk to each other which makes pointless to constantly maintain all the connections and re-check the instances' roles to keep the master's location info up to date. Instead, the discovery is reactive. It only works for the replicasets which are being used for any calls right now. Part of #430 @TarantoolBot document Title: vshard: master discovery on storage level Routers do already support `master = 'auto'` setting at the replicaset level in the config. It makes them able to automatically find the master instance location in the so configured replicaset. On the storages it wasn't supported at all. But now it is. The storages can specify in their configs the same option `master = 'auto'` for their own replicaset and for any other replicasets too. If the storage has `master = 'auto'` for its own replicaset, then it will constantly monitor `box.info.ro`. When it becomes `true`, the instance considers itself a replica. When false - considers itself a master. It also means that such automatic masters will start and stop all needed background fibers automatically as well, depending on the actual state of the instance, not on the config. However it is worth keeping in mind that this option doesn't mean master election. It only means master **detection**. I.e. if this option is used, then the user's code is supposed to assign the masters using `box.cfg.read_only` or any other way. If the storage sees `master = 'auto'` for any other replicasets than its own, then it will be able to automatically discover their master instance for internal operations, such as bucket recovery and rebalancing.
Previous commit added support for full master discovery on the storages, but the rebalancer didn't work on replicasets having auto master enabled for themselves. Now it works on those too. The priority is firstly given to manual masters. If none are found in the entire cluster, then the rebalancer appears on the auto-assigned master with the minimal replicaset UUID among all the other auto-masters. Closes #430 NO_DOC=documented later with new rebalancer options
Automatic master discovery on the storages means, that the instances are going to establish more connections between replicasets to locate the masters of each other. That might easily lead to fullmesh of the whole cluster - it means lots of connections majority of which are not even used most of the time. Need to get rid of them somehow if they are not used for a while. This commit makes the storages drop the too long unused connections. Also the automatically discovered master locations are erased by the same no-activity timeout. The logic behind it is that if an instance wasn't contacted anyhow for a very long time, then it might be not a master anymore. It is better not to treat it as a master then and let it be re-discovered when a master connection is needed again. Follow-up #429 Follow-up #430 NO_DOC=internal
The flags allows to explicitly assign the rebalancer role to a specific instance or a replicaset. Or forbid its automatic assignment to instances or whole replicasets. Part of #432 NO_DOC=in the final commit with rebalancer options
Closes #432 @TarantoolBot document Title: vshard: rebalancer flag and mode So far it was impossible to specify which instance should run the rebalancer. It was always automatically assigned using some internal rules based on UUIDs. Now the users can choose: - Which specific instance should run the rebalancer. Can be a replica or a master - won't matter. - In which replicaset the instance to run the rebalancer should be selected automatically. - Which instances and whole replicasets should not run the rebalancer even when it is selected automatically. For that there are 2 new options: `rebalancer = <bool>` and `rebalancer_mode = <name>`. The `rebalancer` flag can be either omitted, or set to true, or false. It can be set for replicasets and for specific instances. There can be only one `rebalancer = true` in the whole config. But can be many `rebalancer = false`. * `rebalancer = true` assigned to an instance means that this instance is guaranteed to run the rebalancer service on it. The instance role doesn't matter - it can be a replica or a master. Will run the rebalancer anyway. * `rebalancer = true` assigned to a replicaset means that the service will run only on the master of this replicaset. Can be combined with `master = 'auto'` on the given replicaset. * `rebalancer = false` assigned to an instance means that it will not run the rebalancer. * `rebalancer = false` assigned to a replicaset means that all the instances of this replicaset will not run the rebalancer. * `rebalancer = nil` (same as omitted, default) means that the instance/replicaset will be eligible to run the rebalancer only if `rebalancer_mode = 'auto'` is set and there are no `rebalancer = true` anywhere. The option `rebalancer_mode` should be specified in the root of the config. It can have one of those values: * `'auto'` - default. Means that the rebalancer service location is chosen automatically among all master instances in the cluster. Excluding those which have `rebalancer = false` on them or on their replicaset. If there are any `rebalancer = true`, then this mode works the same as `'manual'`. * `'manual'`. The rebalancer will run only if there is at least one `rebalancer = true` in the config. And only on the given instance / replicaset (depending on at which level the flag was specified - for a specific instance or for a whole replicaset). * `'off'`. The rebalancer will not run anywhere, regardless of all the `rebalancer = true/false` specified in the config.
bff0d2f
to
201f1ee
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! I have no objections anymore.
end | ||
|
||
local function master_on_disable() | ||
log.info("Stepping down from the master role") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we disable master through config, we sync
our replicaset:
box.cfg({read_only = true})
sync(cfgctx.new_cfg.sync_timeout)
When master is automatically disabled, then no sync happens. Is it ok?
Frankly speaking, I don't understand the motivation of such sync on manual master switch too, it increases the time, in which several rebalancers can exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is intentional. I totally agree with you that the sync is probably not needed in both cases. However I didn't want to change the old behaviour even in this small detail. In the new behaviour I dropped the sync on purpose.
Thanks for the review! |
#429, #430, #432, #433, and more. All the things needed to make vshard compatible with the config module in the core.