Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto-master detection for storages, rebalancer location selection, and bug fixes #439

Merged

Conversation

Gerold103
Copy link
Collaborator

#429, #430, #432, #433, and more. All the things needed to make vshard compatible with the config module in the core.

SENT bucket could be deleted by GC before bucket_send() is
completed. It could lead to a problem that bucket_send() would
notice that in the end and treat as an error. The bucket should
not be deleted while its transfer is still in progress.

The bug is fixed now because it started manifesting frequently in
new tests for the next commits.

Closes #433

NO_DOC=bugfix
@Gerold103 Gerold103 force-pushed the gerold103/gh-278-429-430-432-auto-master-and-rebalancer branch 2 times, most recently from f3e863c to 45ee7e2 Compare November 3, 2023 22:12
Copy link
Contributor

@Serpentian Serpentian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great patchset! Here're some comments. Feel free to resolve any nits without changing the code

vshard/storage/init.lua Show resolved Hide resolved
vshard/storage/init.lua Show resolved Hide resolved
vshard/storage/init.lua Outdated Show resolved Hide resolved
test/router/master_discovery.result Show resolved Hide resolved
vshard/storage/init.lua Show resolved Hide resolved
vshard/storage/init.lua Show resolved Hide resolved
vshard/storage/init.lua Show resolved Hide resolved
vshard/storage/init.lua Show resolved Hide resolved
vshard/replicaset.lua Show resolved Hide resolved
@Gerold103 Gerold103 self-assigned this Nov 14, 2023
SENDING bucket could be recovered back to ACTIVE state if the
destination instance didn't yet apply the first bucket_recv().
Could happen if the recovery was just faster than the bucket
sending procedure.

That shouldn't happen. The patch makes the recovery skip the
local buckets which are being transferred right now, regardless of
their state.

NO_DOC=bugfix
The old name looked like it meant the object itself is an auto
assigned master. But in fact it is a replicaset flag which means
the master role should be auto-discovered.

The patch makes the name clearer. The reason for it now is that
the flag is going to be used more in the next commits.

Part of #429

NO_DOC=refactoring
The master check is going to be used in more than one place in the
next commits. Better do it via one function which this commit
introduces.

Part of #429

NO_DOC=refactoring
NO_TEST=refactoring
The function used to return 'false' on error. But all functions
which can fail are supposed to return nil on error (exception on
critical errors such as misusage).

The patch makes the function conform with the rules. This is done
as a preparation before the function would start returning another
error in a next commit.

Part of #429

NO_DOC=not visible
NO_TEST=already covered
A few places in storage code still used fiber_cond:wait()
directly. It is not safe for negative timeouts and is too easy to
forget to convert an error to vshard format. The patch replaces
those usages with util.fiber_cond_wait.

This was noticed while fixed another bug related to sleeps and
fiber cancellation.

NO_DOC=not visible
Apparently storage service fibers could live quite a while after
their cancellation, because all what they are doing is wrapped
into pcalls (built-in cancellation points didn't throw), and the
cancel wasn't checked often enough.

This commit adds some more cancellation points to recovery, gc,
rebalancer.

It is fixed now because in the next commits it was getting
problematic to change master role. When a master was switched off
and then back on, then its new services couldn't be started
right away until the old ones finally died. It would clog the logs
a lot.

NO_DOC=bugfix
NO_TEST=covered in next commits
It is moved into its own section where the next commits will
almost fully rewrite it. The move is done separately, because some
parts will remain unchanged and it should be easier to see them in
the diff this way.

Part of #429

NO_DOC=refactoring
NO_TEST=refactoring
It was inside storage_cfg, but it is going to grow too big to
contain it in the cfg code further. The patch moves it to the
rebalancer management section.

Part of #429

NO_DOC=refactoring
NO_TEST=refactoring
It was almost entirely done in a single function, was too
difficult to extend in next commits. This patch splits the config
code into several functions doing a certain step of the
configuration.

The patch also introduces a config context. It is an object where
one time per cfg call are calculated and saved all the most useful
config details such as the current replica and replicaset UUID,
whether the instance was a master before and is now, config of
this particular instance and replicaset, and more.

The context is introduced because these attributes are going to be
used more widely across several functions, and passing them as
arguments everywhere would be just unhandy. As well as calculating
them again in each function.

For example, it allows storage_cfg_master_prepare() and
storage_cfg_master_commit() just use the already provided
was_master_in_cfg and is_master_in_cfg and concentrate on the
switch logic instead of figuring out how did the state change.

Part of #429

NO_DOC=refactoring
NO_TEST=refactoring
It was done after schema upgrade, but now is done before. It makes
sense because master commit should enable read-write mode which is
needed for the schema changes.

It worked so far, because in manual master config the read-only
cfg update is done at the box.cfg stage. I.e. before the schema
update.

But soon the master role will be allowed to turn on and off
automatically. And then the old order wouldn't work anymore.

At the same time not everything can be done before the schema is
installed. Because some services, like recovery and bucket GC do
need the schema (`_bucket` space at least). The solution is to
make those services depend on whether the first configuration is
performed. They are started only after that.

Part of #429

NO_DOC=refactoring
NO_TEST=refactoring
There is this_is_master() function, but it wasn't used in all
places, and it used to rely on the config. Soon the config won't
be telling who the master is when the role is set to be automatic
in next commits.

The patch introduces a flag is_master which reflects the current
actual state, not the config. For now it is only managed by the
config, but in the next commit it won't be so anymore.

Vtest also now won't rely on the config. It uses the actual master
state.

Part of #429

NO_DOC=internal change
It is now storage_call_test.lua. The reason is that it contains
just vshard.storage.call() tests. There are going to be added some
new files and this one was getting more confusing.

Needed for #429

NO_DOC=refactoring
@Gerold103 Gerold103 force-pushed the gerold103/gh-278-429-430-432-auto-master-and-rebalancer branch from 45ee7e2 to feddae2 Compare November 15, 2023 23:05
Replicaset config gets a new option 'master' with the only
possible value (besides nil) 'auto'. That makes the nodes of this
replicaset decide who of them is the master by looking at their
local box.info.ro value.

Such replicaset is yet unable to run the rebalancer, and other
replicasets and routers can't make rw requests to it. This is all
supported in next commits.

Closes #429

NO_DOC=covered by a next commit
bucket_recv, bucket_send, rebalancer_apply_routes, and
rebalancer_request_state only make sense on the master instance.
Now it is checked explicitly.

This becomes necessary for the next commits which will enable
master discovery at the storage level, and the likelihood of
making those calls on a non-master instances rises up.

Part of #430

NO_DOC=internal change
The new function is used by bucket recovery instead of
bucket_stat(). The problem of the latter is that it allows to be
invoked on a non-master instance which is not ok for recovery - it
needs to get the most up to date state of the bucket.

On the other hand it wasn't possible to just ban bucket_stat() on
non-master instances, because it is a public function and it was
never documented to work exclusively on master instances.

Part of #430

NO_DOC=internal change
It wraps replicaset:callrw(). The purpose is to handle the
NON_MASTER error in a single place. It is going to become much
more likely to get after the next commits which will enable master
discovery at the storage level.

It will be important to notice that the node, which was thought to
be the master, is not in that role anymore, and a new discovery is
needed.

Part of #430

NO_DOC=internal change
NO_TEST=covered in next commits
Those functions replace the old split() function. It is needed
mainly for tests which are going to pass read_only options
per-instance in the global config. But it also looks like an
actually useful feature even outside of tests.

In tests that is needed for auto-master cases. There is already
one place - auto_master_2_test.lua, but for that one the patch in
vtest was enough and helped only for bootstrap. The current patch
allows to reconfigure the instances after bootstrap via their
own private config sections.

Needed for #430

@TarantoolBot document
Title: vshard: per-instance box options in the config

Previously the config could include box options only on the root
level like so:

```Lua
{
    -- VShard options:
    sharding = ...,
    weights = ...,
    bucket_count = ...,
    ...
    -- Box options:
    replication_timeout = ...,
    memtx_use_mvcc_engine = ...,
    net_msg_max = ...,
}
```
Those box-options are propagated to `box.cfg{}` on each instance
using this config. But there was no way to pass different options
for difference instances via one config.

Now it is possible via their config sections in `sharding`:
```Lua
{
    sharding = {
        [replicaset_uuid] = {
            replicas = {
                [replica_uuid] = {
                    replication_timeout = ...,
                    net_msg_max = ...,
                    read_only = ...,
                }
            },
            ...
        },
        ...
    },
    ...
    -- Box options:
    replication_timeout = ...,
    memtx_use_mvcc_engine = ...,
    net_msg_max = ...,
}
```

For each instance its resulting `box.cfg` is merged from the root
box options and from the instance's section in `sharding`. If some
options are specified in both places, then the priority is for the
replica-local options.

For example:
```
{
    sharding = {
        [replicaset_uuid] = {
            replicas = {
                [replica_uuid] = {
                    replication_timeout = 100,
                    net_msg_max = 200,
                }
            },
        },
    },
    replication_timeout = 50,
    txn_timeout = 300,
}
```
The resulting `box.cfg` config for the given replica will be:
```Lua
{
    replication_timeout = 100,
    net_msg_max = 200,
    txn_timeout = 300,
}
```
Many of them used to use the default timeout
(replica.net_timeout). It is working good enough, but isn't really
suitable for storages - this auto-tuning timeout thing was
designed for routers mostly.

Besides, soon in master_call() it will be needed to know the
timeout before doing the call itself. Getting the net_timeout
out of the master object would work for now, but soon this won't
be possible when auto-master will be enabled with discovery.

That all means either trying schemas like the router does for the
same job, or simply always pass an explicit timeout to all
master calls. The patch follows the latter way.

The need to know the timeout will appear for doing retries inside
master_call() when the master is not known right now, but chances
are high is going to be discovered in a few milliseconds. For the
waiting a deadline is needed, and for that a timeout is necessary.

Part of #430

NO_DOC=internal change
NO_TEST=covered
locate_master() would not even try to create connections to the
instances, which didn't have an already existing one. As a result,
if the actual master wasn't ever contacted by any calls, then it
would stay undiscovered.

Part of #430

NO_DOC=bugfix
@Gerold103 Gerold103 force-pushed the gerold103/gh-278-429-430-432-auto-master-and-rebalancer branch from feddae2 to bff0d2f Compare November 15, 2023 23:12
Until now only master discovery of own replicaset was supported on
the storage level. But now the storages can find masters of other
replicasets.

The design is similar to the router's master discovery, except
that it is not proactive. Storages rarely need to talk to each
other which makes pointless to constantly maintain all the
connections and re-check the instances' roles to keep the master's
location info up to date.

Instead, the discovery is reactive. It only works for the
replicasets which are being used for any calls right now.

Part of #430

@TarantoolBot document
Title: vshard: master discovery on storage level
Routers do already support `master = 'auto'` setting at the
replicaset level in the config. It makes them able to
automatically find the master instance location in the so
configured replicaset.

On the storages it wasn't supported at all. But now it is. The
storages can specify in their configs the same option
`master = 'auto'` for their own replicaset and for any other
replicasets too.

If the storage has `master = 'auto'` for its own replicaset, then
it will constantly monitor `box.info.ro`. When it becomes `true`,
the instance considers itself a replica. When false - considers
itself a master. It also means that such automatic masters will
start and stop all needed background fibers automatically as well,
depending on the actual state of the instance, not on the config.

However it is worth keeping in mind that this option doesn't mean
master election. It only means master **detection**. I.e. if this
option is used, then the user's code is supposed to assign the
masters using `box.cfg.read_only` or any other way.

If the storage sees `master = 'auto'` for any other replicasets
than its own, then it will be able to automatically discover their
master instance for internal operations, such as bucket recovery
and rebalancing.
Previous commit added support for full master discovery on the
storages, but the rebalancer didn't work on replicasets having
auto master enabled for themselves. Now it works on those too.

The priority is firstly given to manual masters. If none are found
in the entire cluster, then the rebalancer appears on the
auto-assigned master with the minimal replicaset UUID among all
the other auto-masters.

Closes #430

NO_DOC=documented later with new rebalancer options
Automatic master discovery on the storages means, that the
instances are going to establish more connections between
replicasets to locate the masters of each other.

That might easily lead to fullmesh of the whole cluster - it means
lots of connections majority of which are not even used most of
the time. Need to get rid of them somehow if they are not used for
a while.

This commit makes the storages drop the too long unused
connections. Also the automatically discovered master locations
are erased by the same no-activity timeout. The logic behind it is
that if an instance wasn't contacted anyhow for a very long time,
then it might be not a master anymore. It is better not to treat
it as a master then and let it be re-discovered when a master
connection is needed again.

Follow-up #429
Follow-up #430

NO_DOC=internal
The flags allows to explicitly assign the rebalancer role to a
specific instance or a replicaset. Or forbid its automatic
assignment to instances or whole replicasets.

Part of #432

NO_DOC=in the final commit with rebalancer options
Closes #432

@TarantoolBot document
Title: vshard: rebalancer flag and mode
So far it was impossible to specify which instance should run the
rebalancer. It was always automatically assigned using some
internal rules based on UUIDs.

Now the users can choose:
- Which specific instance should run the rebalancer. Can be a
    replica or a master - won't matter.
- In which replicaset the instance to run the rebalancer should be
    selected automatically.
- Which instances and whole replicasets should not run the
    rebalancer even when it is selected automatically.

For that there are 2 new options: `rebalancer = <bool>` and
`rebalancer_mode = <name>`.

The `rebalancer` flag can be either omitted, or set to true, or
false. It can be set for replicasets and for specific instances.
There can be only one `rebalancer = true` in the whole config. But
can be many `rebalancer = false`.

* `rebalancer = true` assigned to an instance means that this
    instance is guaranteed to run the rebalancer service on it.
    The instance role doesn't matter - it can be a replica or a
    master. Will run the rebalancer anyway.

* `rebalancer = true` assigned to a replicaset means that the
    service will run only on the master of this replicaset. Can be
    combined with `master = 'auto'` on the given replicaset.

* `rebalancer = false` assigned to an instance means that it will
    not run the rebalancer.

* `rebalancer = false` assigned to a replicaset means that all the
    instances of this replicaset will not run the rebalancer.

* `rebalancer = nil` (same as omitted, default) means that the
    instance/replicaset will be eligible to run the rebalancer
    only if `rebalancer_mode = 'auto'` is set and there are no
    `rebalancer = true` anywhere.

The option `rebalancer_mode` should be specified in the root of
the config. It can have one of those values:

* `'auto'` - default. Means that the rebalancer service location
    is chosen automatically among all master instances in the
    cluster. Excluding those which have `rebalancer = false` on
    them or on their replicaset. If there are any
    `rebalancer = true`, then this mode works the same as
    `'manual'`.

* `'manual'`. The rebalancer will run only if there is at least
    one `rebalancer = true` in the config. And only on the given
    instance / replicaset (depending on at which level the flag
    was specified - for a specific instance or for a whole
    replicaset).

* `'off'`. The rebalancer will not run anywhere, regardless of all
    the `rebalancer = true/false` specified in the config.
@Gerold103 Gerold103 force-pushed the gerold103/gh-278-429-430-432-auto-master-and-rebalancer branch from bff0d2f to 201f1ee Compare November 15, 2023 23:25
@Gerold103 Gerold103 requested a review from Serpentian November 15, 2023 23:33
Copy link
Contributor

@Serpentian Serpentian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! I have no objections anymore.

end

local function master_on_disable()
log.info("Stepping down from the master role")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we disable master through config, we sync our replicaset:

box.cfg({read_only = true})
sync(cfgctx.new_cfg.sync_timeout)

When master is automatically disabled, then no sync happens. Is it ok?

Frankly speaking, I don't understand the motivation of such sync on manual master switch too, it increases the time, in which several rebalancers can exist.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is intentional. I totally agree with you that the sync is probably not needed in both cases. However I didn't want to change the old behaviour even in this small detail. In the new behaviour I dropped the sync on purpose.

@Gerold103 Gerold103 merged commit 9fda354 into master Nov 17, 2023
12 checks passed
@Gerold103
Copy link
Collaborator Author

Thanks for the review!

@Gerold103 Gerold103 deleted the gerold103/gh-278-429-430-432-auto-master-and-rebalancer branch November 17, 2023 22:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants