Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failed to update Store after split: no system config available #2566

Closed
tbg opened this issue Sep 19, 2015 · 9 comments
Closed

failed to update Store after split: no system config available #2566

tbg opened this issue Sep 19, 2015 · 9 comments
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered).

Comments

@tbg
Copy link
Member

tbg commented Sep 19, 2015

The following test appears to have failed:

#7681:

    /go/src/github.com/cockroachdb/cockroach/kv/txn_coord_sender.go:736 +0x4d
github.com/cockroachdb/cockroach/util/stop.(*Stopper).RunAsyncTask.func1(0xc8203f9360, 0xc820017500, 0xc8203f93a0, 0x1a)
    /go/src/github.com/cockroachdb/cockroach/util/stop/stopper.go:129 +0x2b
created by github.com/cockroachdb/cockroach/util/stop.(*Stopper).RunAsyncTask
    /go/src/github.com/cockroachdb/cockroach/util/stop/stopper.go:131 +0x271
FAIL    github.com/cockroachdb/cockroach/storage    24.499s
=== RUN   TestBatchBasics
I0919 18:19:45.026663 936 storage/engine/rocksdb.go:122  closing in-memory rocksdb instance
--- PASS: TestBatchBasics (0.01s)
=== RUN   TestBatchGet
I0919 18:19:45.031143 936 storage/engine/rocksdb.go:122  closing in-memory rocksdb instance
--- PASS: TestBatchGet (0.01s)
=== RUN   TestBatchMerge
I0919 18:19:45.037866 936 storage/engine/rocksdb.go:122  closing in-memory rocksdb instance
--- PASS: TestBatchMerge (0.00s)
=== RUN   TestBatchProto

Please assign, take a look and update the issue accordingly.

@tbg tbg added the C-test-failure Broken test (automatically or manually discovered). label Sep 19, 2015
@kkaneda
Copy link
Contributor

kkaneda commented Sep 20, 2015

The error was

F0919 18:20:00.524122 879 storage/replica_command.go:1273  failed to update Store after split: no system config available

, which is coming from Replica.updateRangeInfo.

func (r *Replica) updateRangeInfo() error {
    // RangeMaxBytes should be updated by looking up Zone Config in two cases:
    // 1. After snapshot applying, if no updating of zone config
    // for this key range, then maxBytes of this range will not
    // be updated.
    // 2. After a new range is created by range splition, just
    // copying maxBytes from the original range does not work
    // since the original range and the new range might belong
    // to different zones.
    // Load the system config.
    cfg := r.rm.Gossip().GetSystemConfig()
    if cfg == nil {
        return fmt.Errorf("no system config available")
    }

Should we relax the condition (i.e., do nothing if the system config is not available) or should we disallow split to start until the system config is gossiped?

@tbg tbg changed the title Test failure in CI build 7681 failed to update Store after split: no system config available Sep 21, 2015
@tbg
Copy link
Member Author

tbg commented Sep 21, 2015

@mberhault, could you take a look?

It's dangerous to rely on gossip for operations which must be successful (here, it's that updateRangeInfo is called as a defer to a batch, so it's too late to abort when the error happens).
Can the update be skipped as @kkaneda suggested? After all, there must be a loop somewhere, updating the configs. If that is the case, maybe during a split the new range can just take the config of the old one?

Btw, I don't see why updateRangeInfo sits in replica_raftstorage.go. We should move it to replica.go.

@mberhault
Copy link
Contributor

I'm looking. there may be a few tests that rely on gossip callbacks having run. Either we shouldn't care (this is the case for all the queues, since they are repeatable), or we should wait for the config in tests (we do this in a few places, but not everywhere).
In this particular case, all we do is update the "max bytes". We already do this for all replicas when the gossip config changes (storage/store.go:systemGossipUpdate which is the system config callback) , so this should be safe to skip.

@tbg
Copy link
Member Author

tbg commented Sep 21, 2015

My concern is a bit more general in that I'm not sure when we can reasonably rely on things being present in Gossip. For example, the configs are probably gossiped without expiry now, but at some point they might get a TTL (online schema changes, ...) which won't noticeably change things in testing/etc but sets us up for all sorts of production-scenario panics. If we have certain keys that are to always hold a value once a Node has bootstrapped, I'd sleep better having some explicit mechanisms for this.

@tbg
Copy link
Member Author

tbg commented Sep 21, 2015

isn't updateRangeInfo pointless anyways? It's called either after snapshot or after split, but in both cases we can just take the info from the old range (or hold the latest config on the Store, which would allow avoiding races).

@mberhault
Copy link
Contributor

we'll need to harden things in the absence of gossip information. at the very least, we need to fail gracefully.
As for updateRangeInfo, the comment in there is accurate: the config might differ if the split is due to zone boundaries. It is nice to have an accurate MaxBytes, but not critical. In the worst case, we'll split too early (or not enough). Doing it best-effort here (just log error and exit) seems like the right thing to do. The right value will be set when the gossip callback runs.

@tbg
Copy link
Member Author

tbg commented Sep 21, 2015

assuming we skip the update on the split, and there's never a follow-up config update, will the callback ever run or is the range just going to have MaxBytes=0 (and could that in itself blow something up)?

@mberhault
Copy link
Contributor

isn't the replica initialized from the pre-split one? If not, we should make sure we propagate at least the MaxBytes

@mberhault
Copy link
Contributor

actually, even if it isn't, 0 gets ignored when incrementing (replica.go:maybeAddToSplitQueue) and the split queue itself will skip the range if it can't find the zone for that range. So either way, lazily setting it will be fine.

mberhault pushed a commit that referenced this issue Sep 22, 2015
Fixes #2566

When the system config is not loaded, MaxBytes will remain 0 and
be ignored for split purposes.
It will be properly updated the next time the system config callback
runs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered).
Projects
None yet
Development

No branches or pull requests

3 participants