-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ui: no data available but cluster appears to be running normally #18324
Comments
Screenshot of the pages for range 1, 61, 3 please (or probably we can look at this cluster directly, right?). @nvanbenschoten, this version has the log spy so the Raft problems I anticipate should be straightforward to debug. Perhaps #17524 again. |
We diagnosed #18327 as one of the problems in this cluster. @dianasaur323 to bump the node version to see if problems persist. |
The under-replicated ranges are in fact under-replicated; there are only two replicas for ~30 ranges. The reason for which the third replica was removed is always "store dead". The allocator simulation always suggests adding a store. Yet, this doesn't seem to be happening (@dianasaur323 mentioned seeing refused snapshots, which could be part of the problem). There may be more problems, but that is definitely one of them. |
There's also still #18339 which I hear is going to be fixed soon, so we'll have a better idea about the "no lease" ranges which looked healthy when I looked. |
Seeing this on the lease holder of one of the underreplicated ranges:
The cluster is running with defaults, so a raft snapshot should be capped at 8mb/s. |
Logs show many snapshot problems. It looks like all nodes are routinely refusing snapshots and nothing ever gets done because the stores are throttled and even if they weren't they'd refuse most things. I've also seen this: E170910 02:38:29.352546 109 storage/queue.go:656 [raftsnapshot,n2,s2,r754/2:/Table/166/7/9{419/2…-879/2…}] snapshot failed: log truncation during snapshot removed sideloaded SSTable @dianasaur323 I'm curious, did you by any chance manually delete anything from the data directory? You would expect that error only in the wild when actively ingesting a backup. After that, the only explanation I have is that someone deleted the |
A rough idea about the frequency of these deletions. Note how it's always the same ranges -- that sounds like files were deleted manually.
|
Oh, monster duh -- this [the sideloading stuff] is expected with this change: #17787 Basically the old file that was written at |
I've made an attempt at copying the files in their new location, using this snippet:
#!/usr/bin/env bash
id=$(echo "$1" | sed -E 's/^([0-9]+).*/\1/g')
shard=$(($id % 1000))
dest="sideloading/${shard}/${1}"
mkdir -p "$(dirname $dest)"
cp -p "$1" "$dest" (no, that is not idiomatic). Will check tomorrow if the sideloading messages persist. I assume that if they will, it's because I didn't copy into the right location. But hopefully they're gone and we can check whether they were somehow connected to this (pretty likely since it persistently fails snapshots, which could clog things up if it affects enough ranges). |
Hmm. Seems to continue, but the odd thing I'm noticing only now is that the replicas for which I see that error have a new-style sideloading directory already, and there's (usually) one SSTable in it. They don't even have the old-style directory, which makes it seem like they would be unaffected by the problem described in my last post. (Sample of 2, so might not hold for all of them). It's definitely awful that that error doesn't give us but a clue as to which file it's looking for. I'll make sure to fix that for 1.1. But, regardless, it looks more as if the code is looking in the old place and the file is really in the new location. Or we deleted something during truncation we weren't supposed to delete - a one-off would be most likely here, but I didn't find one. |
@dianasaur323 would be interested in when you imported into this cluster and which version you were running. The change above was merged at Hmm, I need to check what we are doing when a range splits. We should be copying (parts of the sideloaded storage) around. Pretty sure we don't, perhaps this is at work here (and even if it is, it's a pretty bad omission). |
@tschottdorf wow, you did a lot of work in the last couple hours.... I haven't been manually deleting anything, although our cluster right now is fully automated. I believe @nstewart has some cron jobs that spin up nodes if they die, and also pull the most recent binary on a daily basis. Should be the keep-alive.sh and upgrade-and-kill.sh files located in the root directory. I'm noticing that the upgrade-and-kill.sh file kills all cockroach nodes versus doing a rolling upgrade. Could that be causing a problem? |
Eh, too late to write comments. The RHS of a split starts with an empty log so the above is false. Still, I'm lacking a sensible theory as to what's going on with these missing files. |
No, or rather, if it is, then that's something that would need to be fixed. Fairly certain it shouldn't though. |
@dianasaur323 killall is local to the node and the cron for that command is offset by 5 mins so it should be a rolling upgrade |
@nstewart ah, excellent! thanks for clarifying that. |
(btw, if you're wondering how I made that screenshot, it's |
For r676, I checked the sideloading directory and a) it has only the new one and b) there's only one SSTable in there, at index 21/term 7. Looking at the screenshot above, the snapshot for the trailing replica would contain that SSTable. Thinking that maybe it was looking for the wrong term I made a few copies of the file at all relevant terms, but no impact. Well, #18405 will tell us what the problem is. |
Bumped the cluster to include #18405. |
Ok, here's the updated message:
This is interesting. As I mentioned above, the file we have in that replica's sideloaded storage is |
node 2, the other member of the raft group which is up to date, also only has |
This is surprising. The empty entry at 22 indicates that the Raft leader proposed an empty entry here (as it does after leader election). A sideloaded proposal looks like the one at index 21. So the conjecture is that the actual sideloaded proposal is This looks like we're seeing an empty entry but somehow mistaking it for a sideloaded entry. This would happen if it had the wrong command encoding version, but an empty entry has an empty version. |
My conjecture is that the fact that there is a sideloaded entry at 21 and then an empty one at 22 somehow causes the bug. For example, this would happen if we reuse the same |
Uh, actually: func TestFoo(t *testing.T) {
var ent raftpb.Entry
ent.Data = []byte("foo")
if err := ent.Unmarshal([]byte{}); err != nil {
t.Fatal(err)
}
// --- FAIL: TestFoo (0.00s)
// store_test.go:61: {0 0 EntryNormal [102 111 111] []}
if len(ent.Data) != 0 {
t.Fatal(ent)
}
} That does explain it because cockroach/pkg/storage/store.go Lines 3430 to 3437 in 1fd366d
|
( |
Having restarted with the fix, r767 is now operational. The system appears to be busy applying (actually applying) snapshots. Not seeing any more sideloading snapshot errors in any of the logs. There are still a bunch of occurrences of #18339, so that issue should be addressed ASAP. |
Everything replicated back up, but there's one single holdover:
... oops, didn't post this for an hour, and now it has upreplicated. Well, problems fixed, then, I suppose? Though it's not clear why the stores were throttled for that long. Any ideas @a-robinson? |
There are 4 reasons that a store may be rejecting snapshots:
Options 2, 3, and 4 could all easily be true on a node that just came up, and 3 and 4 might take a while before they stop being true. |
It took on the order of 3 hours before the upreplication got through.
Wonder what we can do to make it easier to see why it wasn't. Could we add
a reason for the store crying throttled? Or perhaps it's discernible from
the graphs. Either way, I just couldn't tell so we should probably improve
something.
On Mon, Sep 11, 2017, 10:49 Alex Robinson ***@***.***> wrote:
There are 4 reasons that a store may be rejecting snapshots:
1. It's draining before shutting down
2. It's currently receiving/applying a snapshot
3. It's currently removing a replica
4. It's too far behind on too many raft ranges.
Options 2, 3, and 4 could all easily be true on a node that just came up,
and 3 and 4 might take a while before they stop being true.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#18324 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE135PaTPysbFu3WzBuzRM5XNkSJanpnks5shUhRgaJpZM4PQBqr>
.
--
…-- Tobias
|
We can't easily distinguish between 2 and 3 (and we do already include an extra message for 1), but differentiating 4 couldn't hurt. We could probably figure it out from graphs, but we don't need to require that extra step. |
^- this is probably #18055 I found two more issues related to this sideloading in this cluster.
|
Hmm, 2. is not clear to me. We do create the sideloaded storage a little early, but we replace it every time the replicaID changes (from Update now that I have WiFi: ok, figured it out, I think. The code which created the sideloaded storage in |
When a preemptive snapshot is applied, we (have to) write "fat" entries into the log because we don't have a sideloaded storage location yet (no replicaID). However, when that node itself later tries to create snapshot, it ends up in the situation in which an entry has the sideloading raft command version, but isn't actually on disk (but instead already inlined). We weren't handling that case correctly, resulting in (the first half of) cockroachdb#18324 (comment). Remedied that and updated TestRaftSSTableSideloadingInline so that it fails before the fix and passes after.
When a preemptive snapshot is applied, we (have to) write "fat" entries into the log because we don't have a sideloaded storage location yet (no replicaID). However, when that node itself later tries to create snapshot, it ends up in the situation in which an entry has the sideloading raft command version, but isn't actually on disk (but instead already inlined). We weren't handling that case correctly, resulting in (the first half of) cockroachdb#18324 (comment). Remedied that and updated TestRaftSSTableSideloadingInline so that it fails before the fix and passes after.
With #18462 in, I believe these problems have been fixed, though I had to restart one node which also moved Raft leadership to other nodes that didn't have the problem in the first place. The cluster now has node2 decommissioned:
|
(I'll recommission it just for fun) |
This issue has been pretty productive and please keep doing what you're doing, @dianasaur323. Going to close this one though since we seem to be all good. |
When a preemptive snapshot is applied, we (have to) write "fat" entries into the log because we don't have a sideloaded storage location yet (no replicaID). However, when that node itself later tries to create snapshot, it ends up in the situation in which an entry has the sideloading raft command version, but isn't actually on disk (but instead already inlined). We weren't handling that case correctly, resulting in (the first half of) cockroachdb#18324 (comment). Remedied that and updated TestRaftSSTableSideloadingInline so that it fails before the fix and passes after.
BUG REPORT
This is running on the PM cluster (4 nodes, 2 in NYC, 1 in SF, 1 in AMS).
NYC1:
NYC2:
SF:
AMS:
What did you do? Loaded up the admin UI.
What did you expect to see? Expected to see data
What did you see instead? Screenshots below
The interesting thing is that we've had under-replicated ranges for a while. I'm not sure if this is an issue, or an existing just showing itself in a more extreme form.
cc @nstewart
The text was updated successfully, but these errors were encountered: