-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Corrupt zone bundles #4010
Comments
There are many log files in that nexus zone, maybe a log rotation occurred in the middle of creating the tar.gz?
The last line of the zone bundle's log is:
which corresponds to a line in the middle of a rotated log file in the zone:
|
Paging @bnaecker |
I'm not sure what's going on. I'd have expected that the open file descriptor the tar process has would preclude actual removal of the file, even if the file is rotated or archived to the U.2 concurrently. This is what happens if one process I'll look into what |
So we're using |
Note looking at the |
Thanks @rmustacc that makes sense. I assumed the file was renamed or copied and then deleted. @jmpesp do you happen to have either the sled-agent log or the timestamps of the rotated files? I think with one or both of those, and the timestamp from the
Yeah, copying is not immune to this that's true. I think it'll be better than what we have though. The tarball is sad, I'm guessing because the metadata for the log file doesn't match the actual size. That would be the TOCTOU in the implementation I linked above. So while we can't prevent this from happening, we could at least prevent it from breaking the tar file. We can copy the log files into a tempdir, and then use the same approach. It's possible the file will be truncated compared to the original, but this will make sure the tar header and actual file metadata match. This whole process is best-effort, so a truncated file is better than a broken tarball. |
I'm sorry, I don't have either. I can try and reproduce this again today if that helps. |
If you can, that'd be very helpful, thanks. What we see now is consistent with your hypothesis of concurrent rotation / truncation and bundling, it'd just be nice to prove it. I think we can do the approach mentioned above, of copying and then inserting, even if we don't have that confirmation. We may still end up with a truncated log file if we copy it while it's being rotated, but the tar header metadata and the file should at least match in that case. |
Got it:
|
note that the time from |
After some more prompting from @rmustacc, I looked into the implementation of
So we can cooperate with this from the zone bundler, though I'd need to think about exactly how. So a proposal for fixing this:
|
I did a bit more digging after we were hitting the same DetailsWe noted a lot of errors like this, when pulling zone bundle files:
That's an error coming from here in the if (chksum != checksum(&dblock)) {
if (chksum != checksum_signed(&dblock)) {
(void) fprintf(stderr, gettext(
"tar: directory checksum error\n"));
if (iflag) { if (chksum != checksum(&dblock)) {
if (chksum != checksum_signed(&dblock)) {
(void) fprintf(stderr, gettext(
"tar: directory checksum error\n")); So the checksum in the file doesn't match the one we actually expect. Which file are we failing on? I ran this DTrace invocation, which looks for the filename we expect when we call
The So looking at the tar header itself, using Python's In [102]: member
Out[102]: <TarInfo 'oxide-clickhouse:default.log' at 0x1316b1100>
In [103]: member.offset_data
Out[103]: 86410752
In [104]: member.size
Out[104]: 85882899
In [105]: last = member.offset_data + member.size
In [106]: buf[last - 100: last + 100]
Out[106]: b'ments_i64 (5fe342cc-6366-4e12-bec2-cf863eadb378): Removing part from filesystem all_983825_983825_0\n2023.09.28 16:37:30.403686 [ 3 ] {} <Trace> HTTPHandler-factory: HTTP Request for HTTPHandler-factor' So the In [110]: len(buf) - last
Out[110]: 46061 Looking at the tar header structure, the "filename" of the next entry is the garbage In [144]: import math
In [146]: start_next = math.ceil(last / 512) * 512
In [147]: buf[start_next:][:100]
Out[147]: b" user 'default' from [fd00:1122:3344:10a::3]:33787\n2023.09.28 16:37:30.404030 [ 3 ] {} <Debug> HTTP-" So tar is just reinterpreting the bytes of the logfile itself as a header, which obviously fails the checksum comparison. To summarize, I think this issue is a bit broader than handling truncation of the log files correctly as |
I've been reading more of the One thing we'll want to ensure is that we don't keep these snapshots around forever. I think it's enough to delete them if they exist when the sled-agent starts up, and then when the zone-bundling is finished, either successfully or not. If we crash during that process, the snapshot will not persist indefinitely, since it'll be deleted when the sled agent restarts. I'll look more into this today, and hopefully make some progress. |
- Fixes #4010 - Previously, we copied log files directly out of their original locations, which meant we contended with several other components: `logadm` rotating the log file; the log archiver moving the to longer-term storage; and the program writing to the file itself. This commit changes the operation of the bundler, to first create a ZFS snapshot of the filesystem(s) containing the log files, clone them, and then copy files out of the clones. We destroy those clones / snapshots after completing, and when the sled-agent starts to help with crash-safety.
- Fixes #4010 - Previously, we copied log files directly out of their original locations, which meant we contended with several other components: `logadm` rotating the log file; the log archiver moving the to longer-term storage; and the program writing to the file itself. This commit changes the operation of the bundler, to first create a ZFS snapshot of the filesystem(s) containing the log files, clone them, and then copy files out of the clones. We destroy those clones / snapshots after completing, and when the sled-agent starts to help with crash-safety.
- Fixes #4010 - Previously, we copied log files directly out of their original locations, which meant we contended with several other components: `logadm` rotating the log file; the log archiver moving the to longer-term storage; and the program writing to the file itself. This commit changes the operation of the bundler, to first create a ZFS snapshot of the filesystem(s) containing the log files, clone them, and then copy files out of the clones. We destroy those clones / snapshots after completing, and when the sled-agent starts to help with crash-safety.
After run
/opt/oxide/sled-agent/zone-bundle bundle-all
and pulling that to my workstation, tar is reporting errors unpacking the nexus bundle:I'm not sure what file it's not liking, it seems to stop after the default log:
The actual gzip seems ok:
The text was updated successfully, but these errors were encountered: