Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Corrupt zone bundles #4010

Closed
jmpesp opened this issue Sep 1, 2023 · 13 comments · Fixed by #4225
Closed

Corrupt zone bundles #4010

jmpesp opened this issue Sep 1, 2023 · 13 comments · Fixed by #4225

Comments

@jmpesp
Copy link
Contributor

jmpesp commented Sep 1, 2023

After run /opt/oxide/sled-agent/zone-bundle bundle-all and pulling that to my workstation, tar is reporting errors unpacking the nexus bundle:

jwm@duster ~/zone_bundles [] [] $ ls dinnerbone-2023-09-01T18-38-13.tar.gz 
dinnerbone-2023-09-01T18-38-13.tar.gz
jwm@duster ~/zone_bundles [] [] $ cd dinnerbone/
jwm@duster ~/zone_bundles/dinnerbone [] [] $ tar xzf ../dinnerbone-2023-09-01T18-38-13.tar.gz 
jwm@duster ~/zone_bundles/dinnerbone [] [] $ cd oxz_nexus_787d4782-8185-4e2f-85bf-64afecd7a3a9/
jwm@duster ~/zone_bundles/dinnerbone/oxz_nexus_787d4782-8185-4e2f-85bf-64afecd7a3a9 [] [] $ ls
b2e05ce2-5d30-4bc9-998b-5339a338f58e.tar.gz
jwm@duster ~/zone_bundles/dinnerbone/oxz_nexus_787d4782-8185-4e2f-85bf-64afecd7a3a9 [] [] $ tar xzf b2e05ce2-5d30-4bc9-998b-5339a338f58e.tar.gz 
tar: Skipping to next header
tar: Exiting with failure status due to previous errors

I'm not sure what file it's not liking, it seems to stop after the default log:

$ tar tvf b2e05ce2-5d30-4bc9-998b-5339a338f58e.tar.gz
-r--r--r-- 0/0             205 2023-09-01 14:42 metadata.toml
-r--r--r-- 0/0             616 2023-09-01 14:42 ptree
-r--r--r-- 0/0              84 2023-09-01 14:42 uptime
-r--r--r-- 0/0             107 2023-09-01 14:42 last
-r--r--r-- 0/0              17 2023-09-01 14:42 who
-r--r--r-- 0/0            3761 2023-09-01 14:42 svcs
-r--r--r-- 0/0           27599 2023-09-01 14:42 netstat
-r--r--r-- 0/0           61806 2023-09-01 14:42 pfiles.18054
-r--r--r-- 0/0           59396 2023-09-01 14:42 pstack.18054
-r--r--r-- 0/0             189 2023-09-01 14:42 pargs.18054
-rw-r--r-- 0/0       167331462 2023-09-01 14:42 oxide-nexus:default.log
tar: Skipping to next header
tar: Exiting with failure status due to previous errors

The actual gzip seems ok:

jwm@duster ~/zone_bundles/dinnerbone/oxz_nexus_787d4782-8185-4e2f-85bf-64afecd7a3a9 [] [] $ gunzip -t b2e05ce2-5d30-4bc9-998b-5339a338f58e.tar.gz
jwm@duster ~/zone_bundles/dinnerbone/oxz_nexus_787d4782-8185-4e2f-85bf-64afecd7a3a9 [] [] $ echo $?
0
@jmpesp
Copy link
Contributor Author

jmpesp commented Sep 1, 2023

There are many log files in that nexus zone, maybe a log rotation occurred in the middle of creating the tar.gz?

root@oxz_nexus_787d4782-8185-4e2f-85bf-64afecd7a3a9:~# ls $(svcs -L nexus)*
/var/svc/log/oxide-nexus:default.log    /var/svc/log/oxide-nexus:default.log.2
/var/svc/log/oxide-nexus:default.log.0  /var/svc/log/oxide-nexus:default.log.3
/var/svc/log/oxide-nexus:default.log.1

The last line of the zone bundle's log is:

jwm@duster ~/zone_bundles/dinnerbone/oxz_nexus_787d4782-8185-4e2f-85bf-64afecd7a3a9 [] [] $ tail -n1 oxide-nexus\:default.log 
{"msg":"Region requested, not yet created. Retrying in 1.371705339s","v":0,"name":"nexus","level":40,"time":"2023-09-01T18:42:07.060269723Z","hostname":"oxz_nexus_787d4782-8185-4e2f-85bf-64afecd7a3a9","pid":18054,"saga_id":"91e50f2b-1be8-431a-b344-efc9f923cf1a","saga_name":"snapshot-create","component":"nexus","component":"ServerContext","name":"787d4782-8185-4e2f-85bf-64afecd7a3a9","file":"nexus/src/app/sagas/common_storage.rs:74"}

which corresponds to a line in the middle of a rotated log file in the zone:

root@oxz_nexus_787d4782-8185-4e2f-85bf-64afecd7a3a9:~# grep '{"msg":"Region requested, not yet created. Retrying in 1.371705339s","v":0,"name":"nexus","level":40,"time":"2023-09-01T18:42:07.060269723Z","hostname":"oxz_nexus_787d4782-8185-4e2f-85bf-64afecd7a3a9","pid":18054,"saga_id":"91e50f2b-1be8-431a-b344-efc9f923cf1a","saga_name":"snapshot-create","component":"nexus","component":"ServerContext","name":"787d4782-8185-4e2f-85bf-64afecd7a3a9","file":"nexus/src/app/sagas/common_storage.rs:74"}' /var/svc/log/oxide-nexus:default.log.0
{"msg":"Region requested, not yet created. Retrying in 1.371705339s","v":0,"name":"nexus","level":40,"time":"2023-09-01T18:42:07.060269723Z","hostname":"oxz_nexus_787d4782-8185-4e2f-85bf-64afecd7a3a9","pid":18054,"saga_id":"91e50f2b-1be8-431a-b344-efc9f923cf1a","saga_name":"snapshot-create","component":"nexus","component":"ServerContext","name":"787d4782-8185-4e2f-85bf-64afecd7a3a9","file":"nexus/src/app/sagas/common_storage.rs:74"}

@jmpesp
Copy link
Contributor Author

jmpesp commented Sep 1, 2023

Paging @bnaecker

@bnaecker
Copy link
Collaborator

bnaecker commented Sep 1, 2023

I'm not sure what's going on. I'd have expected that the open file descriptor the tar process has would preclude actual removal of the file, even if the file is rotated or archived to the U.2 concurrently. This is what happens if one process open(2)s a file and another unlink(2)s it concurrently. But I guess I don't know how the tar crate is built. It could do something like stat the file to determine the size, and then open it.

I'll look into what tar does here, and we could consider copying the files to /tmp before archiving them to make sure no one can yank them out from under us.

@bnaecker
Copy link
Collaborator

bnaecker commented Sep 1, 2023

So we're using append_path_with_name to insert the log files. We do that because we want to set the name of the entry in the archive to avoid copying the entire directory structure. That seems to stat the file, and then open it. It seems like a TOCTOU.

@rmustacc
Copy link

rmustacc commented Sep 1, 2023

I'm not sure what's going on. I'd have expected that the open file descriptor the tar process has would preclude actual removal of the file, even if the file is rotated or archived to the U.2 concurrently. This is what happens if one process open(2)s a file and another unlink(2)s it concurrently. But I guess I don't know how the tar crate is built. It could do something like stat the file to determine the size, and then open it.

Note looking at the smf/logadm/logadm.conf entry option, the way that this works is that we've used the -c option by opening it, copying it, and then truncating it. So if you're reading the original file while it is operating, it can definitely be truncated out from under you. The same is true with the cp approach mentioned above. I believe that logadm will use mandatory locking at some point as well. I don't know if that's what's going on, but wanted to highlight it as a possibility and something to consider.

@bnaecker
Copy link
Collaborator

bnaecker commented Sep 2, 2023

Thanks @rmustacc that makes sense. I assumed the file was renamed or copied and then deleted. @jmpesp do you happen to have either the sled-agent log or the timestamps of the rotated files? I think with one or both of those, and the timestamp from the metadata.toml file, we can determine if the rotation happened near or during the bundle creation.

The same is true with the cp approach mentioned above.

Yeah, copying is not immune to this that's true. I think it'll be better than what we have though. The tarball is sad, I'm guessing because the metadata for the log file doesn't match the actual size. That would be the TOCTOU in the implementation I linked above.

So while we can't prevent this from happening, we could at least prevent it from breaking the tar file. We can copy the log files into a tempdir, and then use the same approach. It's possible the file will be truncated compared to the original, but this will make sure the tar header and actual file metadata match. This whole process is best-effort, so a truncated file is better than a broken tarball.

@jmpesp
Copy link
Contributor Author

jmpesp commented Sep 8, 2023

I'm sorry, I don't have either. I can try and reproduce this again today if that helps.

@bnaecker
Copy link
Collaborator

bnaecker commented Sep 8, 2023

If you can, that'd be very helpful, thanks. What we see now is consistent with your hypothesis of concurrent rotation / truncation and bundling, it'd just be nice to prove it. I think we can do the approach mentioned above, of copying and then inserting, even if we don't have that confirmation. We may still end up with a truncated log file if we copy it while it's being rotated, but the tar header metadata and the file should at least match in that case.

@jmpesp
Copy link
Contributor Author

jmpesp commented Sep 8, 2023

Got it:

jwm@duster ~/zone_bundles/frostypaws/oxz_nexus_f1dbb3c1-4bdc-4ef3-bc0d-47deae0545b8 [] [] $ tar xzf fe4bd044-4291-4aa0-be60-80870e493985.tar.gz
tar: Skipping to next header
tar: Exiting with failure status due to previous errors
- exit 2

jwm@duster ~/zone_bundles/frostypaws/oxz_nexus_f1dbb3c1-4bdc-4ef3-bc0d-47deae0545b8 [] [] $ ls -al 
total 17088
drwxrwxr-x  2 jwm jwm     4096 Sep  8 13:06 .
drwxrwxr-x 10 jwm jwm     4096 Sep  8 13:05 ..
-r--r--r--  1 jwm jwm   447741 Sep  8 13:04 fe4bd044-4291-4aa0-be60-80870e493985.tar.gz
-r--r--r--  1 jwm jwm      107 Sep  8 13:04 last
-r--r--r--  1 jwm jwm      205 Sep  8 13:04 metadata.toml
-r--r--r--  1 jwm jwm     3837 Sep  8 13:04 netstat
-rw-r--r--  1 jwm jwm 16935780 Sep  8 13:04 oxide-nexus:default.log
-r--r--r--  1 jwm jwm      187 Sep  8 13:04 pargs.5051
-r--r--r--  1 jwm jwm     5106 Sep  8 13:04 pfiles.5051
-r--r--r--  1 jwm jwm    59354 Sep  8 13:04 pstack.5051
-r--r--r--  1 jwm jwm      616 Sep  8 13:04 ptree
-r--r--r--  1 jwm jwm     3761 Sep  8 13:04 svcs
-r--r--r--  1 jwm jwm       87 Sep  8 13:04 uptime
-r--r--r--  1 jwm jwm       17 Sep  8 13:04 who

root@oxz_nexus_f1dbb3c1-4bdc-4ef3-bc0d-47deae0545b8:~# ls -al /var/svc/log/oxide-nexus:default.log*
-rw-r--r--   1 root     root     17061876 Sep  8 17:06 /var/svc/log/oxide-nexus:default.log
-rw-r--r--   1 root     root      528301 Sep  8 16:59 /var/svc/log/oxide-nexus:default.log.0

@jmpesp
Copy link
Contributor Author

jmpesp commented Sep 8, 2023

note that the time from root@oxz_nexus_f1dbb3c1-4bdc-4ef3-bc0d-47deae0545b8 is UTC

@bnaecker
Copy link
Collaborator

bnaecker commented Sep 8, 2023

After some more prompting from @rmustacc, I looked into the implementation of logadm to see if / when there is actual locking of the file. And sure enough:

https://github.com/illumos/illumos-gate/blob/f7ce6a749c79a107447356b130df647b6b60d449/usr/src/cmd/logadm/main.c#L1244-L1262

logadm first copies sections of the file starting from the beginning. When it nears the end (less than 10KiB, I think), it sets the mandatory lock bit and then lockfs the remainder of the file. (The strategy of only locking the end is to avoid blocking verbose log writers for a long time by locking the entire file.) At that point, it copies the remainder of the file, and then truncates the original to zero length.

So we can cooperate with this from the zone bundler, though I'd need to think about exactly how. logadm calls lockf(fi, F_LOCK, 0), which has the effect of locking from the current file position to the end of the file, even if the end of the file changes. That means that the writer is blocked for as long as it takes logadm to copy the remainder. So, we could put a lock on that last section of the file when the bundler first opens it for copying. This does run the risk of blocking the writer for a long time, exactly what logadm is trying to avoid. However, it also means logadm can make some progress if it happens to run during this time, by copying the prefix of the file until it gets to the section where we hold the lock.

So a proposal for fixing this:

  • open the file O_RDWR so we can set mode bits to make the lock mandatory
  • lock the last byte of the file (or page, doesn't really matter I think?). This will prevent the service from continuing to emit logs. logadm can rotate out this file concurrently until it hits the locked section. We know this section won't grow because the service can't write to it until we drop the lock.
  • copy the file into the tarball, or probably to temporary storage and then the tarball, so we don't have the overhead of gzipping the file while we hold this lock
  • unlock the file and reset the mode bits

@bnaecker
Copy link
Collaborator

bnaecker commented Oct 2, 2023

I did a bit more digging after we were hitting the same tar error pretty frequently last week. It's not specific to the truncation issue, and can happen basically anytime the log file is being written to while we're taking a bundle. The TL;DR is the TOCTOU in append_path_with_name() mentioned above, where the tar crate stats the file; creates a tar header with that size; and then later appends until EOF. If the file changes in the meantime, such as being written to by the actual daemon, we'll insert data past the end of the captured size.

Details

We noted a lot of errors like this, when pulling zone bundle files:

bnaecker@shale : ~/tmp $ tar -tf 2388914d-a679-4099-bf9c-5d80d40371f3.tar 1> /dev/null
tar: directory checksum error

That's an error coming from here in the tar implementation on Helios:

	if (chksum != checksum(&dblock)) {
		if (chksum != checksum_signed(&dblock)) {
			(void) fprintf(stderr, gettext(
			    "tar: directory checksum error\n"));
			if (iflag) {	if (chksum != checksum(&dblock)) {
		if (chksum != checksum_signed(&dblock)) {
			(void) fprintf(stderr, gettext(
			    "tar: directory checksum error\n"));

So the checksum in the file doesn't match the one we actually expect. Which file are we failing on? I ran this DTrace invocation, which looks for the filename we expect when we call checksum():

bnaecker@shale : ~/tmp $ pfexec dtrace -n 'pid$target::checksum:entry { dp = copyin(arg0, 100); tracemem(dp, 100);  }' -c '/usr/bin/tar tf 2388914d-a679-4099-bf9c-5d80d40371f3.tar 1> /dev/null'
...
  1  96743                   checksum:entry
             0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f  0123456789abcdef
         0: 6f 78 69 64 65 2d 63 6c 69 63 6b 68 6f 75 73 65  oxide-clickhouse
        10: 3a 64 65 66 61 75 6c 74 2e 6c 6f 67 00 00 00 00  :default.log....
        20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        60: 00 00 00 00                                      ....

  1  96743                   checksum:entry
             0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f  0123456789abcdef
         0: 20 75 73 65 72 20 27 64 65 66 61 75 6c 74 27 20   user 'default'
        10: 66 72 6f 6d 20 5b 66 64 30 30 3a 31 31 32 32 3a  from [fd00:1122:
        20: 33 33 34 34 3a 31 30 61 3a 3a 33 5d 3a 33 33 37  3344:10a::3]:337
        30: 38 37 0a 32 30 32 33 2e 30 39 2e 32 38 20 31 36  87.2023.09.28 16
        40: 3a 33 37 3a 33 30 2e 34 30 34 30 33 30 20 5b 20  :37:30.404030 [
        50: 33 20 5d 20 7b 7d 20 3c 44 65 62 75 67 3e 20 48  3 ] {} <Debug> H
        60: 54 54 50 2d                                      TTP-

The checksum() function is called with a pointer to a struct dblock, which really just starts with the filename itself as a char[100]. The first entry there shows an expected entry, the ClickHouse log file name. The last shows...junk. Specifically, we're somewhere in the actual ClickHouse logfile.

So looking at the tar header itself, using Python's tarfile module in the stdlib for simplicity, here's the basic information about that last valid entry:

In [102]: member
Out[102]: <TarInfo 'oxide-clickhouse:default.log' at 0x1316b1100>

In [103]: member.offset_data
Out[103]: 86410752

In [104]: member.size
Out[104]: 85882899

In [105]: last = member.offset_data + member.size

In [106]: buf[last - 100: last + 100]
Out[106]: b'ments_i64 (5fe342cc-6366-4e12-bec2-cf863eadb378): Removing part from filesystem all_983825_983825_0\n2023.09.28 16:37:30.403686 [ 3 ] {} <Trace> HTTPHandler-factory: HTTP Request for HTTPHandler-factor'

So the size field recorded in the header is BS. We've added about 46K bytes since this file was stat(2)d and when the tar crate actually read to EOF:

In [110]: len(buf) - last
Out[110]: 46061

Looking at the tar header structure, the "filename" of the next entry is the garbage user 'default' from [fd00:1122:3344:10a::3]:33787\n2023.09.28 16:37:30.404030 [ 3 ] {} <Debug> HTTP-. Where does that come from? Tar specifies 512-byte blocks, so we can take the expected end of the data from the last-valid header, and round up to the next multiple of 512, which is where the next header would start, if it exists:

In [144]: import math

In [146]: start_next = math.ceil(last / 512) * 512

In [147]: buf[start_next:][:100]
Out[147]: b" user 'default' from [fd00:1122:3344:10a::3]:33787\n2023.09.28 16:37:30.404030 [ 3 ] {} <Debug> HTTP-"

So tar is just reinterpreting the bytes of the logfile itself as a header, which obviously fails the checksum comparison.

To summarize, I think this issue is a bit broader than handling truncation of the log files correctly as logadm rotates them. That's part of it, but we also need to work around this TOCTOU in the tar crate, by potentially modifying the size of the file in the tar header to match what we actually inserted.

@bnaecker
Copy link
Collaborator

bnaecker commented Oct 3, 2023

I've been reading more of the logadm and log-archival code, and I think that any path which involves locking files at the right time is going to be tricky to prove correct. I talked with @davepacheco and @ahl a bit about this, and I think that taking a ZFS snapshot of the datasets with log-files, and copying them out of that snapshot, seems like a viable approach. This will guarantee that we're not fighting with logadm (other than whatever atomicity the snapshot requires). It'll also ensure that we have a consistent point-in-time view of all the logs. If we take the snapshot of the M.2 before the U.2, then we also know that we'll capture any log messages that exist at the time we take the snapshot.

One thing we'll want to ensure is that we don't keep these snapshots around forever. I think it's enough to delete them if they exist when the sled-agent starts up, and then when the zone-bundling is finished, either successfully or not. If we crash during that process, the snapshot will not persist indefinitely, since it'll be deleted when the sled agent restarts.

I'll look more into this today, and hopefully make some progress.

bnaecker added a commit that referenced this issue Oct 6, 2023
- Fixes #4010
- Previously, we copied log files directly out of their original
  locations, which meant we contended with several other components:
  `logadm` rotating the log file; the log archiver moving the to
  longer-term storage; and the program writing to the file itself. This
  commit changes the operation of the bundler, to first create a ZFS
  snapshot of the filesystem(s) containing the log files, clone them,
  and then copy files out of the clones. We destroy those clones /
  snapshots after completing, and when the sled-agent starts to help
  with crash-safety.
bnaecker added a commit that referenced this issue Oct 7, 2023
- Fixes #4010
- Previously, we copied log files directly out of their original
  locations, which meant we contended with several other components:
  `logadm` rotating the log file; the log archiver moving the to
  longer-term storage; and the program writing to the file itself. This
  commit changes the operation of the bundler, to first create a ZFS
  snapshot of the filesystem(s) containing the log files, clone them,
  and then copy files out of the clones. We destroy those clones /
  snapshots after completing, and when the sled-agent starts to help
  with crash-safety.
bnaecker added a commit that referenced this issue Oct 16, 2023
- Fixes #4010
- Previously, we copied log files directly out of their original
locations, which meant we contended with several other components:
`logadm` rotating the log file; the log archiver moving the to
longer-term storage; and the program writing to the file itself. This
commit changes the operation of the bundler, to first create a ZFS
snapshot of the filesystem(s) containing the log files, clone them, and
then copy files out of the clones. We destroy those clones / snapshots
after completing, and when the sled-agent starts to help with
crash-safety.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants