Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

/var/lib/swupd/staged is retaining too many files #378

Closed
ahkok opened this issue Feb 12, 2018 · 34 comments
Closed

/var/lib/swupd/staged is retaining too many files #378

ahkok opened this issue Feb 12, 2018 · 34 comments

Comments

@ahkok
Copy link
Contributor

ahkok commented Feb 12, 2018

There is a reasonable amount of storage needed to keep updates efficiently, such that files that will be needed on the next release are kept until they are no longer needed. However, it appears that too many files are kept, resulting in wasted disk space.

I've poked at it and since I have several systems where /var/lib/swupd/staged is in the order of 7-9GB currently, I've got some data to discuss potential solutions.

why doesn't systemd-tmpfiles work?

Before I start with the solutions, I'd like to discuss the problems with systemd-tmpfiles, and why it doesn't work. Most of this is due to a combination of (1) our way of timestamping python files, (2) tmpfiles' way of looking at atime/mtime and a small dose of us shipping files that are -r--r--r-- root root which tmpfiles entirely ignores.

The result is that systemd-tmpfiles is useless and not usable to do any reasonable amount of pruning.

We have 2 straightforward solutions, however.

1 - prune what we know is old

Given the manifests that are on the system, we can discover all hashes for files that are relevant for the current state of the system. Since we can compare this to the files in /staged, we know which files are no longer related to any known manifest on the system. This gives us a list of files that "definitely" can be deleted without ever needing to be redownloaded.

In my local system, this would recover 3708879078 bytes from a total of 9354271423 (39%)

This is not a 'cheap" operation. This solution would be costly for a shell script, requiring many seconds of processing, even up to a minute or so, to finish. If done inside swupd itself, it could probably be done much more efficiently.

2 - prune based on mtime

Since we mostly create files in staged and never touch them after much, we could decide to use mtime to distinguish between files that are considered "aged" and not used recently.

The benefits of this method is that it's really quick and can bed one in a shell script.. The downsides is that it will specifically delete lots of Python files that may be relevant and cause fullfiles to be re-downloaded.

Here's what the data looks like:

mtime   rm size (1)  rm size (2)
none    9354271423   3801303314
1       9294391574   3759230964
2       8704640059   3423811441
3       2880597007    795056079
4       2774649062    794839002
5       2685577506    794782008

As to why the gap exists between 2 and 3, this must be due to the actual update content pushing a significant rebuild out.

Conclusion

I'm tempted to pick a broad swinging axe and rely on the availability of data over the network and pick solution 2, since it would have the most effect, and downloads are reasonably cheap. However, that method would cause a significant spike on redownloads from the CDN and potentially increase traffic significantly. Given the size of the spool use, I think -mtime +3 is reasonable, but we may want -mtime +2 in the long run.

@bryteise
Copy link
Member

I slightly remember talking with somebody around just culling all files that are not manifests in /var/lib/swupd as my heavy axe approach. In general the file cache doesn't seem to save time outside of bundle-add X -> bundle-remove X -> bundle-add X. I guess there are some cases where files are outside of a pack but haven't been altered and so wouldn't need to be downloaded again but I don't know if that case is particularly impactful.

@matthewrsj
Copy link
Contributor

@bryteise you were probably talking to me. That is my preferred approach.

@ahkok
Copy link
Contributor Author

ahkok commented Feb 12, 2018

So how about just culling -mtime +1 in a post-update in some way?

@ahkok
Copy link
Contributor Author

ahkok commented Feb 12, 2018

We could just do this with

ExecStartPost=-/usr/bin/find /var/lib/swupd/staged -type f -mtime +1 -exec rm -f {} \;

in swupd-update.service.

@matthewrsj
Copy link
Contributor

What about just swupd_rm everything in /var/lib/swupd that isn't in a version folder? This would keep the manifests around and solve a lot of the issues we've seen (and been fixing) with swupd mishandling the cached files there. Really simplifies things a lot.

@matthewrsj
Copy link
Contributor

@phmccarty @tmarcu

@ahkok
Copy link
Contributor Author

ahkok commented Feb 12, 2018

Deleting everything would likely destroy any valuable files that we may want to inspect for debugging purposes. Hence keeping them for 24h or so doesn't seem like a bad idea.

@matthewrsj
Copy link
Contributor

But I'm also thinking about the benefit to swupd of having the staged directory clear whenever we do our operations. We've had to do a few fixes in the last few months because that hasn't always been handled correctly, and I can't be 100% certain it is being handled correctly currently.

@bryteise
Copy link
Member

Hrm having an ability to disable cleaning the state dir with an option seems fine to me for the usual developer case. Other than that I'm unsure.

@icarus-sparry
Copy link

I agree with @ahkok - better to leave clues around for debugging.

If we want the staged directory empty because we don't have confidence in our code then perhaps moving the staged directory to a datestamped version would work using the hypothetical option @bryteise suggests.

@phmccarty
Copy link
Contributor

I would be okay with having an option to clean the state directory, but by default I think swupd-client should correctly manage its own state directory, and if there are bugs in that management, we should fix those bugs.

@matthewrsj
Copy link
Contributor

@phmccarty that isn't the only reason to wipe it. We don't get much benefit from it being there and it takes up space on the system.

@ahkok
Copy link
Contributor Author

ahkok commented Feb 12, 2018

#define correct management

Given the output of method 1 above in my text, I am dubious that "correct" management will do any significant cleaning, and simply will retain way too many files.

@ahkok
Copy link
Contributor Author

ahkok commented Feb 12, 2018

BTW, with the modified unit proposed above, all you'd need to do to debug issues and keep all files around is disable auto-update. That seems entirely reasonable.

@icarus-sparry
Copy link

One thing that we have talked about briefly for swupd is related to A/B updates. There we can regenerate the staged directory from the existing A, to reduce download bandwidth. Strictly speaking we can do this even for the non A/B case. The Manifest gives us the required hash and the filename of the thing that is supposed to have that hash value, so we can can copy that file to the staged directory and then validate it (order is important). If it fails to validate then we can delete it, and fall back to downloading the file.

@cmarcelo
Copy link
Contributor

We could just do this with

ExecStartPost=-/usr/bin/find /var/lib/swupd/staged -type f -mtime +1 -exec rm -f {} ;

in swupd-update.service.

swupd has a lock per state directory for certain operations, so this one might conflict with a manual execution of swupd by the user. What about adding a subcommand to swupd to achieve the same effect, but that holds the lock, and we call it on the service file?

@cmarcelo
Copy link
Contributor

I gave the subcommand approach a shot to see how it looks. The result is in PR #383. It adds a "swupd clean" that drops old files, but keep enough manifest files to not harm the search functionality. So it is an hybrid between: only care about dates and be very precise about what we want to keep. For ignoring versions and ages, use "swupd clean --all".

@matthewrsj
Copy link
Contributor

@ahkok PR #383 has now been merged, so we could talk about if we want this to be invoked periodically behind the scenes, or after an update.

@otaviobp
Copy link
Contributor

otaviobp commented Jun 6, 2018

Why was mtime used to decide if a file is old? Mtime is the time the file was modified, i.e., the time it was generated in the server and not the time it was extracted in the system. So if we do a bundle-add && clean && bundle-remove && bundle-add, it's very likely we are going to download everything again. Is there any use case where using mtime to decide if the file is old is better than atime or ctime?
If we wan't to keep files that were recently downloaded we should use ctime, but I think that atime makes more sense because if a file was used recently, it's more likely that it's going to be used again (at least for the cases I think cache would be useful).

Another question, If the size of the cache is a problem, why don't we keep tarballs instead of staged files?

Btw, I missed one option to clean everything except Manifests. That's what I would use after each update in my system.

@cmarcelo
Copy link
Contributor

cmarcelo commented Jun 6, 2018

Why was mtime used to decide if a file is old?

See "Conclusion" right in the first comment for the general idea. I'm not familiar with how reliable atime/ctime are, but would make sense to try them out for the staged files and compare (either tweaking clean or performing tests with find). It should be easy to swap which one is used in swupd clean.

Another question, If the size of the cache is a problem, why don't we keep tarballs instead of staged files?

I can say why I haven't done it: that is a larger change. The main issue we had was that nothing was being cleanup at all, so even the "hammer" solution of mtime would already improve things a lot with a low investment.

Note: I'm not sure if swupd clean is being automatically called already, but that was the plan. @matthewrsj have we enabled something like that?

Btw, I missed one option to clean everything except Manifests. That's what I would use after each update in my system.

I'd try as much as possible to avoid lots of options on this command. I've given a default behavior (that make sense to be called periodically) and a "clean everything behavior". Maybe we should just tweak the default behavior.

If the reason to keep only the manifests is making swupd search happy, I think there are plans to make it use other data source, so swupd clean --all would be what you want.

@matthewrsj
Copy link
Contributor

Note: I'm not sure if swupd clean is being automatically called already, but that was the plan. @matthewrsj have we enabled something like that?

We have not, and that is the only reason I've been leaving this option open. I would like to run swupd clean (without --all) after every update. Even if mtime is too aggressive it will still keep the manifests around right?

@cmarcelo
Copy link
Contributor

cmarcelo commented Jun 7, 2018

swupd clean will remove all manifests that are not mentioned by the current version running.

The way its done it might keep some extra manifests, but not less. So it doesn't interfere with swupd search in negative ways.

@matthewrsj
Copy link
Contributor

swupd clean will remove all manifests that are not mentioned by the current version running.

This is desired behavior after an update. We don't need the old manifests around.

@otaviobp
Copy link
Contributor

otaviobp commented Jun 7, 2018

See "Conclusion" right in the first comment for the general idea. I'm not familiar with how reliable atime/ctime are, but would make sense to try them out for the staged files and compare (either tweaking clean or performing tests with find). It should be easy to swap which one is used in swupd clean.

As far as I understood the idea is to clean files older than x days. And I'm ok with that. The problem is the definition of old.
If we use mtime we are going to remove all files that were build before x days ago.
If we use atime we are going to remove all files that weren't accessed in the system for more than x days.
If we use ctime we are going to remove all files that were extracted before x days ago.

I commented in this issue because I'm running some tests and I realized that swupd clean was cleaning more files than I thought it would clean. For example:

# ls /var/lib/swupd/staged/|wc -l
0
# swupd bundle-add shells
Downloading packs...

Extracting shells pack for version 22690
...100%
Installing bundle(s) files...
	...100%
Calling post-update helper scripts.
Bundle(s) installation done.
# ls /var/lib/swupd/staged/|wc -l
2784
# swupd clean 
2784 files removed.
# ls /var/lib/swupd/staged/|wc -l
0

Is this the expected?

@cmarcelo
Copy link
Contributor

cmarcelo commented Jun 7, 2018

Is this the expected?

No. I wasn't explicit in my first reply, I agree this should be fixed.

If we use atime we are going to remove all files that weren't accessed in the system for more than x days.
If we use ctime we are going to remove all files that were extracted before x days ago.

I'd prefer going with ctime here. In many cases we are using that file but we don't access it (because the update says it did not change, so we have no reason to touch it if the hash matches).

@icarus-sparry
Copy link

Just to add to the fun, we play games with the times of files in the build process. The "export SOURCE_DATE_EPOCH=12345" stuff in the spec files drives this. This is necessary (but not sufficient) to get reproducible builds as the rpm files (which have cpio archives inside them) contain the dates of artifacts.

You don't want to see the games played with '*.pyc' files (hinted at by "delete lots of python files").

So this pretty much rules out mtime.

So we have ctime, which pretty much is the last time the file was written to (yes chmod will change ctime) or atime, which should be the last time the file was opened but often isn't because filesystems are mounted with the relatime option rather than strictatime. Fortunately relatime these days does update atime at least once a day.

@ahkok
Copy link
Contributor Author

ahkok commented Jun 7, 2018

I've come back entirely on my earlier assessment. Given the lack of data, I think we should drastically change our strategy:

  • stop using tmpfiles, entirely
  • after swupd update, delete all

that's it. We can try to do a little smartness to just retain latest manifests, but everything else (staged, $VER folders) should just go.

Right now we're hurting users badly with 3-5gb of wasted storage that is only going to be used in the worst case, and even then we know verify is going to be slow anyway.

@mbelluzzo
Copy link

mbelluzzo commented Jun 7, 2018

Except by:

stop using tmpfiles, entirely

And only because I don't know how its being used and why. I'm with ahkok on that.

@otaviobp
Copy link
Contributor

otaviobp commented Jun 7, 2018

after swupd update, delete all

+1

@ahkok
Copy link
Contributor Author

ahkok commented Jun 7, 2018

And only because I don't know how its being used and why

we have a tmpfiles.d entry for swupd:

d /var/lib/swupd 0700 root root 4d -

However, due to the above issues, this is entirely without result.

@matthewrsj
Copy link
Contributor

after swupd update, delete all

Eh, having the manifests for the current version stay around is a big win, especially for swupd search and bundle-add/remove code. I'd be happy with a clean option that deleted all content, but left the manifests referenced in the current MoM behind.

@cmarcelo
Copy link
Contributor

cmarcelo commented Jun 8, 2018

@matthewrsj after #476 swupd clean will do exactly what you are suggesting (and what Auke suggested).

  • swupd clean deletes everything except the relevant manifests (and any other metadata in the future we think is worth keeping in the future)
  • swupd clean --all deletes everything

@matthewrsj
Copy link
Contributor

@cmarcelo ok that's fine. I'm just responding that we don't want to run swupd clean --all after an update and delete everything, which is what I understood from after swupd update, delete all

@cmarcelo
Copy link
Contributor

cmarcelo commented Jun 8, 2018

@matthewrsj Got it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants