Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flux-shutdown: add --gc garbage collection option #4303

Merged
merged 14 commits into from
May 2, 2022

Conversation

garlick
Copy link
Member

@garlick garlick commented Apr 26, 2022

This adds some logic to rc1 and rc3 to enable offline garbage collection of a system instance using flux-dump(1) and flux-restore(1). If requested by flux shutdown --gc, a dump is produced in ${statedir} during shutdown with a RESTORE symlink pointed to it. On startup, if RESTORE exists, the current backing store is truncated and content is restored from the archive file. Then the RESTORE symlink is removed.

For example on my test system:

$ sudo flux shutdown --gc
flux-shutdown: shutdown will dump KVS (this may take some time)
broker.info[0]: cleanup.0: flux queue stop --quiet Exited (rc=0) 0.0s
broker.info[0]: cleanup.1: flux job cancelall --user=all --quiet -f --states RUN Exited (rc=0) 0.0s
broker.info[0]: cleanup.2: flux queue idle --quiet Exited (rc=0) 0.0s
broker.info[0]: cleanup-success: cleanup->shutdown 0.104717s
broker.info[0]: children-none: shutdown->finalize 0.179421ms
broker.info[0]: rc3.0: dumping content to /var/lib/flux/dump-20220425_174130.tgz
broker.info[0]: rc3.0: /usr/local/etc/flux/rc3 Exited (rc=0) 0.6s
broker.info[0]: rc3-success: finalize->goodbye 0.578407s
$ sudo ls -l /var/lib/flux
total 39684
-rw-r--r-- 1 flux flux  1073152 Apr 25 17:41 content.sqlite
-rw-r--r-- 1 flux flux    48183 Apr 25 17:41 dump-20220425_174130.tgz
-rw-r--r-- 1 flux flux 39510016 Apr 24 18:09 job-archive.sqlite
lrwxrwxrwx 1 flux flux       24 Apr 25 17:41 RESTORE -> dump-20220425_174130.tgz
$ sudo systemctl start flux
$ sudo ls -l /var/lib/flux
total 40896
-rw-r--r-- 1 flux flux     4096 Apr 25 17:42 content.sqlite
-rw-r--r-- 1 flux flux  2307232 Apr 25 17:42 content.sqlite-wal
-rw-r--r-- 1 flux flux    48183 Apr 25 17:41 dump-20220425_174130.tgz
-rw-r--r-- 1 flux flux 39510016 Apr 24 18:09 job-archive.sqlite

It's also possible to checkpoint/restart an instance with:

$ flux shutdown --dump=foo.tar.bz2
$ flux start -o,-Scontent.restore=foo.tar.bz2

although this is only practical if R hasn't changed at the moment.

Marking a WIP as I wanted to get feedback on the approach before writing tests.

I had thought maybe garbage collection could be automated somehow by tracking some metric that could be used as an indicator of the need. However, maybe getting a bit of experience with doing it manually first makes sense.

@garlick garlick force-pushed the content_truncate branch 2 times, most recently from 0e97b7c to bebe13d Compare April 26, 2022 21:31
@garlick
Copy link
Member Author

garlick commented Apr 26, 2022

I went ahead and added test coverage here and have been testing this on my home test system, so removing the WIP. I'm still open to doing this another way if people have better ideas!

@garlick garlick changed the title WIP: flux-shutdown: add --gc garbage collection option flux-shutdown: add --gc garbage collection option Apr 26, 2022
@grondo
Copy link
Contributor

grondo commented Apr 27, 2022

Sorry, I haven't had time to take a peek at this. I can't think of a any better interface, given that garbage collection must occur as part of a dump/restore. I can't remember, does GC happen just as a natural result of the restore?

As a more general point (and I guess unrelated to this PR), I'm a little worried there will be confusion when to use flux shutdown vs systemd for stopping a Flux system instance. I'm not sure there are many systemd services that are stopped via a different command.

@garlick
Copy link
Member Author

garlick commented Apr 27, 2022

Yes a dump/restore walks the KVS metadata starting with the last root hash, so when it is restored, all the unreferenced data is left behind. In addition, the archive only contains "files", unlike a file system archive created with tar (for example). So empty job directories are removed also.

One can still run systemctl stop flux on the rank 0 node to shut down the instance. And one could trigger a dump/restore as part of that by manually setting the content.dump attribute to auto. However I didn't want to encourage that because then the dump would be subject to the systemd TimeoutStopSec and would risk getting killed before the dump is complete. Hence adding the option to shutdown so it's tied to that way of bringing down the instance. I anticipate that we will add other things to flux-shutdown(1) like options to stop the queue and let running jobs complete, or to shut down in the future. So maybe it will become a natural way to stop flux.

One other weakness I see here is the dump files aren't removed and will pile up after a while. I was vaguely thinking that this could be valuable if we wanted to revert to a previous checkpoint if the db was corrupted or whatever, and that sys admins could manage the dump files with log rotation tools. Is that reasonable?

I guess the other question - is --gc an annoyingly terse option? It's just a shorthand for --dump=auto. Should we just go with that? The purpose of --gc was to provide an option that matched the desired end effect (garbage collecting the KVS). I'm open to better names.

@grondo
Copy link
Contributor

grondo commented Apr 27, 2022

Ah, thanks for that refresher, that was helpful. Given the above, this approach seems just fine IMO. I like the idea of extending shutdown semantics in the future. Also, if dumpfiles tend to accumulate, maybe logrotate or systemd-tmpfiles could be configured to automatically clean things up? (Edit: I see now you already mentioned this approach in your previous post. It does seem reasonable to me!)

@grondo
Copy link
Contributor

grondo commented Apr 27, 2022

And FWIW, I don't have a problem with --gc. It is close enough to git gc where I understand what it is meant to do.

@garlick garlick force-pushed the content_truncate branch 4 times, most recently from 0d2f2d7 to 4384fd9 Compare May 1, 2022 22:44
@garlick
Copy link
Member Author

garlick commented May 1, 2022

Pushed a tmpfiles.d config file and moved "auto" dumps to $statedir/dump since that made the tmpfiles rule easier to write.

Tested on my test system instance by running flux shutdown --gc a few times and reducing the age setting for dump files, then observing that they were purged with systemd-tmpfiles --clean

@garlick garlick force-pushed the content_truncate branch from 4384fd9 to 2011670 Compare May 2, 2022 15:10
@codecov
Copy link

codecov bot commented May 2, 2022

Codecov Report

Merging #4303 (4384fd9) into master (d53b662) will increase coverage by 0.01%.
The diff coverage is 85.71%.

❗ Current head 4384fd9 differs from pull request most recent head 2011670. Consider uploading reports for the commit 2011670 to get more accurate results

@@            Coverage Diff             @@
##           master    #4303      +/-   ##
==========================================
+ Coverage   83.62%   83.64%   +0.01%     
==========================================
  Files         389      389              
  Lines       65388    65421      +33     
==========================================
+ Hits        54680    54720      +40     
+ Misses      10708    10701       -7     
Impacted Files Coverage Δ
src/cmd/builtin/shutdown.c 87.27% <80.00%> (-0.73%) ⬇️
src/modules/content-files/content-files.c 78.91% <81.48%> (+2.69%) ⬆️
src/modules/content-sqlite/content-sqlite.c 63.00% <100.00%> (+0.44%) ⬆️
src/modules/job-archive/job-archive.c 62.13% <0.00%> (-0.74%) ⬇️
src/shell/pmi/pmi.c 82.29% <0.00%> (-0.66%) ⬇️
src/common/libpmi/simple_server.c 86.63% <0.00%> (-0.50%) ⬇️
src/cmd/flux-module.c 83.96% <0.00%> (-0.30%) ⬇️
src/cmd/flux-job.c 87.27% <0.00%> (-0.14%) ⬇️
src/broker/overlay.c 86.69% <0.00%> (-0.11%) ⬇️
src/common/libsdprocess/sdprocess.c 69.25% <0.00%> (+0.12%) ⬆️
... and 9 more

@garlick
Copy link
Member Author

garlick commented May 2, 2022

Repushed with reference to #258

Copy link
Member

@chu11 chu11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall lgtm, just a few comments / nits I found

@@ -807,7 +807,7 @@ static int process_args (struct content_sqlite *ctx,
*truncate = true;
}
else {
flux_log_error (ctx->h, "Unknown module option: '%s'", argv[i]);
flux_log (ctx->h, LOG_ERR, "Unknown module option: '%s'", argv[i]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps should stylize do a similar change in content-files? (content-files sets errno = EINVAL to make it not as bad)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh and I guess w/ content-s3 too (given follow up commit to this one)

Comment on lines +76 to +79
test_expect_success 'content-files module load fails with unknown option' '
test_must_fail flux module load content-files notoption
'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, should this test be a different commit? not really related to content-files: add truncate module option

exit_rc=1
fi
fi
flux module remove ${backingmod} || exit_rc=1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should use modrm?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

modrm tests $RANK before running flux module remove, but this is within a block that is already conditional on $RANK. I thought it kind of weird to trade a straightforward one-liner for a function call to do same when it was necessary to repeat the rank constraint. Does that make sense?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh that makes sense, you have the rank == 0 check above this.

fi
fi
if test -n "${dumpfile}"; then
flux module load ${backingmod} truncate
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use modload for consistency?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same deal here

rm -f ${dumplink}
fi
else
flux module load ${backingmod}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use modload for consistency?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ibid

Comment on lines +36 to +38
if test -n "${dumplink}"; then
rm -f ${dumplink}
fi
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should only remove the link if the restore is successful?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but the rc1 shebang is #!/bin/bash -e so the script aborts before the remove if the restore is unsuccessful.

garlick added 12 commits May 2, 2022 15:43
Problem: rc scripts use the content backing store 'truncate' option
to manage offline garbage collection, but content-sqlite does not
support this option.

Add a truncate module option that unlinks the database file before
the database is opened, thereby emptying it.

Add test.
Problem: if an unknown module option is supplied flux_log_error()
is called without errno set.

Log the error with flux_log (LOG_ERR) instead.
Problem: there is no option to query the number of objects held
by content-files in test.

Override the stats.get built-in RPC handler with one that provides
the object count.  So like content-sqlite,
  flux module stats content-files

returns the count.
Problem: rc scripts use the content backing store 'truncate' option
to manage offline garbage collection, but content-files does not
support this option.

Add a truncate module option that recursively removes the db dir
before the database is opened.

Add test.
Problem: rc scripts use the content backing store 'truncate' option
to manage offline garbage collection, but content-s3 does not
support this option.

Add a truncate module option.  Since emptying an s3 bucket is not
directly suported by libs3, this is a rather involved process.
For now, if this option is supplied, log an error instructing the
user to purge the bucket using s3 console or another mechanism and
return failure.

Add test.
Problem: we need a way to tell rc scripts to restore content
on startup, and dump content on shutdown, for offline KVS garbage
collection of a system instance or user checkpoint/restart.

Add some logic to rc1 and rc3:

rc1:  If the content.restore broker attribute is set to a file path,
then load the content backing store module with the 'truncate' option,
and restore content from the file before loading the KVS.

rc3:  If the content.dump broker attribute is set to a file path,
then dump content to the file after unloading the KVS.

Additionally, if content.restore=auto, then rc1 looks for a symlink
named RESTORE in the broker's current working directory or ${statedir}
if defined.  If the symlink exists, then restore content from the file
it points to and remove the symlink on success.

If content.dump=auto, then rc3 dumps content to an automatically generated
file name containing the date in the current working directory or
${statedir} if defined, and creates the RESTORE symlink pointing to it.
Problem: content.restore is not set for the system instance, so
automatic restore from a dump for garbage collection purposes
cannot be automated.

Set content.restore=auto, so if the ${statedir}/RESTORE symlink
exists, content will be truncated and then restored from a
previously created archive.
Problem: a system instance that runs flux-dump(1) from rc3
might get killed by systemd TimeoutStopSec.

Have flux-shutdown(1) arrange for the dump.  If the instance is
being shut down by this method, then systemctl stop is not being run,
so TimeoutStopSec does not apply.

Fixes flux-framework#258
Problem: system tests do not set statedir like systemd unit file.

Set statedir to a subdirectory under $workdir.
Problem: there is no test coverage for offline KVS garbage
collection.

Add a sharness script that exercises this functionality.
Augment the shutdown-cmd sharness script to cover new shutdown options.
Problem: dump files created for garbage collection may accumulate
in $statedir of a system instance.

Install a tmpfiles.d config file that removes dumps older than 30 days.
Problem: content-files logs "<option>: Invalid argument" on
an invalid module option, rather than mentioning "module option"
in the error, which would be more helpful.

Fix log message.
garlick added 2 commits May 2, 2022 15:43
Problem: content-s3 logs "<option>" on an invalid module option,
which is a bit vague.

Change error message to be more descriptive.
Problem: the sharness test for content-files does not cover
a bad module option.

Add test.
@garlick garlick force-pushed the content_truncate branch from 2011670 to 2ae2799 Compare May 2, 2022 22:43
@garlick
Copy link
Member Author

garlick commented May 2, 2022

Just pushed fixes based on @chu11's comments, and also rebased on current master.

Copy link
Member

@chu11 chu11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@garlick
Copy link
Member Author

garlick commented May 2, 2022

Thanks! I'll set MWP.

@mergify mergify bot merged commit d9c64e7 into flux-framework:master May 2, 2022
@garlick garlick deleted the content_truncate branch May 3, 2022 13:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants