broker: fix content-cache flush list corruption #4484

chu11 · 2022-08-10T20:07:11Z

Problem: A dirty cache entry has to potential to be added onto the flush list twice. This double addition can lead to list corruption.
The observed side effect was a list that was shortened and no longer accurate with respects to the acct_dirty counter. This could lead to hangs with content flush, missed flushes to the backing store, and segfault/memory corruption in the worst case.

Solution: Check if the cache entry is already on the flush list before adding it.

Fixes #4482

chu11 · 2022-08-10T20:15:40Z

nit for debate: is "corruption" the right word to use here for comments / commit messages / descriptions? Lets say we got a list like:

a -> b -> c -> d -> e -> NULL

lets say c gets double appended, I think the result is

a-> b -> c -> NULL

b/c code would set e->next = c, c.prev = e, c.next = NULL, but b/c b's pointer still points to c, we got a shortened list as a result. We've lost pointers to d and e.

it's not really "corrupted" in the usual sense of the word, but I couldn't think of a better one. Like just "damaged"?

chu11 · 2022-08-10T20:50:42Z

hmmm, one builder failed my regression test b/c of:

  expecting success: run_timeout 120 /usr/src/flux-core-0.42.0-46-g410dedf74/t/issues/t4482-flush-list-corruption.sh
  2022-08-10T20:30:35.659903Z broker.err[0]: rc1.0: /bin/bash: /usr/src/flux-core-0.42.0-46-g410dedf74/_build/sub/t/rc/rc1-issue4482: No such file or directory
  2022-08-10T20:30:35.660029Z broker.err[0]: rc1.0: /usr/src/flux-core-0.42.0-46-g410dedf74/_build/sub/t/rc/rc1-issue4482 Exited (rc=127) 0.0s
  2022-08-10T20:30:35.662138Z broker.err[0]: rc3.0: /bin/bash: /usr/src/flux-core-0.42.0-46-g410dedf74/_build/sub/t/rc/rc3-issue4482: No such file or directory
  2022-08-10T20:30:35.662266Z broker.err[0]: rc3.0: /usr/src/flux-core-0.42.0-46-g410dedf74/_build/sub/t/rc/rc3-issue4482 Exited (rc=127) 0.0s

not sure why every other builder works. Lemme try using FLUX_SOURCE_DIR instead of SHARNESS_TEST_DIRECTORY.

garlick · 2022-08-10T20:54:38Z

Well first, excellent job tracking this down, and it seems like the effect is actually pretty insidious.

Here's a thought. In at least one other place In content-cache.c, I see we call

        list_del_from (&cache->lru, &e->list);
        list_add (&cache->lru, &e->list);

would doing that be sufficient rather than creating a new list_ function?

If we need to add a new function, we may want to add it directly to libccan and submit the change upstream. Or if it's really only useful to us, then possibly add it with a "namespace" other than "list_" so it is evident to casual perusers of our code that it's not part of the original class. But if we can do something simple to use the class as designed without adding anything then maybe it's better to do that.

chu11 · 2022-08-10T21:49:43Z

would doing that be sufficient rather than creating a new list_ function?

Hmmm, I don't think that will specifically work b/c it requires the entry to be on a list already (there's an assert in the ccan code that checks for this fact). but list_del() + list_add_tail() should be sufficient.

If we need to add a new function, we may want to add it directly to libccan and submit the change upstream.

I went down this path just b/c I remember needing the check in the KVS. But doing something similar to what you suggested should be fine.

chu11 · 2022-08-10T23:29:22Z

re-pushed, doing list_del() and list_add_tail() together instead of the previous solution. so the PR is now just one commit :-)

garlick

LGTM. Thanks, simple is good.
Just a suggestion for improving the test.

garlick · 2022-08-10T23:47:35Z

t/issues/t4482-flush-list-corruption.sh

+flux kvs put issue4482A.a="abcdefghijk"
+flux kvs put issue4482A.b="lmnopqrstuv"
+flux kvs put issue4482A.c="wxyz0123456"
+flux kvs put issue4482A.d="7890ABCDEFG"
+flux kvs put issue4482A.e="HIJKLMNOPQR"
+flux kvs put issue4482A.f="STUVWXYZ!!!"
+flux kvs put issue4482A.g="<<<<<:>>>>>"
+
+flux kvs dropcache
+
+flux kvs put issue4482B.a="abcdefghijk"
+flux kvs put issue4482B.b="lmnopqrstuv"
+flux kvs put issue4482B.c="wxyz0123456"
+flux kvs put issue4482B.d="7890ABCDEFG"
+flux kvs put issue4482B.e="HIJKLMNOPQR"
+flux kvs put issue4482B.f="STUVWXYZ!!!"


I'm not sure this is doing what it looks like at face value, although it still may work. Those short values will be cached inside the directory entry for "issue4482B" so really what you're doing is creating multiple versions of that directory and the root in the content store.

Suggestion: use flux content store since the problem this pokes at has nothing to do with the kvs.

ahhh yeah, you're right my description is not correct, but I the the effect is identical, the multiple versions of the directory are the "data", not the junk I'm writing.

Let me try with flux content store

grondo

Not a full review, but I did want to point something out in how the test is organized.

Also, in practice we usually split additional tests into a separate commit (unless code changes break tests) in keeping with the idea that commits should "do one thing". However, I don't feel strongly about that so this is fine with me.

grondo · 2022-08-11T00:21:06Z

t/issues/t4482-flush-list-corruption.sh

+
+chmod +x t4482.sh
+
+flux start -s 1 \


Since the test script is generated in place, it might be easier (and keep all test components together) to generate the custom rc1 and rc3 scripts here as well. I'd also hate to proliferate the one-off rc scripts in t/rc and end up with many rc1-issue* files in there in the future.

Or, you could do away with the rc scripts and just load and unload necessary modules directly in the test script as you are doing with content-sqlite

My vote would be to make the test self-contained as well. It should probably drop the bash -e option and handle errors explicitly then, so that it doesn't bail out leaving modules unloaded for the next test.

Problem: A dirty cache entry has to potential to be added onto the content cache's flush list twice. This double addition can lead to list corruption. The observed side effect was a list that was shortened and no longer accurate with respect to the `acct_dirty` counter. This could lead to hangs with content flush, missed flushes to the backing store, and segfault/memory corruption in the worst case. Solution: Remove the cache entry from the flush list before adding it. The remove is a no-op if it is not already on a list. Fixes flux-framework#4482

Problem: No test covers duplicate content cache entries being added to the content cache's flush list. Solution: Add a regression test.

chu11 · 2022-08-11T05:46:01Z

re-pushed, cleaned up the test a ton, it looks far simpler / better now, and split it off into its own commit.

codecov · 2022-08-11T06:31:07Z

Codecov Report

Merging #4484 (8c51d6f) into master (8e53bf5) will decrease coverage by 0.00%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #4484      +/-   ##
==========================================
- Coverage   83.37%   83.37%   -0.01%     
==========================================
  Files         401      401              
  Lines       67527    67529       +2     
==========================================
- Hits        56303    56301       -2     
- Misses      11224    11228       +4

Impacted Files	Coverage Δ
src/broker/content-cache.c	`85.74% <100.00%> (+0.05%)`	⬆️
src/common/libterminus/terminus.c	`85.82% <0.00%> (-0.25%)`	⬇️
src/shell/output.c	`76.54% <0.00%> (-0.16%)`	⬇️
src/cmd/flux-job.c	`87.29% <0.00%> (-0.14%)`	⬇️

garlick

LGTM!

Problem: It would be nice to get a simple answer of what nodes/ranks are up vs down in the overlay network. Solution: Support a new flux overlay whatsup subcommand that is modeled after the whatsup(1) command. Fixes flux-framework#4484

chu11 force-pushed the issue4482_flush_list_corruption branch 2 times, most recently from 7d9d663 to f904744 Compare August 10, 2022 23:29

chu11 force-pushed the issue4482_flush_list_corruption branch from f904744 to b15711f Compare August 10, 2022 23:30

garlick approved these changes Aug 10, 2022

View reviewed changes

grondo reviewed Aug 11, 2022

View reviewed changes

chu11 added 2 commits August 10, 2022 21:42

testsuite: add content cache flush regression test

8c51d6f

Problem: No test covers duplicate content cache entries being added to the content cache's flush list. Solution: Add a regression test.

chu11 force-pushed the issue4482_flush_list_corruption branch from b15711f to 8c51d6f Compare August 11, 2022 05:45

garlick approved these changes Aug 11, 2022

View reviewed changes

chu11 added the merge-when-passing label Aug 11, 2022

mergify bot merged commit 56a8c5e into flux-framework:master Aug 11, 2022

chu11 deleted the issue4482_flush_list_corruption branch August 17, 2022 20:29

chu11 mentioned this pull request Mar 29, 2023

flux-overlay: add whatsup subcommand #5044

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

broker: fix content-cache flush list corruption #4484

broker: fix content-cache flush list corruption #4484

chu11 commented Aug 10, 2022

chu11 commented Aug 10, 2022 •

edited

Loading

chu11 commented Aug 10, 2022 •

edited

Loading

garlick commented Aug 10, 2022

chu11 commented Aug 10, 2022

chu11 commented Aug 10, 2022

garlick left a comment

garlick Aug 10, 2022

chu11 Aug 11, 2022

grondo left a comment •

edited

Loading

grondo Aug 11, 2022

grondo Aug 11, 2022

garlick Aug 11, 2022

chu11 commented Aug 11, 2022

codecov bot commented Aug 11, 2022

garlick left a comment

broker: fix content-cache flush list corruption #4484

broker: fix content-cache flush list corruption #4484

Conversation

chu11 commented Aug 10, 2022

chu11 commented Aug 10, 2022 • edited Loading

chu11 commented Aug 10, 2022 • edited Loading

garlick commented Aug 10, 2022

chu11 commented Aug 10, 2022

chu11 commented Aug 10, 2022

garlick left a comment

Choose a reason for hiding this comment

garlick Aug 10, 2022

Choose a reason for hiding this comment

chu11 Aug 11, 2022

Choose a reason for hiding this comment

grondo left a comment • edited Loading

Choose a reason for hiding this comment

grondo Aug 11, 2022

Choose a reason for hiding this comment

grondo Aug 11, 2022

Choose a reason for hiding this comment

garlick Aug 11, 2022

Choose a reason for hiding this comment

chu11 commented Aug 11, 2022

codecov bot commented Aug 11, 2022

Codecov Report

garlick left a comment

Choose a reason for hiding this comment

chu11 commented Aug 10, 2022 •

edited

Loading

chu11 commented Aug 10, 2022 •

edited

Loading

grondo left a comment •

edited

Loading