Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core dump with flux archive create #6461

Closed
vsoch opened this issue Nov 27, 2024 · 17 comments · Fixed by #6462
Closed

Core dump with flux archive create #6461

vsoch opened this issue Nov 27, 2024 · 17 comments · Fixed by #6462

Comments

@vsoch
Copy link
Member

vsoch commented Nov 27, 2024

When I try to do a flux archive create, specifically 3GB or larger, there is a Segmentation fault:

# flux archive create --name create-archive-${size} --dir /chonks ${size}gb.txt
Segmentation fault (core dumped)
root@flux-sample-0:/chonks# echo $?
139

Here is the run with valgrind:

root@flux-sample-0:/chonks# valgrind flux archive create --name create-archive-${size} --dir /chonks ${size}gb.txt
==3732== Memcheck, a memory error detector
==3732== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==3732== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info
==3732== Command: flux archive create --name create-archive-3 --dir /chonks 3gb.txt
==3732== 
==3732== Warning: set address range perms: large range [0x59c87000, 0x10c98d000) (defined)

==3732== Invalid read of size 4
==3732==    at 0x12FE41: SHA1_Transform (sha1.c:130)
==3732==    by 0x13118C: SHA1_Update (sha1.c:216)
==3732==    by 0x12CED3: sha1_hash (blobref.c:76)
==3732==    by 0x12D11C: blobref_hash (blobref.c:184)
==3732==    by 0x12717B: blobvec_append (fileref.c:46)
==3732==    by 0x12717B: blobvec_create (fileref.c:114)
==3732==    by 0x12717B: fileref_create_blobvec (fileref.c:163)
==3732==    by 0x12717B: fileref_create_ex (fileref.c:394)
==3732==    by 0x1200A4: add_archive_file (archive.c:175)
==3732==    by 0x12081C: subcmd_create (archive.c:336)
==3732==    by 0x11F3CB: cmd_archive (archive.c:665)
==3732==    by 0x116533: main (flux.c:235)
==3732==  Address 0x10c98d000 is not stack'd, malloc'd or (recently) free'd
==3732== 
==3732== 
==3732== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==3732==  Access not within mapped region at address 0x10C98D000
==3732==    at 0x12FE41: SHA1_Transform (sha1.c:130)
==3732==    by 0x13118C: SHA1_Update (sha1.c:216)
==3732==    by 0x12CED3: sha1_hash (blobref.c:76)
==3732==    by 0x12D11C: blobref_hash (blobref.c:184)
==3732==    by 0x12717B: blobvec_append (fileref.c:46)
==3732==    by 0x12717B: blobvec_create (fileref.c:114)
==3732==    by 0x12717B: fileref_create_blobvec (fileref.c:163)
==3732==    by 0x12717B: fileref_create_ex (fileref.c:394)
==3732==    by 0x1200A4: add_archive_file (archive.c:175)
==3732==    by 0x12081C: subcmd_create (archive.c:336)
==3732==    by 0x11F3CB: cmd_archive (archive.c:665)
==3732==    by 0x116533: main (flux.c:235)
==3732==  If you believe this happened as a result of a stack
==3732==  overflow in your program's main thread (unlikely but
==3732==  possible), you can try to increase the size of the
==3732==  main thread stack using the --main-stacksize= flag.
==3732==  The main thread stack size used in this run was 8388608.
==3732== 
==3732== HEAP SUMMARY:
==3732==     in use at exit: 272,858 bytes in 1,747 blocks
==3732==   total heap usage: 1,899 allocs, 152 frees, 287,035 bytes allocated
==3732== 
==3732== LEAK SUMMARY:
==3732==    definitely lost: 0 bytes in 0 blocks
==3732==    indirectly lost: 0 bytes in 0 blocks
==3732==      possibly lost: 24 bytes in 1 blocks
==3732==    still reachable: 272,834 bytes in 1,746 blocks
==3732==         suppressed: 0 bytes in 0 blocks
==3732== Rerun with --leak-check=full to see details of leaked memory
==3732== 
==3732== For lists of detected and suppressed errors, rerun with: -s
==3732== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)
Segmentation fault (core dumped)

I saw the message above, and added --leak-check=full:

 valgrind --leak-check=full flux archive create --name create-archive-${size} --dir /chonks ${size}gb.txt==3741== Memcheck, a memory error detector
==3741== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==3741== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info
==3741== Command: flux archive create --name create-archive-3 --dir /chonks 3gb.txt
==3741== 
==3741== Warning: set address range perms: large range [0x59c87000, 0x10c98d000) (defined)
==3741== Invalid read of size 4
==3741==    at 0x12FE41: SHA1_Transform (sha1.c:130)
==3741==    by 0x13118C: SHA1_Update (sha1.c:216)
==3741==    by 0x12CED3: sha1_hash (blobref.c:76)
==3741==    by 0x12D11C: blobref_hash (blobref.c:184)
==3741==    by 0x12717B: blobvec_append (fileref.c:46)
==3741==    by 0x12717B: blobvec_create (fileref.c:114)
==3741==    by 0x12717B: fileref_create_blobvec (fileref.c:163)
==3741==    by 0x12717B: fileref_create_ex (fileref.c:394)
==3741==    by 0x1200A4: add_archive_file (archive.c:175)
==3741==    by 0x12081C: subcmd_create (archive.c:336)
==3741==    by 0x11F3CB: cmd_archive (archive.c:665)
==3741==    by 0x116533: main (flux.c:235)
==3741==  Address 0x10c98d000 is not stack'd, malloc'd or (recently) free'd
==3741== 
==3741== 
==3741== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==3741==  Access not within mapped region at address 0x10C98D000
==3741==    at 0x12FE41: SHA1_Transform (sha1.c:130)
==3741==    by 0x13118C: SHA1_Update (sha1.c:216)
==3741==    by 0x12CED3: sha1_hash (blobref.c:76)
==3741==    by 0x12D11C: blobref_hash (blobref.c:184)
==3741==    by 0x12717B: blobvec_append (fileref.c:46)
==3741==    by 0x12717B: blobvec_create (fileref.c:114)
==3741==    by 0x12717B: fileref_create_blobvec (fileref.c:163)
==3741==    by 0x12717B: fileref_create_ex (fileref.c:394)
==3741==    by 0x1200A4: add_archive_file (archive.c:175)
==3741==    by 0x12081C: subcmd_create (archive.c:336)
==3741==    by 0x11F3CB: cmd_archive (archive.c:665)
==3741==    by 0x116533: main (flux.c:235)
==3741==  If you believe this happened as a result of a stack
==3741==  overflow in your program's main thread (unlikely but
==3741==  possible), you can try to increase the size of the
==3741==  main thread stack using the --main-stacksize= flag.
==3741==  The main thread stack size used in this run was 8388608.
==3741== 
==3741== HEAP SUMMARY:
==3741==     in use at exit: 272,858 bytes in 1,747 blocks
==3741==   total heap usage: 1,899 allocs, 152 frees, 287,035 bytes allocated
==3741== 
==3741== 24 bytes in 1 blocks are possibly lost in loss record 229 of 773
==3741==    at 0x484A899: malloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==3741==    by 0x4B08F9E: tsearch (tsearch.c:337)
==3741==    by 0x4A2E06A: __add_to_environ (setenv.c:233)
==3741==    by 0x48563FF: setenv (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==3741==    by 0x12E654: environment_apply (environment.c:305)
==3741==    by 0x116393: main (flux.c:222)
==3741== 
==3741== LEAK SUMMARY:
==3741==    definitely lost: 0 bytes in 0 blocks
==3741==    indirectly lost: 0 bytes in 0 blocks
==3741==      possibly lost: 24 bytes in 1 blocks
==3741==    still reachable: 272,834 bytes in 1,746 blocks
==3741==         suppressed: 0 bytes in 0 blocks
==3741== Reachable blocks (those to which a pointer was found) are not shown.
==3741== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==3741== 
==3741== For lists of detected and suppressed errors, rerun with: -s
==3741== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 0 from 0)
Segmentation fault (core dumped)

I'm attemping to upload the core dump to Microsoft OneDrive, but it's a 🔥 🗑️ so it's failing every time - I'll keep trying and post a link here if/when it works. I hope (think maybe?) the above gives enough hint to what might be going on?

Ping @garlick and @grondo !

@garlick
Copy link
Member

garlick commented Nov 28, 2024

Thanks! I created some files with dd from /dev/urandom:

$ ls -l /tmp/chonks
total 6291468
-rw-rw-r-- 1 garlick garlick 1073741824 Nov 28 07:29 big1
-rw-rw-r-- 1 garlick garlick 2147483648 Nov 28 07:28 big2
-rw-rw-r-- 1 garlick garlick 3221225472 Nov 28 07:18 big3

and (eek) I was able to reproduce with simply

$ flux archive create FILE

where FILE is big2 or big3 but not big1.

Edit: so don't bother with the core file - I can make them all day long!

@garlick
Copy link
Member

garlick commented Nov 28, 2024

(gdb) bt -full
#0  0x000055555557ded1 in SHA1_Transform (state=state@entry=0x7fffffffd5f0, 
    buffer=buffer@entry=0x7ffff541a000 "") at sha1.c:130
        a = <optimized out>
        b = <optimized out>
        c = <optimized out>
        d = <optimized out>
        e = <optimized out>
        block = 0x5555555b0a60 <workspace>
        workspace = '\000' <repeats 63 times>
#1  0x000055555557f21d in SHA1_Update (context=0x7fffffffd5f0, 
    data=0x7fff75200000 "\341\203~\323\062\352\275\bp\206Ȇ#A1\313N>K\207\315lG\353\333i\377\017\\\321:\266\351P\034Л\374#}\361X\240\367\022Y\370\031J\367T|\262\207\227\354bm\353\311ԅ\001Eĺ\230K\001٧PQ#w\020\236oc\357\265|\361\326.\334N\237\027\226\366Z\227\204\066|\226\215\211\"\217G\247\064\220\062:\372S\035\212\304c{\237$j6\024M\326\034\275\005\006\026**\326\"\266H!d\267\064\367_t\022@\342\t\315a\201\025\024s\203\006\004\367\272b\263\255", len=18446744071562067968) at sha1.c:216
        i = 2149687296
        j = 0
#2  0x000055555557ab64 in sha1_hash (data=0x7fff75200000, data_len=-2147483648, 
    hash=0x7fffffffd690, hash_len=<optimized out>) at blobref.c:76
        ctx = {state = {2385607362, 1459538167, 3096738172, 3090102770, 782466460}, 
          count = {0, 4294967293}, 
          buffer = "\341\203~\323\062\352\275\bp\206Ȇ#A1\313N>K\207\315lG\353\333i\377\017\\\321:\266\351P\034Л\374#}\361X\240\367\022Y\370\031J\367T|\262\207\227\354bm\353\311ԅ\001E"}
        __PRETTY_FUNCTION__ = "sha1_hash"
#3  0x000055555557adad in blobref_hash (
    hashtype=hashtype@entry=0x5555556103b0 "sha1", data=data@entry=0x7fff75200000, 
    len=len@entry=-2147483648, blobref=blobref@entry=0x7fffffffd7e0, 
    blobref_len=blobref_len@entry=72) at blobref.c:184
        bh = 0x5555555af160 <blobtab>
        hash = "\240\326\377\377\377\177\000\000\000`\351\v\224bc/\003\000\000\000\000\000\000\000\000`\351\v\224bc/"
#4  0x0000555555574e0c in blobvec_append (hashtype=0x5555556103b0 "sha1", 
    blobsize=18446744071562067968, offset=0, mapbuf=0x7fff75200000, 
    blobvec=0x55555560f9b0) at fileref.c:46
        blobref = "IJYUUU\000\000\220\330\377\377\377\177\000\000\060\004aUUU\000\000\000`\351\v\224bc/\377\377\377\377\000\000\000\000x\376\377\377\377\377\377\377\002\000\000\000\001\000\000\000@,\340\367\377\177\000\000\002\000\000\000\000\000\000"
        o = <optimized out>
        offsetj = 0
        blobsizej = 0
        blobref = <optimized out>
        o = <optimized out>
        offsetj = <optimized out>
        blobsizej = <optimized out>
#5  blobvec_create (chunksize=1048576, hashtype=0x5555556103b0 "sha1", 
    size=2147483648, mapbuf=0x7fff75200000, fd=5) at fileref.c:114
        notdata = <optimized out>
        blobsize = -2147483648
        blobvec = 0x55555560f9b0
        offset = <optimized out>
        error = <optimized out>
        blobvec = <optimized out>
        offset = <optimized out>
        __PRETTY_FUNCTION__ = "blobvec_create"
        error = <optimized out>
        notdata = <optimized out>
        blobsize = <optimized out>
        saved_errno = <optimized out>
#6  fileref_create_blobvec (error=0x7fffffffd8c0, chunksize=1048576, hashtype=0x5555556103b0 "sha1", sb=0x7fffffffd750, mapbuf=0x7fff75200000, fd=5, path=0x7fffffffe08b "tmp/chonks/big2") at fileref.c:163
        blobvec = <optimized out>
        o = <optimized out>
        error = <optimized out>
        blobvec = <optimized out>
        o = <optimized out>
        error = <optimized out>
        saved_errno = <optimized out>
#7  fileref_create_ex (path=path@entry=0x7fffffffe08a "/tmp/chonks/big2", param=param@entry=0x7fffffffd9f8, mapinfop=mapinfop@entry=0x7fffffffd8b0, error=error@entry=0x7fffffffd8c0) at fileref.c:394
        chunksize = 1048576
        relative_path = 0x7fffffffe08b "tmp/chonks/big2"
        o = <optimized out>
        fd = 5
        sb = {st_dev = 66306, st_ino = 20840665, st_nlink = 1, st_mode = 33204, st_uid = 5588, st_gid = 5588, __pad0 = 0, st_rdev = 0, st_size = 2147483648, st_blksize = 4096, st_blocks = 4194312, st_atim = {tv_sec = 1732807877, tv_nsec = 453461565}, st_mtim = {tv_sec = 1732807725, tv_nsec = 757533978}, st_ctim = {tv_sec = 1732807760, tv_nsec = 934444746}, __glibc_reserved = {0, 0, 0}}
        mapinfo = {base = 0x7fff75200000, size = <optimized out>}
        saved_errno = <optimized out>
        error = <optimized out>
#8  0x000055555556d5f5 in add_archive_file (ctx=ctx@entry=0x7fffffffd9d0, path=path@entry=0x7fffffffe08a "/tmp/chonks/big2") at builtin/archive.c:175
        mapinfo = {base = 0x100000002, size = 2}
        error = {text = "\002\000\000\000\000\000\000\000\000`\351\v\224bc/\260\004aUUU\000\000x\376\377\377\377\377\377\377\002\000\000\000\000\000\000\000\002\000\000\000\000\000\000\000\002\000\000\000\001\000\000\000\250\334\377\377\377\177\000\000\260\004aUUU\000\000\000`\351\v\224bc/\300\005aUUU\000\000x\376\377\377\377\377\377\377\002\000\000\000\000\000\000\000\060\033aUUU\000\000\320$\000\000\000\000\000\000\b\000\000\000\000\000\000\000\200\254\301\367\377\177\000\000\250\334\377\377\377\177\000\000\260\004aUUU\000\000\371e\252\367\377\177\000"}
        fileref = <optimized out>
#9  0x000055555556dd6d in subcmd_create (p=<optimized out>, ac=<optimized out>, av=<optimized out>) at builtin/archive.c:336
        path = 0x7fffffffe08a "/tmp/chonks/big2"
        sb = {st_dev = 66306, st_ino = 20840665, st_nlink = 1, st_mode = 33204, st_uid = 5588, st_gid = 5588, __pad0 = 0, st_rdev = 0, st_size = 2147483648, st_blksize = 4096, st_blocks = 4194312, st_atim = {tv_sec = 1732807877, tv_nsec = 453461565}, st_mtim = {tv_sec = 1732807725, tv_nsec = 757533978}, st_ctim = {tv_sec = 1732807760, tv_nsec = 934444746}, __glibc_reserved = {0, 0, 0}}
        ctx = {p = 0x5555555c57e0, h = 0x55555560f370, name = 0x555555599614 "main", namespace = 0x555555599324 "primary", verbose = 0, use_mmap = false, param = {hashtype = 0x5555556103b0 "sha1", chunksize = 1048576, small_file_threshold = 1024}, archive = 0x555555610540, txn = 0x5555556119b0, preserve_seq = 0}
        n = <optimized out>
        directory = 0x0
        flags = 5
        s = <optimized out>
        hashtype = 0x5555556103b0 "sha1"
        key = 0x55555560d280 "archive.main"
        f = <optimized out>
#10 0x000055555556c91c in cmd_archive (p=0x5555555c4420, ac=3, av=0x7fffffffdca0) at builtin/archive.c:665
No locals.
#11 0x0000555555562674 in main (argc=4, argv=0x7fffffffdc98) at flux.c:235
        vopt = false
        env = 0x55555560b580
        p = 0x5555555c2ee0
        searchpath = <optimized out>
        s = <optimized out>
        argv0 = <optimized out>
        flags = 1
        optindex = 1
(gdb)  

@vsoch
Copy link
Member Author

vsoch commented Nov 28, 2024

That's great! I'm glad it wasn't something about my environment. If you make a bunch of cores, you can use them for pie. 🥧 Happy Thanksgiving @garlick !

@garlick
Copy link
Member

garlick commented Nov 28, 2024

You too @vsoch !

@garlick
Copy link
Member

garlick commented Nov 28, 2024

Oh wow, this was a lame error on my part! This appears to fix the problem:

diff --git a/src/common/libfilemap/fileref.c b/src/common/libfilemap/fileref.c
index 02756a9c3..dab7a1dc5 100644
--- a/src/common/libfilemap/fileref.c
+++ b/src/common/libfilemap/fileref.c
@@ -98,7 +98,7 @@ static json_t *blobvec_create (int fd,
 #endif
         if (offset < size) {
             off_t notdata;
-            int blobsize;
+            size_t blobsize;
 
 #ifdef SEEK_HOLE
             // N.B. returns size if there are no more holes

@vsoch
Copy link
Member Author

vsoch commented Nov 28, 2024

Nice! Should I hot patch this on a custom branch or are you planning to do a PR soon? I'm good either way, but please let me know so I can run more experiments over break! I'm testing different topology setups in the Flux Operator and it would be really nice to go about 2GB. I'm really wanting to know if the different kary designs (at larger sizes) split out more-so than this (right now it's mostly binomial vs. kary that seems to make a difference, and in some cases kary:1 is just really bad).

image

image

image

Don't mind the crappy graphs - I'll make them better when I'm beyond prototype mode.

garlick added a commit to garlick/flux-core that referenced this issue Nov 28, 2024
Problem: 'flux archive create' segfaults when it tries to archive
files with size >2G.

Change a local variable from int to size_t.

Fixes flux-framework#6461
@garlick
Copy link
Member

garlick commented Nov 28, 2024

Just posted a PR.

Right - the performance will be sensitive to the tree fanout because each level of the tree will fetch data once from its parent, then provide it once to each child that is requesting it. Well, that would assume perfect caching but the LRU cache tries to maintain itself below 16MB so for large amounts of data the cache may thrash a bit. If you want to play with that limit, you could do something like

flux module reload content purge-target-size=104857600 # 100mb

Not sure what effect that would have since it kind of depends on how the timing works out. You can peek at the cache size with

flux module stats content | jq

@vsoch
Copy link
Member Author

vsoch commented Nov 28, 2024

Oh nice! That's really helpful - I'll add that to my notes (and experiments). We definitely want to crank that size up.

And we are testing exactly that - how different topologies handle the distribution. We are wanting to use flux as a distribution mechanism, and I'm new to this generally and want to experimentally see/verify things. There are a few uses:

  1. As a means to distribute large data across nodes in a Kubernetes cluster (I have a project flux-distribute that installs Flux to nodes, and could then make a container storage interface to retrieve and bind to pods), and the use case would be some huge ML model files that need to be present on all nodes.
  2. As a snapshotter plugin assistant (to do the same). Right now when a container is pulled to a node in Kubernetes, each container separately does the pull. It's not a huge issue for the registry (they expect it) but it would be a better design to pull once and then distribute across the cluster. And
  3. To go with a tool like spindle (although right now just using TMPDIR is probably a good start).

Anyway - thank you!

@garlick
Copy link
Member

garlick commented Nov 28, 2024

I should have mentioned that if you do the module reload, use flux exec to do it instance wide.

@mergify mergify bot closed this as completed in #6462 Nov 28, 2024
@vsoch
Copy link
Member Author

vsoch commented Nov 29, 2024

Works great! 🥳 We weren't able to go above 2GB before, and here we are successfully creating and distributing 3:

image

And we go up to 10!

image

The times are really shooting up there - the cleans / deletes aren't that important, moreso the create and distribute. I don't have any plots yet - doing a smaller cluster first (6 nodes up to 10GB). And then I'm thinking the base case would be a wget download of an archive of the same size, because that's what we are trying to improve upon.

Question for you @garlick - for the stats, what would this show me / how would it be useful? Should I run it at the beginning / end / between operations to get stats, and what do they tell me? Here is a shot of a run at the start (before I've done anything, but after I reload the content module across brokers):

image

@vsoch
Copy link
Member Author

vsoch commented Nov 29, 2024

I'm looking at some of the result data, and the purple line is erroneous:

distribute-all-nodes-nodes-6

Specifically, I think I'm hitting some limit or other error with kary-3 (and I'm not sure what at the moment, just finished the runs and would need to manually inspect):

image

Here is what that topology looks like:

nodes-6-topo-kary:3
0 faa621ae1899: full
├─ 1 faa621ae1899: full
│  ├─ 4 faa621ae1899: full
│  └─ 5 faa621ae1899: full
├─ 2 faa621ae1899: full
└─ 3 faa621ae1899: full

I'll try bringing the cluster up tomorrow and just targeting that size, and running that stats command between the calls to see if anything looks weird. I could also try bringing up the larger cluster just to do spot check runs to see if the issue reproduces (I bet it will)!

Also - I know that binomial == kary:2, I'm mostly doing both to sanity check flux knows that too - seems like something is funny! :) In the plot, the green and aqua should be the same (but they are not) and it makes sense because the topologies look different. This of course could be a bug on my part, but here is the tree I get when I ask the broker to make me a binomial topology:

nodes-6-topo-binomial
0 faa621ae1899: full
├─ 1 faa621ae1899: full
├─ 2 faa621ae1899: full
│  └─ 3 faa621ae1899: full
└─ 4 faa621ae1899: full
   └─ 5 faa621ae1899: full

And the kary:2 (this is my understanding of what it should look like)

nodes-6-topo-kary:2
0 faa621ae1899: full
├─ 1 faa621ae1899: full
│  ├─ 3 faa621ae1899: full
│  └─ 4 faa621ae1899: full
└─ 2 faa621ae1899: full
   └─ 5 faa621ae1899: full

If I'm reading that right, broker 0 has three children, and I think there should be just two?

@grondo
Copy link
Contributor

grondo commented Nov 29, 2024

Also - I know that binomial == kary:2, I'm mostly doing both to sanity check flux knows that too - seems like something is funny! :)

I'm sure @garlick will answer here shortly with a more thoughtful answer, but I just wanted to quickly point out that binomial != kary:2.

kary:2 is a binary tree, while a binomial tree has a more complex definition (explained better than I can in the binomial heap Wikipedia article.)

@vsoch
Copy link
Member Author

vsoch commented Nov 29, 2024

You're right! I need to read more about this - this totally goes against my idea of what a heap is too. I think I was probably thinking of binary tree? https://en.wikipedia.org/wiki/M-ary_tree.

What about the bug?

It's not letting me comment anymore so I'm re-opening.

@vsoch vsoch reopened this Nov 29, 2024
@vsoch
Copy link
Member Author

vsoch commented Nov 29, 2024

Also - I was perusing through old issues and found one that seems to encompass both of the findings here - a segfault and no such file or directory:

#2443

The difference is that it's directed at the kvs, but (I suspect)? flux archive is using kvs, so maybe there is some overlap there? I think we fixed the segfault, and maybe there is some issue still with keys/indexes. I'm bringing up a new cluster now to see if I can reproduce and get you more data.

@garlick
Copy link
Member

garlick commented Nov 29, 2024

Is the bug you are referring to the ENOENT from
Kvs commit? Perhaps that should be a new issue?

Sorry I'm away from a computer atm

@vsoch
Copy link
Member Author

vsoch commented Nov 29, 2024

The kvs commit: no such file or directory. Sure, happy to open a new issue!

@vsoch
Copy link
Member Author

vsoch commented Nov 29, 2024

Continued in #6463.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants