Slow store queries may block fleet nodes #1017

jm-clius · 2022-06-23T09:07:01Z

With the SQLite-only store it seems as if nodes can become unresponsive when building responses to sparse queries.

This has been observed on the wakuv2.prod fleet for node-01.gc-us-central1-a.wakuv2.prod around 8 UTC on 2022-06-23, though the issue may have been present quite some time before that. New connection attempts to the node time out and consul reports metrics reporting failures.

Successive backtraces on the running node reveals that the node is stuck answering a store query:

(gdb) bt
#0  0x000055af8ad2ce93 in pcache1PinPage (pPage=<optimized out>) at /app/vendor/nim-sqlite3-abi/sqlite3.c:50879
#1  pcache1FetchStage2 (pCache=0x7f6894694ef8, iKey=31504, createFlag=2) at /app/vendor/nim-sqlite3-abi/sqlite3.c:50879
#2  0x000055af8ad82ca9 in sqlite3PcacheFetch (createFlag=3, pgno=31504, pCache=<optimized out>) at /app/vendor/nim-sqlite3-abi/sqlite3.c:57397
#3  getPageNormal (pPager=0x7f6894694718, pgno=31504, ppPage=0x7ffec9598190, flags=2) at /app/vendor/nim-sqlite3-abi/sqlite3.c:57397
#4  0x000055af8ad84637 in sqlite3PagerGet (flags=2, ppPage=0x7ffec9598190, pgno=<optimized out>, pPager=<optimized out>) at /app/vendor/nim-sqlite3-abi/sqlite3.c:69670
#5  accessPayload (pCur=0x7f68941dc0d8, offset=0, amt=52524,
    pBuf=0x7f68924b818a "83031613061356561653536663432363633343030222c22307839303239666466616539613033313335383436333831633763653136353935633335353465313061223a2230783036313130613833623965636366313935383030222c223078376230653"..., eOp=0) at /app/vendor/nim-sqlite3-abi/sqlite3.c:4134
#6  0x000055af8ad8595a in sqlite3BtreePayload (pBuf=<optimized out>, amt=65182, offset=107, pCur=0x7f68941dc0d8) at /app/vendor/nim-sqlite3-abi/sqlite3.c:77264
#7  sqlite3VdbeMemFromBtree (pCur=0x7f68941dc0d8, offset=107, amt=65182, pMem=0x7f68941dd588) at /app/vendor/nim-sqlite3-abi/sqlite3.c:11728
#8  0x000055af8adcd79e in sqlite3VdbeExec (p=<optimized out>) at /app/vendor/nim-sqlite3-abi/sqlite3.c:88420
#9  0x000055af8add0b30 in sqlite3Step (p=0x0) at /app/vendor/nim-sqlite3-abi/sqlite3.c:83941
#10 sqlite3_step (pStmt=pStmt@entry=0x7f68941dd798) at /app/vendor/nim-sqlite3-abi/sqlite3.c:18462
#11 0x000055af8b2c0b61 in exec__mW8h6kAccHfls1l8PiW87A (s_0=s_0@entry=0x7f68941dd798, params=params@entry=0x7ffec95987f0, onData=..., Result=Result@entry=0x7ffec95987e0)
    at /app/waku/v2/node/storage/sqlite.nim:207
#12 0x000055af8b324fa2 in getPage__xsP8EcHIHYTHBPzOo1A8Aw (db=<optimized out>, pred=..., pagingInfo=<optimized out>, Result=<optimized out>) at /app/waku/v2/node/storage/message/waku_message_store.nim:383
#13 0x00000000006ea77c in ?? ()
#14 0x000055af8b2173c8 in findMessages__xInphcAVlliR33P39aBgAwA (w=<optimized out>, query=query@entry=0x7f6893e83088, Result=Result@entry=0x7f6893e83250) at /app/waku/v2/protocol/waku_store/waku_store.nim:333
#15 0x000055af8b21897a in handler__5HU2zQKkk6mkjZiQFIFJ3w_2 (chronosInternalRetFuture=<optimized out>, ClE_0=<optimized out>) at /app/waku/v2/protocol/waku_store/waku_store.nim:357
#16 0x000055af8aed5833 in futureContinue__XuNTB7fHwBI8KII0qEQaCw (fut=0x7f68912190c0) at /app/vendor/nim-chronos/chronos/asyncfutures2.nim:365
#17 0x000055af8aeef0c2 in poll__YNjd8fE6xG8CRNwfLnrx0g_2 () at /app/vendor/nimbus-build-system/vendor/Nim/lib/pure/collections/deques.nim:390
#18 0x000055af8af01c1d in runForever__YNjd8fE6xG8CRNwfLnrx0g_3 () at /app/vendor/nim-chronos/chronos/asyncloop.nim:1068
#19 0x000055af8b3561c1 in NimMainModule () at /app/waku/v2/node/wakunode2.nim:1458
#20 0x000055af8b35f272 in NimMain () at /app/vendor/dnsclient.nim/src/dnsclientpkg/types.nim:411
#21 0x000055af8ac96a6d in main (argc=<optimized out>, args=<optimized out>, env=<optimized out>) at /app/vendor/dnsclient.nim/src/dnsclientpkg/types.nim:418

(gdb) bt
#0  0x000055af8ae1c76d in split__NJo8pxZdXEAIa7wkHls9cOw_2 (t=0x7f68933bdac0) at /app/vendor/nimbus-build-system/vendor/Nim/lib/system/avltree.nim:71
#1  add__3D9aOyz4rDquPZKBlqn0xig (a=<optimized out>, t=0x7f68933bdac0, key=<optimized out>, upperBound=<optimized out>) at /app/vendor/nimbus-build-system/vendor/Nim/lib/system/avltree.nim:71
#2  0x000055af8ae1c7d9 in add__3D9aOyz4rDquPZKBlqn0xig (a=<optimized out>, t=0x7f68933bcdd8, key=<optimized out>, upperBound=<optimized out>)
    at /app/vendor/nimbus-build-system/vendor/Nim/lib/system/avltree.nim:67
#3  0x000055af8ae1c7d9 in add__3D9aOyz4rDquPZKBlqn0xig (a=<optimized out>, t=0x7f689293a5e8, key=<optimized out>, upperBound=<optimized out>)
    at /app/vendor/nimbus-build-system/vendor/Nim/lib/system/avltree.nim:67
#4  0x000055af8ae1c7d9 in add__3D9aOyz4rDquPZKBlqn0xig (a=<optimized out>, t=0x7f689251c770, key=<optimized out>, upperBound=<optimized out>)
    at /app/vendor/nimbus-build-system/vendor/Nim/lib/system/avltree.nim:67
#5  0x000055af8ae1c7d9 in add__3D9aOyz4rDquPZKBlqn0xig (a=<optimized out>, t=0x7f68929339c8, key=<optimized out>, upperBound=<optimized out>)
    at /app/vendor/nimbus-build-system/vendor/Nim/lib/system/avltree.nim:67
#6  0x000055af8ae1c7d9 in add__3D9aOyz4rDquPZKBlqn0xig (a=a@entry=0x7f68946770b8, t=t@entry=0x7f689467a160, key=key@entry=140087070900256, upperBound=upperBound@entry=140087070965606)
    at /app/vendor/nimbus-build-system/vendor/Nim/lib/system/avltree.nim:67
#7  0x000055af8ae1c8f3 in rawAlloc__mE4QEVyMvGRVliDWDngZCQ (a=a@entry=0x7f68946770b8, requestedSize=requestedSize@entry=65318) at /app/vendor/nimbus-build-system/vendor/Nim/lib/system/alloc.nim:808
#8  0x000055af8ae21f4c in rawNewObj__ehkAaLROrd0Hc9aLROWt1nQ (typ=0x55af8b56f6c0 <NTI__6H5Oh5UUvVCLiakt9aTwtUQ_>, size=65302, gch=0x7f6894677050)
    at /app/vendor/nimbus-build-system/vendor/Nim/lib/system/gc.nim:799
#9  0x000055af8ae221da in newObj (typ=<optimized out>, size=65302) at /app/vendor/nimbus-build-system/vendor/Nim/lib/system/gc.nim:440
#10 0x000055af8ae243b0 in newSeq (typ=<optimized out>, len=len@entry=65286) at /app/vendor/nimbus-build-system/vendor/Nim/lib/system/gc.nim:448
#11 0x000055af8ae2964d in at___FKhCG9aXnVoTuyjERVXzd2g (
    a=a@entry=0x7f68924b5018 "{\"data\":\"7b2266656573223a7b22307863323133326430356433316339313461383763363631316331303734386165623034623538653866223a223078333432343966222c2230783237393162636131663264653436363165643838613330633939613"..., aLen_0=aLen_0@entry=65286) at /app/vendor/nimbus-build-system/vendor/Nim/lib/system.nim:1544
#12 0x000055af8b323052 in msg__i3o9cXbEdaMzfyufaKvkWWg_3 (s_0=s_0@entry=0x7f68941dd798, ClE_0=ClE_0@entry=0x7f6893e48730) at /app/waku/v2/node/storage/message/waku_message_store.nim:311
#13 0x000055af8b2c0d5b in exec__mW8h6kAccHfls1l8PiW87A (s_0=s_0@entry=0x7f68941dd798, params=params@entry=0x7ffec95987f0, onData=..., Result=Result@entry=0x7ffec95987e0)
    at /app/waku/v2/node/storage/sqlite.nim:210
#14 0x000055af8b324fa2 in getPage__xsP8EcHIHYTHBPzOo1A8Aw (db=<optimized out>, pred=..., pagingInfo=<optimized out>, Result=<optimized out>) at /app/waku/v2/node/storage/message/waku_message_store.nim:383
#15 0x00000000006ea77c in ?? ()
#16 0x000055af8b2173c8 in findMessages__xInphcAVlliR33P39aBgAwA (w=<optimized out>, query=query@entry=0x7f6893e83088, Result=Result@entry=0x7f6893e83250) at /app/waku/v2/protocol/waku_store/waku_store.nim:333
#17 0x000055af8b21897a in handler__5HU2zQKkk6mkjZiQFIFJ3w_2 (chronosInternalRetFuture=<optimized out>, ClE_0=<optimized out>) at /app/waku/v2/protocol/waku_store/waku_store.nim:357
#18 0x000055af8aed5833 in futureContinue__XuNTB7fHwBI8KII0qEQaCw (fut=0x7f68912190c0) at /app/vendor/nim-chronos/chronos/asyncfutures2.nim:365
#19 0x000055af8aeef0c2 in poll__YNjd8fE6xG8CRNwfLnrx0g_2 () at /app/vendor/nimbus-build-system/vendor/Nim/lib/pure/collections/deques.nim:390
#20 0x000055af8af01c1d in runForever__YNjd8fE6xG8CRNwfLnrx0g_3 () at /app/vendor/nim-chronos/chronos/asyncloop.nim:1068
#21 0x000055af8b3561c1 in NimMainModule () at /app/waku/v2/node/wakunode2.nim:1458
#22 0x000055af8b35f272 in NimMain () at /app/vendor/dnsclient.nim/src/dnsclientpkg/types.nim:411
#23 0x000055af8ac96a6d in main (argc=<optimized out>, args=<optimized out>, env=<optimized out>) at /app/vendor/dnsclient.nim/src/dnsclientpkg/types.nim:418

The last query in the nodes logs was:

INF 2022-06-23 08:22:29.377+00:00 received query                             topics="wakustore" tid=1 file=waku_store.nim:353 rpc="(requestId: "48408cdd-a8fc-48c7-a29b-589f97ff0d17", query: (contentFilters: @[(contentTopic: "/toy-chat/2/huilong/proto")], pubsubTopic: "/waku/2/default-waku/proto", pagingInfo: (pageSize: 100, cursor: (digest: 0000000000000000000000000000000000000000000000000000000000000000, receiverTime: 0, senderTime: 0, pubsubTopic: ""), direction: BACKWARD), startTime: 1655965347185000000, endTime: 1655972547185000000), response: (messages: @[], pagingInfo: (pageSize: 0, cursor: (digest: 0000000000000000000000000000000000000000000000000000000000000000, receiverTime: 0, senderTime: 0, pubsubTopic: ""), direction: BACKWARD), error: NONE))"

which is known to be a slow query.

The text was updated successfully, but these errors were encountered:

jm-clius · 2022-06-23T09:10:58Z

cc @kaiserd

kaiserd · 2022-06-23T09:38:12Z

See #982 under efficiency issue.

Sparse filters are relatively slow. This combined with the fact that some operations are blocking (see todo list in #982) may lead to store nodes becoming unresponsive. I ran into this issue with toy chat and a large DB (1.6GB, >800k messages) that only contained a few toy chat messages.
Your observation shows this is a problem in practice. Thank you for compiling this report!
How many messages were in the DB and how large in terms of used storage was the DB?

I will be on it after making more progress on vacp2p/research#104

jm-clius · 2022-06-23T10:03:48Z

DB size is not huge:

sudo du -hsc /docker/nim-waku-v2/data/store.sqlite3
1.5G    /docker/nim-waku-v2/data/store.sqlite3
1.5G    total

sudo sqlite3 /docker/nim-waku-v2/data/store.sqlite3
SQLite version 3.31.1 2020-01-27 19:55:54
Enter ".help" for usage hints.
sqlite> select count(*) from Message;
246320

kaiserd · 2022-06-23T10:06:11Z

Thank you!

jm-clius · 2022-06-23T15:09:59Z

Possible partial fix merged in #1018. Thanks, @kaiserd. Deploying to test fleet.

There is a known issue in `v0.10` where certain store queries can slow down nodes: waku-org/nwaku#1017 A possible (partial?) fix has been merged and deployed to `wakuv2.test`: waku-org/nwaku#1018 But the SQLite store can be disable on `prod` in the meantime. Signed-off-by: Jakub Sokołowski <[email protected]>

jm-clius · 2022-08-26T14:45:27Z

@LNSD not sure how relevant this issue still is. Based on this comment: https://discord.com/channels/864066763682218004/1011953262586507354/1012036459097767946 I think there is still some investigation necessary, but likely unrelated to the issue that was reported (and addressed) here.

LNSD · 2022-09-06T14:52:38Z

Closing this issue as the changes merged as part of #1120 have improved the store query time significantly.

Otherwise nodes try to load all messages into memory at startup. And the issue that originally caused this to be disabled is closed: waku-org/nwaku#1017 waku-org/nwaku#1018 Signed-off-by: Jakub Sokołowski <[email protected]>

kaiserd mentioned this issue Jun 23, 2022

fix(store): limit dbPageSize #1018

Merged

jm-clius mentioned this issue Jun 23, 2022

chore: disable sqlite store on wakuv2.prod status-im/infra-nim-waku#53

Merged

jm-clius added the track:maintenance label Aug 26, 2022

LNSD self-assigned this Aug 31, 2022

jm-clius added this to Waku Sep 2, 2022

jm-clius moved this to In Progress in Waku Sep 2, 2022

jm-clius added this to the Release 0.12 milestone Sep 2, 2022

LNSD closed this as completed Sep 6, 2022

LNSD moved this from In Progress to Done in Waku Sep 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow store queries may block fleet nodes #1017

Slow store queries may block fleet nodes #1017

jm-clius commented Jun 23, 2022

jm-clius commented Jun 23, 2022

kaiserd commented Jun 23, 2022 •

edited

Loading

jm-clius commented Jun 23, 2022

kaiserd commented Jun 23, 2022

jm-clius commented Jun 23, 2022

jm-clius commented Aug 26, 2022

LNSD commented Sep 6, 2022 •

edited

Loading

Slow store queries may block fleet nodes #1017

Slow store queries may block fleet nodes #1017

Comments

jm-clius commented Jun 23, 2022

jm-clius commented Jun 23, 2022

kaiserd commented Jun 23, 2022 • edited Loading

jm-clius commented Jun 23, 2022

kaiserd commented Jun 23, 2022

jm-clius commented Jun 23, 2022

jm-clius commented Aug 26, 2022

LNSD commented Sep 6, 2022 • edited Loading

kaiserd commented Jun 23, 2022 •

edited

Loading

LNSD commented Sep 6, 2022 •

edited

Loading