Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow store queries may block fleet nodes #1017

Closed
jm-clius opened this issue Jun 23, 2022 · 7 comments
Closed

Slow store queries may block fleet nodes #1017

jm-clius opened this issue Jun 23, 2022 · 7 comments
Assignees

Comments

@jm-clius
Copy link
Contributor

With the SQLite-only store it seems as if nodes can become unresponsive when building responses to sparse queries.

This has been observed on the wakuv2.prod fleet for node-01.gc-us-central1-a.wakuv2.prod around 8 UTC on 2022-06-23, though the issue may have been present quite some time before that. New connection attempts to the node time out and consul reports metrics reporting failures.

Successive backtraces on the running node reveals that the node is stuck answering a store query:

(gdb) bt
#0  0x000055af8ad2ce93 in pcache1PinPage (pPage=<optimized out>) at /app/vendor/nim-sqlite3-abi/sqlite3.c:50879
#1  pcache1FetchStage2 (pCache=0x7f6894694ef8, iKey=31504, createFlag=2) at /app/vendor/nim-sqlite3-abi/sqlite3.c:50879
#2  0x000055af8ad82ca9 in sqlite3PcacheFetch (createFlag=3, pgno=31504, pCache=<optimized out>) at /app/vendor/nim-sqlite3-abi/sqlite3.c:57397
#3  getPageNormal (pPager=0x7f6894694718, pgno=31504, ppPage=0x7ffec9598190, flags=2) at /app/vendor/nim-sqlite3-abi/sqlite3.c:57397
#4  0x000055af8ad84637 in sqlite3PagerGet (flags=2, ppPage=0x7ffec9598190, pgno=<optimized out>, pPager=<optimized out>) at /app/vendor/nim-sqlite3-abi/sqlite3.c:69670
#5  accessPayload (pCur=0x7f68941dc0d8, offset=0, amt=52524,
    pBuf=0x7f68924b818a "83031613061356561653536663432363633343030222c22307839303239666466616539613033313335383436333831633763653136353935633335353465313061223a2230783036313130613833623965636366313935383030222c223078376230653"..., eOp=0) at /app/vendor/nim-sqlite3-abi/sqlite3.c:4134
#6  0x000055af8ad8595a in sqlite3BtreePayload (pBuf=<optimized out>, amt=65182, offset=107, pCur=0x7f68941dc0d8) at /app/vendor/nim-sqlite3-abi/sqlite3.c:77264
#7  sqlite3VdbeMemFromBtree (pCur=0x7f68941dc0d8, offset=107, amt=65182, pMem=0x7f68941dd588) at /app/vendor/nim-sqlite3-abi/sqlite3.c:11728
#8  0x000055af8adcd79e in sqlite3VdbeExec (p=<optimized out>) at /app/vendor/nim-sqlite3-abi/sqlite3.c:88420
#9  0x000055af8add0b30 in sqlite3Step (p=0x0) at /app/vendor/nim-sqlite3-abi/sqlite3.c:83941
#10 sqlite3_step (pStmt=pStmt@entry=0x7f68941dd798) at /app/vendor/nim-sqlite3-abi/sqlite3.c:18462
#11 0x000055af8b2c0b61 in exec__mW8h6kAccHfls1l8PiW87A (s_0=s_0@entry=0x7f68941dd798, params=params@entry=0x7ffec95987f0, onData=..., Result=Result@entry=0x7ffec95987e0)
    at /app/waku/v2/node/storage/sqlite.nim:207
#12 0x000055af8b324fa2 in getPage__xsP8EcHIHYTHBPzOo1A8Aw (db=<optimized out>, pred=..., pagingInfo=<optimized out>, Result=<optimized out>) at /app/waku/v2/node/storage/message/waku_message_store.nim:383
#13 0x00000000006ea77c in ?? ()
#14 0x000055af8b2173c8 in findMessages__xInphcAVlliR33P39aBgAwA (w=<optimized out>, query=query@entry=0x7f6893e83088, Result=Result@entry=0x7f6893e83250) at /app/waku/v2/protocol/waku_store/waku_store.nim:333
#15 0x000055af8b21897a in handler__5HU2zQKkk6mkjZiQFIFJ3w_2 (chronosInternalRetFuture=<optimized out>, ClE_0=<optimized out>) at /app/waku/v2/protocol/waku_store/waku_store.nim:357
#16 0x000055af8aed5833 in futureContinue__XuNTB7fHwBI8KII0qEQaCw (fut=0x7f68912190c0) at /app/vendor/nim-chronos/chronos/asyncfutures2.nim:365
#17 0x000055af8aeef0c2 in poll__YNjd8fE6xG8CRNwfLnrx0g_2 () at /app/vendor/nimbus-build-system/vendor/Nim/lib/pure/collections/deques.nim:390
#18 0x000055af8af01c1d in runForever__YNjd8fE6xG8CRNwfLnrx0g_3 () at /app/vendor/nim-chronos/chronos/asyncloop.nim:1068
#19 0x000055af8b3561c1 in NimMainModule () at /app/waku/v2/node/wakunode2.nim:1458
#20 0x000055af8b35f272 in NimMain () at /app/vendor/dnsclient.nim/src/dnsclientpkg/types.nim:411
#21 0x000055af8ac96a6d in main (argc=<optimized out>, args=<optimized out>, env=<optimized out>) at /app/vendor/dnsclient.nim/src/dnsclientpkg/types.nim:418
(gdb) bt
#0  0x000055af8ae1c76d in split__NJo8pxZdXEAIa7wkHls9cOw_2 (t=0x7f68933bdac0) at /app/vendor/nimbus-build-system/vendor/Nim/lib/system/avltree.nim:71
#1  add__3D9aOyz4rDquPZKBlqn0xig (a=<optimized out>, t=0x7f68933bdac0, key=<optimized out>, upperBound=<optimized out>) at /app/vendor/nimbus-build-system/vendor/Nim/lib/system/avltree.nim:71
#2  0x000055af8ae1c7d9 in add__3D9aOyz4rDquPZKBlqn0xig (a=<optimized out>, t=0x7f68933bcdd8, key=<optimized out>, upperBound=<optimized out>)
    at /app/vendor/nimbus-build-system/vendor/Nim/lib/system/avltree.nim:67
#3  0x000055af8ae1c7d9 in add__3D9aOyz4rDquPZKBlqn0xig (a=<optimized out>, t=0x7f689293a5e8, key=<optimized out>, upperBound=<optimized out>)
    at /app/vendor/nimbus-build-system/vendor/Nim/lib/system/avltree.nim:67
#4  0x000055af8ae1c7d9 in add__3D9aOyz4rDquPZKBlqn0xig (a=<optimized out>, t=0x7f689251c770, key=<optimized out>, upperBound=<optimized out>)
    at /app/vendor/nimbus-build-system/vendor/Nim/lib/system/avltree.nim:67
#5  0x000055af8ae1c7d9 in add__3D9aOyz4rDquPZKBlqn0xig (a=<optimized out>, t=0x7f68929339c8, key=<optimized out>, upperBound=<optimized out>)
    at /app/vendor/nimbus-build-system/vendor/Nim/lib/system/avltree.nim:67
#6  0x000055af8ae1c7d9 in add__3D9aOyz4rDquPZKBlqn0xig (a=a@entry=0x7f68946770b8, t=t@entry=0x7f689467a160, key=key@entry=140087070900256, upperBound=upperBound@entry=140087070965606)
    at /app/vendor/nimbus-build-system/vendor/Nim/lib/system/avltree.nim:67
#7  0x000055af8ae1c8f3 in rawAlloc__mE4QEVyMvGRVliDWDngZCQ (a=a@entry=0x7f68946770b8, requestedSize=requestedSize@entry=65318) at /app/vendor/nimbus-build-system/vendor/Nim/lib/system/alloc.nim:808
#8  0x000055af8ae21f4c in rawNewObj__ehkAaLROrd0Hc9aLROWt1nQ (typ=0x55af8b56f6c0 <NTI__6H5Oh5UUvVCLiakt9aTwtUQ_>, size=65302, gch=0x7f6894677050)
    at /app/vendor/nimbus-build-system/vendor/Nim/lib/system/gc.nim:799
#9  0x000055af8ae221da in newObj (typ=<optimized out>, size=65302) at /app/vendor/nimbus-build-system/vendor/Nim/lib/system/gc.nim:440
#10 0x000055af8ae243b0 in newSeq (typ=<optimized out>, len=len@entry=65286) at /app/vendor/nimbus-build-system/vendor/Nim/lib/system/gc.nim:448
#11 0x000055af8ae2964d in at___FKhCG9aXnVoTuyjERVXzd2g (
    a=a@entry=0x7f68924b5018 "{\"data\":\"7b2266656573223a7b22307863323133326430356433316339313461383763363631316331303734386165623034623538653866223a223078333432343966222c2230783237393162636131663264653436363165643838613330633939613"..., aLen_0=aLen_0@entry=65286) at /app/vendor/nimbus-build-system/vendor/Nim/lib/system.nim:1544
#12 0x000055af8b323052 in msg__i3o9cXbEdaMzfyufaKvkWWg_3 (s_0=s_0@entry=0x7f68941dd798, ClE_0=ClE_0@entry=0x7f6893e48730) at /app/waku/v2/node/storage/message/waku_message_store.nim:311
#13 0x000055af8b2c0d5b in exec__mW8h6kAccHfls1l8PiW87A (s_0=s_0@entry=0x7f68941dd798, params=params@entry=0x7ffec95987f0, onData=..., Result=Result@entry=0x7ffec95987e0)
    at /app/waku/v2/node/storage/sqlite.nim:210
#14 0x000055af8b324fa2 in getPage__xsP8EcHIHYTHBPzOo1A8Aw (db=<optimized out>, pred=..., pagingInfo=<optimized out>, Result=<optimized out>) at /app/waku/v2/node/storage/message/waku_message_store.nim:383
#15 0x00000000006ea77c in ?? ()
#16 0x000055af8b2173c8 in findMessages__xInphcAVlliR33P39aBgAwA (w=<optimized out>, query=query@entry=0x7f6893e83088, Result=Result@entry=0x7f6893e83250) at /app/waku/v2/protocol/waku_store/waku_store.nim:333
#17 0x000055af8b21897a in handler__5HU2zQKkk6mkjZiQFIFJ3w_2 (chronosInternalRetFuture=<optimized out>, ClE_0=<optimized out>) at /app/waku/v2/protocol/waku_store/waku_store.nim:357
#18 0x000055af8aed5833 in futureContinue__XuNTB7fHwBI8KII0qEQaCw (fut=0x7f68912190c0) at /app/vendor/nim-chronos/chronos/asyncfutures2.nim:365
#19 0x000055af8aeef0c2 in poll__YNjd8fE6xG8CRNwfLnrx0g_2 () at /app/vendor/nimbus-build-system/vendor/Nim/lib/pure/collections/deques.nim:390
#20 0x000055af8af01c1d in runForever__YNjd8fE6xG8CRNwfLnrx0g_3 () at /app/vendor/nim-chronos/chronos/asyncloop.nim:1068
#21 0x000055af8b3561c1 in NimMainModule () at /app/waku/v2/node/wakunode2.nim:1458
#22 0x000055af8b35f272 in NimMain () at /app/vendor/dnsclient.nim/src/dnsclientpkg/types.nim:411
#23 0x000055af8ac96a6d in main (argc=<optimized out>, args=<optimized out>, env=<optimized out>) at /app/vendor/dnsclient.nim/src/dnsclientpkg/types.nim:418

The last query in the nodes logs was:

INF 2022-06-23 08:22:29.377+00:00 received query                             topics="wakustore" tid=1 file=waku_store.nim:353 rpc="(requestId: "48408cdd-a8fc-48c7-a29b-589f97ff0d17", query: (contentFilters: @[(contentTopic: "/toy-chat/2/huilong/proto")], pubsubTopic: "/waku/2/default-waku/proto", pagingInfo: (pageSize: 100, cursor: (digest: 0000000000000000000000000000000000000000000000000000000000000000, receiverTime: 0, senderTime: 0, pubsubTopic: ""), direction: BACKWARD), startTime: 1655965347185000000, endTime: 1655972547185000000), response: (messages: @[], pagingInfo: (pageSize: 0, cursor: (digest: 0000000000000000000000000000000000000000000000000000000000000000, receiverTime: 0, senderTime: 0, pubsubTopic: ""), direction: BACKWARD), error: NONE))"

which is known to be a slow query.

@jm-clius
Copy link
Contributor Author

cc @kaiserd

@kaiserd
Copy link
Contributor

kaiserd commented Jun 23, 2022

See #982 under efficiency issue.

Sparse filters are relatively slow. This combined with the fact that some operations are blocking (see todo list in #982) may lead to store nodes becoming unresponsive. I ran into this issue with toy chat and a large DB (1.6GB, >800k messages) that only contained a few toy chat messages.
Your observation shows this is a problem in practice. Thank you for compiling this report!
How many messages were in the DB and how large in terms of used storage was the DB?

I will be on it after making more progress on vacp2p/research#104

@jm-clius
Copy link
Contributor Author

DB size is not huge:

sudo du -hsc /docker/nim-waku-v2/data/store.sqlite3
1.5G    /docker/nim-waku-v2/data/store.sqlite3
1.5G    total
sudo sqlite3 /docker/nim-waku-v2/data/store.sqlite3
SQLite version 3.31.1 2020-01-27 19:55:54
Enter ".help" for usage hints.
sqlite> select count(*) from Message;
246320

@kaiserd
Copy link
Contributor

kaiserd commented Jun 23, 2022

Thank you!

@jm-clius
Copy link
Contributor Author

Possible partial fix merged in #1018. Thanks, @kaiserd. Deploying to test fleet.

jakubgs pushed a commit to status-im/infra-nim-waku that referenced this issue Jun 23, 2022
There is a known issue in `v0.10` where certain store queries
can slow down nodes: waku-org/nwaku#1017

A possible (partial?) fix has been merged and deployed to `wakuv2.test`:
waku-org/nwaku#1018
But the SQLite store can be disable on `prod` in the meantime.

Signed-off-by: Jakub Sokołowski <[email protected]>
@jm-clius
Copy link
Contributor Author

@LNSD not sure how relevant this issue still is. Based on this comment: https://discord.com/channels/864066763682218004/1011953262586507354/1012036459097767946 I think there is still some investigation necessary, but likely unrelated to the issue that was reported (and addressed) here.

@LNSD LNSD self-assigned this Aug 31, 2022
@jm-clius jm-clius added this to Waku Sep 2, 2022
@jm-clius jm-clius moved this to In Progress in Waku Sep 2, 2022
@jm-clius jm-clius added this to the Release 0.12 milestone Sep 2, 2022
@LNSD
Copy link
Contributor

LNSD commented Sep 6, 2022

Closing this issue as the changes merged as part of #1120 have improved the store query time significantly.

@LNSD LNSD closed this as completed Sep 6, 2022
@LNSD LNSD moved this from In Progress to Done in Waku Sep 6, 2022
jakubgs added a commit to status-im/infra-nim-waku that referenced this issue Oct 19, 2022
Otherwise nodes try to load all messages into memory at startup.

And the issue that originally caused this to be disabled is closed:
waku-org/nwaku#1017
waku-org/nwaku#1018

Signed-off-by: Jakub Sokołowski <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

3 participants