Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Investigate higher thread usage since Polkadot 0.9.36 #12988

Closed
bkchr opened this issue Dec 21, 2022 · 12 comments
Closed

Investigate higher thread usage since Polkadot 0.9.36 #12988

bkchr opened this issue Dec 21, 2022 · 12 comments
Assignees
Labels
I8-footprint An enhancement to provide a smaller (system load, memory, network or disk) footprint.

Comments

@bkchr
Copy link
Member

bkchr commented Dec 21, 2022

People report that since switching to Polkadot 0.9.36 from 0.9.33 (last node release) they see a much higher thread usage. They report that the CPU usage didn't increased.

I think that this is maybe related to the switch of using tokio in libp2p. If it is confirmed, this issue can be closed. In general this is just some investigation to find the underlying change for this behavior.

@bkchr bkchr added the I8-footprint An enhancement to provide a smaller (system load, memory, network or disk) footprint. label Dec 21, 2022
@Nexus2k
Copy link

Nexus2k commented Dec 21, 2022

I'm the reporter of the issue, please find the two graphs attached:
image
CPU load itself is roughly the same:
image

Thread usage seems to spike with 1k threads when polkadot node is in active set and producing blocks.

@dmitry-markin
Copy link
Contributor

dmitry-markin commented Dec 21, 2022

@Nexus2k Thanks for the graphs! Sorry for the dumb question, what tool do you use to collect the statistics?

Do you have the same thread count graph for v0.9.33 to better understand the difference?

@bkchr
Copy link
Member Author

bkchr commented Dec 21, 2022

@dmitry-markin that should be grafana.

See Tuesday before it started to grow. We released on Tuesday, so that should be the point when he upgraded.

@Nexus2k
Copy link

Nexus2k commented Dec 21, 2022

@dmitry-markin I use https://checkmk.com/ to monitor the few dozens of servers I run for substrate blockchains.

Here's the graph for the last 8 days, the only changing part according to my knowledge was the polkadot release upgrade:
image

The node that produces this graph is somewhat permanently active in Kusama and became active in Polkadot on Tuesday around 16:30 CET. The release upgrade happened on 14:30 CET due to automatic upgrading on new docker image publishing.

@dmitry-markin
Copy link
Contributor

The node that produces this graph is somewhat permanently active in Kusama and became active in Polkadot ...

Do you mean you run two nodes on the same server, and the number of threads is the total number of threads in the system?

I don't have access to any validator nodes, but would expect the number of polkadot threads to be on the order of 50-100 (I just measured 42 during syncing on v0.9.36 and 54 on v0.9.33).

@Nexus2k
Copy link

Nexus2k commented Dec 21, 2022

Correct, it's a rather beefy machine and can handle multiple chains in parallel. And yes it's the overall system threads which were rather flat before the upgrade as you can see. I've rebooted the system for good measure just to make sure it's not a problem for any other component.

@Nexus2k
Copy link

Nexus2k commented Dec 21, 2022

Here's a thread graph of another system that also shows annomalies after the upgrade:
image

That one is was not active during the upgrade but became active in Kusama at 19:28 CET

@dmitry-markin
Copy link
Contributor

@Nexus2k Could you get the number of polkadot threads during the anomaly, something like ps -e -O nlwp | grep polkadot?

@Nexus2k
Copy link

Nexus2k commented Dec 21, 2022

# ps -e -O nlwp | grep polkadot
   3442   84 S ?        01:36:12 /usr/bin/polkadot --name=🍁 HIGH/STAKE 🥩 | HEL1-KSM -lsync=warn,afg=warn,babe=warn --chain=kusama --validator --rpc-methods=Unsafe --rpc-external --rpc-cors=all --listen-addr=/ip4/0.0.0.0/tcp/30333 --public-addr=/ip4/<REDACTED>/tcp/30334 --prometheus-external --no-mdns --pruning=1000 --telemetry-url=wss://telemetry.polkadot.io/submit/ 1 --telemetry-url=wss://telemetry-backend.w3f.community/submit/ 1
   3767   85 S ?        01:18:52 /usr/bin/polkadot --name=🍁 HIGH/STAKE 🥩 | HEL1-DOT -lsync=warn,afg=warn,babe=warn --chain=polkadot --validator --rpc-methods=Unsafe --rpc-external --rpc-cors=all --listen-addr=/ip4/0.0.0.0/tcp/30333 --public-addr=/ip4/<REDACTED>/tcp/30333 --prometheus-external --no-mdns --pruning=1000 --telemetry-url=wss://telemetry.polkadot.io/submit/ 1 --telemetry-url=wss://telemetry-backend.w3f.community/submit/ 1
   4326   44 S ?        00:02:11 /usr/bin/polkadot prepare-worker /tmp/pvf-host-prepareK3WuNH4EJm
   4351  362 S ?        00:00:29 /usr/bin/polkadot execute-worker /tmp/pvf-host-executeBFOEmXjdbI
   4352  562 S ?        00:02:26 /usr/bin/polkadot execute-worker /tmp/pvf-host-executetYaDNMqDdJ

@dmitry-markin
Copy link
Contributor

It looks like the high thread count is caused not by networking code, but by PVF subsystem, which I'm unfamiliar with. @bkchr do you know who better to assign the issue to?

@bkchr
Copy link
Member Author

bkchr commented Dec 21, 2022

Ahh it just aggregates all the threads of all running polkadot processes.

There is currently some rework going on around the pvf worker which should solve this. I'm going to close this, as this is nothing problematic.

However, thank you @dmitry-markin for looking that fast into it and ty for @Nexus2k for your help here :)

Ping @m-cat as you are the one rewriting the pvf worker currently. As I said above, I don't think that we need to take any special action here.

@bkchr bkchr closed this as completed Dec 21, 2022
@mrcnski
Copy link

mrcnski commented Dec 21, 2022

Thanks for the ping @bkchr. I wasn't aware that the thread count was so high, so it's good that we will be closing them properly in paritytech/polkadot#6419.

For background, most of these threads are sleeping, which is why the CPU usage doesn't change. We don't currently kill them because killing threads is unsafe. But the PR above changes it so that we signal the thread to finish on its own, when the PVF job finishes.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
I8-footprint An enhancement to provide a smaller (system load, memory, network or disk) footprint.
Projects
None yet
Development

No branches or pull requests

4 participants