High IO Wait CPU usage #2729

hermanbanken · 2024-03-14T15:21:11Z

Describe the bug
Slightly related to #66 but different, we see that when a GKE node VM runs (any or more) DragonFlyDB container, the CPU stats contain a very high "IO Wait", up to 100%. When all DragonFlyDB containers are stopped, this drops back to 0%. It seems like DragonFly consumes all of the remaining CPU with IO wait.

Our DataDog monitoring for the CPU usage of our nodes has been alerting for a few weeks now, and we dismissed it as we saw no high CPU usage (not including IO Wait) in any of our containers, but it makes the automatic monitoring completely useless as we can not rely on the CPU usage to be low. It seems to be impossible to see the IO wait per process in Linux.

We'd like to understand and prevent DragonFlyDB from causing 100% IO Wait. What is DragonFlyDB even using IO for if this is all stored in memory?

To Reproduce
Steps to reproduce the behavior:

Run DragonFlyDB
vmstat 1 outputs >80 values in the WA column.
top outputs >80 values in the wa field

Expected behavior
Low CPU usage on idle systems.

Screenshots

Environment (please complete the following information):

OS: cos (v1.27.8-gke.1067004)
Kernel: # Command: Linux hostname 5.15.133+ #1 SMP Sat Dec 30 13:01:38 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Containerized?: Kubernetes
Dragonfly Version: docker.dragonflydb.io/dragonflydb/dragonfly:v1.14.5

To workaround
Adding --force-epoll seems to avoid the issue.

The text was updated successfully, but these errors were encountered:

romange · 2024-03-14T16:11:14Z

duplicate of #2181 #2270 #2287 and #2444

For the explanation please, see #2270 (comment)
Also, please see axboe/liburing#943 where iouring kernel maintainers discuss this issue.

Based on these discussions, I understand that high IOWAIT is not related to high cpu usage, it's part of the CPU IDLE time, i.e. it's the time spent by a thread being blocked on (networking and disk) I/O, so if you need to monitor high cpu usage, you may consider ignoring IOWAIT contributions for that.

romange · 2024-03-14T16:13:39Z

And yes, --force_epoll avoids this issue because it's only in iouring they changed semantics so that CPU time when thread is blocked on networking I/O is also considered IOWAIT in contrast to the epoll API that preserves its current behaviour.

hermanbanken · 2024-03-14T21:39:03Z

Also related:

I understand that this is "working as intended" and we should not be worried. It is a bit unfortunate that this "newish" way of doing async IO is not well understood / worked out yet, and that monitoring tools assume that high IO Wait is bad.

I'm currently wading through info about this new topic and hoping to find how we can best resolve this. Any suggestions are welcome.

Small nit: the documentation doesn't really explain these tradeoffs and the cli docs only say what arguments exist, not how they affect DragonFlyDB. I found out about io_uring and more details via this code:

#if USE_URING
  unique_ptr<util::ProactorPool> pp(fb2::Pool::IOUring(1024));
#else
  unique_ptr<util::ProactorPool> pp(fb2::Pool::Epoll());
#endif
  pp->Run();

romange · 2024-03-15T07:18:51Z

@hermanbanken where did you search for the documentation ?
https://dragonflydb.io/docs ?

hermanbanken · 2024-03-15T07:25:49Z

I did search the documentation site:

https://www.dragonflydb.io/docs/managing-dragonfly/flags#--force_epoll - not much info, only terms that might hint you in directions to look if you are already knowledgable about epoll/proactor.
https://www.dragonflydb.io/docs/search?q=proactor
https://www.dragonflydb.io/docs/search?q=epoll

romange · 2024-03-15T07:27:46Z

And thanks for digging this - it's a great discussion where a rational logic tries to overcome tradition and of course looses. Jen's response summarises it all:

For sure, it's a stupid metric. But at the same time, educating people
on this can be like talking to a brick wall, and it'll be years of doing
that before we're making a dent in it. Hence I do think that just
exposing the knob and letting the storage side use it, if they want, is
the path of least resistance. I'm personally not going to do a crusade
on iowait to eliminate it, I don't have the time for that. I'll educate
people when it comes up, like I have been doing, but pulling this to
conclusion would be 10+ years easily.

Once the kernel will be released with this commit, dragonfly will automcatically revert to its "normal" behavior on that version :)

melroy89 · 2024-03-28T22:16:23Z

This is my second biggest issue I had with Dragonfly. Sorry I'm out now: MbinOrg/mbin#641

And I do believe a high io wait can have performance impact by the cpu scheduler. Even if I'm wrong about that, I still would like to have a low IOWait, so I know if there are problems on disks IO (which also causes high IOWait in some cases).

nantiferov · 2024-10-03T15:42:57Z

@romange I'm running Dragonfly 1.23.0 on Ubuntu 24.04 with kernel 6.8.0-1016-aws with default options.

According to aforementioned links this iowait behaviour should be fixed in Linux kernel 6.4.8. However, I observe same situation with all CPU cores "utilised" with 100% iowait. I understand from discussions in this issue that it's not a problem, but strange that it's still shows in 6.8.0 kernel. I'll try to check maybe Ubuntu maintainers doing some patching for it, but so far I have a question.

If I will enable --force-epoll, will it bring any performance or other disadvantages?

romange · 2024-10-03T17:27:22Z

epoll is legacy engine supported by dragonfly but it has some gaps:

lack of ssd tiering support
dragonfly does not run memory defragmentation with epoll (something we need to implement but have not done yet)

romange · 2024-10-03T17:30:16Z

i do not think it has been fixed - see axboe/liburing#943 (comment)

nantiferov · 2024-10-03T21:00:10Z

Cool, thank you for details. I'll exclude iowait from our alerts and let's see then if liburing will release with fix (according to the last comment you mentioned they're testing it).

hermanbanken added the bug Something isn't working label Mar 14, 2024

hermanbanken closed this as completed Mar 14, 2024

romange mentioned this issue Mar 15, 2024

update the documentation about epoll/iouring dragonflydb/documentation#218

Open

dragonflydb deleted a comment from melroy89 Oct 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High IO Wait CPU usage #2729

High IO Wait CPU usage #2729

hermanbanken commented Mar 14, 2024 •

edited

Loading

romange commented Mar 14, 2024 •

edited

Loading

romange commented Mar 14, 2024

hermanbanken commented Mar 14, 2024 •

edited

Loading

romange commented Mar 15, 2024

hermanbanken commented Mar 15, 2024

romange commented Mar 15, 2024 •

edited

Loading

melroy89 commented Mar 28, 2024 •

edited

Loading

nantiferov commented Oct 3, 2024

romange commented Oct 3, 2024

romange commented Oct 3, 2024

nantiferov commented Oct 3, 2024 •

edited

Loading

High IO Wait CPU usage #2729

High IO Wait CPU usage #2729

Comments

hermanbanken commented Mar 14, 2024 • edited Loading

romange commented Mar 14, 2024 • edited Loading

romange commented Mar 14, 2024

hermanbanken commented Mar 14, 2024 • edited Loading

romange commented Mar 15, 2024

hermanbanken commented Mar 15, 2024

romange commented Mar 15, 2024 • edited Loading

melroy89 commented Mar 28, 2024 • edited Loading

nantiferov commented Oct 3, 2024

romange commented Oct 3, 2024

romange commented Oct 3, 2024

nantiferov commented Oct 3, 2024 • edited Loading

hermanbanken commented Mar 14, 2024 •

edited

Loading

romange commented Mar 14, 2024 •

edited

Loading

hermanbanken commented Mar 14, 2024 •

edited

Loading

romange commented Mar 15, 2024 •

edited

Loading

melroy89 commented Mar 28, 2024 •

edited

Loading

nantiferov commented Oct 3, 2024 •

edited

Loading