Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High IO Wait CPU usage #2729

Closed
hermanbanken opened this issue Mar 14, 2024 · 11 comments
Closed

High IO Wait CPU usage #2729

hermanbanken opened this issue Mar 14, 2024 · 11 comments
Labels
bug Something isn't working

Comments

@hermanbanken
Copy link
Contributor

hermanbanken commented Mar 14, 2024

Describe the bug
Slightly related to #66 but different, we see that when a GKE node VM runs (any or more) DragonFlyDB container, the CPU stats contain a very high "IO Wait", up to 100%. When all DragonFlyDB containers are stopped, this drops back to 0%. It seems like DragonFly consumes all of the remaining CPU with IO wait.

Our DataDog monitoring for the CPU usage of our nodes has been alerting for a few weeks now, and we dismissed it as we saw no high CPU usage (not including IO Wait) in any of our containers, but it makes the automatic monitoring completely useless as we can not rely on the CPU usage to be low. It seems to be impossible to see the IO wait per process in Linux.

We'd like to understand and prevent DragonFlyDB from causing 100% IO Wait. What is DragonFlyDB even using IO for if this is all stored in memory?

To Reproduce
Steps to reproduce the behavior:

  1. Run DragonFlyDB
  2. vmstat 1 outputs >80 values in the WA column.
  3. top outputs >80 values in the wa field

Expected behavior
Low CPU usage on idle systems.

Screenshots
Screenshot 2024-03-14 at 11 12 38

Environment (please complete the following information):

  • OS: cos (v1.27.8-gke.1067004)
  • Kernel: # Command: Linux hostname 5.15.133+ #1 SMP Sat Dec 30 13:01:38 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
  • Containerized?: Kubernetes
  • Dragonfly Version: docker.dragonflydb.io/dragonflydb/dragonfly:v1.14.5

To workaround
Adding --force-epoll seems to avoid the issue.

@hermanbanken hermanbanken added the bug Something isn't working label Mar 14, 2024
@romange
Copy link
Collaborator

romange commented Mar 14, 2024

duplicate of #2181 #2270 #2287 and #2444

For the explanation please, see #2270 (comment)
Also, please see axboe/liburing#943 where iouring kernel maintainers discuss this issue.

Based on these discussions, I understand that high IOWAIT is not related to high cpu usage, it's part of the CPU IDLE time, i.e. it's the time spent by a thread being blocked on (networking and disk) I/O, so if you need to monitor high cpu usage, you may consider ignoring IOWAIT contributions for that.

@romange
Copy link
Collaborator

romange commented Mar 14, 2024

And yes, --force_epoll avoids this issue because it's only in iouring they changed semantics so that CPU time when thread is blocked on networking I/O is also considered IOWAIT in contrast to the epoll API that preserves its current behaviour.

@hermanbanken
Copy link
Contributor Author

hermanbanken commented Mar 14, 2024

Also related:

I understand that this is "working as intended" and we should not be worried. It is a bit unfortunate that this "newish" way of doing async IO is not well understood / worked out yet, and that monitoring tools assume that high IO Wait is bad.

I'm currently wading through info about this new topic and hoping to find how we can best resolve this. Any suggestions are welcome.

Small nit: the documentation doesn't really explain these tradeoffs and the cli docs only say what arguments exist, not how they affect DragonFlyDB. I found out about io_uring and more details via this code:

#if USE_URING
  unique_ptr<util::ProactorPool> pp(fb2::Pool::IOUring(1024));
#else
  unique_ptr<util::ProactorPool> pp(fb2::Pool::Epoll());
#endif
  pp->Run();

@romange
Copy link
Collaborator

romange commented Mar 15, 2024

@hermanbanken where did you search for the documentation ?
https://dragonflydb.io/docs ?

@hermanbanken
Copy link
Contributor Author

I did search the documentation site:

@romange
Copy link
Collaborator

romange commented Mar 15, 2024

And thanks for digging this - it's a great discussion where a rational logic tries to overcome tradition and of course looses. Jen's response summarises it all:

For sure, it's a stupid metric. But at the same time, educating people
on this can be like talking to a brick wall, and it'll be years of doing
that before we're making a dent in it. Hence I do think that just
exposing the knob and letting the storage side use it, if they want, is
the path of least resistance. I'm personally not going to do a crusade
on iowait to eliminate it, I don't have the time for that. I'll educate
people when it comes up, like I have been doing, but pulling this to
conclusion would be 10+ years easily.

Once the kernel will be released with this commit, dragonfly will automcatically revert to its "normal" behavior on that version :)

@melroy89
Copy link

melroy89 commented Mar 28, 2024

This is my second biggest issue I had with Dragonfly. Sorry I'm out now: MbinOrg/mbin#641

And I do believe a high io wait can have performance impact by the cpu scheduler. Even if I'm wrong about that, I still would like to have a low IOWait, so I know if there are problems on disks IO (which also causes high IOWait in some cases).

@nantiferov
Copy link

@romange I'm running Dragonfly 1.23.0 on Ubuntu 24.04 with kernel 6.8.0-1016-aws with default options.

According to aforementioned links this iowait behaviour should be fixed in Linux kernel 6.4.8. However, I observe same situation with all CPU cores "utilised" with 100% iowait. I understand from discussions in this issue that it's not a problem, but strange that it's still shows in 6.8.0 kernel. I'll try to check maybe Ubuntu maintainers doing some patching for it, but so far I have a question.

If I will enable --force-epoll, will it bring any performance or other disadvantages?

@dragonflydb dragonflydb deleted a comment from melroy89 Oct 3, 2024
@romange
Copy link
Collaborator

romange commented Oct 3, 2024

epoll is legacy engine supported by dragonfly but it has some gaps:

  1. lack of ssd tiering support
  2. dragonfly does not run memory defragmentation with epoll (something we need to implement but have not done yet)

@romange
Copy link
Collaborator

romange commented Oct 3, 2024

i do not think it has been fixed - see axboe/liburing#943 (comment)

@nantiferov
Copy link

nantiferov commented Oct 3, 2024

Cool, thank you for details. I'll exclude iowait from our alerts and let's see then if liburing will release with fix (according to the last comment you mentioned they're testing it).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants