Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

weride iouring writer #17

Open
wants to merge 9 commits into
base: master
Choose a base branch
from
Open

Conversation

ChiZhang0907
Copy link

Hi Marcin,

Hope your work goes well.

This is our first version of io_uring writer. Files related to io_uring are in /riegeli/iouring directory. This version is an attempt and there is some room for improvement. I have some questions and want to discuss with you. Thank you very much!

In this version, I created a separate writer named fd_io_uring_writer to manage the io_uring instance. Each fd_io_uring_writer class has its own seperate io_uring instance. The user can choose to use io_uring or not manually. But the final feature I want is merging the io_uring to fd_writer. The system will check the environment automatically, if the io_uring is available, it will apply io_uring, otherwise it will go through the old one. The difficulty now is that the byte writer is a parameter of the record writer template. So I have no idea how to organize the structure here.

Besides, the io_uring instance has two mode, syn and async. In async mode, the function will return immediately and report the error later. In this case, we need to maintain the writing buffer until the result is returned. At this time, I copy the buffer one more time to avoid changing the structure of your buffer class. But this overhead can be eliminated. Maybe we can write a new buffer to meet this goal?

Thank you for your patience. If you have any suggestion about the code or above questions, please feel free to reach out me.

@QrczakMK
Copy link
Member

QrczakMK commented Aug 9, 2021

Thank you for your work.

Some questions

While the benefit of an asynchronous FdIoUringWriter is clear (after submitting a request to write, the main code can prepare further data while writing previous data proceeds in background), what is an advantage of an synchronous FdIoUringWriter over the plain FdWriter?

How does the caller decide whether set_fd_register() should be used? Maybe this should be decided automatically by FdIoUringWriter. FdIoUringWriter is associated with a fixed file, until it is reassigned or Reset(). If IOSQE_FIXED_FILE is helpful for write(), then perhaps it should be used unconditionally.

How does the caller decide about set_size()? AFAIK in earliest kernels requests could be dropped if they exceeded the queue size, but in later kernels there is some fallback mechanism for keeping them. If the caller should decide this, the caller should have hints how this affect the behavior.

What happens if the calling code requests writing data at a higher pace than the fd allows? There should probably be some throttling mechanism. Maybe it should be based on the amount of memory kept for the asynchronous buffers, to avoid allocating an unbounded amount of memory in such case.

Choosing a Writer at runtime

In the internal Google repository, there is some layering of the source code, which constrains which libraries may depend on which libraries, to prevent pulling unwanted dependencies by core libraries by accident.

riegeli/bytes belongs to a component which may depend on a certain set of libraries, and liburing is currently not among them. This means that — at least in the internal Google version — FdIoUringWriter cannot be put in riegeli/bytes; it might be put in a subdirectory, or in a separate directory. This also means that currently FdWriter cannot depend on liburing. Eventually liburing can be added to core libraries if it is deemed important enough and our deciders agree to this.

It should not be necessary to merge FdIoUringWriter to FdWriter though. There are several approaches:

  1. Use RecordWriter<std::unique_ptr<Writer>>. We can have a factory function which returns the appropriate std::unique_ptr<Writer>.

  2. Make a class deriving from WrappedWriter<std::unique_ptr<Writer>>. This allows to add convenience functions like filename() and fd(). If this approach is used, there might be a base class deriving from WrappedWriter<std::unique_ptr<Writer>>, possibly implemented in terms of FdWriter<UnownedFd> and FdIoUringWriter<UnownedFd>, plus a template which implements fd ownership and thus overrides certain operations like Done() and FlushImpl(). This approach was not used yet but I believe it should work.

  3. Make a class deriving from WrappedWriter<Writer*> which holds absl::variant<FdWriter<UnownedFd>, FdIoUringWriter<UnownedFd>>. This makes it easier to implement e.g. filename() without duplicating storage.

  4. Let FdIoUringWriter use a fallback implementation if io_uring is unavailable in the kernel, keeping FdWriter unchanged.

I am not sure whether these combinations are needed at all. Maybe io_uring is established enough that it can be used unconditionally by applications which care about this aspect of performance; they will just require the appropriate kernel. Is it going to be used in contexts where it is uncertain whether the kernel will support that?

If so, approach (1) is the simplest, and I would recommend it unless it is insufficient for some reason.

Avoiding copying the buffer

This can be avoided by implementing a custom detachable buffer instead of relying on BufferedWriter.

BufferedWriter is meant for cases where the core functionality is available as a write()-like function which does not take ownership of the buffer. In particular if the caller calls Write() with a sufficiently long string_view, then WriteInternal() will be called directly with that string_view, bypassing buffering. In this case the copy for asynchronous mode is unavoidable.

But if the caller calls Write() with many short string_views, the writes are stored in the internal buffer, which can moved along with the asynchronous operation without copying. The buffer might be implemented as Buffer like in BufferedWriter. There are several modes of writing:

  • Writing of buffered contents: attach the movable Buffer to the asynchronous operation. A new internal buffer will be allocated later if needed.

  • Writing of string_view: in contrast to BufferedWriter, do not override WriteSlow(string_view), let the default implementation in Writerimplement that in terms ofPush()`, so that it goes through the buffer. It will be written chunks, to avoid allocating a huge amount of memory for a temporary buffer.

  • Writing of Chain or Cord: they can be moved to the asynchronous operation without copying, hence the asynchronous operation should not work in terms of just Buffer, but also Chain or Cord. One difference is that Buffer is flat, while Chain and Cord consist of multiple fragments. I am not sure whether it is better to utilize writev() for writing multiple fragments together, or to merge smaller fragments into separately allocated buffers. Recently I tried employing writev() in FdWriter, and the results were mixed: it was a win only for intermediate fragment sizes, while for smaller fragment sizes it was worse than the current behavior of copying data into 64KB by default arrays and writing them using write(). There is no conclusion yet how writev() should be employed, if at all, for writing a Chain or Cord, or the current buffer together with a string_view.

Tests

Riegeli does have tests internally. They are not open sourced (except for Python ones) because I did not want to spend time for adapting between the internal and external version of gtest. I am sorry about this; this might change.

I trust that if the tests pass internally, and the code builds in open source, then it works in open source. In addition to that, riegeli/records/tools:records_benchmark serves as an end to end test.

When FdIoUringWriter is implemented in the internal version, it will undergo various generic Writer tests.

Miscellaneous

FlushImpl() should wait until all data are written and flushed, otherwise it provides no guarantees for the caller.

For technical reasons Riegeli cannot use std::mutex but can use absl::Mutex.

Who does the work

It is clear that a lot of work remains to be done.

There are questions to be answered, in particular those I formulated at the beginning, and whether a runtime switch is needed — and if so, how it should look like.

I think that when we agree regarding how this should look like, I will implement that myself based on your work. Some of the aspects are tricky, and I do not want you to waste time on working on something that I would rewrite anyway.

It should be even more important to support asynchronous reading.

During the next 2 weeks I am partially on vacation.

@ChiZhang0907
Copy link
Author

Hi Marcin,

Thank you for your patient. I read your comments carefully and thought about the details you pointed. Though I have not figured out the whole structure, I thought it is necessary to reply and discuss with you.

Some questions

The synchronous FdIoUringWriter also has some benefit. In io_uring, although we use the synchronous mode, it has less overhead than the normal IO (write/pwrite). Due to the design of io_uring, the system can save some overhead of memory copy and system call. In the write function, the kernel needs to copy the memory to complete operation. For example, if you want to write 100 MB to a file, the kernel will copy this 100 MB first. In contrast, io_uring can bypass this operation. In my test in other projects with io_uring, the io_uring could decrease the CPU utilization and system time by about 30% when comparing with normal IO in the same condition.

Of course, the asynchronous mode has higher throughput than synchronous mode. But the problem here is that we have a reap thread to process the completion queue. This is an extra overhead for CPU. So, there is a tradeoff here. If you need higher throughput, choose asynchronous mode, if you want to optimize the overhead of CPU, synchronous mode is better. But no matter which mode you choose, the speed and overhead of io_uring will be better than the normal IO.

As for registering fd. It is a mechanism in io_uring. The io_uring allow us to register some fds in advance, so that the kernel could bypass the step of copying fd reference when do IO operation. The effect of this will be obvious for high IOPS workloads. I set this function because I want to write a more general class for io_uring. But your opinion is correct, in consideration of the instance is associated with a fixed file, we can set it unconditionally.

The caller can choose the size from 1 to 4096, every number which is the power of 2 will be valid. This number is the size of the submission queue. In some extreme situation, when the submission queue is full, you cannot get a new sq element from the function io_uring _get_sqe.
Actually, this number is the size of the buffer. Why I provide an API to change the size here is because I want the user can make their own decision. The larger the size is, the higher IOPS the writer could handle with of course. But the kernel resource for the whole io_uring is limited (though I don’t know the exact number here, maybe I could write an email to ask Axboe), so it is possible that the new io_uring could not be created if you have many io_uring instance with large size. (This could happen when you set parallelism.) This situation is very rare, a large number could work well normally.

As I mentioned, when the submission queue buffer is full, we will get a nullptr when we call io_uring_get_sqe. This will happen when the speed of user calling requests exceeds the kernel a lot, and the submission cannot catch up with your expectation. For example, you want to submit 10 sqe, but the return value of io_uring_submit is only 4 because the fundamental layer has no more space for 6 extra elements. These 6 sqes will be remained in submission queue. It is necessary for us to design a throttling mechanism. There are two simple options, the first is that we just throw this operation, and the second is that we do a busy loop to submit the element continuously when the return value for io_uring_get_sqe is nullptr, until that we get a enough space. In the second option, the writer will be blocked for a while to control the throughput.

Choosing a writer at runtime

Yes, I think io_uring is a very well-done feature after kernel 5.10. I also think option 1 is simple and acceptable.

Avoiding copying the buffer

I will need more time to read the code and think about this part, will reply you later.

Miscellaneous

Thank you. I will fix these two parts. But could you please explain more about the technical reasons? Why is the std::mutex inappropriate here?

Besides, what do you mean “a run time switch”? If it refers to check the environment and decide which io we will apply automatically, I think it is a useful and necessary feature.

@QrczakMK
Copy link
Member

I wonder why the kernel copies the data first in write().

Threading

std::mutex is inappropriate for the Google internal version because multithreading used in Google requires using special synchronization primitives, and even requires annotating blocking calls. Hence this is what Riegeli uses:

  • For a mutex: absl::Mutex — which has the appropriate variants for the Google internal version and for open source. It also serves as a replacement for condition variables with LockWhen().
  • For futures — in open source std::future, std::promise etc., but in the Google internal version they have an emulation of the relevant subset of the API using absl::Mutex. During open sourcing the emulation is removed and riegeli::internal::future is replaced with std::future etc.
  • For starting threads — riegeli/base/parallelism.h, which uses std::thread in open source and an appropriate emulation in the Google internal version. The API to use in the rest of Riegeli is ThreadPool::global().Schedule(…). An example is in riegeli/records/record_writer.cc.

Chosing the Writer at runtime

By runtime switch in this context I meant when the caller does not know whether he needs FdWriter or FdIoUringWriter. I can imagine the following cases:

  1. The caller does not care that much about the last bit of performance, and uses FdWriter for portability.
  2. The caller cares about performance, cares to have recent enough Linux, and uses FdIoUringWriter unconditionally.
  3. The caller wants to use FdIoUringWriter when it is available, but to fall back to FdWriter when it is not available.

Cases 1 and 2 are easy. The question is whether case 3 is important.

Actually case 3 is split into two:
3a. The caller wants to use FdIoUringWriter on platforms where it is available, but since it would not even compile on other platforms which should nevertheless be supported, FdWriter should be used as a fallback on other platforms.
3b. The caller will build on linux, so FdIoUringWriter will compile, but the kernel might not support io_uring, and these kernels should nevertheless be supported.

The question is whether case 3a is important, or 3b, or both. Solutions for 3a and 3b are different.

If 3b is important, then perhaps it is best if FdIoUringWriter does autodetection and includes a fallback implementation which works similarly to FdWriter.

If 3a is important, then I am not sure how to build this, because I am not familiar with conditional dependencies in bazel (and also with differences between conditional compilation in the Google internal version and in open source bazel), but it is definitely doable.

Even if cases 3a and/or 3b are sometimes needed, Riegeli does not have to immediately implement them. They might be implemented by the caller, by instantiating std::unique_ptr<Writer> with the appropriate class, after checking somehow whether io_uring is supported. But if these cases are needed often, perhaps Riegeli should support them directly.

@yunhuali
Copy link

yunhuali commented Aug 20, 2021

HI @QrczakMK

wonder why the kernel copies the data first in write().

kernel will do the copy:
https://elixir.bootlin.com/linux/latest/source/mm/filemap.c#L3668
write -> sys_write -> vfs_write -> new_sync_write -> ext4_file_write_iter -> __generic_file_write_iter -> generic_perform_write

  1. kernel low level code(such as DMA) can't access user address directly
  2. Block device cache
    maybe some other history reason. that's one of the reason "kernel by pass" such as DPDK/SPDK are popular those days.

@ChiZhang0907
Copy link
Author

Yunhua will give you a specifica explain about the write function. By the way, do you have any opinions about the throttling mechanism in last comment?

Avoiding copy the buffer

Your suggestion is good. We can implement a new custom buffer which will only free the space after the asynchronous writing finish. And the string_view should be copied to the buffer unconditionally regardless of the size. In this case, an extra copy could be avoided.

As for the Cord or the Chain, the io_uring has the API for writev, which can provide support if you need.

Chosing the Writer at runtime

I think 3b is important. The user should know that the io_uring is a feature of Linux. So, we don't need to consider the situation the user wants to use it on other platforms. What we should do is guaranteeing that the system will back up to normal IO if the kernel is not available for io_uring.

I think the structure you mentioned in last comment (RecordWriter<std::unique_ptr>) is appropriate.

@QrczakMK
Copy link
Member

Copying by the kernel

If io_uring can avoid the copy, why can’t the same mechanism be applied to write()? If there are technical reasons why write() must do the copy, why don’t they apply to io_uring?

One of the reasons I asked this is that if only asynchronous mode is important, then I wanted to name this class AsyncFdWriter, to concentrate on the functionality rather than the internal mechanism. If the reaping thread is launched using a thread pool (e.g. riegeli/base/parallelism.h), maybe the overhead of using this mode unconditionally is acceptable.

Throttling

I think there should be a limit of the amount of data being buffered for asynchronous writing. But at least in some circumstances a reasonable limit seems to be 2 * buffer_size, i.e. the current buffer plus one similar buffer being processed in background. If the client is writing data in small chunks, at a constant pace, and it wrote more than the full buffer while the previous buffer was being written in background, then this means that the client wants to write faster than low level writes can proceed. Throttling is inevitable, and we might as well do it early, to avoid using lots of memory for the queue.

These circumstances do not necessarily apply all the time. Maybe the client wants to write at a fast pace now, and will rest later. In this case it might be better to allow for a larger queue, which will eventually be flushed.

I am not sure how this should be handled, and what should be the default. The default might depend on buffer_size; this is normally done by letting the parameter be optional<size_t>, with nullopt being interpreted as a computed default based on other parameters.

Choosing the Writer at runtime

For 3b without 3a, it might be sufficient that FdIoUringWriter uses write() / pwrite() as a fallback.

Miscellaneous

I am working on employing writev() / pwritev() in FdWriter. This is surprisingly tricky; it seems that for smaller pieces it is faster to coalesce them to larger arrays than to call writev(). A Chain or Cord can have a mixture of large and small pieces. Another annoyance is that pwritev() is not supported on MacOS (but writev() is).

The whole issue with writing is analogous to reading. And here too readv() / preadv() is tricky. While they might seem to be not needed in the context of Riegeli, because the Reader decides how it wants the memory to be laid out (and even if its client calls Read() to its own memory, this is done only with one piece at a time because there is no readv()-like API in Reader), I recently decided to avoid attaching huge fragments to a Chain or Cord merely because the parameter of Read() was huge. This avoids surprises when a middle sized substring pins a huge data substructure. Instead, data there are fragmented by reading one buffer size at a time. This means that it now makes sense to employ readv() / preadv() in FdReader when reading to Chain or Cord.

@yunhuali
Copy link

yunhuali commented Aug 20, 2021

If io_uring can avoid the copy, why can’t the same mechanism be applied to write()? If there are technical reasons why write() must do the copy, why don’t they apply to io_uring?

I think this is a good question. I may not be able to give perfect answer. I guess most important is compatibility. as you can see, for io_uring API, it requires different setup/cleanup API call.

historically before io_uring, we do have another LinuxAio exist, sometimes referred as "Linux Native AIO".
https://elixir.bootlin.com/linux/latest/source/fs/aio.c
the author of io_uring did mentioned that they actually tried to improve LinuxAio at beginning. "and work progressed fairly far down that path before being abandoned. " please check this link, section: 2.0 Improving the status quo, and there are also some explanation.
https://kernel.dk/io_uring.pdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants