Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Async write_all sometimes silently fails to write to file #4296

Closed
igorsol opened this issue Dec 3, 2021 · 15 comments
Closed

Async write_all sometimes silently fails to write to file #4296

igorsol opened this issue Dec 3, 2021 · 15 comments
Labels
A-tokio Area: The main tokio crate C-bug Category: This is a bug. M-fs Module: tokio/fs

Comments

@igorsol
Copy link

igorsol commented Dec 3, 2021

Version
tokio v1.14.0
tokio-macros v1.6.0 (proc-macro)

Platform
Linux 4.15.0-163-generic #171~16.04.1-Ubuntu SMP Wed Nov 10 11:19:51 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Description
While using write_all to write data to file I noticed that in some rare cases it does not write anything to the file despite the awaiting on corresponding future.

To reproduce the issue I've got the code from the write_all documentation:

use tokio::io::{self, AsyncWriteExt};
use tokio::fs::File;

#[tokio::main]
async fn main() -> io::Result<()> {
    let mut file = File::create("foo.txt").await?;

    file.write_all(b"some bytes").await?;
    //file.flush().await?;
    Ok(())
}

This code should create foo.txt file in the current directory. To test the binary executable I created following bash script which calls test binary in a loop a checks if non-zero length foo.txt was created:

i=0

ctrl_c() {
    echo
    echo "Ctrl-c captured on iteration $i"
    exit 0
}

trap 'ctrl_c' SIGINT

while true; do
    ((++i))
    target/debug/walltest
    if ! [ -s foo.txt ]; then
        echo "failure occurred on iteration $i"
        exit 1
    fi
    rm foo.txt
done

I expected to see this happen:

bash script must run infinitely until Ctrl-C is pressed

Instead, this happened:

the script stops in less than a second reporting that zero-length foo.txt was created. Number of successful iteration varies from less than 10 to few hundreds.

Additional information:

  • If I uncomment the file.flush call then file is always written as expected.
  • Identical code executed under async-std runtime works as expected without file.flush.
@igorsol igorsol added A-tokio Area: The main tokio crate C-bug Category: This is a bug. labels Dec 3, 2021
@Darksonn Darksonn added the M-fs Module: tokio/fs label Dec 4, 2021
@Darksonn
Copy link
Contributor

Darksonn commented Dec 4, 2021

This is because file operations are offloaded to the spawn_blocking pool, and you need the flush to wait for it to complete. Normally the runtime will always wait for spawn_blocking tasks when shutting down, but it actually only waits if the spawn_blocking task has started before the shutdown happens. So the places where your operation fails is if the runtime shutdown happens before the spawn_blocking task starts.

Note that the above implies that the data can only be lost if the write happens just before runtime shutdown. Otherwise it will complete normally even if the file is dropped before the write completes.

It seems like async-std's approach is the following:

impl Drop for File {
    fn drop(&mut self) {
        // We need to flush the file on drop. Unfortunately, that is not possible to do in a
        // non-blocking fashion, but our only other option here is losing data remaining in the
        // write cache. Good task schedulers should be resilient to occasional blocking hiccups in
        // file destructors so we don't expect this to be a common problem in practice.
        let _ = futures_lite::future::block_on(self.flush());
    }
}

However I would prefer to avoid this solution.

If we want to solve this, I would find it better to have the runtime somehow always wait for the file operation before shutting down.

@krtab
Copy link

krtab commented Dec 4, 2021

Hi!

This is because file operations are offloaded to the spawn_blocking pool, and you need the flush to wait for it to complete.

I don't understand why this guarantee is not/cannot be upheld by awaiting poll_write. Before diving into this issues (and the associated thread on URLO), my mental model was:

  1. In an async context, I'm calling (indirectly or not) poll_write. This return a Future which will be Ready once the bytes have actually been passed on to the OS for writting¹.
  2. Upon the first polling of this Future, a "task" (i'm unsure about the proper terminology here) is started on the blocking thread_pool. This task will use the blocking OS primitives to write the bytes and once this is done wake the future by calling its waker's method.
  3. When polled afterwards, the Future checks if some internal state has been changed to "n bytes have been succefuly written" and if so returns Ready(Ok(n))

But my understanding is that it is rather:

  1. Upon polling, the Future returns Ready(Ok(n)) if n bytes have successfully been written to some internal buffer shared with the blocking task.

Is my first mental impractical to carry out, or is it a deliberate choice? In any case, I think @igorsol had a similar confusion so there may be some way we can think of improving the documentation.


¹ What I mean by "passed on to the OS for writting" is akin to <std::fs::File as std::io::Write>::write has returned.

@Darksonn
Copy link
Contributor

Darksonn commented Dec 4, 2021

Unfortunately, having poll_write wait until the write actually completes is incompatible with the contract for AsyncWrite. If you return Poll::Pending, then the user may supply a different buffer next time they call poll_write. This becomes visible in the cancellation safety section for AsyncWriteExt::write, which says the following:

This method is cancellation safe in the sense that if it is used as the event in a tokio::select! statement and some other branch completes first, then it is guaranteed that no data was written to this AsyncWrite.

If the write call returns Poll::Pending, then some other branch in a select! can complete first but the data would still be written in violation with the above.

The behavior is conceptually similar to writing to a buffered writer.

@igorsol
Copy link
Author

igorsol commented Dec 6, 2021

Just some additional thoughts:

  1. So write_all() reports that future is completed when actual write is not completed yet. If write_all() is a write operation then its future just lies about operation completion. If write_all() is not write operation then what it is? Maybe it should be documented as some scheduling operation?
  2. write_all() returns io::Result<()> but if write operation is not completed then it cannot return result of actual writes. Does it mean that if there is some real IO error it will be lost?
  3. I don't understand how file operations in spawn_blocking pool are executed after File's destructor. Is file handle duplicated or stored behind some reference-counting smart pointer? Sorry I didn't look into tokio sources. The fact that file handle lives longer than File instance is also very confusing.

@Darksonn
Copy link
Contributor

Darksonn commented Dec 6, 2021

The current semantics of write (which write_all inherit for its last call to write) is that if a write has already been started, then it will wait until that write completes, then starts a new write and immediately return. Io errors are returned on the next call to write or flush. And yes, the file descriptor is reference counted between the File object and the background task.

The documentation already contains the following snippet:

A file will not be closed immediately when it goes out of scope if there are any IO operations that have not yet completed. To ensure that a file is closed immediately when it is dropped, you should call flush before dropping it.

Of course, documentation can always be improved.

@malaire
Copy link

malaire commented Dec 6, 2021

... starts a new write and immediately return. Io errors are returned on the next call to write or flush ...

This at the very least should be documented as it's completely unexpected behavior at least for me.

So errors returned by write are NEVER about that write which user attempted to do but about PREVIOUS write. Totally unexpected.

@BraulioVM
Copy link
Contributor

From @Darksonn's comment above

If we want to solve this, I would find it better to have the runtime somehow always wait for the file operation before shutting down.

Is this something you think should be solved? Would it make sense that I look into writing a PR along the lines those lines? Or is it likely that the change wouldn't be accepted?

@Darksonn
Copy link
Contributor

If we can find a good solution, then I am ok with making the change, but it is probably best to discuss the details of the implementation idea with me before you spend a lot of time on implementing it.

@BraulioVM
Copy link
Contributor

Cool, I'll have a look and if I find a way it could be solved then I'll draft a proposal to see what you think

@iitalics
Copy link

I ran into this before, I think the main point of confusion is how std::fs::File immediately flushes on drop so { let f = File::create(); f.write_all(...)?; } just works; then you port this line by line to tokio and it doesn't anymore. yet the write_all method is doing the same behavior in both API's, technically speaking. It's the invisible flush that is different.

I don't really think the API should be changed but perhaps it should be documented loudly that this is something you may need to handle.

@malaire
Copy link

malaire commented Dec 12, 2021

I ran into this before, I think the main point of confusion is how std::fs::File immediately flushes on drop so { let f = File::create(); f.write_all(...)?; } just works; then you port this line by line to tokio and it doesn't anymore. yet the write_all method is doing the same behavior in both API's, technically speaking. It's the invisible flush that is different.

Behaviour is definitely NOT same. Standard write_all is documented to not return until the entire buffer has been successfully written or such an error occurs while last write of tokio write_all doesn't even bother checking for errors, it just starts a new write and immediately return.

@Darksonn
Copy link
Contributor

I agree that the behavior is not the same, but regarding the standard write_all, if you use a BufWriter<File>, then you get a very similar behavior with the std write_all and it also will not always catch errors until you flush, so this is not unique to Tokio's IO traits.

@BraulioVM
Copy link
Contributor

This issue is fixed now!

@msftsecurityteam
Copy link

Hi all, I know this issue is marked as closed and "fixed" but I am writing some Rust that performs many asynchronous HTTP requests spawned via tokio::spawn, and then each of the results are then again passed into a tokio::spawn. I have it set to parse these returned JSON blobs and write out to a .CSV file. However, I am noticing that towards the end of my program in some cases, in other cases not but when there have been a lot of writes I think?, I am getting an error where it's failing to get the handle to the mutex lock. I know this issue was a silent non-write, but this seems related in that it's a non-write but not silent? I hav eattached the backtrace here:

tokio-file-write-bt.txt

@Darksonn
Copy link
Contributor

It sounds like you are dealing with a different issue. Open a new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-tokio Area: The main tokio crate C-bug Category: This is a bug. M-fs Module: tokio/fs
Projects
None yet
Development

No branches or pull requests

7 participants