Panic when reading file #391

threeseed · 2021-07-26T06:22:27Z

thread 'unnamed-6' panicked at 'assertion failed: `(left != right)`
  left: `32767`,
 right: `32767`', /home/naden/.cargo/registry/src/github.com-1ecc6299db9ec823/glommio-0.5.0/src/task/raw.rs:268:9
stack backtrace:
   0: rust_begin_unwind
             at /rustc/53cb7b09b00cbea8754ffb78e7e3cb521cb8af4b/library/std/src/panicking.rs:493:5
   1: core::panicking::panic_fmt
             at /rustc/53cb7b09b00cbea8754ffb78e7e3cb521cb8af4b/library/core/src/panicking.rs:92:14
   2: core::panicking::assert_failed::inner
   3: core::panicking::assert_failed
             at /rustc/53cb7b09b00cbea8754ffb78e7e3cb521cb8af4b/library/core/src/panicking.rs:143:5
   4: glommio::task::raw::RawTask<F,R,S>::increment_references
             at /home/naden/.cargo/registry/src/github.com-1ecc6299db9ec823/glommio-0.5.0/src/task/raw.rs:276:9
   5: glommio::task::raw::RawTask<F,R,S>::clone_waker
             at /home/naden/.cargo/registry/src/github.com-1ecc6299db9ec823/glommio-0.5.0/src/task/raw.rs:268:9
   6: glommio::task::raw::RawTask<F,R,S>::schedule
             at /home/naden/.cargo/registry/src/github.com-1ecc6299db9ec823/glommio-0.5.0/src/task/raw.rs:364:42
   7: glommio::task::raw::RawTask<F,R,S>::wake
             at /home/naden/.cargo/registry/src/github.com-1ecc6299db9ec823/glommio-0.5.0/src/task/raw.rs:230:17
   8: core::task::wake::Waker::wake
             at /rustc/53cb7b09b00cbea8754ffb78e7e3cb521cb8af4b/library/core/src/task/wake.rs:217:18
   9: tokio::sync::task::atomic_waker::AtomicWaker::wake
             at /home/naden/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.9.0/src/sync/task/atomic_waker.rs:248:13
  10: tokio::sync::mpsc::chan::Chan<T,S>::send
             at /home/naden/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.9.0/src/sync/mpsc/chan.rs:292:9
  11: tokio::sync::mpsc::chan::Tx<T,S>::send
             at /home/naden/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.9.0/src/sync/mpsc/chan.rs:135:9
  12: tokio::sync::mpsc::bounded::Permit<T>::send
             at /home/naden/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.9.0/src/sync/mpsc/bounded.rs:990:9
  13: tokio::sync::mpsc::bounded::Sender<T>::send::{{closure}}
             at /home/naden/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.9.0/src/sync/mpsc/bounded.rs:382:17
  14: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
             at /rustc/53cb7b09b00cbea8754ffb78e7e3cb521cb8af4b/library/core/src/future/mod.rs:80:19
  15: datagrid::http::data::download_handler::{{closure}}::{{closure}}

Code

https://github.com/nadenf/datagrid/blob/main/src/http/data.rs

Steps to Recreate

fallocate -l 1g /tmp/error_file
git clone https://github.com/nadenf/datagrid
cd datagrid
cargo build
RUST_BACKTRACE=1 ./target/debug/datagrid

In another Terminal:

wget http://127.0.0.1:8000/download?path=/tmp/error_file -q --show-progress

The text was updated successfully, but these errors were encountered:

glommer · 2021-08-02T18:41:48Z

Thank you @nadenf
I will try to investigate it this week. We recently had a similar issue, that was fixed by 26bc054. I am surprised and sad to see this again.

glommer · 2021-08-03T13:54:33Z

This reproduced locally. I will now take a look.

glommer · 2021-08-03T17:55:52Z

Ok, quick update: I know what this is now. This seems to be a regression introduced by @HippoBaro 's work on the read_many API.

What happens is that the network streams add waiters to a source (Which increase the reference count on the waker), by doing source.add_waiter. There is nothing that seems to be popping the waiters from the internal vector where they are stored (short of the task itself being destroyed).

I added a print statement to add_waiter, to print the current len of the wakers vector, and it keeps growing until it reaches 32k wakers. So it is not surprising that we assert on task with a 32k reference count (so not a bug in task reference counting, which fills me with relief)

There is code to flush this vector that should be called, that I don't see called. So there are still missing pieces of this puzzle, but we'll get there!

HippoBaro · 2021-08-03T18:13:02Z

👋 bug author here! Thanks for looking into this @glommer. The vector in question is drained when processing events from the ring. My intuition is that this code may process sources outside of the ring somehow. If that's the case, then I can see how the Vec could grow unbounded.

It used to be the case where we would assume there was a single waiter per source, but in the read_many work we moved that to many-waiters. The function that manipulates the wakers was kept the same, so whether or not we had many wakers was entirely up to the user. However I just found a use case (issue DataDog#391) where there are many waiters added to a source that only expects a single waiter. That's a use case in which we reuse a source, and we keep calling the poll function into a stream even though the stream is never ready. It's not clear to me (yet) why this is the case. It is certainly surprising. While I want to get to the bottom of this, it is not a bad idea to require the user to ask for their intentions explicitly. In the future, if we can indeed guarantee that a function with a single waiter should be empty we can use this opportunity for a stronger assert. For now, this reverts the old behavior in the original users and at least gets rid of this particular regression Fixes DataDog#391

glommer · 2021-08-04T18:00:27Z

hi @nadenf

I just opened a PR that should "fix" this issue. However, I'm only pushing this now because I am very respectful of your contributions and mindful of your time, and want to unblock any work of yours that may be blocked on it. So feel free to use that for now.

However, I may withhold merging for a while for the following reason: it makes no sense to me that the network stream never completes. The behavior I see is that we keep calling the poll_read function over and over and over, even though the previous request never completed.

I am fine breaking this assumption but this is really strange. If the poll function is called, I would expect that this happens because the operation completed.

In a nutshell, I'd like to understand a bit more why this is. Maybe this is related to hyper, and they may have something artificially driving the poll function. But how does it make sense that it is never ready?

Stay tuned!

glommer · 2021-08-04T20:24:23Z

I looked into this a bit more, and hyper indeed has logic that calls poll_read manually on idle connections. Because that logic pertains to idle connections, it makes sense that we would never see the Source complete in this case.

So this is legitimate. We don't want to cancel the existing source (which was my biggest fear), and because this is a stream, this immediately means that we're not interested in the old waiter anymore.

I want to think a bit more about this, to make sure that there aren't cases in which we are not supposed to keep both wakers in the stream

It used to be the case where we would assume there was a single waiter per source, but in the read_many work we moved that to many-waiters. The function that manipulates the wakers was kept the same, so whether or not we had many wakers was entirely up to the user. However I just found a use case (issue DataDog#391) where there are many waiters added to a source that only expects a single waiter. That's a use case in which we reuse a source, and we keep calling the poll function into a stream even though the stream is never ready. It's not clear to me (yet) why this is the case. It is certainly surprising. While I want to get to the bottom of this, it is not a bad idea to require the user to ask for their intentions explicitly. In the future, if we can indeed guarantee that a function with a single waiter should be empty we can use this opportunity for a stronger assert. For now, this reverts the old behavior in the original users and at least gets rid of this particular regression Fixes DataDog#391

It used to be the case where we would assume there was a single waiter per source, but in the read_many work we moved that to many-waiters. The function that manipulates the wakers was kept the same, so whether or not we had many wakers was entirely up to the user. However I just found a use case (issue #391) where there are many waiters added to a source that only expects a single waiter. That's a use case in which we reuse a source, and we keep calling the poll function into a stream even though the stream is never ready. It's not clear to me (yet) why this is the case. It is certainly surprising. While I want to get to the bottom of this, it is not a bad idea to require the user to ask for their intentions explicitly. In the future, if we can indeed guarantee that a function with a single waiter should be empty we can use this opportunity for a stronger assert. For now, this reverts the old behavior in the original users and at least gets rid of this particular regression Fixes #391 (cherry picked from commit b301484)

glommer mentioned this issue Aug 4, 2021

make a stronger distinction between add_waiter use cases #395

Merged

glommer closed this as completed in #395 Aug 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Panic when reading file #391

Panic when reading file #391

threeseed commented Jul 26, 2021 •

edited

Loading

glommer commented Aug 2, 2021

glommer commented Aug 3, 2021

glommer commented Aug 3, 2021

HippoBaro commented Aug 3, 2021

glommer commented Aug 4, 2021

glommer commented Aug 4, 2021

Panic when reading file #391

Panic when reading file #391

Comments

threeseed commented Jul 26, 2021 • edited Loading

Code

Steps to Recreate

glommer commented Aug 2, 2021

glommer commented Aug 3, 2021

glommer commented Aug 3, 2021

HippoBaro commented Aug 3, 2021

glommer commented Aug 4, 2021

glommer commented Aug 4, 2021

threeseed commented Jul 26, 2021 •

edited

Loading