Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Panic when reading file #391

Closed
threeseed opened this issue Jul 26, 2021 · 6 comments · Fixed by #395
Closed

Panic when reading file #391

threeseed opened this issue Jul 26, 2021 · 6 comments · Fixed by #395

Comments

@threeseed
Copy link

threeseed commented Jul 26, 2021

thread 'unnamed-6' panicked at 'assertion failed: `(left != right)`
  left: `32767`,
 right: `32767`', /home/naden/.cargo/registry/src/github.com-1ecc6299db9ec823/glommio-0.5.0/src/task/raw.rs:268:9
stack backtrace:
   0: rust_begin_unwind
             at /rustc/53cb7b09b00cbea8754ffb78e7e3cb521cb8af4b/library/std/src/panicking.rs:493:5
   1: core::panicking::panic_fmt
             at /rustc/53cb7b09b00cbea8754ffb78e7e3cb521cb8af4b/library/core/src/panicking.rs:92:14
   2: core::panicking::assert_failed::inner
   3: core::panicking::assert_failed
             at /rustc/53cb7b09b00cbea8754ffb78e7e3cb521cb8af4b/library/core/src/panicking.rs:143:5
   4: glommio::task::raw::RawTask<F,R,S>::increment_references
             at /home/naden/.cargo/registry/src/github.com-1ecc6299db9ec823/glommio-0.5.0/src/task/raw.rs:276:9
   5: glommio::task::raw::RawTask<F,R,S>::clone_waker
             at /home/naden/.cargo/registry/src/github.com-1ecc6299db9ec823/glommio-0.5.0/src/task/raw.rs:268:9
   6: glommio::task::raw::RawTask<F,R,S>::schedule
             at /home/naden/.cargo/registry/src/github.com-1ecc6299db9ec823/glommio-0.5.0/src/task/raw.rs:364:42
   7: glommio::task::raw::RawTask<F,R,S>::wake
             at /home/naden/.cargo/registry/src/github.com-1ecc6299db9ec823/glommio-0.5.0/src/task/raw.rs:230:17
   8: core::task::wake::Waker::wake
             at /rustc/53cb7b09b00cbea8754ffb78e7e3cb521cb8af4b/library/core/src/task/wake.rs:217:18
   9: tokio::sync::task::atomic_waker::AtomicWaker::wake
             at /home/naden/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.9.0/src/sync/task/atomic_waker.rs:248:13
  10: tokio::sync::mpsc::chan::Chan<T,S>::send
             at /home/naden/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.9.0/src/sync/mpsc/chan.rs:292:9
  11: tokio::sync::mpsc::chan::Tx<T,S>::send
             at /home/naden/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.9.0/src/sync/mpsc/chan.rs:135:9
  12: tokio::sync::mpsc::bounded::Permit<T>::send
             at /home/naden/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.9.0/src/sync/mpsc/bounded.rs:990:9
  13: tokio::sync::mpsc::bounded::Sender<T>::send::{{closure}}
             at /home/naden/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.9.0/src/sync/mpsc/bounded.rs:382:17
  14: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
             at /rustc/53cb7b09b00cbea8754ffb78e7e3cb521cb8af4b/library/core/src/future/mod.rs:80:19
  15: datagrid::http::data::download_handler::{{closure}}::{{closure}}

Code

https://github.com/nadenf/datagrid/blob/main/src/http/data.rs

Steps to Recreate

fallocate -l 1g /tmp/error_file
git clone https://github.com/nadenf/datagrid
cd datagrid
cargo build
RUST_BACKTRACE=1 ./target/debug/datagrid

In another Terminal:

wget http://127.0.0.1:8000/download?path=/tmp/error_file -q --show-progress
@glommer
Copy link
Collaborator

glommer commented Aug 2, 2021

Thank you @nadenf
I will try to investigate it this week. We recently had a similar issue, that was fixed by 26bc054. I am surprised and sad to see this again.

@glommer
Copy link
Collaborator

glommer commented Aug 3, 2021

This reproduced locally. I will now take a look.

@glommer
Copy link
Collaborator

glommer commented Aug 3, 2021

Ok, quick update: I know what this is now. This seems to be a regression introduced by @HippoBaro 's work on the read_many API.

What happens is that the network streams add waiters to a source (Which increase the reference count on the waker), by doing source.add_waiter. There is nothing that seems to be popping the waiters from the internal vector where they are stored (short of the task itself being destroyed).

I added a print statement to add_waiter, to print the current len of the wakers vector, and it keeps growing until it reaches 32k wakers. So it is not surprising that we assert on task with a 32k reference count (so not a bug in task reference counting, which fills me with relief)

There is code to flush this vector that should be called, that I don't see called. So there are still missing pieces of this puzzle, but we'll get there!

@HippoBaro
Copy link
Member

👋 bug author here! Thanks for looking into this @glommer. The vector in question is drained when processing events from the ring. My intuition is that this code may process sources outside of the ring somehow. If that's the case, then I can see how the Vec could grow unbounded.

glommer pushed a commit to glommer/glommio that referenced this issue Aug 4, 2021
It used to be the case where we would assume there was a single waiter
per source, but in the read_many work we moved that to many-waiters.

The function that manipulates the wakers was kept the same, so whether
or not we had many wakers was entirely up to the user.

However I just found a use case (issue DataDog#391) where there are many
waiters added to a source that only expects a single waiter. That's
a use case in which we reuse a source, and we keep calling the poll
function into a stream even though the stream is never ready.

It's not clear to me (yet) why this is the case. It is certainly
surprising. While I want to get to the bottom of this, it is not a bad
idea to require the user to ask for their intentions explicitly.

In the future, if we can indeed guarantee that a function with a single
waiter should be empty we can use this opportunity for a stronger
assert. For now, this reverts the old behavior in the original users
and at least gets rid of this particular regression

Fixes DataDog#391
@glommer
Copy link
Collaborator

glommer commented Aug 4, 2021

hi @nadenf

I just opened a PR that should "fix" this issue. However, I'm only pushing this now because I am very respectful of your contributions and mindful of your time, and want to unblock any work of yours that may be blocked on it. So feel free to use that for now.

However, I may withhold merging for a while for the following reason: it makes no sense to me that the network stream never completes. The behavior I see is that we keep calling the poll_read function over and over and over, even though the previous request never completed.

I am fine breaking this assumption but this is really strange. If the poll function is called, I would expect that this happens because the operation completed.

In a nutshell, I'd like to understand a bit more why this is. Maybe this is related to hyper, and they may have something artificially driving the poll function. But how does it make sense that it is never ready?

Stay tuned!

@glommer
Copy link
Collaborator

glommer commented Aug 4, 2021

I looked into this a bit more, and hyper indeed has logic that calls poll_read manually on idle connections. Because that logic pertains to idle connections, it makes sense that we would never see the Source complete in this case.

So this is legitimate. We don't want to cancel the existing source (which was my biggest fear), and because this is a stream, this immediately means that we're not interested in the old waiter anymore.

I want to think a bit more about this, to make sure that there aren't cases in which we are not supposed to keep both wakers in the stream

glommer pushed a commit to glommer/glommio that referenced this issue Aug 9, 2021
It used to be the case where we would assume there was a single waiter
per source, but in the read_many work we moved that to many-waiters.

The function that manipulates the wakers was kept the same, so whether
or not we had many wakers was entirely up to the user.

However I just found a use case (issue DataDog#391) where there are many
waiters added to a source that only expects a single waiter. That's
a use case in which we reuse a source, and we keep calling the poll
function into a stream even though the stream is never ready.

It's not clear to me (yet) why this is the case. It is certainly
surprising. While I want to get to the bottom of this, it is not a bad
idea to require the user to ask for their intentions explicitly.

In the future, if we can indeed guarantee that a function with a single
waiter should be empty we can use this opportunity for a stronger
assert. For now, this reverts the old behavior in the original users
and at least gets rid of this particular regression

Fixes DataDog#391
glommer pushed a commit to glommer/glommio that referenced this issue Aug 9, 2021
It used to be the case where we would assume there was a single waiter
per source, but in the read_many work we moved that to many-waiters.

The function that manipulates the wakers was kept the same, so whether
or not we had many wakers was entirely up to the user.

However I just found a use case (issue DataDog#391) where there are many
waiters added to a source that only expects a single waiter. That's
a use case in which we reuse a source, and we keep calling the poll
function into a stream even though the stream is never ready.

It's not clear to me (yet) why this is the case. It is certainly
surprising. While I want to get to the bottom of this, it is not a bad
idea to require the user to ask for their intentions explicitly.

In the future, if we can indeed guarantee that a function with a single
waiter should be empty we can use this opportunity for a stronger
assert. For now, this reverts the old behavior in the original users
and at least gets rid of this particular regression

Fixes DataDog#391
glommer pushed a commit that referenced this issue Aug 12, 2021
It used to be the case where we would assume there was a single waiter
per source, but in the read_many work we moved that to many-waiters.

The function that manipulates the wakers was kept the same, so whether
or not we had many wakers was entirely up to the user.

However I just found a use case (issue #391) where there are many
waiters added to a source that only expects a single waiter. That's
a use case in which we reuse a source, and we keep calling the poll
function into a stream even though the stream is never ready.

It's not clear to me (yet) why this is the case. It is certainly
surprising. While I want to get to the bottom of this, it is not a bad
idea to require the user to ask for their intentions explicitly.

In the future, if we can indeed guarantee that a function with a single
waiter should be empty we can use this opportunity for a stronger
assert. For now, this reverts the old behavior in the original users
and at least gets rid of this particular regression

Fixes #391

(cherry picked from commit b301484)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants