Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mysql failure while dropping a future in the middle of reading or writing to the stream. #563

Open
antialize opened this issue Jul 26, 2020 · 25 comments

Comments

@antialize
Copy link
Contributor

antialize commented Jul 26, 2020

I have implemented a web backend using sqlx, when running requests one at a time, everything works fine, but when multiple requests run concurently it causes A panic within the mysql part in sqlx.

thread 'tokio-runtime-worker' panicked at 'index out of bounds: the len is 0 but the index is 0', /home/jakobt/.cargo/git/checkouts/sqlx-f05f33ba4f5c3036/eba6f39/sqlx-core/src/mysql/connection/stream.rs:152:12

I currently cannot compile agains master so the bug is against antialize@908279f, this is just head of master with an additional symbol exposed.

@antialize
Copy link
Contributor Author

antialize commented Jul 26, 2020

It might be related to me connecting to mariadb instead of mysql.
The panic happens in

pub(crate) async fn recv_packet(&mut self) -> Result<Packet<Bytes>, Error> {

at

if payload[0] == 0xff

When I add the following code just before the if

        if payload.is_empty() {
            return Err(
                Error::Protocol("Empty package".to_string())
            );
        }

I get the following error:
2020-07-26 13:16:54,960 INFO [sqlx_core::pool::inner] ping on idle connection returned error: encountered unexpected or invalid data: Empty package

So perhaps ping is just supposed to return 0 bytes and the right fix for this is to change the if to

if payload.get(0) == Some(&0xff) {

However there are many other places after this that assumes payload to be non empty, so they would also have to be changed I assume.

@mehcode
Copy link
Member

mehcode commented Jul 26, 2020

What version of MariaDB are you using? Is it possible for you to put together a small example that showcases the problem?

It's impossible that an empty payload is supposed to be returned. There must be something else afoot.

@antialize
Copy link
Contributor Author

I am using Server version: 10.4.12-MariaDB-1:10.4.12+maria~bionic-log mariadb.org binary distribution

@antialize
Copy link
Contributor Author

It is also somewhat odd that it only happens when I run multiple queries at the same time.

@antialize
Copy link
Contributor Author

I have done a bit printf tracing antialize@53d55dd.
I added a unique id number to all mysql streams, in a trace of the failure the error happened for a stream with id @4 and @6 at the around same time. I have attached the full log of the run until the failure so that one can see all the packages send and received to all the streams.

log.zip

Below is a trace of the interesting bits:

user@server:~$ grep "packet_size: 0" log
Jul 26 19:07:28 ws5 a3c66e9325e0[30076]: 2020-07-26 19:07:28,660 INFO  [sqlx_core::mysql::connection::stream] mysql recv_packet:1 @6 sequence_id: 0, packet_size: 0, payload.len(): 0, header.len(): 0
Jul 26 19:07:29 ws5 a3c66e9325e0[30076]: 2020-07-26 19:07:29,076 INFO  [sqlx_core::mysql::connection::stream] mysql recv_packet:1 @4 sequence_id: 0, packet_size: 0, payload.len(): 0, header.len(): 0
user@server:~$ grep "@4" log | tail -n 10
Jul 26 19:07:28 ws5 a3c66e9325e0[30076]: 2020-07-26 19:07:28,337 INFO  [sqlx_core::mysql::connection::stream] mysql recv_packet:1 @4 
Jul 26 19:07:28 ws5 a3c66e9325e0[30076]: 2020-07-26 19:07:28,337 INFO  [sqlx_core::mysql::connection::stream] mysql recv_packet:1 @4 sequence_id: 26, packet_size: 27, payload.len(): 27, header.len(): 0
Jul 26 19:07:28 ws5 a3c66e9325e0[30076]: 2020-07-26 19:07:28,337 INFO  [sqlx_core::mysql::connection::stream] mysql write_packet @4
Jul 26 19:07:28 ws5 a3c66e9325e0[30076]: 2020-07-26 19:07:28,337 INFO  [sqlx_core::mysql::connection::stream] mysql recv_packet:1 @4 
Jul 26 19:07:29 ws5 a3c66e9325e0[30076]: 2020-07-26 19:07:29,076 INFO  [sqlx_core::mysql::connection] ping:0 @4
Jul 26 19:07:29 ws5 a3c66e9325e0[30076]: 2020-07-26 19:07:29,076 INFO  [sqlx_core::mysql::connection] ping:1 @4
Jul 26 19:07:29 ws5 a3c66e9325e0[30076]: 2020-07-26 19:07:29,076 INFO  [sqlx_core::mysql::connection::stream] wait_until_ready:1 @4
Jul 26 19:07:29 ws5 a3c66e9325e0[30076]: 2020-07-26 19:07:29,076 INFO  [sqlx_core::mysql::connection::stream] wait_until_ready:4 @4
Jul 26 19:07:29 ws5 a3c66e9325e0[30076]: 2020-07-26 19:07:29,076 INFO  [sqlx_core::mysql::connection::stream] mysql recv_packet:1 @4 
Jul 26 19:07:29 ws5 a3c66e9325e0[30076]: 2020-07-26 19:07:29,076 INFO  [sqlx_core::mysql::connection::stream] mysql recv_packet:1 @4 sequence_id: 0, packet_size: 0, payload.len(): 0, header.len(): 0
user@server:~$ grep "@6" log | tail -n 10
Jul 26 19:07:27 ws5 a3c66e9325e0[30076]: 2020-07-26 19:07:27,935 INFO  [sqlx_core::mysql::connection::stream] mysql recv_packet:1 @6 
Jul 26 19:07:27 ws5 a3c66e9325e0[30076]: 2020-07-26 19:07:27,935 INFO  [sqlx_core::mysql::connection::stream] mysql recv_packet:1 @6 sequence_id: 26, packet_size: 27, payload.len(): 27, header.len(): 0
Jul 26 19:07:27 ws5 a3c66e9325e0[30076]: 2020-07-26 19:07:27,935 INFO  [sqlx_core::mysql::connection::stream] mysql write_packet @6
Jul 26 19:07:27 ws5 a3c66e9325e0[30076]: 2020-07-26 19:07:27,935 INFO  [sqlx_core::mysql::connection::stream] mysql recv_packet:1 @6 
Jul 26 19:07:28 ws5 a3c66e9325e0[30076]: 2020-07-26 19:07:28,488 INFO  [sqlx_core::mysql::connection] ping:0 @6
Jul 26 19:07:28 ws5 a3c66e9325e0[30076]: 2020-07-26 19:07:28,488 INFO  [sqlx_core::mysql::connection] ping:1 @6
Jul 26 19:07:28 ws5 a3c66e9325e0[30076]: 2020-07-26 19:07:28,488 INFO  [sqlx_core::mysql::connection::stream] wait_until_ready:1 @6
Jul 26 19:07:28 ws5 a3c66e9325e0[30076]: 2020-07-26 19:07:28,488 INFO  [sqlx_core::mysql::connection::stream] wait_until_ready:4 @6
Jul 26 19:07:28 ws5 a3c66e9325e0[30076]: 2020-07-26 19:07:28,488 INFO  [sqlx_core::mysql::connection::stream] mysql recv_packet:1 @6 
Jul 26 19:07:28 ws5 a3c66e9325e0[30076]: 2020-07-26 19:07:28,660 INFO  [sqlx_core::mysql::connection::stream] mysql recv_packet:1 @6 sequence_id: 0, packet_size: 0, payload.len(): 0, header.len(): 0

It seems that both streams receive a zero size package in wait_until_ready within the ping call. I don't know how to debug this further. Let me know if there is any information I can give that would be use full.

@mehcode
Copy link
Member

mehcode commented Jul 27, 2020

I tried digging in and reproducing this but I can't. I came up with this, that from the description sounds like it should trigger this, but everything works fine.

Tried on all supported versions of MariaDB and MySQL.

#[sqlx_macros::test]
async fn it_can_run_many_queries_at_once() -> anyhow::Result<()> {
    let pool = pool::<MySql>().await?; // 10 max connections
    let mut handles = Vec::new();

    for i in 0..100 {
        let pool = pool.clone();
        handles.push(async move {
            let val: i32 = sqlx::query_scalar("SELECT ?, SLEEP(1)").bind(i).fetch_one(&pool).await?;

            assert_eq!(val, i);

            Ok::<_, anyhow::Error>(())
        });
    }

    future::try_join_all(handles).await?;

    Ok(())
}

@antialize
Copy link
Contributor Author

Thanks for looking at it. I will try to blind my data and minimize my failing example.

@antialize
Copy link
Contributor Author

The issue turned out to be somewhat unrelated to what I though.
I am using the mysql pool in a hyper http application, every now an then a request will be send by the browser to hyper, and then cancelled because the browser does not need the information any more. Hypers response to this is to drop the Future. And it seems that the mysql pool comes into a bad state when the fetch_many Future is dropped. This is my guess at what is happening:

  • A large fetch_many is initiated which will return a lot of rows (not just sleep)
  • The fetch_many Future is dropped, while we are reading the result. This causes us to stop reading a mysql package in the middle of it.
  • ping is called which will try to read orphaned packages before sending the ping package.
  • We read the first package reads 4 bytes somewhere arbitrarily within the middle of a previous package.
  • My tables contain a lot of 0 and NULL entries so those 4 bytes are often 0x00000000
  • This is interperted as a 0 byte package with sequence number 0
  • Crash

I have tried wrapping all the queries in tokio::task::spawn(...).await and that indeed makes the problem go away.

@antialize
Copy link
Contributor Author

I have been looking into making the code support dropping the future on the floor in the middle of the query, however I do not think this can be done without making a lot of copies of data, so for efficiency reasons I think that would be a no go. So it would probably make sense to say that drops should not be allowed. If this is the Case it would probably make sense to panic on the drop. Something like

struct DropGuard {}
impl DropGuard {
    fn release(self) {std::mem::forget(self);}
}
impl Drop for DropGuard {
    fn drop(&mut self) {
        panic!("I should not be dropped")
    }
}

....
pub(crate) async fn recv_packet_inner(&mut self) -> Result<Packet<Bytes>, Error> {...}

pub(crate) async fn recv_packet(&mut self) -> Result<Packet<Bytes>, Error> {
   let g = DropGuard{};
   let a = self.recv_packet_inner().await;
   g.release();
   a
}

This should not be too costly. If I am not mistaken the cost of this is a single match over the state (index) of the generate future at drop time.

@mehcode
Copy link
Member

mehcode commented Jul 28, 2020

Really appreciate the dive here. This now makes perfect sense to me. Very busy at the moment so can't dive in but I will within a day to fix it. Basically we need to make sure the stream is properly exhausted before we run the ping command.

@antialize
Copy link
Contributor Author

The reading should be relationally easy to fix as you say, you can just keep a counter of how many bytes to skip within the stream. I think the hard problem to fix is dropping the future in the middle of sending a command to the database, here you cannot just keep a counter of how many bytes you have yet to send, you have to send the right bytes for the command.

@antialize antialize changed the title Mysql failure when running multiple queries at the same time. Mysql failure while dropping a future in the middle of reading or writing to the stream. Jul 29, 2020
@mehcode
Copy link
Member

mehcode commented Jul 29, 2020

Okay so this is not what I was thinking. We're talking about futures being dropped as a way to cancel the request.

Hmmm.. 🤔

I'm going to sit and think about this for a bit. I don't think SQLx is properly following Rust and allowing futures to be dropped.

@jkoudys
Copy link
Contributor

jkoudys commented Jul 30, 2020

We're hitting this on our actix-web app too. Our very crude workaround is to pass down the connection string to each new actix thread and connect then, that way it failing doesn't bring down the whole system. Obviously that eliminates 95% of why we'd have a pool in the first place, but it's a necessary kludge for now.

@mehcode
Copy link
Member

mehcode commented Jul 31, 2020

We now have a reasonable way to reproduce this with #593

@jkoudys
Copy link
Contributor

jkoudys commented Sep 15, 2020

We've found that it's possible to see a panic after

let payload: Bytes = self.stream.read(packet_size).await?;
returns Bytes of size 0, so [0] is out of bounds and causes a panic.

The only way I can see that happening is if the packet_size comes back 0, which we'd see if this returns a 0:

let packet_size = header.get_uint_le(3) as usize;

because the read buffer will always be set to the size it's asked to read:

buf.resize(offset + cnt, 0);

Should I PR on throwing a new error if the packet size is 0? Or is a 0 packet size okay and we should simply not do this if it is 0 if payload[0] == 0xff { , to avoid the panic? Or does a 0 packet size indicate that the header wasn't read correctly/stream queried properly?

@jkoudys
Copy link
Contributor

jkoudys commented Dec 9, 2020

Has anything gone in that might help with this? It's still causing headaches for us -- even a minute of outage on our live system today.

@antialize
Copy link
Contributor Author

I think the best course of action would be to drop the connection from the pool, if a query/query result is dropped before it is done. This is much easier than somehow trying to cancel the query and recover the connection, though it is much less performant. I can try to create a PR to implement this, if the maintainers would be willing to merge something with that approach.

@jkoudys
Copy link
Contributor

jkoudys commented Dec 10, 2020

@antialize it's perhaps less performant in the individual case, but the simpler recovery means all the callers will have an easier time recovering too. Today's outage was from 100% CPU, where the many sqlx panics had it thrashing as the whole conn pool sat there picking up connections and immediately erroring out. It doesn't make sense to me to live with site-crashingly poor performance because there's a theoretical better implementation that won't be implemented in months, when an easier (and at least better for my use case) approach can go in immediately.

@nemosupremo
Copy link

I'm hitting this as well; although I'm not using fetch_many, with hyper, warp, sqlx 0.4.1 and mysql 5.6. In my case I'm just using fetch, but it may take me some time (1-2s) to actually exhaust the stream. Within that time the request could be dropped.

@mehcode
Copy link
Member

mehcode commented Dec 19, 2020

I think the best course of action would be to drop the connection from the pool, if a query/query result is dropped before it is done. This is much easier than somehow trying to cancel the query and recover the connection, though it is much less performant. I can try to create a PR to implement this, if the maintainers would be willing to merge something with that approach.

@antialize We'd very much appreciate a PR that fixed this in the short term in the way you're describing.

@antialize
Copy link
Contributor Author

I have added a pull request #915. This does not fix the issue by dropping the connection as I first proposed, instead it makes read_package resumable this should be much more efficient as we do not have to open a new connection and re-authenticate.

@kraigher
Copy link

First of all, thank you for the great work.
Unfortunately the problem described here is causing production outages for us. I saw that there was a PR created but it has not been merged yet. Do you think a fix or workaround will be released soon or do we need to wait for 0.6? Any tip or work arround I can do would be helpful. Imho sacrificing performance to fix an unsound implementation would be the right tradeoff in this case.

@mehcode
Copy link
Member

mehcode commented Feb 1, 2021

As I was writing this response, I realized that the work-around described here was never implemented. For some reason I thought it was, and merged. But reading back up I see that wasn't done ( nothing on you @antialize , I should have followed up on this ). I think the reason I thought it was merged was #918.

I'll see if we can get this investigated in the next couple days.


@kraigher Do you have test_before_acquire turned on? That should help as it would drop the connection if it's bad before handing it out with the pool.


In the next branch I've been deep at work on MySQL-first fixing this problem. A working alpha of 0.6 for MySQL should be available in the next couple weeks. The issue is fixing this for real is a massive core refactor. This is not a simple issue and I worry about other large async libraries, it's pretty easy to run into the hole like this in Rust as it stands.


Lastly, I haven't merged the other PR @antialize because it only addresses a slice of the problem. It's my fault for not communicating this better but a simple guide is look at every async method, if it has more than two .awaits or accesses anything &mut _ before the .await, it's likely broken. Obviously this means most of the internal code is broken in this way.

@kraigher
Copy link

kraigher commented Feb 1, 2021

I do not have test_before_aquire turned on, I will try it.
Another question: Would wrapping request handling in a tokio::spawn avoid the sqlx future drop problem by running every request in its own task?

@mehcode
Copy link
Member

mehcode commented Feb 4, 2021

Can anyone that's experienced this problem and can reproduce it, try #1025 and see how it works ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants