Mysql failure while dropping a future in the middle of reading or writing to the stream. #563

antialize · 2020-07-26T10:56:59Z

I have implemented a web backend using sqlx, when running requests one at a time, everything works fine, but when multiple requests run concurently it causes A panic within the mysql part in sqlx.

thread 'tokio-runtime-worker' panicked at 'index out of bounds: the len is 0 but the index is 0', /home/jakobt/.cargo/git/checkouts/sqlx-f05f33ba4f5c3036/eba6f39/sqlx-core/src/mysql/connection/stream.rs:152:12

I currently cannot compile agains master so the bug is against antialize@908279f, this is just head of master with an additional symbol exposed.

The text was updated successfully, but these errors were encountered:

antialize · 2020-07-26T11:24:25Z

It might be related to me connecting to mariadb instead of mysql.
The panic happens in

pub(crate) async fn recv_packet(&mut self) -> Result<Packet<Bytes>, Error> {

at

if payload[0] == 0xff

When I add the following code just before the if

        if payload.is_empty() {
            return Err(
                Error::Protocol("Empty package".to_string())
            );
        }

I get the following error:
2020-07-26 13:16:54,960 INFO [sqlx_core::pool::inner] ping on idle connection returned error: encountered unexpected or invalid data: Empty package

So perhaps ping is just supposed to return 0 bytes and the right fix for this is to change the if to

if payload.get(0) == Some(&0xff) {

However there are many other places after this that assumes payload to be non empty, so they would also have to be changed I assume.

mehcode · 2020-07-26T11:40:32Z

What version of MariaDB are you using? Is it possible for you to put together a small example that showcases the problem?

It's impossible that an empty payload is supposed to be returned. There must be something else afoot.

antialize · 2020-07-26T12:18:21Z

I am using Server version: 10.4.12-MariaDB-1:10.4.12+maria~bionic-log mariadb.org binary distribution

antialize · 2020-07-26T12:19:11Z

It is also somewhat odd that it only happens when I run multiple queries at the same time.

antialize · 2020-07-26T17:32:17Z

I have done a bit printf tracing antialize@53d55dd.
I added a unique id number to all mysql streams, in a trace of the failure the error happened for a stream with id @4 and @6 at the around same time. I have attached the full log of the run until the failure so that one can see all the packages send and received to all the streams.

log.zip

Below is a trace of the interesting bits:

user@server:~$ grep "packet_size: 0" log
Jul 26 19:07:28 ws5 a3c66e9325e0[30076]: 2020-07-26 19:07:28,660 INFO  [sqlx_core::mysql::connection::stream] mysql recv_packet:1 @6 sequence_id: 0, packet_size: 0, payload.len(): 0, header.len(): 0
Jul 26 19:07:29 ws5 a3c66e9325e0[30076]: 2020-07-26 19:07:29,076 INFO  [sqlx_core::mysql::connection::stream] mysql recv_packet:1 @4 sequence_id: 0, packet_size: 0, payload.len(): 0, header.len(): 0
user@server:~$ grep "@4" log | tail -n 10
Jul 26 19:07:28 ws5 a3c66e9325e0[30076]: 2020-07-26 19:07:28,337 INFO  [sqlx_core::mysql::connection::stream] mysql recv_packet:1 @4 
Jul 26 19:07:28 ws5 a3c66e9325e0[30076]: 2020-07-26 19:07:28,337 INFO  [sqlx_core::mysql::connection::stream] mysql recv_packet:1 @4 sequence_id: 26, packet_size: 27, payload.len(): 27, header.len(): 0
Jul 26 19:07:28 ws5 a3c66e9325e0[30076]: 2020-07-26 19:07:28,337 INFO  [sqlx_core::mysql::connection::stream] mysql write_packet @4
Jul 26 19:07:28 ws5 a3c66e9325e0[30076]: 2020-07-26 19:07:28,337 INFO  [sqlx_core::mysql::connection::stream] mysql recv_packet:1 @4 
Jul 26 19:07:29 ws5 a3c66e9325e0[30076]: 2020-07-26 19:07:29,076 INFO  [sqlx_core::mysql::connection] ping:0 @4
Jul 26 19:07:29 ws5 a3c66e9325e0[30076]: 2020-07-26 19:07:29,076 INFO  [sqlx_core::mysql::connection] ping:1 @4
Jul 26 19:07:29 ws5 a3c66e9325e0[30076]: 2020-07-26 19:07:29,076 INFO  [sqlx_core::mysql::connection::stream] wait_until_ready:1 @4
Jul 26 19:07:29 ws5 a3c66e9325e0[30076]: 2020-07-26 19:07:29,076 INFO  [sqlx_core::mysql::connection::stream] wait_until_ready:4 @4
Jul 26 19:07:29 ws5 a3c66e9325e0[30076]: 2020-07-26 19:07:29,076 INFO  [sqlx_core::mysql::connection::stream] mysql recv_packet:1 @4 
Jul 26 19:07:29 ws5 a3c66e9325e0[30076]: 2020-07-26 19:07:29,076 INFO  [sqlx_core::mysql::connection::stream] mysql recv_packet:1 @4 sequence_id: 0, packet_size: 0, payload.len(): 0, header.len(): 0
user@server:~$ grep "@6" log | tail -n 10
Jul 26 19:07:27 ws5 a3c66e9325e0[30076]: 2020-07-26 19:07:27,935 INFO  [sqlx_core::mysql::connection::stream] mysql recv_packet:1 @6 
Jul 26 19:07:27 ws5 a3c66e9325e0[30076]: 2020-07-26 19:07:27,935 INFO  [sqlx_core::mysql::connection::stream] mysql recv_packet:1 @6 sequence_id: 26, packet_size: 27, payload.len(): 27, header.len(): 0
Jul 26 19:07:27 ws5 a3c66e9325e0[30076]: 2020-07-26 19:07:27,935 INFO  [sqlx_core::mysql::connection::stream] mysql write_packet @6
Jul 26 19:07:27 ws5 a3c66e9325e0[30076]: 2020-07-26 19:07:27,935 INFO  [sqlx_core::mysql::connection::stream] mysql recv_packet:1 @6 
Jul 26 19:07:28 ws5 a3c66e9325e0[30076]: 2020-07-26 19:07:28,488 INFO  [sqlx_core::mysql::connection] ping:0 @6
Jul 26 19:07:28 ws5 a3c66e9325e0[30076]: 2020-07-26 19:07:28,488 INFO  [sqlx_core::mysql::connection] ping:1 @6
Jul 26 19:07:28 ws5 a3c66e9325e0[30076]: 2020-07-26 19:07:28,488 INFO  [sqlx_core::mysql::connection::stream] wait_until_ready:1 @6
Jul 26 19:07:28 ws5 a3c66e9325e0[30076]: 2020-07-26 19:07:28,488 INFO  [sqlx_core::mysql::connection::stream] wait_until_ready:4 @6
Jul 26 19:07:28 ws5 a3c66e9325e0[30076]: 2020-07-26 19:07:28,488 INFO  [sqlx_core::mysql::connection::stream] mysql recv_packet:1 @6 
Jul 26 19:07:28 ws5 a3c66e9325e0[30076]: 2020-07-26 19:07:28,660 INFO  [sqlx_core::mysql::connection::stream] mysql recv_packet:1 @6 sequence_id: 0, packet_size: 0, payload.len(): 0, header.len(): 0

It seems that both streams receive a zero size package in wait_until_ready within the ping call. I don't know how to debug this further. Let me know if there is any information I can give that would be use full.

mehcode · 2020-07-27T08:21:38Z

I tried digging in and reproducing this but I can't. I came up with this, that from the description sounds like it should trigger this, but everything works fine.

Tried on all supported versions of MariaDB and MySQL.

#[sqlx_macros::test]
async fn it_can_run_many_queries_at_once() -> anyhow::Result<()> {
    let pool = pool::<MySql>().await?; // 10 max connections
    let mut handles = Vec::new();

    for i in 0..100 {
        let pool = pool.clone();
        handles.push(async move {
            let val: i32 = sqlx::query_scalar("SELECT ?, SLEEP(1)").bind(i).fetch_one(&pool).await?;

            assert_eq!(val, i);

            Ok::<_, anyhow::Error>(())
        });
    }

    future::try_join_all(handles).await?;

    Ok(())
}

antialize · 2020-07-27T08:25:49Z

Thanks for looking at it. I will try to blind my data and minimize my failing example.

antialize · 2020-07-28T14:09:18Z

The issue turned out to be somewhat unrelated to what I though.
I am using the mysql pool in a hyper http application, every now an then a request will be send by the browser to hyper, and then cancelled because the browser does not need the information any more. Hypers response to this is to drop the Future. And it seems that the mysql pool comes into a bad state when the fetch_many Future is dropped. This is my guess at what is happening:

A large fetch_many is initiated which will return a lot of rows (not just sleep)
The fetch_many Future is dropped, while we are reading the result. This causes us to stop reading a mysql package in the middle of it.
ping is called which will try to read orphaned packages before sending the ping package.
We read the first package reads 4 bytes somewhere arbitrarily within the middle of a previous package.
My tables contain a lot of 0 and NULL entries so those 4 bytes are often 0x00000000
This is interperted as a 0 byte package with sequence number 0
Crash

I have tried wrapping all the queries in tokio::task::spawn(...).await and that indeed makes the problem go away.

antialize · 2020-07-28T19:49:10Z

I have been looking into making the code support dropping the future on the floor in the middle of the query, however I do not think this can be done without making a lot of copies of data, so for efficiency reasons I think that would be a no go. So it would probably make sense to say that drops should not be allowed. If this is the Case it would probably make sense to panic on the drop. Something like

struct DropGuard {}
impl DropGuard {
    fn release(self) {std::mem::forget(self);}
}
impl Drop for DropGuard {
    fn drop(&mut self) {
        panic!("I should not be dropped")
    }
}

....
pub(crate) async fn recv_packet_inner(&mut self) -> Result<Packet<Bytes>, Error> {...}

pub(crate) async fn recv_packet(&mut self) -> Result<Packet<Bytes>, Error> {
   let g = DropGuard{};
   let a = self.recv_packet_inner().await;
   g.release();
   a
}

This should not be too costly. If I am not mistaken the cost of this is a single match over the state (index) of the generate future at drop time.

mehcode · 2020-07-28T19:53:14Z

Really appreciate the dive here. This now makes perfect sense to me. Very busy at the moment so can't dive in but I will within a day to fix it. Basically we need to make sure the stream is properly exhausted before we run the ping command.

antialize · 2020-07-29T07:33:54Z

The reading should be relationally easy to fix as you say, you can just keep a counter of how many bytes to skip within the stream. I think the hard problem to fix is dropping the future in the middle of sending a command to the database, here you cannot just keep a counter of how many bytes you have yet to send, you have to send the right bytes for the command.

mehcode · 2020-07-29T08:45:54Z

Okay so this is not what I was thinking. We're talking about futures being dropped as a way to cancel the request.

Hmmm.. 🤔

I'm going to sit and think about this for a bit. I don't think SQLx is properly following Rust and allowing futures to be dropped.

jkoudys · 2020-07-30T22:43:10Z

We're hitting this on our actix-web app too. Our very crude workaround is to pass down the connection string to each new actix thread and connect then, that way it failing doesn't bring down the whole system. Obviously that eliminates 95% of why we'd have a pool in the first place, but it's a necessary kludge for now.

mehcode · 2020-07-31T11:27:57Z

We now have a reasonable way to reproduce this with #593

jkoudys · 2020-09-15T20:52:18Z

We've found that it's possible to see a panic after

sqlx/sqlx-core/src/mysql/connection/stream.rs

Line 147 in 6a70780

let payload: Bytes = self.stream.read(packet_size).await?;

returns Bytes of size 0, so [0] is out of bounds and causes a panic.

The only way I can see that happening is if the packet_size comes back 0, which we'd see if this returns a 0:

let packet_size = header.get_uint_le(3) as usize;

because the read buffer will always be set to the size it's asked to read:

sqlx/sqlx-core/src/io/buf_stream.rs

Line 115 in 6a70780

buf.resize(offset + cnt, 0);

Should I PR on throwing a new error if the packet size is 0? Or is a 0 packet size okay and we should simply not do this if it is 0 if payload[0] == 0xff { , to avoid the panic? Or does a 0 packet size indicate that the header wasn't read correctly/stream queried properly?

jkoudys · 2020-12-09T16:25:16Z

Has anything gone in that might help with this? It's still causing headaches for us -- even a minute of outage on our live system today.

antialize · 2020-12-09T17:38:58Z

I think the best course of action would be to drop the connection from the pool, if a query/query result is dropped before it is done. This is much easier than somehow trying to cancel the query and recover the connection, though it is much less performant. I can try to create a PR to implement this, if the maintainers would be willing to merge something with that approach.

jkoudys · 2020-12-10T00:29:10Z

@antialize it's perhaps less performant in the individual case, but the simpler recovery means all the callers will have an easier time recovering too. Today's outage was from 100% CPU, where the many sqlx panics had it thrashing as the whole conn pool sat there picking up connections and immediately erroring out. It doesn't make sense to me to live with site-crashingly poor performance because there's a theoretical better implementation that won't be implemented in months, when an easier (and at least better for my use case) approach can go in immediately.

nemosupremo · 2020-12-19T02:29:17Z

I'm hitting this as well; although I'm not using fetch_many, with hyper, warp, sqlx 0.4.1 and mysql 5.6. In my case I'm just using fetch, but it may take me some time (1-2s) to actually exhaust the stream. Within that time the request could be dropped.

mehcode · 2020-12-19T04:31:46Z

I think the best course of action would be to drop the connection from the pool, if a query/query result is dropped before it is done. This is much easier than somehow trying to cancel the query and recover the connection, though it is much less performant. I can try to create a PR to implement this, if the maintainers would be willing to merge something with that approach.

@antialize We'd very much appreciate a PR that fixed this in the short term in the way you're describing.

antialize · 2020-12-20T14:05:53Z

I have added a pull request #915. This does not fix the issue by dropping the connection as I first proposed, instead it makes read_package resumable this should be much more efficient as we do not have to open a new connection and re-authenticate.

kraigher · 2021-01-31T22:27:55Z

First of all, thank you for the great work.
Unfortunately the problem described here is causing production outages for us. I saw that there was a PR created but it has not been merged yet. Do you think a fix or workaround will be released soon or do we need to wait for 0.6? Any tip or work arround I can do would be helpful. Imho sacrificing performance to fix an unsound implementation would be the right tradeoff in this case.

mehcode · 2021-02-01T00:53:16Z

As I was writing this response, I realized that the work-around described here was never implemented. For some reason I thought it was, and merged. But reading back up I see that wasn't done ( nothing on you @antialize , I should have followed up on this ). I think the reason I thought it was merged was #918.

I'll see if we can get this investigated in the next couple days.

@kraigher Do you have test_before_acquire turned on? That should help as it would drop the connection if it's bad before handing it out with the pool.

In the next branch I've been deep at work on MySQL-first fixing this problem. A working alpha of 0.6 for MySQL should be available in the next couple weeks. The issue is fixing this for real is a massive core refactor. This is not a simple issue and I worry about other large async libraries, it's pretty easy to run into the hole like this in Rust as it stands.

Lastly, I haven't merged the other PR @antialize because it only addresses a slice of the problem. It's my fault for not communicating this better but a simple guide is look at every async method, if it has more than two .awaits or accesses anything &mut _ before the .await, it's likely broken. Obviously this means most of the internal code is broken in this way.

kraigher · 2021-02-01T08:51:02Z

I do not have test_before_aquire turned on, I will try it.
Another question: Would wrapping request handling in a tokio::spawn avoid the sqlx future drop problem by running every request in its own task?

mehcode · 2021-02-04T00:58:33Z

Can anyone that's experienced this problem and can reproduce it, try #1025 and see how it works ?

antialize changed the title ~~Mysql failure when running multiple queries at the same time.~~ Mysql failure while dropping a future in the middle of reading or writing to the stream. Jul 29, 2020

antialize mentioned this issue Dec 20, 2020

Resume read_package after drop. #915

Closed

mehcode mentioned this issue Dec 23, 2020

unknown message type: '\u{0}' for postgres #886

Closed

abonander mentioned this issue Feb 2, 2021

fix(pool): always test connection on release to see if it's still viable #1025

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mysql failure while dropping a future in the middle of reading or writing to the stream. #563

Mysql failure while dropping a future in the middle of reading or writing to the stream. #563

antialize commented Jul 26, 2020 •

edited

Loading

antialize commented Jul 26, 2020 •

edited

Loading

mehcode commented Jul 26, 2020

antialize commented Jul 26, 2020

antialize commented Jul 26, 2020

antialize commented Jul 26, 2020

mehcode commented Jul 27, 2020

antialize commented Jul 27, 2020

antialize commented Jul 28, 2020

antialize commented Jul 28, 2020

mehcode commented Jul 28, 2020

antialize commented Jul 29, 2020

mehcode commented Jul 29, 2020

jkoudys commented Jul 30, 2020

mehcode commented Jul 31, 2020

jkoudys commented Sep 15, 2020

jkoudys commented Dec 9, 2020

antialize commented Dec 9, 2020

jkoudys commented Dec 10, 2020

nemosupremo commented Dec 19, 2020

mehcode commented Dec 19, 2020

antialize commented Dec 20, 2020

kraigher commented Jan 31, 2021

mehcode commented Feb 1, 2021

kraigher commented Feb 1, 2021

mehcode commented Feb 4, 2021 •

edited

Loading

Mysql failure while dropping a future in the middle of reading or writing to the stream. #563

Mysql failure while dropping a future in the middle of reading or writing to the stream. #563

Comments

antialize commented Jul 26, 2020 • edited Loading

antialize commented Jul 26, 2020 • edited Loading

mehcode commented Jul 26, 2020

antialize commented Jul 26, 2020

antialize commented Jul 26, 2020

antialize commented Jul 26, 2020

mehcode commented Jul 27, 2020

antialize commented Jul 27, 2020

antialize commented Jul 28, 2020

antialize commented Jul 28, 2020

mehcode commented Jul 28, 2020

antialize commented Jul 29, 2020

mehcode commented Jul 29, 2020

jkoudys commented Jul 30, 2020

mehcode commented Jul 31, 2020

jkoudys commented Sep 15, 2020

jkoudys commented Dec 9, 2020

antialize commented Dec 9, 2020

jkoudys commented Dec 10, 2020

nemosupremo commented Dec 19, 2020

mehcode commented Dec 19, 2020

antialize commented Dec 20, 2020

kraigher commented Jan 31, 2021

mehcode commented Feb 1, 2021

kraigher commented Feb 1, 2021

mehcode commented Feb 4, 2021 • edited Loading

antialize commented Jul 26, 2020 •

edited

Loading

antialize commented Jul 26, 2020 •

edited

Loading

mehcode commented Feb 4, 2021 •

edited

Loading