Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connect onSuccess/onFailure can be called more than once #972

Closed
svanharmelen opened this issue Oct 9, 2020 · 17 comments
Closed

Connect onSuccess/onFailure can be called more than once #972

svanharmelen opened this issue Oct 9, 2020 · 17 comments

Comments

@svanharmelen
Copy link
Contributor

svanharmelen commented Oct 9, 2020

I have a consistent crash in my program which started occurring when I switched from using QoS=1 to QoS=0. I'm using the async client by means of the paho.mqtt.rust crate (using the master branch). While I understand that the Rust crate is a project on it's own, I traced the problem back to the underlying paho.mqtt.c package.

I tried it with the version bundled by the Rust crate (1.3.2), as well as with the latest version (1.3.6) build locally by myself. Both give the exact same traces.

My Rust is pretty decent, but my C is not. So it's a challenge for me to further debug this myself, hence I'm hoping this could be checked by someone with some more C skills.

I have no clear way to reproduce the issue stand alone, but as you can see in the logs (trace log using the environment variable as described in the readme and a GDB backtrace) it actually happens right after the first publish:

Trace log
GBD log

Please let me know if there is anything I can/should test that might help understand the problem better. Thanks!

@keysight-daryl
Copy link
Contributor

yeah memory mgmt in C is really hard to get right. would be super improvement if the project moved on from strict C only and started leveraging smart pointers and lambda callbacks imho. almost no one on any platform even embedded is limited to old school C anymore.

@svanharmelen
Copy link
Contributor Author

Well I don't have an opinion towards the choice to use C, I just leak the C skills to debug and fix this myself 😏

But on topic... I just tried some more things and noticed that I can also reproduce the same error when using QoS=1 together with clear_session=false. Not as consistent as using QoS=0 with clear_session=true (as then it was ~100% of the time) but I do get the error with these settings as well every once and a while (maybe ~30% of the time).

In all cases using the async client and in all cases right after sending the first message (see the traces I posted earlier). Hope this helpes and again please let me know how I can assist in moving this forward. Thanks!

/CC @fpagliughi this seems related to the C code, but since I call it from async/await Rust code probably good to let you know I have this issue.

@icraggs
Copy link
Contributor

icraggs commented Oct 12, 2020

I think there is a fix for this in 1.3.5. Use 1.3.6 if you can, to try it out.

@svanharmelen
Copy link
Contributor Author

Hi @icraggs! I will try 1.3.6 again just to be sure, but I tested both 1.3.2 and master already... Yet it cannot hurt to test it again, so will give it another try tomorrow to be 100% sure.

@icraggs
Copy link
Contributor

icraggs commented Oct 12, 2020

Oh yes, you did say you tried the latest. I was looking at the trace which says 1.3.2. It is still true that there is a fix in for this issue in 1.3.5 - maybe it's a different case, but on the basis of the gdb trace you provided, the fix is aimed at exactly this case (it fixed it for me).

If you can provide a gdb backtrace of 1.3.6, maybe it will show something different.

In addition, iff you can provide a Rust program which will shows the issue, I'm prepared to attempt to run it.

@svanharmelen
Copy link
Contributor Author

Sounds good! I will start by making a trace using 1.3.6 just so we're sure (I'm sure) I'm testing the right thing. If it persists, I'll try to write up a standalone test case in Rust 👍🏻 Thanks so far and will update this issue tomorrow.

@fpagliughi
Copy link
Contributor

If there's a simple test case in Rust, I can try a translation into C if that helps any.

@svanharmelen
Copy link
Contributor Author

svanharmelen commented Oct 13, 2020

Hmm... So I doubt if I actually tested 1.3.6 correctly before, but at least I'm sure I did now. See the following traces I made while testing:

trace-1_3_6.log
gdb-1_3_6.log

I did indeed still get an error, but when have a closer look (using gdb) it seems to be a different error as before. This time it actually seems to be related to the Rust bindings, most likely from a call made using unsafe Rust.

But to be honest I'm not 100% sure were this issue belongs now as I also see lines like:

20201013 153416.742 Error from MQTTAsync_cycle() - removing socket 9
20201013 153416.742 (3737003776)  (1)> MQTTAsync_disconnect1:3862

@fpagliughi maybe you can judge is this issue needs to remain here or if it should be moved to the Rust repo?

EDIT: Used clean_session=true with qos=0 for this test

@svanharmelen
Copy link
Contributor Author

svanharmelen commented Oct 13, 2020

This one "seems" related: eclipse-paho/paho.mqtt.rust#46

I used a custom compiled version of the MQTT 1.3.6 lib and used master (well actually my SSL branch) from paho.mqtt.rust compiled using --no-default-features --features "ssl"

@fpagliughi
Copy link
Contributor

It could be that the underlying problem in the C library was fixed in the last release, but now there's a mismatch in the Rust/C bindings for the 1.3.6 release.

This is a "problem" in the Rust lib, but it is a totally expected one, and documented in the Rust readme.

Any change in any mapped C struct is considered a breaking change by Rust. This is how the C lib extends the API without breaking it (extending versioned structs), but the Rust lib can't handle this automatically. Each Rust release is tagged to a specific C release.

There is a Rust release (v0.8) due out in a few weeks that will support C v1.3.6. I will start updating and testing shortly, so keep a watch for those commits that cover support for 1.3.6.

@svanharmelen
Copy link
Contributor Author

There is a Rust release (v0.8) due out in a few weeks

Yes, we talked about that earlier. Would be very nice if eclipse-paho/paho.mqtt.rust#92 could be part of v0.8 😉

that will support C v1.3.6

That is very good to hear, since it seems that clearly fixes some issues 👍🏻

I will start updating and testing shortly, so keep a watch for those commits that cover support for 1.3.6

I will most certainly do, as this currently kind of blocks us... Thanks for the info and feedback so far and let me know if I can help with anything or should test anything! Thx...

@svanharmelen
Copy link
Contributor Author

svanharmelen commented Oct 18, 2020

After updating the Rust bindings (eclipse-paho/paho.mqtt.rust#100) and testing my code using all the latest and greatest, the original error for which I opened this issue is no longer there 👍🏻

But unfortunately I still get the same error that was logged in the second gdb trace gdb-1_3_6.log, so essentially this:

20201013 153416.743 Calling connect failure for client test-app
[DEBUG paho_mqtt::token] Token failure! 0x555557270930, 0x7fffdebe1730
[DEBUG paho_mqtt::token] Token failure message: "socket error"
[DEBUG paho_mqtt::token] Token completed with code: -1

Thread 13 "test-app-worker received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffdebe2700 (LWP 16284)]
__GI___pthread_mutex_lock (mutex=0x455030641304331) at ../nptl/pthread_mutex_lock.c:67
67        unsigned int type = PTHREAD_MUTEX_TYPE_ELISION (mutex);
(gdb) bt
#0  __GI___pthread_mutex_lock (mutex=0x455030641304331) at ../nptl/pthread_mutex_lock.c:67
#1  0x000055555600d45d in std::sys::unix::mutex::Mutex::lock (self=0x455030641304331)
    at /home/sander/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/sys/unix/mutex.rs:63
#2  0x00005555560076ba in std::sys_common::mutex::Mutex::raw_lock (self=0x455030641304331)
    at /home/sander/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/sys_common/mutex.rs:42
#3  0x0000555556005366 in std::sync::mutex::Mutex<T>::lock (self=0x555557270930) at /home/sander/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/sync/mutex.rs:269
#4  0x0000555555fe5321 in paho_mqtt::token::TokenInner::on_complete (self=0x555557270930, msgid=0, rc=-1, err_msg=..., rsp=0x0) at /home/sander/paho.mqtt.rust/src/token.rs:324
#5  0x0000555555fe3931 in paho_mqtt::token::TokenInner::on_failure (context=0x555557270930, rsp=0x7fffdebe1730) at /home/sander/paho.mqtt.rust/src/token.rs:233
#6  0x00007ffff7b25f29 in nextOrClose (m=m@entry=0x5555573d58a8, rc=-1, message=message@entry=0x7ffff7b378b2 "socket error") at src/MQTTAsync.c:2275
#7  0x00007ffff7b268c8 in MQTTAsync_receiveThread (n=<optimized out>) at src/MQTTAsync.c:2739
#8  0x00007ffff6c2f40b in start_thread (arg=0x7fffdebe2700) at pthread_create.c:465
#9  0x00007ffff6753e7f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Now I've spend some time to better understand the code (both the FFI Rust bindings and the C code) and eventually found the issue.

It's a long story (at least for me it was) about raw pointers in Rust and how Arc<T>.into_raw() uses std::mem::forget() to make sure the value behind the pointer isn't "dropped". This construct is used for passing the context to the C code when making a server request, which is than passed back to Rust as part of the response (callback). When Rust receives the context back again, it uses Arc::from_raw() to reconstruct the Arc from the raw pointer. So far so good, but at this point the reconstructed Arc will be dropped whenever it goes out of scope which then releases the memory. This means that using the same raw pointer after this point should be considered a bug (undefined behavior) and will cause a segmentation fault if the underlying memory is reused for another allocation.

Now I understand that you probably already understand all these details, yet I thought to just write it out for clarity.

So in our test setup I run a couple of things at the same time and one of our backends is IoT-Core from AWS. IoT-Core will not allow the same client to connect twice, so whenever that happens the new client is connected and at the same time the already connected client will be forcefully disconnected. Since I'm running multiple tests and sometimes forgot to kill an older running client, this actually happened a lot and it turned out is was a perfect test case to trigger the issue I saw in the logs.

So here is a snipped of log to show what happens:

[DEBUG paho_mqtt::token] Token success! 0x7fc88fd0be50, 0x70000da6ef18
[DEBUG paho_mqtt::token] Token completed with code: 0
[DEBUG paho_mqtt::token] Expecting server response for: Connect
[DEBUG paho_mqtt::token] Got response: ServerResponse { rsp: Connect(ConnectResponse { server_uri: "xxxxxxxxxxxxx.iot.eu-west-1.amazonaws.com:8883", mqtt_version: 4, session_present: false }), props: Properties { cprops: MQTTProperties { count: 0, max_count: 0, length: 0, array: 0x0 } }, reason_code: Success }
<SNIP>
[DEBUG paho_mqtt::token] Token failure! 0x7fc88fd0be50, 0x70000da6ebf0
[DEBUG paho_mqtt::token] Token failure message: "socket error"
[DEBUG paho_mqtt::token] Token completed with code: -1
[DEBUG paho_mqtt::token] Expecting server response for: Connect
[DEBUG paho_mqtt::token] Got response: ServerResponse { rsp: Connect(ConnectResponse { server_uri: "\u{0}\u{0}\u{0}\u{0}\u{0}\u{0}\u{0} \u{0}\u{0}\u{0}\u{0}\u{0}\u{0}\u{0} \u{4}\u{0}.iot.eu-west-1.amazonaws.com:8883", mqtt_version: 4, session_present: false }), props: Properties { cprops: MQTTProperties { count: 0, max_count: 0, length: 0, array: 0x0 } }, reason_code: Success }
iot-gateway(37209,0x70000d057000) malloc: Incorrect checksum for freed object 0x7fc88fd0be40: probably modified after being freed.
Corrupt value: 0x1fffffffffffffff
iot-gateway(37209,0x70000d057000) malloc: *** set a breakpoint in malloc_error_break to debug
Signal: SIGABRT (signal SIGABRT)

Looking closely you see that the returned context (raw pointer) is used twice as both the success and failure callbacks send 0x7fc88fd0be50 as the pointer address. Additionally you can also see that the server_uri in the second call no longer contains the correct URI, but now contains random bits as the memory was released after the first call.

So now the question is how to solve this? Looking at the C code I found these lines (https://github.com/eclipse/paho.mqtt.c/blob/master/src/MQTTAsync.c#L2795 and https://github.com/eclipse/paho.mqtt.c/blob/master/src/MQTTAsync.c#L2810) and wondered if the solution should be to also set the onFailure's callbacks to NULL here? That would prevent reusing the same context in all situations, which AFAIK is what is expected/intended. If that is not the case and if it is actually correct that the context is reused when an already succeeded connection fails, then we need to update the Rust code so it will handle this properly.

Curious to learn to what extent my assessment is correct, and if this would be a proper solution. And of course if not, then super interested to learn what the right solution should be. Thanks!

@fpagliughi
Copy link
Contributor

I need to look closer into this, but certainly I am not expecting the same token to get both a success and and failure callback! I am assuming one or the other, after all, how could the same operation both succeed and fail? What would cause this?

And, since this is the C repo, I should explain a bit...
Rust has a thread-safe, reference-counted smart pointer called an Arc, akin to a C++ shared_ptr<>. I create one to hold a Token object to track an operation, and "release" the Arc when I pass it as a context pointer to the C lib for, say, a publish message call. I then re-wrap the Arc on the success/failure call to restore ownership of the raw pointer and release the memory. If the same pointer is sent back to me in two callbacks, I will try to release it twice.

So, from that printout, it does look like the same context pointer, 0x7fc88fd0be50, is being sent back to the app in two separate callbacks.

@svanharmelen
Copy link
Contributor Author

but certainly I am not expecting the same token to get both a success and and failure callback

Yeah that was my idea as well, yet there doesn't seem to be a way to set a new token to an existing connection (once an initial success callback is made) which can then be used if later a connection issue triggers a call to the failure callback.

So the biggest question for me is if (from the C code point of view) a token should never be reused or that indeed a token (context) can be reused and the caller should be able to handle that. @icraggs do you maybe know the answer to that question?

As knowing that (and knowing/understanding the intended logic) tells me what to look at next and where to try and solve this problem. It is still blocking us from moving forward, so I'm very willing to help find a solution. But I need a little more info in order to continue my search and focus on the correct Rust or C code...

@icraggs
Copy link
Contributor

icraggs commented Oct 19, 2020

It seems right that we should NULL out both success and failure pointers when either is called for a connect.

@svanharmelen
Copy link
Contributor Author

Oops, you already added a fix for this in the develop branch... Missed that 😏 I'll close my PR again and will do some testing with the develop branch.

I was actually already using the develop branch for the connection callback fix which is also in there, so wondering if you already have any idea or planning for making a v1.3.7 release?

Thanks!

@icraggs icraggs added this to the 1.3.7 milestone Oct 21, 2020
@icraggs icraggs changed the title Double free error when using the async client with QoS=0 Connect onSuccess/onFailure can be called more than once Oct 21, 2020
@icraggs icraggs added the bug label Oct 21, 2020
@svanharmelen
Copy link
Contributor Author

I tested the fix that is currently in the develop branch and I can confirm that it solves the problem. Before I was able to trigger the segmentation fault really consistently, but using the updated code I wasn't able to trigger it anymore. Thanks again for the help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants