Connect onSuccess/onFailure can be called more than once #972

svanharmelen · 2020-10-09T15:32:24Z

I have a consistent crash in my program which started occurring when I switched from using QoS=1 to QoS=0. I'm using the async client by means of the paho.mqtt.rust crate (using the master branch). While I understand that the Rust crate is a project on it's own, I traced the problem back to the underlying paho.mqtt.c package.

I tried it with the version bundled by the Rust crate (1.3.2), as well as with the latest version (1.3.6) build locally by myself. Both give the exact same traces.

My Rust is pretty decent, but my C is not. So it's a challenge for me to further debug this myself, hence I'm hoping this could be checked by someone with some more C skills.

I have no clear way to reproduce the issue stand alone, but as you can see in the logs (trace log using the environment variable as described in the readme and a GDB backtrace) it actually happens right after the first publish:

Trace log
GBD log

Please let me know if there is anything I can/should test that might help understand the problem better. Thanks!

The text was updated successfully, but these errors were encountered:

keysight-daryl · 2020-10-09T16:23:22Z

yeah memory mgmt in C is really hard to get right. would be super improvement if the project moved on from strict C only and started leveraging smart pointers and lambda callbacks imho. almost no one on any platform even embedded is limited to old school C anymore.

svanharmelen · 2020-10-12T09:50:18Z

Well I don't have an opinion towards the choice to use C, I just leak the C skills to debug and fix this myself 😏

But on topic... I just tried some more things and noticed that I can also reproduce the same error when using QoS=1 together with clear_session=false. Not as consistent as using QoS=0 with clear_session=true (as then it was ~100% of the time) but I do get the error with these settings as well every once and a while (maybe ~30% of the time).

In all cases using the async client and in all cases right after sending the first message (see the traces I posted earlier). Hope this helpes and again please let me know how I can assist in moving this forward. Thanks!

/CC @fpagliughi this seems related to the C code, but since I call it from async/await Rust code probably good to let you know I have this issue.

icraggs · 2020-10-12T16:56:43Z

I think there is a fix for this in 1.3.5. Use 1.3.6 if you can, to try it out.

svanharmelen · 2020-10-12T17:29:33Z

Hi @icraggs! I will try 1.3.6 again just to be sure, but I tested both 1.3.2 and master already... Yet it cannot hurt to test it again, so will give it another try tomorrow to be 100% sure.

icraggs · 2020-10-12T17:58:28Z

Oh yes, you did say you tried the latest. I was looking at the trace which says 1.3.2. It is still true that there is a fix in for this issue in 1.3.5 - maybe it's a different case, but on the basis of the gdb trace you provided, the fix is aimed at exactly this case (it fixed it for me).

If you can provide a gdb backtrace of 1.3.6, maybe it will show something different.

In addition, iff you can provide a Rust program which will shows the issue, I'm prepared to attempt to run it.

svanharmelen · 2020-10-12T18:36:16Z

Sounds good! I will start by making a trace using 1.3.6 just so we're sure (I'm sure) I'm testing the right thing. If it persists, I'll try to write up a standalone test case in Rust 👍🏻 Thanks so far and will update this issue tomorrow.

fpagliughi · 2020-10-12T19:07:38Z

If there's a simple test case in Rust, I can try a translation into C if that helps any.

svanharmelen · 2020-10-13T16:26:06Z

Hmm... So I doubt if I actually tested 1.3.6 correctly before, but at least I'm sure I did now. See the following traces I made while testing:

trace-1_3_6.log
gdb-1_3_6.log

I did indeed still get an error, but when have a closer look (using gdb) it seems to be a different error as before. This time it actually seems to be related to the Rust bindings, most likely from a call made using unsafe Rust.

But to be honest I'm not 100% sure were this issue belongs now as I also see lines like:

20201013 153416.742 Error from MQTTAsync_cycle() - removing socket 9
20201013 153416.742 (3737003776)  (1)> MQTTAsync_disconnect1:3862

@fpagliughi maybe you can judge is this issue needs to remain here or if it should be moved to the Rust repo?

EDIT: Used clean_session=true with qos=0 for this test

svanharmelen · 2020-10-13T16:34:10Z

This one "seems" related: eclipse-paho/paho.mqtt.rust#46

I used a custom compiled version of the MQTT 1.3.6 lib and used master (well actually my SSL branch) from paho.mqtt.rust compiled using --no-default-features --features "ssl"

fpagliughi · 2020-10-13T18:05:28Z

It could be that the underlying problem in the C library was fixed in the last release, but now there's a mismatch in the Rust/C bindings for the 1.3.6 release.

This is a "problem" in the Rust lib, but it is a totally expected one, and documented in the Rust readme.

Any change in any mapped C struct is considered a breaking change by Rust. This is how the C lib extends the API without breaking it (extending versioned structs), but the Rust lib can't handle this automatically. Each Rust release is tagged to a specific C release.

There is a Rust release (v0.8) due out in a few weeks that will support C v1.3.6. I will start updating and testing shortly, so keep a watch for those commits that cover support for 1.3.6.

svanharmelen · 2020-10-13T18:10:14Z

There is a Rust release (v0.8) due out in a few weeks

Yes, we talked about that earlier. Would be very nice if eclipse-paho/paho.mqtt.rust#92 could be part of v0.8 😉

that will support C v1.3.6

That is very good to hear, since it seems that clearly fixes some issues 👍🏻

I will start updating and testing shortly, so keep a watch for those commits that cover support for 1.3.6

I will most certainly do, as this currently kind of blocks us... Thanks for the info and feedback so far and let me know if I can help with anything or should test anything! Thx...

svanharmelen · 2020-10-18T17:00:29Z

After updating the Rust bindings (eclipse-paho/paho.mqtt.rust#100) and testing my code using all the latest and greatest, the original error for which I opened this issue is no longer there 👍🏻

But unfortunately I still get the same error that was logged in the second gdb trace gdb-1_3_6.log, so essentially this:

20201013 153416.743 Calling connect failure for client test-app
[DEBUG paho_mqtt::token] Token failure! 0x555557270930, 0x7fffdebe1730
[DEBUG paho_mqtt::token] Token failure message: "socket error"
[DEBUG paho_mqtt::token] Token completed with code: -1

Thread 13 "test-app-worker received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffdebe2700 (LWP 16284)]
__GI___pthread_mutex_lock (mutex=0x455030641304331) at ../nptl/pthread_mutex_lock.c:67
67        unsigned int type = PTHREAD_MUTEX_TYPE_ELISION (mutex);
(gdb) bt
#0  __GI___pthread_mutex_lock (mutex=0x455030641304331) at ../nptl/pthread_mutex_lock.c:67
#1  0x000055555600d45d in std::sys::unix::mutex::Mutex::lock (self=0x455030641304331)
    at /home/sander/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/sys/unix/mutex.rs:63
#2  0x00005555560076ba in std::sys_common::mutex::Mutex::raw_lock (self=0x455030641304331)
    at /home/sander/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/sys_common/mutex.rs:42
#3  0x0000555556005366 in std::sync::mutex::Mutex<T>::lock (self=0x555557270930) at /home/sander/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/sync/mutex.rs:269
#4  0x0000555555fe5321 in paho_mqtt::token::TokenInner::on_complete (self=0x555557270930, msgid=0, rc=-1, err_msg=..., rsp=0x0) at /home/sander/paho.mqtt.rust/src/token.rs:324
#5  0x0000555555fe3931 in paho_mqtt::token::TokenInner::on_failure (context=0x555557270930, rsp=0x7fffdebe1730) at /home/sander/paho.mqtt.rust/src/token.rs:233
#6  0x00007ffff7b25f29 in nextOrClose (m=m@entry=0x5555573d58a8, rc=-1, message=message@entry=0x7ffff7b378b2 "socket error") at src/MQTTAsync.c:2275
#7  0x00007ffff7b268c8 in MQTTAsync_receiveThread (n=<optimized out>) at src/MQTTAsync.c:2739
#8  0x00007ffff6c2f40b in start_thread (arg=0x7fffdebe2700) at pthread_create.c:465
#9  0x00007ffff6753e7f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Now I've spend some time to better understand the code (both the FFI Rust bindings and the C code) and eventually found the issue.

It's a long story (at least for me it was) about raw pointers in Rust and how Arc<T>.into_raw() uses std::mem::forget() to make sure the value behind the pointer isn't "dropped". This construct is used for passing the context to the C code when making a server request, which is than passed back to Rust as part of the response (callback). When Rust receives the context back again, it uses Arc::from_raw() to reconstruct the Arc from the raw pointer. So far so good, but at this point the reconstructed Arc will be dropped whenever it goes out of scope which then releases the memory. This means that using the same raw pointer after this point should be considered a bug (undefined behavior) and will cause a segmentation fault if the underlying memory is reused for another allocation.

Now I understand that you probably already understand all these details, yet I thought to just write it out for clarity.

So in our test setup I run a couple of things at the same time and one of our backends is IoT-Core from AWS. IoT-Core will not allow the same client to connect twice, so whenever that happens the new client is connected and at the same time the already connected client will be forcefully disconnected. Since I'm running multiple tests and sometimes forgot to kill an older running client, this actually happened a lot and it turned out is was a perfect test case to trigger the issue I saw in the logs.

So here is a snipped of log to show what happens:

[DEBUG paho_mqtt::token] Token success! 0x7fc88fd0be50, 0x70000da6ef18
[DEBUG paho_mqtt::token] Token completed with code: 0
[DEBUG paho_mqtt::token] Expecting server response for: Connect
[DEBUG paho_mqtt::token] Got response: ServerResponse { rsp: Connect(ConnectResponse { server_uri: "xxxxxxxxxxxxx.iot.eu-west-1.amazonaws.com:8883", mqtt_version: 4, session_present: false }), props: Properties { cprops: MQTTProperties { count: 0, max_count: 0, length: 0, array: 0x0 } }, reason_code: Success }
<SNIP>
[DEBUG paho_mqtt::token] Token failure! 0x7fc88fd0be50, 0x70000da6ebf0
[DEBUG paho_mqtt::token] Token failure message: "socket error"
[DEBUG paho_mqtt::token] Token completed with code: -1
[DEBUG paho_mqtt::token] Expecting server response for: Connect
[DEBUG paho_mqtt::token] Got response: ServerResponse { rsp: Connect(ConnectResponse { server_uri: "\u{0}\u{0}\u{0}\u{0}\u{0}\u{0}\u{0} \u{0}\u{0}\u{0}\u{0}\u{0}\u{0}\u{0} \u{4}\u{0}.iot.eu-west-1.amazonaws.com:8883", mqtt_version: 4, session_present: false }), props: Properties { cprops: MQTTProperties { count: 0, max_count: 0, length: 0, array: 0x0 } }, reason_code: Success }
iot-gateway(37209,0x70000d057000) malloc: Incorrect checksum for freed object 0x7fc88fd0be40: probably modified after being freed.
Corrupt value: 0x1fffffffffffffff
iot-gateway(37209,0x70000d057000) malloc: *** set a breakpoint in malloc_error_break to debug
Signal: SIGABRT (signal SIGABRT)

Looking closely you see that the returned context (raw pointer) is used twice as both the success and failure callbacks send 0x7fc88fd0be50 as the pointer address. Additionally you can also see that the server_uri in the second call no longer contains the correct URI, but now contains random bits as the memory was released after the first call.

So now the question is how to solve this? Looking at the C code I found these lines (https://github.com/eclipse/paho.mqtt.c/blob/master/src/MQTTAsync.c#L2795 and https://github.com/eclipse/paho.mqtt.c/blob/master/src/MQTTAsync.c#L2810) and wondered if the solution should be to also set the onFailure's callbacks to NULL here? That would prevent reusing the same context in all situations, which AFAIK is what is expected/intended. If that is not the case and if it is actually correct that the context is reused when an already succeeded connection fails, then we need to update the Rust code so it will handle this properly.

Curious to learn to what extent my assessment is correct, and if this would be a proper solution. And of course if not, then super interested to learn what the right solution should be. Thanks!

fpagliughi · 2020-10-19T19:41:38Z

I need to look closer into this, but certainly I am not expecting the same token to get both a success and and failure callback! I am assuming one or the other, after all, how could the same operation both succeed and fail? What would cause this?

And, since this is the C repo, I should explain a bit...
Rust has a thread-safe, reference-counted smart pointer called an Arc, akin to a C++ shared_ptr<>. I create one to hold a Token object to track an operation, and "release" the Arc when I pass it as a context pointer to the C lib for, say, a publish message call. I then re-wrap the Arc on the success/failure call to restore ownership of the raw pointer and release the memory. If the same pointer is sent back to me in two callbacks, I will try to release it twice.

So, from that printout, it does look like the same context pointer, 0x7fc88fd0be50, is being sent back to the app in two separate callbacks.

svanharmelen · 2020-10-19T20:06:34Z

but certainly I am not expecting the same token to get both a success and and failure callback

Yeah that was my idea as well, yet there doesn't seem to be a way to set a new token to an existing connection (once an initial success callback is made) which can then be used if later a connection issue triggers a call to the failure callback.

So the biggest question for me is if (from the C code point of view) a token should never be reused or that indeed a token (context) can be reused and the caller should be able to handle that. @icraggs do you maybe know the answer to that question?

As knowing that (and knowing/understanding the intended logic) tells me what to look at next and where to try and solve this problem. It is still blocking us from moving forward, so I'm very willing to help find a solution. But I need a little more info in order to continue my search and focus on the correct Rust or C code...

icraggs · 2020-10-19T22:04:59Z

It seems right that we should NULL out both success and failure pointers when either is called for a connect.

svanharmelen · 2020-10-20T19:16:20Z

Oops, you already added a fix for this in the develop branch... Missed that 😏 I'll close my PR again and will do some testing with the develop branch.

I was actually already using the develop branch for the connection callback fix which is also in there, so wondering if you already have any idea or planning for making a v1.3.7 release?

Thanks!

svanharmelen · 2020-10-21T18:07:29Z

I tested the fix that is currently in the develop branch and I can confirm that it solves the problem. Before I was able to trigger the segmentation fault really consistently, but using the updated code I wasn't able to trigger it anymore. Thanks again for the help!

fpagliughi mentioned this issue Oct 13, 2020

Update support for Paho C to v1.3.8 eclipse-paho/paho.mqtt.rust#99

Closed

icraggs added a commit that referenced this issue Oct 20, 2020

NULL out connect success/failure pointers once they've been used #972

9e4d95b

svanharmelen mentioned this issue Oct 20, 2020

Don't call the failure callback once connected #979

Closed

svanharmelen mentioned this issue Oct 21, 2020

Update to support Paho C v1.3.6 eclipse-paho/paho.mqtt.rust#100

Closed

icraggs added this to the 1.3.7 milestone Oct 21, 2020

icraggs added the fix added label Oct 21, 2020

icraggs changed the title ~~Double free error when using the async client with QoS=0~~ Connect onSuccess/onFailure can be called more than once Oct 21, 2020

icraggs added the bug label Oct 21, 2020

icraggs closed this as completed Nov 21, 2020

ashish-250 mentioned this issue Jul 30, 2021

Connect onSuccess is no longer called while using automaticReconnect if the first connect attempt was a failure #1135

Closed

ashish-250 mentioned this issue Aug 16, 2021

Not able to know the connection failure reason in certain situation while using MQTT client with auto reconnect option #1148

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Connect onSuccess/onFailure can be called more than once #972

Connect onSuccess/onFailure can be called more than once #972

svanharmelen commented Oct 9, 2020 •

edited

Loading

keysight-daryl commented Oct 9, 2020

svanharmelen commented Oct 12, 2020

icraggs commented Oct 12, 2020

svanharmelen commented Oct 12, 2020

icraggs commented Oct 12, 2020

svanharmelen commented Oct 12, 2020

fpagliughi commented Oct 12, 2020

svanharmelen commented Oct 13, 2020 •

edited

Loading

svanharmelen commented Oct 13, 2020 •

edited

Loading

fpagliughi commented Oct 13, 2020

svanharmelen commented Oct 13, 2020

svanharmelen commented Oct 18, 2020 •

edited

Loading

fpagliughi commented Oct 19, 2020

svanharmelen commented Oct 19, 2020

icraggs commented Oct 19, 2020

svanharmelen commented Oct 20, 2020

svanharmelen commented Oct 21, 2020

Connect onSuccess/onFailure can be called more than once #972

Connect onSuccess/onFailure can be called more than once #972

Comments

svanharmelen commented Oct 9, 2020 • edited Loading

keysight-daryl commented Oct 9, 2020

svanharmelen commented Oct 12, 2020

icraggs commented Oct 12, 2020

svanharmelen commented Oct 12, 2020

icraggs commented Oct 12, 2020

svanharmelen commented Oct 12, 2020

fpagliughi commented Oct 12, 2020

svanharmelen commented Oct 13, 2020 • edited Loading

svanharmelen commented Oct 13, 2020 • edited Loading

fpagliughi commented Oct 13, 2020

svanharmelen commented Oct 13, 2020

svanharmelen commented Oct 18, 2020 • edited Loading

fpagliughi commented Oct 19, 2020

svanharmelen commented Oct 19, 2020

icraggs commented Oct 19, 2020

svanharmelen commented Oct 20, 2020

svanharmelen commented Oct 21, 2020

svanharmelen commented Oct 9, 2020 •

edited

Loading

svanharmelen commented Oct 13, 2020 •

edited

Loading

svanharmelen commented Oct 13, 2020 •

edited

Loading

svanharmelen commented Oct 18, 2020 •

edited

Loading