-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Connect onSuccess/onFailure can be called more than once #972
Comments
yeah memory mgmt in C is really hard to get right. would be super improvement if the project moved on from strict C only and started leveraging smart pointers and lambda callbacks imho. almost no one on any platform even embedded is limited to old school C anymore. |
Well I don't have an opinion towards the choice to use C, I just leak the C skills to debug and fix this myself 😏 But on topic... I just tried some more things and noticed that I can also reproduce the same error when using In all cases using the async client and in all cases right after sending the first message (see the traces I posted earlier). Hope this helpes and again please let me know how I can assist in moving this forward. Thanks! /CC @fpagliughi this seems related to the C code, but since I call it from async/await Rust code probably good to let you know I have this issue. |
I think there is a fix for this in 1.3.5. Use 1.3.6 if you can, to try it out. |
Hi @icraggs! I will try 1.3.6 again just to be sure, but I tested both 1.3.2 and master already... Yet it cannot hurt to test it again, so will give it another try tomorrow to be 100% sure. |
Oh yes, you did say you tried the latest. I was looking at the trace which says 1.3.2. It is still true that there is a fix in for this issue in 1.3.5 - maybe it's a different case, but on the basis of the gdb trace you provided, the fix is aimed at exactly this case (it fixed it for me). If you can provide a gdb backtrace of 1.3.6, maybe it will show something different. In addition, iff you can provide a Rust program which will shows the issue, I'm prepared to attempt to run it. |
Sounds good! I will start by making a trace using |
If there's a simple test case in Rust, I can try a translation into C if that helps any. |
Hmm... So I doubt if I actually tested I did indeed still get an error, but when have a closer look (using gdb) it seems to be a different error as before. This time it actually seems to be related to the Rust bindings, most likely from a call made using unsafe Rust. But to be honest I'm not 100% sure were this issue belongs now as I also see lines like:
@fpagliughi maybe you can judge is this issue needs to remain here or if it should be moved to the Rust repo? EDIT: Used |
This one "seems" related: eclipse-paho/paho.mqtt.rust#46 I used a custom compiled version of the MQTT 1.3.6 lib and used master (well actually my SSL branch) from |
It could be that the underlying problem in the C library was fixed in the last release, but now there's a mismatch in the Rust/C bindings for the 1.3.6 release. This is a "problem" in the Rust lib, but it is a totally expected one, and documented in the Rust readme. Any change in any mapped C struct is considered a breaking change by Rust. This is how the C lib extends the API without breaking it (extending versioned structs), but the Rust lib can't handle this automatically. Each Rust release is tagged to a specific C release. There is a Rust release (v0.8) due out in a few weeks that will support C v1.3.6. I will start updating and testing shortly, so keep a watch for those commits that cover support for 1.3.6. |
Yes, we talked about that earlier. Would be very nice if eclipse-paho/paho.mqtt.rust#92 could be part of v0.8 😉
That is very good to hear, since it seems that clearly fixes some issues 👍🏻
I will most certainly do, as this currently kind of blocks us... Thanks for the info and feedback so far and let me know if I can help with anything or should test anything! Thx... |
After updating the Rust bindings (eclipse-paho/paho.mqtt.rust#100) and testing my code using all the latest and greatest, the original error for which I opened this issue is no longer there 👍🏻 But unfortunately I still get the same error that was logged in the second gdb trace gdb-1_3_6.log, so essentially this: 20201013 153416.743 Calling connect failure for client test-app
[DEBUG paho_mqtt::token] Token failure! 0x555557270930, 0x7fffdebe1730
[DEBUG paho_mqtt::token] Token failure message: "socket error"
[DEBUG paho_mqtt::token] Token completed with code: -1
Thread 13 "test-app-worker received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffdebe2700 (LWP 16284)]
__GI___pthread_mutex_lock (mutex=0x455030641304331) at ../nptl/pthread_mutex_lock.c:67
67 unsigned int type = PTHREAD_MUTEX_TYPE_ELISION (mutex);
(gdb) bt
#0 __GI___pthread_mutex_lock (mutex=0x455030641304331) at ../nptl/pthread_mutex_lock.c:67
#1 0x000055555600d45d in std::sys::unix::mutex::Mutex::lock (self=0x455030641304331)
at /home/sander/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/sys/unix/mutex.rs:63
#2 0x00005555560076ba in std::sys_common::mutex::Mutex::raw_lock (self=0x455030641304331)
at /home/sander/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/sys_common/mutex.rs:42
#3 0x0000555556005366 in std::sync::mutex::Mutex<T>::lock (self=0x555557270930) at /home/sander/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/sync/mutex.rs:269
#4 0x0000555555fe5321 in paho_mqtt::token::TokenInner::on_complete (self=0x555557270930, msgid=0, rc=-1, err_msg=..., rsp=0x0) at /home/sander/paho.mqtt.rust/src/token.rs:324
#5 0x0000555555fe3931 in paho_mqtt::token::TokenInner::on_failure (context=0x555557270930, rsp=0x7fffdebe1730) at /home/sander/paho.mqtt.rust/src/token.rs:233
#6 0x00007ffff7b25f29 in nextOrClose (m=m@entry=0x5555573d58a8, rc=-1, message=message@entry=0x7ffff7b378b2 "socket error") at src/MQTTAsync.c:2275
#7 0x00007ffff7b268c8 in MQTTAsync_receiveThread (n=<optimized out>) at src/MQTTAsync.c:2739
#8 0x00007ffff6c2f40b in start_thread (arg=0x7fffdebe2700) at pthread_create.c:465
#9 0x00007ffff6753e7f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 Now I've spend some time to better understand the code (both the FFI Rust bindings and the C code) and eventually found the issue. It's a long story (at least for me it was) about raw pointers in Rust and how Now I understand that you probably already understand all these details, yet I thought to just write it out for clarity. So in our test setup I run a couple of things at the same time and one of our backends is IoT-Core from AWS. IoT-Core will not allow the same client to connect twice, so whenever that happens the new client is connected and at the same time the already connected client will be forcefully disconnected. Since I'm running multiple tests and sometimes forgot to kill an older running client, this actually happened a lot and it turned out is was a perfect test case to trigger the issue I saw in the logs. So here is a snipped of log to show what happens:
Looking closely you see that the returned context (raw pointer) is used twice as both the success and failure callbacks send So now the question is how to solve this? Looking at the C code I found these lines (https://github.com/eclipse/paho.mqtt.c/blob/master/src/MQTTAsync.c#L2795 and https://github.com/eclipse/paho.mqtt.c/blob/master/src/MQTTAsync.c#L2810) and wondered if the solution should be to also set the Curious to learn to what extent my assessment is correct, and if this would be a proper solution. And of course if not, then super interested to learn what the right solution should be. Thanks! |
I need to look closer into this, but certainly I am not expecting the same token to get both a success and and failure callback! I am assuming one or the other, after all, how could the same operation both succeed and fail? What would cause this? And, since this is the C repo, I should explain a bit... So, from that printout, it does look like the same context pointer, 0x7fc88fd0be50, is being sent back to the app in two separate callbacks. |
Yeah that was my idea as well, yet there doesn't seem to be a way to set a new token to an existing connection (once an initial So the biggest question for me is if (from the C code point of view) a token should never be reused or that indeed a token (context) can be reused and the caller should be able to handle that. @icraggs do you maybe know the answer to that question? As knowing that (and knowing/understanding the intended logic) tells me what to look at next and where to try and solve this problem. It is still blocking us from moving forward, so I'm very willing to help find a solution. But I need a little more info in order to continue my search and focus on the correct Rust or C code... |
It seems right that we should NULL out both success and failure pointers when either is called for a connect. |
Oops, you already added a fix for this in the develop branch... Missed that 😏 I'll close my PR again and will do some testing with the develop branch. I was actually already using the develop branch for the connection callback fix which is also in there, so wondering if you already have any idea or planning for making a v1.3.7 release? Thanks! |
I tested the fix that is currently in the develop branch and I can confirm that it solves the problem. Before I was able to trigger the segmentation fault really consistently, but using the updated code I wasn't able to trigger it anymore. Thanks again for the help! |
I have a consistent crash in my program which started occurring when I switched from using
QoS=1
toQoS=0
. I'm using the async client by means of thepaho.mqtt.rust
crate (using the master branch). While I understand that the Rust crate is a project on it's own, I traced the problem back to the underlyingpaho.mqtt.c
package.I tried it with the version bundled by the Rust crate (1.3.2), as well as with the latest version (1.3.6) build locally by myself. Both give the exact same traces.
My Rust is pretty decent, but my C is not. So it's a challenge for me to further debug this myself, hence I'm hoping this could be checked by someone with some more C skills.
I have no clear way to reproduce the issue stand alone, but as you can see in the logs (trace log using the environment variable as described in the readme and a GDB backtrace) it actually happens right after the first publish:
Trace log
GBD log
Please let me know if there is anything I can/should test that might help understand the problem better. Thanks!
The text was updated successfully, but these errors were encountered: