c8y mapper not starting up reliably #3279

reey · 2024-12-04T08:16:43Z

Describe the bug
When ever I restart my machine or restart the c8y mapper service (sudo systemctl restart tedge-mapper-c8y.service) it is not really reliably starting up again. The mqtt messages on the te/# topics are not processed so that no messages are actually published to the c8y/# topic.

I've kept the mapper in this not functioning state for ~8 hours to verify if it comes back online at some point, but it does not.
About 1 out of 5 startup attempts of the service is succeeding for me. I had the feeling that with debug logging enabled for the mapper it was starting up a bit more reliable.. Maybe some race condition? I've also got 82 child devices connected right now. I'm not sure what the mapper might be doing on startup, but maybe the number of devices is triggering a race condition?

To Reproduce

Restart the c8y mapper using sudo systemctl restart tedge-mapper-c8y.service
verify whether mqtt messages are still sent to c8y/#

Expected behavior
Restarting the tedge-mapper-c8y service should not cause any issues and the mapping functionality should proceed afterwards.

Screenshots

Environment (please complete the following information):

Property	Value
OS [incl. version]	`Debian GNU/Linux 12 (bookworm)`
Hardware [incl. revision]	`Raspberry Pi 4 Model B Rev 1.2`
System-Architecture	`Linux raspberrypi 6.6.62+rpt-rpi-v8 #1 SMP PREEMPT Debian 1:6.6.62-1+rpt1 (2024-11-25) aarch64 GNU/Linux`
thin-edge.io version	`tedge 1.3.1`
mosquitto version	`2.0.11`

Additional context
I've got about 82 device child devices connected to my thin-edge instance.

I've attached an encrypted debug log file: thin-edge-c8y-mapper.age.log

The text was updated successfully, but these errors were encountered:

Bravo555 · 2024-12-04T17:09:43Z

From looking at the log, it looks like mosquitto can't establish a bridge between the local broker and the cloud:

there are MQTT pub and MQTT recv logs for te/... topics, but for c8y/... topics, there are only MQTT pub lines, so we're successfully publishing on local c8y/... topics, but the bridge can't be established with c8y, that's why messages we publish probably aren't being sent to the cloud and we're not getting any messages from the cloud as well.

Please also share /var/log/mosquitto/mosquitto.log and /etc/tedge/tedge.toml

reey · 2024-12-04T18:48:26Z

there are only MQTT pub lines, so we're successfully publishing on local c8y/... topics, but the bridge can't be established with c8y

But then I should be able to see these published messages on the local broker when subscribing to c8y/# , right? But I do not..

I've attached the requested files. Below mapper log files is from the same timeframe I've grabbed the mosquitto log.
mosquitto.age.log
thin-edge-c8y-mapper.age.log
tedge.age.toml.txt

reey · 2024-12-04T19:59:33Z

So actually I'm able to see traffic on the c8y/# topics during the startup of the service, and also responses from the cloud if I subscribe in parallel to the startup of the service instead of after the startup.
like e.g. the whole child device creation:
e.g.:

[c8y/s/us] 101,tedge-zigbee-0x2c1165fffe7d8d4d,0x2c1165fffe7d8d4d,zigbee2mqtt
[c8y/s/e] 41,101,Device with serial 'tedge-zigbee-0x2c1165fffe7d8d4d' already existing

and also the JWT is correctly retrieved beforehand:

[c8y/s/uat] 
[c8y/s/dat] 71,<redacted>

But nothing further afterwards..

Bravo555 · 2024-12-05T08:44:38Z

But nothing further afterwards..

After replicating your child devices, I am able to reproduce the issue.
Also, after SIGINT, the following actors are failing to shutdown:

2024-12-05T01:55:28.443635893Z ERROR Runtime: crates/core/tedge_actors/src/runtime.rs:225: Timeout waiting for all actors to shutdown    
2024-12-05T01:55:28.443651833Z ERROR Runtime: crates/core/tedge_actors/src/runtime.rs:227: Failed to shutdown: CumulocityMapper-7    
2024-12-05T01:55:28.443655881Z ERROR Runtime: crates/core/tedge_actors/src/runtime.rs:227: Failed to shutdown: AvailabilityActor-12    
2024-12-05T01:55:28.443658937Z ERROR Runtime: crates/core/tedge_actors/src/runtime.rs:227: Failed to shutdown: Timer-6

So it is indeed most likely a bug with thin-edge, and not the bridge. Trying to find the cause now.

Bravo555 · 2024-12-09T08:36:22Z

A cause was identified: due to an accidental deadlock, the outgoing MQTT message queue wasn't getting drained.
Working on a fix now.

Because the C8yMapperActor synchronously first receives a message from the input channel, and then produces output by sending to the output channel, if one of these channels block, the other one will also not be processed. In this case, there was a loop: C8yMapperActor sent `MqttMessage`s to AvailabilityActor, and AvailabilityActor send `PublishMessage`s to C8yMapperActor for it to relay it to MqttActor. When a sender for `PublishMessage`s was full when AvailabilityActor tried sending another message, it would block, so until this message was processed, AvailabilityActor wasn't processing now input. However, if before this output was sent by C8yMapperActor there was another message to relay to AvailabilityActor, and the inbound channel was also full, this would result in a deadlock where neither AvailabilityActor input or output could be sent through. Signed-off-by: Marcel Guzik <[email protected]>

This test is meant to verify the fix to thin-edge#3279. Another test like this will probably have to be added to cover the original behaviour that motivated connecting AvailabilityActor to the C8yMapperActor instead of MqttActor directly, i.e. AvailabilityActor sending a Smartrest 117 message before C8yMapperActor sends 101 message that registers a child device (which is invalid and should not happen). Signed-off-by: Marcel Guzik <[email protected]>

Because the C8yMapperActor synchronously first receives a message from the input channel, and then produces output by sending to the output channel, if one of these channels block, the other one will also not be processed. In this case, there was a loop: C8yMapperActor sent `MqttMessage`s to AvailabilityActor, and AvailabilityActor send `PublishMessage`s to C8yMapperActor for it to relay it to MqttActor. When a sender for `PublishMessage`s was full when AvailabilityActor tried sending another message, it would block, so until this message was processed, AvailabilityActor wasn't processing now input. However, if before this output was sent by C8yMapperActor there was another message to relay to AvailabilityActor, and the inbound channel was also full, this would result in a deadlock where neither AvailabilityActor input or output could be sent through. Signed-off-by: Marcel Guzik <[email protected]>

…actor" This reverts commit 47ddac9.

Because the C8yMapperActor synchronously first receives a message from the input channel, and then produces output by sending to the output channel, if one of these channels block, the other one will also not be processed. In this case, there was a loop: C8yMapperActor sent `MqttMessage`s to AvailabilityActor, and AvailabilityActor send `PublishMessage`s to C8yMapperActor for it to relay it to MqttActor. When a sender for `PublishMessage`s was full when AvailabilityActor tried sending another message, it would block, so until this message was processed, AvailabilityActor wasn't processing now input. However, if before this output was sent by C8yMapperActor there was another message to relay to AvailabilityActor, and the inbound channel was also full, this would result in a deadlock where neither AvailabilityActor input or output could be sent through. Signed-off-by: Marcel Guzik <[email protected]>

Solve the deadlock that can arise when C8yMapperActor can't complete sending MqttMessage to some other actor because that other actor can't complete the send of a `PublishMessage` to C8yMapperActor by creating a fast path to drain PublishMessage buffer ASAP. Previously when `PublishMessage`s were a part of the main message box we could not prioritize them because we had to process messages in order; separating out the receiver for `PublishMessage`s makes it so no other messages can block `PublishMessage`s and allows us to use select! to concurrently process them when processing other events blocks Signed-off-by: Marcel Guzik <[email protected]>

fix #3279: Create a fast path for sending `PublishMessage`s

reubenmiller · 2024-12-13T16:06:49Z

@reey We've merged a fix for this. We'll include it in a new official release next week, however it might be worth verifying for yourself by install thin-edge.io from our main release channel. You can easily do this by executing the following one-liner:

wget -O - thin-edge.io/install.sh | sh -s -- --channel main

Or if you are a curl user:

curl -fsSL https://thin-edge.io/install.sh | sh -s -- --channel main

reey · 2024-12-14T13:32:14Z

I gave it a try and it seems to be working fine 👍
Thanks @Bravo555 and @reubenmiller

reey added the bug Something isn't working label Dec 4, 2024

Bravo555 self-assigned this Dec 4, 2024

Bravo555 mentioned this issue Dec 9, 2024

fix(#3279): plug AvailabilityActor directly to MqttActor #3289

Closed

12 tasks

reubenmiller added this to the 1.4.0 milestone Dec 9, 2024

Bravo555 added a commit to Bravo555/thin-edge.io that referenced this issue Dec 10, 2024

Revert "fix(thin-edge#3279): plug AvailabilityActor directly to MQTT …

a3b150b

…actor" This reverts commit 47ddac9.

Bravo555 mentioned this issue Dec 12, 2024

fix: Prevent deadlock on mapper startup #3295

Merged

11 tasks

github-merge-queue bot pushed a commit that referenced this issue Dec 13, 2024

Merge pull request #3295 from Bravo555/fix/3279-3

7c17eb0

fix #3279: Create a fast path for sending `PublishMessage`s

reubenmiller closed this as completed Dec 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

c8y mapper not starting up reliably #3279

c8y mapper not starting up reliably #3279

reey commented Dec 4, 2024

Bravo555 commented Dec 4, 2024 •

edited

Loading

reey commented Dec 4, 2024 •

edited

Loading

reey commented Dec 4, 2024

Bravo555 commented Dec 5, 2024 •

edited

Loading

Bravo555 commented Dec 9, 2024

reubenmiller commented Dec 13, 2024

reey commented Dec 14, 2024

c8y mapper not starting up reliably #3279

c8y mapper not starting up reliably #3279

Comments

reey commented Dec 4, 2024

Screenshots

Bravo555 commented Dec 4, 2024 • edited Loading

reey commented Dec 4, 2024 • edited Loading

reey commented Dec 4, 2024

Bravo555 commented Dec 5, 2024 • edited Loading

Bravo555 commented Dec 9, 2024

reubenmiller commented Dec 13, 2024

reey commented Dec 14, 2024

Bravo555 commented Dec 4, 2024 •

edited

Loading

reey commented Dec 4, 2024 •

edited

Loading

Bravo555 commented Dec 5, 2024 •

edited

Loading