Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

c8y mapper not starting up reliably #3279

Closed
reey opened this issue Dec 4, 2024 · 7 comments
Closed

c8y mapper not starting up reliably #3279

reey opened this issue Dec 4, 2024 · 7 comments
Assignees
Labels
bug Something isn't working
Milestone

Comments

@reey
Copy link
Contributor

reey commented Dec 4, 2024

Describe the bug
When ever I restart my machine or restart the c8y mapper service (sudo systemctl restart tedge-mapper-c8y.service) it is not really reliably starting up again. The mqtt messages on the te/# topics are not processed so that no messages are actually published to the c8y/# topic.

I've kept the mapper in this not functioning state for ~8 hours to verify if it comes back online at some point, but it does not.
About 1 out of 5 startup attempts of the service is succeeding for me. I had the feeling that with debug logging enabled for the mapper it was starting up a bit more reliable.. Maybe some race condition? I've also got 82 child devices connected right now. I'm not sure what the mapper might be doing on startup, but maybe the number of devices is triggering a race condition?

To Reproduce

  • Restart the c8y mapper using sudo systemctl restart tedge-mapper-c8y.service
  • verify whether mqtt messages are still sent to c8y/#

Expected behavior
Restarting the tedge-mapper-c8y service should not cause any issues and the mapping functionality should proceed afterwards.

Screenshots

Environment (please complete the following information):

Property Value
OS [incl. version] Debian GNU/Linux 12 (bookworm)
Hardware [incl. revision] Raspberry Pi 4 Model B Rev 1.2
System-Architecture Linux raspberrypi 6.6.62+rpt-rpi-v8 #1 SMP PREEMPT Debian 1:6.6.62-1+rpt1 (2024-11-25) aarch64 GNU/Linux
thin-edge.io version tedge 1.3.1
mosquitto version 2.0.11

Additional context
I've got about 82 device child devices connected to my thin-edge instance.

I've attached an encrypted debug log file: thin-edge-c8y-mapper.age.log

@reey reey added the bug Something isn't working label Dec 4, 2024
@Bravo555 Bravo555 self-assigned this Dec 4, 2024
@Bravo555
Copy link
Contributor

Bravo555 commented Dec 4, 2024

From looking at the log, it looks like mosquitto can't establish a bridge between the local broker and the cloud:

  • there are MQTT pub and MQTT recv logs for te/... topics, but for c8y/... topics, there are only MQTT pub lines, so we're successfully publishing on local c8y/... topics, but the bridge can't be established with c8y, that's why messages we publish probably aren't being sent to the cloud and we're not getting any messages from the cloud as well.

Please also share /var/log/mosquitto/mosquitto.log and /etc/tedge/tedge.toml

@reey
Copy link
Contributor Author

reey commented Dec 4, 2024

there are only MQTT pub lines, so we're successfully publishing on local c8y/... topics, but the bridge can't be established with c8y

But then I should be able to see these published messages on the local broker when subscribing to c8y/# , right? But I do not..

I've attached the requested files. Below mapper log files is from the same timeframe I've grabbed the mosquitto log.
mosquitto.age.log
thin-edge-c8y-mapper.age.log
tedge.age.toml.txt

@reey
Copy link
Contributor Author

reey commented Dec 4, 2024

So actually I'm able to see traffic on the c8y/# topics during the startup of the service, and also responses from the cloud if I subscribe in parallel to the startup of the service instead of after the startup.
like e.g. the whole child device creation:
e.g.:

[c8y/s/us] 101,tedge-zigbee-0x2c1165fffe7d8d4d,0x2c1165fffe7d8d4d,zigbee2mqtt
[c8y/s/e] 41,101,Device with serial 'tedge-zigbee-0x2c1165fffe7d8d4d' already existing

and also the JWT is correctly retrieved beforehand:

[c8y/s/uat] 
[c8y/s/dat] 71,<redacted>

But nothing further afterwards..

@Bravo555
Copy link
Contributor

Bravo555 commented Dec 5, 2024

But nothing further afterwards..

After replicating your child devices, I am able to reproduce the issue.
Also, after SIGINT, the following actors are failing to shutdown:

2024-12-05T01:55:28.443635893Z ERROR Runtime: crates/core/tedge_actors/src/runtime.rs:225: Timeout waiting for all actors to shutdown    
2024-12-05T01:55:28.443651833Z ERROR Runtime: crates/core/tedge_actors/src/runtime.rs:227: Failed to shutdown: CumulocityMapper-7    
2024-12-05T01:55:28.443655881Z ERROR Runtime: crates/core/tedge_actors/src/runtime.rs:227: Failed to shutdown: AvailabilityActor-12    
2024-12-05T01:55:28.443658937Z ERROR Runtime: crates/core/tedge_actors/src/runtime.rs:227: Failed to shutdown: Timer-6    

So it is indeed most likely a bug with thin-edge, and not the bridge. Trying to find the cause now.

@Bravo555
Copy link
Contributor

Bravo555 commented Dec 9, 2024

A cause was identified: due to an accidental deadlock, the outgoing MQTT message queue wasn't getting drained.
Working on a fix now.

Bravo555 added a commit to Bravo555/thin-edge.io that referenced this issue Dec 9, 2024
Because the C8yMapperActor synchronously first receives a message from
the input channel, and then produces output by sending to the output
channel, if one of these channels block, the other one will also not be
processed.

In this case, there was a loop: C8yMapperActor sent `MqttMessage`s to
AvailabilityActor, and AvailabilityActor send `PublishMessage`s to
C8yMapperActor for it to relay it to MqttActor.

When a sender for `PublishMessage`s was full when AvailabilityActor
tried sending another message, it would block, so until this message was
processed, AvailabilityActor wasn't processing now input. However, if
before this output was sent by C8yMapperActor there was another message
to relay to AvailabilityActor, and the inbound channel was also full,
this would result in a deadlock where neither AvailabilityActor input or
output could be sent through.

Signed-off-by: Marcel Guzik <[email protected]>
Bravo555 added a commit to Bravo555/thin-edge.io that referenced this issue Dec 9, 2024
Because the C8yMapperActor synchronously first receives a message from
the input channel, and then produces output by sending to the output
channel, if one of these channels block, the other one will also not be
processed.

In this case, there was a loop: C8yMapperActor sent `MqttMessage`s to
AvailabilityActor, and AvailabilityActor send `PublishMessage`s to
C8yMapperActor for it to relay it to MqttActor.

When a sender for `PublishMessage`s was full when AvailabilityActor
tried sending another message, it would block, so until this message was
processed, AvailabilityActor wasn't processing now input. However, if
before this output was sent by C8yMapperActor there was another message
to relay to AvailabilityActor, and the inbound channel was also full,
this would result in a deadlock where neither AvailabilityActor input or
output could be sent through.

Signed-off-by: Marcel Guzik <[email protected]>
@reubenmiller reubenmiller added this to the 1.4.0 milestone Dec 9, 2024
Bravo555 added a commit to Bravo555/thin-edge.io that referenced this issue Dec 10, 2024
This test is meant to verify the fix to thin-edge#3279.

Another test like this will probably have to be added to cover the
original behaviour that motivated connecting AvailabilityActor to the
C8yMapperActor instead of MqttActor directly, i.e. AvailabilityActor
sending a Smartrest 117 message before C8yMapperActor sends 101 message
that registers a child device (which is invalid and should not happen).

Signed-off-by: Marcel Guzik <[email protected]>
Bravo555 added a commit to Bravo555/thin-edge.io that referenced this issue Dec 10, 2024
Because the C8yMapperActor synchronously first receives a message from
the input channel, and then produces output by sending to the output
channel, if one of these channels block, the other one will also not be
processed.

In this case, there was a loop: C8yMapperActor sent `MqttMessage`s to
AvailabilityActor, and AvailabilityActor send `PublishMessage`s to
C8yMapperActor for it to relay it to MqttActor.

When a sender for `PublishMessage`s was full when AvailabilityActor
tried sending another message, it would block, so until this message was
processed, AvailabilityActor wasn't processing now input. However, if
before this output was sent by C8yMapperActor there was another message
to relay to AvailabilityActor, and the inbound channel was also full,
this would result in a deadlock where neither AvailabilityActor input or
output could be sent through.

Signed-off-by: Marcel Guzik <[email protected]>
Bravo555 added a commit to Bravo555/thin-edge.io that referenced this issue Dec 10, 2024
Bravo555 added a commit to Bravo555/thin-edge.io that referenced this issue Dec 12, 2024
Because the C8yMapperActor synchronously first receives a message from
the input channel, and then produces output by sending to the output
channel, if one of these channels block, the other one will also not be
processed.

In this case, there was a loop: C8yMapperActor sent `MqttMessage`s to
AvailabilityActor, and AvailabilityActor send `PublishMessage`s to
C8yMapperActor for it to relay it to MqttActor.

When a sender for `PublishMessage`s was full when AvailabilityActor
tried sending another message, it would block, so until this message was
processed, AvailabilityActor wasn't processing now input. However, if
before this output was sent by C8yMapperActor there was another message
to relay to AvailabilityActor, and the inbound channel was also full,
this would result in a deadlock where neither AvailabilityActor input or
output could be sent through.

Signed-off-by: Marcel Guzik <[email protected]>
Bravo555 added a commit to Bravo555/thin-edge.io that referenced this issue Dec 12, 2024
Solve the deadlock that can arise when C8yMapperActor can't complete
sending MqttMessage to some other actor because that other actor can't
complete the send of a `PublishMessage` to C8yMapperActor by creating a
fast path to drain PublishMessage buffer ASAP.

Previously when `PublishMessage`s were a part of the main message box we
could not prioritize them because we had to process messages in order;

separating out the receiver for `PublishMessage`s makes it so no other
messages can block `PublishMessage`s and allows us to use select! to
concurrently process them when processing other events blocks

Signed-off-by: Marcel Guzik <[email protected]>
Bravo555 added a commit to Bravo555/thin-edge.io that referenced this issue Dec 12, 2024
Solve the deadlock that can arise when C8yMapperActor can't complete
sending MqttMessage to some other actor because that other actor can't
complete the send of a `PublishMessage` to C8yMapperActor by creating a
fast path to drain PublishMessage buffer ASAP.

Previously when `PublishMessage`s were a part of the main message box we
could not prioritize them because we had to process messages in order;

separating out the receiver for `PublishMessage`s makes it so no other
messages can block `PublishMessage`s and allows us to use select! to
concurrently process them when processing other events blocks

Signed-off-by: Marcel Guzik <[email protected]>
Bravo555 added a commit to Bravo555/thin-edge.io that referenced this issue Dec 12, 2024
Solve the deadlock that can arise when C8yMapperActor can't complete
sending MqttMessage to some other actor because that other actor can't
complete the send of a `PublishMessage` to C8yMapperActor by creating a
fast path to drain PublishMessage buffer ASAP.

Previously when `PublishMessage`s were a part of the main message box we
could not prioritize them because we had to process messages in order;

separating out the receiver for `PublishMessage`s makes it so no other
messages can block `PublishMessage`s and allows us to use select! to
concurrently process them when processing other events blocks

Signed-off-by: Marcel Guzik <[email protected]>
Bravo555 added a commit to Bravo555/thin-edge.io that referenced this issue Dec 12, 2024
Solve the deadlock that can arise when C8yMapperActor can't complete
sending MqttMessage to some other actor because that other actor can't
complete the send of a `PublishMessage` to C8yMapperActor by creating a
fast path to drain PublishMessage buffer ASAP.

Previously when `PublishMessage`s were a part of the main message box we
could not prioritize them because we had to process messages in order;

separating out the receiver for `PublishMessage`s makes it so no other
messages can block `PublishMessage`s and allows us to use select! to
concurrently process them when processing other events blocks

Signed-off-by: Marcel Guzik <[email protected]>
Bravo555 added a commit to Bravo555/thin-edge.io that referenced this issue Dec 12, 2024
Solve the deadlock that can arise when C8yMapperActor can't complete
sending MqttMessage to some other actor because that other actor can't
complete the send of a `PublishMessage` to C8yMapperActor by creating a
fast path to drain PublishMessage buffer ASAP.

Previously when `PublishMessage`s were a part of the main message box we
could not prioritize them because we had to process messages in order;

separating out the receiver for `PublishMessage`s makes it so no other
messages can block `PublishMessage`s and allows us to use select! to
concurrently process them when processing other events blocks

Signed-off-by: Marcel Guzik <[email protected]>
Bravo555 added a commit to Bravo555/thin-edge.io that referenced this issue Dec 13, 2024
Solve the deadlock that can arise when C8yMapperActor can't complete
sending MqttMessage to some other actor because that other actor can't
complete the send of a `PublishMessage` to C8yMapperActor by creating a
fast path to drain PublishMessage buffer ASAP.

Previously when `PublishMessage`s were a part of the main message box we
could not prioritize them because we had to process messages in order;

separating out the receiver for `PublishMessage`s makes it so no other
messages can block `PublishMessage`s and allows us to use select! to
concurrently process them when processing other events blocks

Signed-off-by: Marcel Guzik <[email protected]>
Bravo555 added a commit to Bravo555/thin-edge.io that referenced this issue Dec 13, 2024
Solve the deadlock that can arise when C8yMapperActor can't complete
sending MqttMessage to some other actor because that other actor can't
complete the send of a `PublishMessage` to C8yMapperActor by creating a
fast path to drain PublishMessage buffer ASAP.

Previously when `PublishMessage`s were a part of the main message box we
could not prioritize them because we had to process messages in order;

separating out the receiver for `PublishMessage`s makes it so no other
messages can block `PublishMessage`s and allows us to use select! to
concurrently process them when processing other events blocks

Signed-off-by: Marcel Guzik <[email protected]>
github-merge-queue bot pushed a commit that referenced this issue Dec 13, 2024
fix #3279: Create a fast path for sending `PublishMessage`s
@reubenmiller
Copy link
Contributor

@reey We've merged a fix for this. We'll include it in a new official release next week, however it might be worth verifying for yourself by install thin-edge.io from our main release channel. You can easily do this by executing the following one-liner:

wget -O - thin-edge.io/install.sh | sh -s -- --channel main

Or if you are a curl user:

curl -fsSL https://thin-edge.io/install.sh | sh -s -- --channel main

@reey
Copy link
Contributor Author

reey commented Dec 14, 2024

I gave it a try and it seems to be working fine 👍
Thanks @Bravo555 and @reubenmiller

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants