Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition in interrupt examples #657

Closed
matthijskooijman opened this issue Jan 6, 2023 · 2 comments
Closed

Race condition in interrupt examples #657

matthijskooijman opened this issue Jan 6, 2023 · 2 comments
Labels
bug Something isn't working resolved Issue was resolved (e.g. bug fixed, or feature implemented)

Comments

@matthijskooijman
Copy link
Contributor

Describe the bug
The interrupt examples use an enableInterrupt variable to suppress interrupt handling while the result of a previous interrupt is handled and a new request (TX/RX/CAD/etc) is being configured. Typically, the flow is something like:

if (IrqFlag) { // Check flag set by ISR
  enableInterrupt = false;
  irqFlag = false;  
  ... // Handle event
  state = radio.startChannelScan();
  ... // print state
  enableInterrupt = true;
}

Here, a race condition occurs when the IRQ is triggered after the call to e.g. startChannelScan() and before enableInterrupt is set again. Usually this does not happen, but since timings like these can vary in different circumstances, there is a window for problems. And if this problem occurs, the IRQ will not be handled and the code will lock up, indefinitely waiting for the IRQ that already happenedd.

On the STM32WL with CAD I've ran into this problem in practice. I've seen that the IRQ happens after around 20ms or so (so that's a lot quicker than a TX and, in most cases an RX interrupt) and (for some reason I have not investigated), the "print state" step is usually fast but consistently takes around 50ms on the 18th (or so) try.

To confirm this is wat happened, I applied the following patch:

--- a/examples/STM32WLx/STM32WLx_Channel_Activity_Detection_Interrupt/STM32WLx_Channel_Activity_Detection_Interrupt.ino
+++ b/examples/STM32WLx/STM32WLx_Channel_Activity_Detection_Interrupt/STM32WLx_Channel_Activity_Detection_Interrupt.ino
@@ -47,6 +47,9 @@ void setup() {
     Serial.print(F("failed, code "));
     Serial.println(state);
   }
+  pinMode(2, OUTPUT);
+  pinMode(3, OUTPUT);
+  pinMode(4, OUTPUT);
 }
 
 // flag to indicate that a packet was detected or CAD timed out
@@ -60,6 +63,7 @@ volatile bool enableInterrupt = true;
 // IMPORTANT: this function MUST be 'void' type
 //            and MUST NOT have any arguments!
 void setFlag(void) {
+  digitalWrite(3, !digitalRead(3));
   // check if the interrupt is enabled
   if(!enableInterrupt) {
     return;
@@ -67,6 +71,8 @@ void setFlag(void) {
 
   // something happened, set the flag
   scanFlag = true;
+
+  digitalWrite(4, !digitalRead(4));
 }
 
 void loop() {
@@ -99,6 +105,7 @@ void loop() {
 
     // start scanning the channel again
     Serial.print(F("[STM32WLx] Starting scan for LoRa preamble ... "));
+    digitalWrite(2, HIGH);
     state = radio.startChannelScan();
     if (state == RADIOLIB_ERR_NONE) {
       Serial.println(F("success!"));
@@ -106,6 +113,7 @@ void loop() {
       Serial.print(F("failed, code "));
       Serial.println(state);
     }
+    digitalWrite(2, LOW);
 
     // enable interrupt service routine
     enableInterrupt = true;

Which produced the following trace on my logic analyzer:

image

In other words: slowness in serial printing (which might happen for various reasons, especially if USB Serial is involved) causes the code to lock up.

To Reproduce
This can be easily reproduced by taking e.g. the SX126x_Channel_Activity_Detection_Interrupt example and adding a delay to fake slow serial printing:

--- a/examples/SX126x/SX126x_Channel_Activity_Detection_Interrupt/SX126x_Channel_Activity_Detection_Interrupt.ino
+++ b/examples/SX126x/SX126x_Channel_Activity_Detection_Interrupt/SX126x_Channel_Activity_Detection_Interrupt.ino
@@ -121,6 +121,7 @@ void loop() {
       Serial.print(F("failed, code "));
       Serial.println(state);
     }
+    delay(20);
 
     // enable interrupt service routine
     enableInterrupt = true;

Alternatively, slow down serial printing by repeating "success" a few times (this depends on the board and serial driver, I've tested this on AVR HardwareSerial (Arduino Uno), it might not work on native USB serial ports or serial ports with larger buffers).

--- a/examples/SX126x/SX126x_Channel_Activity_Detection_Interrupt/SX126x_Channel_Activity_Detection_Interrupt.ino
+++ b/examples/SX126x/SX126x_Channel_Activity_Detection_Interrupt/SX126x_Channel_Activity_Detection_Interrupt.ino
@@ -116,7 +116,7 @@ void loop() {
     Serial.print(F("[SX1262] Starting scan for LoRa preamble ... "));
     state = radio.startChannelScan();
     if (state == RADIOLIB_ERR_NONE) {
-      Serial.println(F("success!"));
+      Serial.println(F("successsuccesssuccesssuccesssuccess!"));
     } else {
       Serial.print(F("failed, code "));
       Serial.println(state);

With either of these patches, the sketch can do CAD once and then locks up:

[2023-01-06 18:38:11] [SX1262] Initializing ... success!
[2023-01-06 18:38:11] [SX1262] Starting scan for LoRa preamble ... success!
[2023-01-06 18:38:11] [SX1262] Channel is free!
[2023-01-06 18:38:11] [SX1262] Starting scan for LoRa preamble ... successsuccesssuccesssuccesssuccess!

To fix
To fix, the enableInterrupt = true line should be moved up. At least before the serial printing, but that still leaves a small race condition (e.g. when some other IRQ happens at the right moment that takes 20ms). To really fix this, enableInterrupt must be before the call to startChannelScan(), but I'm not sure if this would introduce other problems.

In general, I wonder if enableInterrupt is really needed at all, though. If the hardware guarantees that only one IRQ is triggered for every operation requested, then I think there would be no need to suppress interrupts? But I guess it was introduced for a reason....

@jgromes
Copy link
Owner

jgromes commented Jan 7, 2023

In general, I wonder if enableInterrupt is really needed at all, though.

You're corect, it is not needed. I think the original idea was in Receive_Interrupt, to make sure reception of a second packet doesn't start before the previous one was processed. But that doesn't really make sense, since to receive a packet, the device would first have to set back to Rx mode e.g. by calling startReceive(), which obviously cannot happen until the first packet is processed. I guess I then cargo-culted this flag variable into every other interrupt example.

I tested all the examples for SX127x without the enable flag, and they are working correctly. I was also able to force the race condition by including a delay, and fix it by removeing the enable flag.

Thanks for reporting this, I will remove the enable flag from all the examples.

@jgromes jgromes added the bug Something isn't working label Jan 7, 2023
@jgromes
Copy link
Owner

jgromes commented Jan 7, 2023

All removed now, thanks for reporting.

@jgromes jgromes closed this as completed Jan 7, 2023
@jgromes jgromes added the resolved Issue was resolved (e.g. bug fixed, or feature implemented) label Jan 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working resolved Issue was resolved (e.g. bug fixed, or feature implemented)
Projects
None yet
Development

No branches or pull requests

2 participants