Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NCS36510: Fix the sporadic semaphore timing issue #3779

Merged
merged 4 commits into from
Feb 21, 2017

Conversation

geky
Copy link
Contributor

@geky geky commented Feb 15, 2017

Whew, this was a long bug. Sorry for the delay.

Here's the history of the bug: #3404, it's been causing small problems on the CI, but takes a while to pop up, proved difficult to nail down.

This patch has three parts:

  • c0951c9 Fix the NCS36510 timer issue
  • 1f9e239 Add tests to better check the low-level integration between mbed-events and hardware timers
  • 0949164 Make mbed-events more tolerant of these sort of timing errors

As for exactly what the timing bug was:

The NCS36510 is limited to 16bit timers. Construction of larger intervals is performed in software by counting the number of 16bit intervals that pass.

Either this counting takes a bit of time, or there is a math error somewhere (maybe a long critical section?), because there is a roughly ~1us delay between when the interrupt occurs and the ticker
progresses onto the next 16bit interval. This is normally a completely reasonable error, except that the error accumulates. After a while, the equeue tests find themselves with tens of milliseconds of error.
To make matters worse, this error is random because of other interrupts occuring in the system, making the exact issue quite a bit difficult to track down.

This fix drops the software counter in favor of just recalculating the next delay interval from the target time and value of the running timer. The running timer used to calculate the current tick is left to
overflow in hardware and doesn't have this drift.

should resolve #3404
cc @maclobdell, @bridadan, @pradeep-gr, @radhika-raghavendran

Copy link
Contributor

@bridadan bridadan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @geky!

Tests the timer/semaphores at a lower level than the event queue,
which removes a layer of concerns from issues in the rtos timing.
The NCS36510 is limited to 16bit timers. Construction of larger
intervals is performed in software by counting the number of 16bit
intervals that pass.

Either this counting takes a bit of time, or there is a math error
somewhere (maybe a long critical section?), because there is a
roughly ~1us delay between when the interrupt occurs and the ticker
progresses onto the next 16bit interval. This is normally a completely
reasonable error, except that the error accumulates. After a while,
the equeue tests find themselves with tens of milliseconds of error.
To make matters worse, this error is random because of other interrupts
occuring in the system, making the exact issue quite a bit difficult
to track down.

This fix drops the software counter in favor of just recalculating
the next delay interval from the target time and value of the running
timer. The running timer used to calculate the current tick is left to
overflow in hardware and doesn't have this drift.
An odd bug (c0951c9) in the NCS36510 ticker code caused the timer/ticker
classes to become desynchronized. Updated the handling to not assume the
timers are perfectly in synch. This will increase the event's tolerance of
less robust platforms.
@bridadan
Copy link
Contributor

/morph test-nightly

@mbed-bot
Copy link

Result: FAILURE

Your command has finished executing! Here's what you wrote!

/morph test-nightly

Output

mbed Build Number: 1608

Test failed!

@geky
Copy link
Contributor Author

geky commented Feb 16, 2017

Dammit, the test is working. Currently fails on NRF52:

:96::FAIL: Values Not Within Delta 2000 Expected 233886 Was 236000

@geky
Copy link
Contributor Author

geky commented Feb 16, 2017

/morph test

@mbed-bot
Copy link

Result: SUCCESS

Your command has finished executing! Here's what you wrote!

/morph test

Output

mbed Build Number: 1622

All builds and test passed!

@bridadan
Copy link
Contributor

A non-nightly won't cover the event tests, restarting.

/morph test-nightly

@geky
Copy link
Contributor Author

geky commented Feb 17, 2017

Ah thanks 👍

@mbed-bot
Copy link

Result: SUCCESS

Your command has finished executing! Here's what you wrote!

/morph test-nightly

Output

mbed Build Number: 1624

All builds and test passed!

@sg-
Copy link
Contributor

sg- commented Feb 17, 2017

@geky @bridadan

:96::FAIL: Values Not Within Delta 2000 Expected 233886 Was 236000

What is this failuer? Test machine under heavy load?

@sg- sg- removed the needs: review label Feb 17, 2017
@geky
Copy link
Contributor Author

geky commented Feb 17, 2017

It was a legitimate 2.1 millisecond error. These tests are for the layer between the event queue's porting layer and hardware, so it indicates that either the NRF52's hardware timer or ticker was off by 2.1 milliseconds.

I couldn't replicate this issue locally, the test is structured to be deterministic and repeatable, so my guess is that differences on the host side timing and USB communication caused some hardware events on the device side to line up and cause a 2 millisecond error.

I'm not sure what tolerance we should be expecting, but given that this is one of the least powerful platforms, it seemed reasonable to increase the tolerance to 5 milliseconds.

@geky
Copy link
Contributor Author

geky commented Feb 17, 2017

@pan-, @bridadan pointed out the NRF52 has a softdevice with other interrupts that occur outside of the microcontroller?

Given that there is no bluetooth operations, do you know if it's reasonable for a ~2ms interrupt/critical section to occur?

@@ -0,0 +1,122 @@
#include "mbed_events.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

license missing here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a blocker, we should definitely add licenses to our tests (more files are without as I noticed).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry, not sure how I ended up with a lot of licenseless test files.

@sg- sg- merged commit 09a0ff3 into ARMmbed:master Feb 21, 2017
aisair pushed a commit to aisair/mbed that referenced this pull request Apr 30, 2024
Ports for Upcoming Targets


Fixes and Changes

3590: [NUC472/M453] Export IAR project and other bugfixes ARMmbed/mbed-os#3590
3740: STM32L4 HAL update for RTC Wake Up Timer ARMmbed/mbed-os#3740
3739: STM32F7 : remove multiple HSE_VALUE define value ARMmbed/mbed-os#3739
3759: STM32: spi_frequency table index fix ARMmbed/mbed-os#3759
3779: NCS36510: Fix the sporadic semaphore timing issue ARMmbed/mbed-os#3779
3806: NXP KL43Z/KL27Z: fix spi format bits check ARMmbed/mbed-os#3806
3814: NCS36510: I2C idle delay of 1us ARMmbed/mbed-os#3814
3803: Bug fix of initial value of interrupt edge in "gpio_irq_init" again ARMmbed/mbed-os#3803
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sporadic semaphore timing issue on NCS36510
5 participants