Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tests: add tests for netdev flooding race-condition #11256

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

miri64
Copy link
Member

@miri64 miri64 commented Mar 24, 2019

Contribution description

While working on #11068, I noticed a race condition within the state machine (see p. 51 in the datasheet) of the at86rf2xx device driver:

in_rx_aack svg
.
This PR introduces two accompanying applications that reproduce this race condition.

  • A netdev_flood_flooder that sends IEEE 802.15.4 frames periodically every 5ms
  • A netdev_flood_replier that receives those frames and tries to reply to them with different content after a 2ms delay

If they would succeed the netdev_flood_replier application would just receive the frames sent by netdev_flood_flooder, however due to the discovered race condition it may happen that it reads the data it just sent.

Testing procedure

In general: check the READMEs, they should describe it very well. But here is the rundown.

Compile and flash tests/netdev_flood_flooder first

make -C tests/netdev_flood_flooder flash

Check the output with

make -C tests/netdev_flood_flooder term

Then compile and flash tests/netdev_flood_replier too

make -C tests/netdev_flood_replier flash

Use the make test target to check the output. If the following message is not shown it is successful (which it shouldn't be in the current master).

Unexpected payload. This test failed!

Issues/PRs references

Issue made clear with these tests was found and is making problems for #11068.

@stale
Copy link

stale bot commented Sep 25, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you want me to ignore this issue, please mark it with the "State: don't stale" label. Thank you for your contributions.

@stale stale bot added the State: stale State: The issue / PR has no activity for >185 days label Sep 25, 2019
@miri64 miri64 removed the State: stale State: The issue / PR has no activity for >185 days label Sep 25, 2019
@jia200x
Copy link
Member

jia200x commented Oct 15, 2019

note that all radios (and probably network devices) are affected by this issue, because the driver ISR and the send messages are handled by the same thread.

EDIT: All radios that share the framebuffer for sending/receiving

@bergzand
Copy link
Member

note that all radios (and probably network devices) are affected by this issue, because the driver ISR and the send messages are handled by the same thread.

From how I understand the issue here (and please correct me if I'm wrong) this is caused by shared frame buffer of the at86rf2xx radio. It has a single 128B buffer used for both the TX PDU and the RX PDU

The mrf24j40, cc2420 and the nrf52840 radios all have separate transmit and receive buffers (I didn't check or know about the remaining few), so for those radios it is not possible to overwrite the receive buffer with the PDU from the _send() call.

@jia200x
Copy link
Member

jia200x commented Oct 16, 2019

From how I understand the issue here (and please correct me if I'm wrong) this is caused by shared frame buffer of the at86rf2xx radio. It has a single 128B buffer used for both the TX PDU and the RX PDU

If the radio has a different TX and RX framebuffer, then yes... this problem doesn't happen.

@miri64
Copy link
Member Author

miri64 commented Dec 6, 2019

Side-note: while selective fragment recovery (#12303) works very well for mitigating this bug (still waiting for the high-load results, but when taking it slow I get 100% success-rate even when forwarding) it causes the forwarder to create VRB entries to itself: When a fragment is forwarded, it can happen that the forwarder due to this bug receives it itself, determining that it is not the destination of this fragment, and creating a VRB for it with its own address as the source address. The only bad thing is that when ever there is an ACK to be forwarded to the same tag that erroneous read packet is send to, it is forwarded on the medium to the forwarder itself (so causing energy problems), but other that again: some of the problems of this bug are mitigated. For my experiments I will just merge #11264 into the branch my experiments will be based on and use gnrc_netif_pktq, so this won't be an issue.

@fjmolinas
Copy link
Contributor

@jia200x could you take a look at this one? It seems to be with the problem stated in your issue #12910 so it could be useful to have it in and check on which platforms it is present.

@jia200x
Copy link
Member

jia200x commented Jan 6, 2020

@fjmolinas sure!

@fjmolinas fjmolinas requested a review from jia200x January 6, 2020 12:48
@benpicco
Copy link
Contributor

benpicco commented Feb 5, 2020

So when I flash those on two samr21-xpros with #12728 I see on one board

2020-02-05 12:01:06,131 # main(): This is RIOT! (Version: 2020.01-devel-695-gfb7d9-pr/at86rf2xx/fix_recv_before_send)
2020-02-05 12:01:06,140 # Starting to flooding packets now; start an instance of
2020-02-05 12:01:06,144 # tests/netdev_flood_flooder (on a device with matching link-layer ;-))
2020-02-05 12:01:06,145 # to see results.

and on the other

2020-02-05 12:01:04,328 # main(): This is RIOT! (Version: 2020.01-devel-695-gfb7d9-pr/at86rf2xx/fix_recv_before_send)
2020-02-05 12:01:04,335 # Starting to flood now; start an instance of tests/netdev_flood_replier
2020-02-05 12:01:04,344 # (on a device with matching link-layer ;-)) to see results.

So I guess it's working? Might also mean the boards are stuck 😉
Some periodic feedback would be nice.

edit: Now I got

2020-02-05 12:03:02,759 # Unexpected payload. This test failed!
2020-02-05 12:03:02,765 # 00000000  7B  3B  3A  02  85  00  74  D1  00  00  00  00  01  02  D2  AE
2020-02-05 12:03:02,775 # 00000010  0C  1B  20  54  05  8C  00  00  00  00  00  00
2020-02-05 12:04:02,760 # Unexpected payload. This test failed!
2020-02-05 12:04:02,766 # 00000000  7B  3B  3A  02  85  00  74  D1  00  00  00  00  01  02  D2  AE
2020-02-05 12:04:02,771 # 00000010  0C  1B  20  54  05  8C  00  00  00  00  00  00

But I guess that's just another node sending periodic announcements.

Comment on lines +30 to +39
#if defined(MODULE_AT86RF2XX)
#define NETDEV_ADDR_LEN (2U)
#define NETDEV_FLOOD_HDR_SEQ_OFFSET (2U)
/* IEEE 802.15.4 header */
#define NETDEV_FLOOD_HDR { 0x71, 0x98, /* FCF */ \
0xa1, /* Sequence number 161 */ \
0x23, 0x00, /* PAN ID 0x23 */ \
0x0e, 0x50, /* 0x500e (start of NETDEV_FLOOD_TARGET) */ \
0x0e, 0x12, /* 0x120e (start of NETDEV_FLOOD_SOURCE) */ }
#else
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this at86rf2xx specific?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this test is trying to point out a problem in the at86rf2xx implementation. For other drivers other data might be needed. See also https://github.com/RIOT-OS/RIOT/pull/11256/files#diff-f29476e871b447f52f062cd754786b75R53-R54

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the 802.15.4 header should be the same for all 802.15.4 devices - what configuration would that be?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 802.15.4 header, yes. I don't remember if this data was specific to the error case though.

@OlegHahm
Copy link
Member

Is this still relevant?

@miri64
Copy link
Member Author

miri64 commented Sep 26, 2023

I guess, if all 802.15.4 devices are ported to the new ieee802154_hal, it can be closed. But AFAIK, they are not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: drivers Area: Device drivers Area: network Area: Networking Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants