Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

drivers/enc28j60: fix of #9784 #9806

Merged
merged 1 commit into from
Jan 25, 2019
Merged

Conversation

gschorcht
Copy link
Contributor

Fixes the problem described in #9784. There are no locks anymore. Stress test can be done by flooding the enc28j60 node with IMCP echo request packets with maximum size of 65.507 byte, e.g.:

sudo ping6 -f fe80::b48a:eeff:fe23:2323 -I eth0 -s 65507

Thanks to @gebart, I could fix the problem by dropping packets. The fix contains following changes:

  1. Packets are dropped in the nd_recv function when buf = NULL and max_len != 0.
  2. According to the EC28J60 data sheet, section 12.0, the host controller should clear the global enable bit for the interrupt pin before servicing the interrupt. This is done now in the on_int function. The global enable interrupt bit is enabled again in the nd_isr function once all pending interrupts were served.
  3. The pending packet interrupt flag (PKTIF) is cleared now explicitly in the nd_isr function.
  4. Additional debugging outputs left in nd_recv and nd_isr functions.

@maribu
Copy link
Member

maribu commented Aug 21, 2018

@gschorcht: Thanks for your work. It does solve the problem with the infinite loop. Sadly, a different bug seems to be also present. I guess when the on-device RX buffer is overflown, the content needs to be still actively dropped. Below is the log. I didn't have the time to confirm that theory with the datasheet, though. Do you want to investigate that as well? Otherwise, I hope that I will have some time tomorrow or Thursday.

https://gist.github.com/maribu/423b08ab675c553a766b617678ec5ea3

@maribu
Copy link
Member

maribu commented Aug 21, 2018

Is there a way to add a scrollbar to a long pastes? Dumping huge logs like above makes the thread quite unreadable :-(

@aabadie
Copy link
Contributor

aabadie commented Aug 21, 2018

Is there a way to add a scrollbar to a long pastes?

You can put the log in a gist and give the link instead.

@gschorcht
Copy link
Contributor Author

@maribu Unfortunatly, I can't reproduce the problem.

Once the packets are arriving faster than they can be processed, function nd_isr remains in the while loop

        do {
            DEBUG("[enc28j60] isr: packet received\n");
            netdev->event_callback(netdev, NETDEV_EVENT_RX_COMPLETE);
        } while (cmd_rcr(dev, REG_B1_EPKTCNT, 1) > 0);

and no other interrupt is handled anymore as long as there are packets in the receive buffer. Normally, the enc28j60 generates an interrupt (EIR_RXERIF) when its receive buffer overflows and arriving packets are dropped. However, since we are in the do ... while loop all the time, this interrupt is not handled. As soon as the traffic stops and RIOT has processed all the packets, the EIR_RXERIF interrupt is handled exactly once. In my test environment, I can see only this one EIR_RXERIF interrupt. After that everything is working as expected.

@maribu
Copy link
Member

maribu commented Aug 21, 2018

The code related to the remaining problem is here:

if (eir & EIR_RXERIF) {
DEBUG("[enc28j60] isr: incoming packet dropped - RX buffer full\n");
cmd_bfc(dev, REG_EIR, -1, EIR_RXERIF);
}

I think in section 12.1.2 on page 68 in the datasheet the required information is written. RXERIF may occur when:

  1. The RX buffer is overflown, or
  2. The packet counter is overflown

I can confirm the overflow of the packet counter is not the problem by applying:

diff --git a/drivers/enc28j60/enc28j60.c b/drivers/enc28j60/enc28j60.c
index f8b8dec76..dbe9387e5 100644
--- a/drivers/enc28j60/enc28j60.c
+++ b/drivers/enc28j60/enc28j60.c
@@ -31,7 +31,7 @@
 #include "enc28j60.h"
 #include "enc28j60_regs.h"
 
-#define ENABLE_DEBUG    (0)
+#define ENABLE_DEBUG    (1)
 #include "debug.h"
 
 /**
@@ -485,7 +485,9 @@ static void nd_isr(netdev_t *netdev)
             cmd_bfc(dev, REG_EIR, -1, EIR_PKTIF);
         }
         if (eir & EIR_RXERIF) {
-            DEBUG("[enc28j60] isr: incoming packet dropped - RX buffer full\n");
+            DEBUG("[enc28j60] isr: incoming packet dropped - "
+                  "RX buffer / EPKTCNT overflow (EPKTCNT=%i)\n",
+                  (int)cmd_rcr(dev, REG_B1_EPKTCNT, 1));
             cmd_bfc(dev, REG_EIR, -1, EIR_RXERIF);
         }
         if (eir & EIR_TXIF) {

the console looks like this:

2018-08-21 12:34:58,997 - INFO # [enc28j60] isr: incoming packet dropped - RX buffer / EPKTCNT overflow (EPKTCNT=0)
2018-08-21 12:34:59,293 - INFO # [enc28j60] isr: incoming packet dropped - RX buffer / EPKTCNT overflow (EPKTCNT=0)
2018-08-21 12:34:59,640 - INFO # [enc28j60] isr: incoming packet dropped - RX buffer / EPKTCNT overflow (EPKTCNT=0)
2018-08-21 12:34:59,979 - INFO # [enc28j60] isr: incoming packet dropped - RX buffer / EPKTCNT overflow (EPKTCNT=0)

@maribu
Copy link
Member

maribu commented Aug 21, 2018

And another issue: Now according to GDB the system froze in:

static inline void _wait_for_end(spi_t bus)
{
/* make sure the transfer is completed before continuing, see reference
* manual(s) -> section 'Disabling the SPI' */
while (!(dev(bus)->SR & SPI_SR_TXE)) {}
while (dev(bus)->SR & SPI_SR_BSY) {}
}

(The first loop is spinning infinitely.)

Maybe on chip debugging breaks SPI communication.

@gschorcht
Copy link
Contributor Author

2018-08-21 12:34:58,997 - INFO # [enc28j60] isr: incoming packet dropped - RX buffer / EPKTCNT overflow (EPKTCNT=0)
2018-08-21 12:34:59,293 - INFO # [enc28j60] isr: incoming packet dropped - RX buffer / EPKTCNT overflow (EPKTCNT=0)
2018-08-21 12:34:59,640 - INFO # [enc28j60] isr: incoming packet dropped - RX buffer / EPKTCNT overflow (EPKTCNT=0)
2018-08-21 12:34:59,979 - INFO # [enc28j60] isr: incoming packet dropped - RX buffer / EPKTCNT overflow

Could you check whether the nd_isr is left everytime or if it is iterating endless in the while (eir != 0) { ... } loop by placing a debug message outside the while loop? At least the EIR_RXERIF everytime.

@gschorcht
Copy link
Contributor Author

And another issue: Now according to GDB the system froze in:

Seems to be a platform specific SPI problem.

@maribu
Copy link
Member

maribu commented Aug 21, 2018

@gschorcht: The loop is left, so the RXERIF bit is successfully unset.

diff --git a/drivers/enc28j60/enc28j60.c b/drivers/enc28j60/enc28j60.c
index f8b8dec76..eba287f9f 100644
--- a/drivers/enc28j60/enc28j60.c
+++ b/drivers/enc28j60/enc28j60.c
@@ -31,7 +31,7 @@
 #include "enc28j60.h"
 #include "enc28j60_regs.h"
 
-#define ENABLE_DEBUG    (0)
+#define ENABLE_DEBUG    (1)
 #include "debug.h"
 
 /**
@@ -458,6 +458,19 @@ static int nd_init(netdev_t *netdev)
     return 0;
 }
 
+static void drop_rx_buffer(enc28j60_t *dev)
+{
+    uint16_t rx_rd_ptr, next;
+    uint8_t head[6];
+    rx_rd_ptr = cmd_r_addr(dev, ADDR_RX_READ);
+    cmd_w_addr(dev, ADDR_READ_PTR, ERXRDPT_TO_NEXT(rx_rd_ptr));
+    /* read packet header */
+    cmd_rbm(dev, head, 6);
+    /* TODO: care for endianess */
+    next = (uint16_t)((head[1] << 8) | head[0]);
+    cmd_w_addr(dev, ADDR_RX_READ, NEXT_TO_ERXRDPT(next));
+}
+
 static void nd_isr(netdev_t *netdev)
 {
     enc28j60_t *dev = (enc28j60_t *)netdev;
@@ -486,6 +499,9 @@ static void nd_isr(netdev_t *netdev)
         }
         if (eir & EIR_RXERIF) {
             DEBUG("[enc28j60] isr: incoming packet dropped - RX buffer full\n");
+            /* Drop partly received packet */
+            drop_rx_buffer(dev);
+            /* Clear RXERIF bit */
             cmd_bfc(dev, REG_EIR, -1, EIR_RXERIF);
         }
         if (eir & EIR_TXIF) {

The time until the problem occurs seems to be increased by the patch above. But it doesn't solve it :-(

@maribu
Copy link
Member

maribu commented Aug 21, 2018

Seems to be a platform specific SPI problem.

Yes, it is unrelated and seems to only occur while interrupting the MCU during on-chip debugging.

@gschorcht
Copy link
Contributor Author

I guess you have also to decrement the packet counter in your drop function. But I'm not sure.

cmd_bfs(dev, REG_ECON2, -1, ECON2_PKTDEC);

@maribu
Copy link
Member

maribu commented Aug 21, 2018

Yes, that was the issue. I just had the same idea and so far the driver is working fine. Let me wait another 5 minutes to be sure :-)

@maribu
Copy link
Member

maribu commented Aug 21, 2018

diff --git a/drivers/enc28j60/enc28j60.c b/drivers/enc28j60/enc28j60.c
index f8b8dec76..aa405c2f7 100644
--- a/drivers/enc28j60/enc28j60.c
+++ b/drivers/enc28j60/enc28j60.c
@@ -314,10 +314,37 @@ static int nd_send(netdev_t *netdev, const iolist_t *iolist)
 #define NEXT_TO_ERXRDPT(n) ((n == BUF_RX_START || n - 1 > BUF_RX_END) ? BUF_RX_END : n - 1)
 #define ERXRDPT_TO_NEXT(e) ((e >= BUF_RX_END) ? BUF_RX_START : e + 1)
 
+static void get_rx_header(enc28j60_t *dev, uint16_t *size, uint16_t *next)
+{
+    uint16_t rx_rd_ptr;
+    uint8_t head[6];
+
+    /* set read pointer to RX read address */
+    rx_rd_ptr = cmd_r_addr(dev, ADDR_RX_READ);
+    cmd_w_addr(dev, ADDR_READ_PTR, ERXRDPT_TO_NEXT(rx_rd_ptr));
+    /* read packet header */
+    cmd_rbm(dev, head, 6);
+    /* TODO: care for endianess */
+    if (next != NULL) {
+        *next = (uint16_t)((head[1] << 8) | head[0]);
+    }
+
+    if (size != NULL) {
+        *size = (uint16_t)((head[3] << 8) | head[2]) - 4;  /* discard CRC */
+    }
+}
+
+static void drop_rx_buffer(enc28j60_t *dev)
+{
+    uint16_t next;
+    get_rx_header(dev, NULL, &next);
+    cmd_w_addr(dev, ADDR_RX_READ, NEXT_TO_ERXRDPT(next));
+    cmd_bfs(dev, REG_ECON2, -1, ECON2_PKTDEC);
+}
+
 static int nd_recv(netdev_t *netdev, void *buf, size_t max_len, void *info)
 {
     enc28j60_t *dev = (enc28j60_t *)netdev;
-    uint8_t head[6];
     uint16_t size;
     uint16_t next;
 
@@ -325,14 +352,7 @@ static int nd_recv(netdev_t *netdev, void *buf, size_t max_len, void *info)
     mutex_lock(&dev->devlock);
 
     /* set read pointer to RX read address */
-    uint16_t rx_rd_ptr = cmd_r_addr(dev, ADDR_RX_READ);
-    cmd_w_addr(dev, ADDR_READ_PTR, ERXRDPT_TO_NEXT(rx_rd_ptr));
-    /* read packet header */
-    cmd_rbm(dev, head, 6);
-    /* TODO: care for endianess */
-    next = (uint16_t)((head[1] << 8) | head[0]);
-    size = (uint16_t)((head[3] << 8) | head[2]) - 4;  /* discard CRC */
-
+    get_rx_header(dev, &size, &next);
     DEBUG("[enc28j60] recv: size=%i next=%i buf=%p len=%d\n",
           (int)size, (int)next, buf, max_len);
 
@@ -486,6 +506,9 @@ static void nd_isr(netdev_t *netdev)
         }
         if (eir & EIR_RXERIF) {
             DEBUG("[enc28j60] isr: incoming packet dropped - RX buffer full\n");
+            /* Drop partly received packet */
+            drop_rx_buffer(dev);
+            /* Clear RXERIF bit */
             cmd_bfc(dev, REG_EIR, -1, EIR_RXERIF);
         }
         if (eir & EIR_TXIF) {

It worked for 5 minutes :-) . The above patch is an updated version in which the header parsing is moved to a separate function in order to de-duplicate that code. Do you care to add that fix (or similar) to your PR? Or should I open a separate PR?

@gschorcht
Copy link
Contributor Author

I can add it to my PR but you would have to test it again.

@maribu
Copy link
Member

maribu commented Aug 21, 2018

Crap. With DEBUG_ENABLE set to 0 for me ping6 stop's to work (times out). :-( With it enabled I get 99 pongs for 100 pings send.

@maribu
Copy link
Member

maribu commented Aug 21, 2018

I can add it to my PR but you would have to test it again.

Sure :-) Who thought that a couple of wrongly configured switches sending Ethernet frames to random nodes could provide such an excellent testing environment :-D

@maribu
Copy link
Member

maribu commented Aug 21, 2018

Wait. It happened again. The fix I proposed obviously doesn't solve the issue. I think I will have a look into it some other day...

@gschorcht
Copy link
Contributor Author

Crap. With DEBUG_ENABLE set to 0 for me ping6 stop's to work (times out). :-(
With it enabled I get 99 pongs for 100 pings send.

I was just committing your changes, but I have not pushed them yet. That is, it still does not work for you?

@gschorcht
Copy link
Contributor Author

I think I will have a look into it some other day...
Ok

@maribu
Copy link
Member

maribu commented Aug 21, 2018

I was just committing your changes, but I have not pushed them yet. That is, it still does not work for you?

Yes, the RX buffer overflow is still not handled correctly with my suggested fix. Additionally, there seem to be some unrelated timing issue, as ping6 stops working for me when the debug output is disabled.

However, I can confirm that your PR fixes the missing implementation of dropping packets in netdev_driver::recv and the driver works a lot more reliable because of that.

@miri64 miri64 added Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors) Area: drivers Area: Device drivers labels Oct 17, 2018
@gschorcht
Copy link
Contributor Author

@maribu @haukepetersen
How do we continue with this PR? At least, it fixes the problem with dropping packets when the receive buffer is full and increase the stability a lot.

@maribu
Copy link
Member

maribu commented Oct 30, 2018

I suggest to merge it :-)

My reasoning:
The API change I suggested will either come not at all, or not any time soon. Thus, at least fixing the bugs against the old API and fixing the DoS security hole is a good thing. If and when the API is changed, this could be done just as easily on top of fixed drivers. Also, the additional DEBUG() macros will make future debugging easier :-)

@PeterKietzmann
Copy link
Member

Without in-depth review or testing I'm also in favor of merging this. @smlng do you still have the device on your desk to run a short test?

@gschorcht
Copy link
Contributor Author

@maribu The PR still doesn't contain your last changes. Is that OK or should we add them?

@maribu
Copy link
Member

maribu commented Nov 5, 2018

I personally would be in favor of merging it as it is. I still had sporadic dead-locks with my last changes. It seems to occur much less frequently with the changes, but this could be just bad luck. So I would like to investigate this further (hopefully some time soon) and be sure what causes the remaining deadlocks.

The missing drop feature added in this PR does reliably solve one cause of a deadlock, so I'm personally in favor of merging the PR as it is :-)

Copy link
Member

@maribu maribu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for addressing this issue. I believe clearing the PKTIF flag in the EIR register in line 485 is not required according to the datasheet (see comment above) - but I'm not sure about that.

Maybe you could add a comment, if it is actually needed. I will test this PR again this afternoon.

@@ -470,6 +482,7 @@ static void nd_isr(netdev_t *netdev)
DEBUG("[enc28j60] isr: packet received\n");
netdev->event_callback(netdev, NETDEV_EVENT_RX_COMPLETE);
} while (cmd_rcr(dev, REG_B1_EPKTCNT, 1) > 0);
cmd_bfc(dev, REG_EIR, -1, EIR_PKTIF);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After decrementing, if EPKTCNT is ‘0’, the EIR.PKTIF flag will automatically be cleared.

Citing section 7.2.4 on page 45 in the datasheet. I assume this line could be dropped.

Copy link
Contributor Author

@gschorcht gschorcht Dec 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably yes, but it shouldn't cause errors if it is there. Since I don't have time to test it, I would leave it as it is for the moment. If you could test, we could decide to remove it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maribu Have you tried it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did so now. Works fine without manually clearing EIR_PKTIF.

@maribu maribu added Reviewed: 1-fundamentals The fundamentals of the PR were reviewed according to the maintainer guidelines Reviewed: 2-code-design The code design of the PR was reviewed according to the maintainer guidelines Reviewed: 4-code-style The adherence to coding conventions by the PR were reviewed according to the maintainer guidelines labels Nov 19, 2018
@aabadie aabadie added this to the Release 2019.01 milestone Dec 3, 2018
@aabadie
Copy link
Contributor

aabadie commented Dec 20, 2018

@maribu @gschorcht, what is the status here ? Do you think it can be merged or is it waiting for other updates ?

@jia200x
Copy link
Member

jia200x commented Jan 9, 2019

ping @maribu @gschorcht

@maribu
Copy link
Member

maribu commented Jan 10, 2019

@jia200x, @gschorcht : Sorry, I again did not got to test it today. Tomorrow is hopefully less busy and I will finally be able to confirm whether cmd_bfc(dev, REG_EIR, -1, EIR_PKTIF) is indeed required or as the datasheet suggests not.

@gschorcht
Copy link
Contributor Author

gschorcht commented Jan 24, 2019

@maribu I removed this line but I had not the time to test it.

@maribu
Copy link
Member

maribu commented Jan 24, 2019

Mind to squash? I will verify that it still works tomorrow in a free minute and hopefully merge it then :-)

@gschorcht
Copy link
Contributor Author

Rebase ans squashed.

@maribu maribu added CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR Reviewed: 3-testing The PR was tested according to the maintainer guidelines Reviewed: 5-documentation The documentation details of the PR were reviewed according to the maintainer guidelines labels Jan 25, 2019
Copy link
Member

@maribu maribu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just tested on the Nucleo-F303RE and with the PR applied, the network stack works reliable even when frames larger than the buffer come in :-)

@gschorcht
Copy link
Contributor Author

Thanks for testing.

@maribu maribu merged commit 370f33a into RIOT-OS:master Jan 25, 2019
@maribu
Copy link
Member

maribu commented Jan 25, 2019

No problem, thanks for identifying and fixing the issue :-)

@gschorcht gschorcht deleted the enc82j60_fix_9784 branch February 7, 2019 14:09
@chrysn chrysn mentioned this pull request Feb 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: drivers Area: Device drivers CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR Reviewed: 1-fundamentals The fundamentals of the PR were reviewed according to the maintainer guidelines Reviewed: 2-code-design The code design of the PR was reviewed according to the maintainer guidelines Reviewed: 3-testing The PR was tested according to the maintainer guidelines Reviewed: 4-code-style The adherence to coding conventions by the PR were reviewed according to the maintainer guidelines Reviewed: 5-documentation The documentation details of the PR were reviewed according to the maintainer guidelines Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants