-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
drivers/enc28j60: fix of #9784 #9806
Conversation
bcf4b4f
to
aa00087
Compare
@gschorcht: Thanks for your work. It does solve the problem with the infinite loop. Sadly, a different bug seems to be also present. I guess when the on-device RX buffer is overflown, the content needs to be still actively dropped. Below is the log. I didn't have the time to confirm that theory with the datasheet, though. Do you want to investigate that as well? Otherwise, I hope that I will have some time tomorrow or Thursday. https://gist.github.com/maribu/423b08ab675c553a766b617678ec5ea3 |
Is there a way to add a scrollbar to a long pastes? Dumping huge logs like above makes the thread quite unreadable :-( |
You can put the log in a gist and give the link instead. |
@maribu Unfortunatly, I can't reproduce the problem. Once the packets are arriving faster than they can be processed, function
and no other interrupt is handled anymore as long as there are packets in the receive buffer. Normally, the enc28j60 generates an interrupt ( |
The code related to the remaining problem is here: RIOT/drivers/enc28j60/enc28j60.c Lines 474 to 477 in ead6c1f
I think in section 12.1.2 on page 68 in the datasheet the required information is written. RXERIF may occur when:
I can confirm the overflow of the packet counter is not the problem by applying: diff --git a/drivers/enc28j60/enc28j60.c b/drivers/enc28j60/enc28j60.c
index f8b8dec76..dbe9387e5 100644
--- a/drivers/enc28j60/enc28j60.c
+++ b/drivers/enc28j60/enc28j60.c
@@ -31,7 +31,7 @@
#include "enc28j60.h"
#include "enc28j60_regs.h"
-#define ENABLE_DEBUG (0)
+#define ENABLE_DEBUG (1)
#include "debug.h"
/**
@@ -485,7 +485,9 @@ static void nd_isr(netdev_t *netdev)
cmd_bfc(dev, REG_EIR, -1, EIR_PKTIF);
}
if (eir & EIR_RXERIF) {
- DEBUG("[enc28j60] isr: incoming packet dropped - RX buffer full\n");
+ DEBUG("[enc28j60] isr: incoming packet dropped - "
+ "RX buffer / EPKTCNT overflow (EPKTCNT=%i)\n",
+ (int)cmd_rcr(dev, REG_B1_EPKTCNT, 1));
cmd_bfc(dev, REG_EIR, -1, EIR_RXERIF);
}
if (eir & EIR_TXIF) { the console looks like this:
|
And another issue: Now according to GDB the system froze in: RIOT/cpu/stm32_common/periph/spi.c Lines 155 to 161 in 69b2bae
(The first loop is spinning infinitely.) Maybe on chip debugging breaks SPI communication. |
Could you check whether the |
Seems to be a platform specific SPI problem. |
@gschorcht: The loop is left, so the RXERIF bit is successfully unset. diff --git a/drivers/enc28j60/enc28j60.c b/drivers/enc28j60/enc28j60.c
index f8b8dec76..eba287f9f 100644
--- a/drivers/enc28j60/enc28j60.c
+++ b/drivers/enc28j60/enc28j60.c
@@ -31,7 +31,7 @@
#include "enc28j60.h"
#include "enc28j60_regs.h"
-#define ENABLE_DEBUG (0)
+#define ENABLE_DEBUG (1)
#include "debug.h"
/**
@@ -458,6 +458,19 @@ static int nd_init(netdev_t *netdev)
return 0;
}
+static void drop_rx_buffer(enc28j60_t *dev)
+{
+ uint16_t rx_rd_ptr, next;
+ uint8_t head[6];
+ rx_rd_ptr = cmd_r_addr(dev, ADDR_RX_READ);
+ cmd_w_addr(dev, ADDR_READ_PTR, ERXRDPT_TO_NEXT(rx_rd_ptr));
+ /* read packet header */
+ cmd_rbm(dev, head, 6);
+ /* TODO: care for endianess */
+ next = (uint16_t)((head[1] << 8) | head[0]);
+ cmd_w_addr(dev, ADDR_RX_READ, NEXT_TO_ERXRDPT(next));
+}
+
static void nd_isr(netdev_t *netdev)
{
enc28j60_t *dev = (enc28j60_t *)netdev;
@@ -486,6 +499,9 @@ static void nd_isr(netdev_t *netdev)
}
if (eir & EIR_RXERIF) {
DEBUG("[enc28j60] isr: incoming packet dropped - RX buffer full\n");
+ /* Drop partly received packet */
+ drop_rx_buffer(dev);
+ /* Clear RXERIF bit */
cmd_bfc(dev, REG_EIR, -1, EIR_RXERIF);
}
if (eir & EIR_TXIF) { The time until the problem occurs seems to be increased by the patch above. But it doesn't solve it :-( |
Yes, it is unrelated and seems to only occur while interrupting the MCU during on-chip debugging. |
I guess you have also to decrement the packet counter in your drop function. But I'm not sure.
|
Yes, that was the issue. I just had the same idea and so far the driver is working fine. Let me wait another 5 minutes to be sure :-) |
diff --git a/drivers/enc28j60/enc28j60.c b/drivers/enc28j60/enc28j60.c
index f8b8dec76..aa405c2f7 100644
--- a/drivers/enc28j60/enc28j60.c
+++ b/drivers/enc28j60/enc28j60.c
@@ -314,10 +314,37 @@ static int nd_send(netdev_t *netdev, const iolist_t *iolist)
#define NEXT_TO_ERXRDPT(n) ((n == BUF_RX_START || n - 1 > BUF_RX_END) ? BUF_RX_END : n - 1)
#define ERXRDPT_TO_NEXT(e) ((e >= BUF_RX_END) ? BUF_RX_START : e + 1)
+static void get_rx_header(enc28j60_t *dev, uint16_t *size, uint16_t *next)
+{
+ uint16_t rx_rd_ptr;
+ uint8_t head[6];
+
+ /* set read pointer to RX read address */
+ rx_rd_ptr = cmd_r_addr(dev, ADDR_RX_READ);
+ cmd_w_addr(dev, ADDR_READ_PTR, ERXRDPT_TO_NEXT(rx_rd_ptr));
+ /* read packet header */
+ cmd_rbm(dev, head, 6);
+ /* TODO: care for endianess */
+ if (next != NULL) {
+ *next = (uint16_t)((head[1] << 8) | head[0]);
+ }
+
+ if (size != NULL) {
+ *size = (uint16_t)((head[3] << 8) | head[2]) - 4; /* discard CRC */
+ }
+}
+
+static void drop_rx_buffer(enc28j60_t *dev)
+{
+ uint16_t next;
+ get_rx_header(dev, NULL, &next);
+ cmd_w_addr(dev, ADDR_RX_READ, NEXT_TO_ERXRDPT(next));
+ cmd_bfs(dev, REG_ECON2, -1, ECON2_PKTDEC);
+}
+
static int nd_recv(netdev_t *netdev, void *buf, size_t max_len, void *info)
{
enc28j60_t *dev = (enc28j60_t *)netdev;
- uint8_t head[6];
uint16_t size;
uint16_t next;
@@ -325,14 +352,7 @@ static int nd_recv(netdev_t *netdev, void *buf, size_t max_len, void *info)
mutex_lock(&dev->devlock);
/* set read pointer to RX read address */
- uint16_t rx_rd_ptr = cmd_r_addr(dev, ADDR_RX_READ);
- cmd_w_addr(dev, ADDR_READ_PTR, ERXRDPT_TO_NEXT(rx_rd_ptr));
- /* read packet header */
- cmd_rbm(dev, head, 6);
- /* TODO: care for endianess */
- next = (uint16_t)((head[1] << 8) | head[0]);
- size = (uint16_t)((head[3] << 8) | head[2]) - 4; /* discard CRC */
-
+ get_rx_header(dev, &size, &next);
DEBUG("[enc28j60] recv: size=%i next=%i buf=%p len=%d\n",
(int)size, (int)next, buf, max_len);
@@ -486,6 +506,9 @@ static void nd_isr(netdev_t *netdev)
}
if (eir & EIR_RXERIF) {
DEBUG("[enc28j60] isr: incoming packet dropped - RX buffer full\n");
+ /* Drop partly received packet */
+ drop_rx_buffer(dev);
+ /* Clear RXERIF bit */
cmd_bfc(dev, REG_EIR, -1, EIR_RXERIF);
}
if (eir & EIR_TXIF) { It worked for 5 minutes :-) . The above patch is an updated version in which the header parsing is moved to a separate function in order to de-duplicate that code. Do you care to add that fix (or similar) to your PR? Or should I open a separate PR? |
I can add it to my PR but you would have to test it again. |
Crap. With |
Sure :-) Who thought that a couple of wrongly configured switches sending Ethernet frames to random nodes could provide such an excellent testing environment :-D |
Wait. It happened again. The fix I proposed obviously doesn't solve the issue. I think I will have a look into it some other day... |
I was just committing your changes, but I have not pushed them yet. That is, it still does not work for you? |
|
Yes, the RX buffer overflow is still not handled correctly with my suggested fix. Additionally, there seem to be some unrelated timing issue, as However, I can confirm that your PR fixes the missing implementation of dropping packets in |
@maribu @haukepetersen |
I suggest to merge it :-) My reasoning: |
Without in-depth review or testing I'm also in favor of merging this. @smlng do you still have the device on your desk to run a short test? |
@maribu The PR still doesn't contain your last changes. Is that OK or should we add them? |
I personally would be in favor of merging it as it is. I still had sporadic dead-locks with my last changes. It seems to occur much less frequently with the changes, but this could be just bad luck. So I would like to investigate this further (hopefully some time soon) and be sure what causes the remaining deadlocks. The missing drop feature added in this PR does reliably solve one cause of a deadlock, so I'm personally in favor of merging the PR as it is :-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for addressing this issue. I believe clearing the PKTIF flag in the EIR register in line 485 is not required according to the datasheet (see comment above) - but I'm not sure about that.
Maybe you could add a comment, if it is actually needed. I will test this PR again this afternoon.
drivers/enc28j60/enc28j60.c
Outdated
@@ -470,6 +482,7 @@ static void nd_isr(netdev_t *netdev) | |||
DEBUG("[enc28j60] isr: packet received\n"); | |||
netdev->event_callback(netdev, NETDEV_EVENT_RX_COMPLETE); | |||
} while (cmd_rcr(dev, REG_B1_EPKTCNT, 1) > 0); | |||
cmd_bfc(dev, REG_EIR, -1, EIR_PKTIF); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After decrementing, if EPKTCNT is ‘0’, the EIR.PKTIF flag will automatically be cleared.
Citing section 7.2.4 on page 45 in the datasheet. I assume this line could be dropped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably yes, but it shouldn't cause errors if it is there. Since I don't have time to test it, I would leave it as it is for the moment. If you could test, we could decide to remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maribu Have you tried it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did so now. Works fine without manually clearing EIR_PKTIF.
@maribu @gschorcht, what is the status here ? Do you think it can be merged or is it waiting for other updates ? |
ping @maribu @gschorcht |
@jia200x, @gschorcht : Sorry, I again did not got to test it today. Tomorrow is hopefully less busy and I will finally be able to confirm whether |
@maribu I removed this line but I had not the time to test it. |
Mind to squash? I will verify that it still works tomorrow in a free minute and hopefully merge it then :-) |
Rebase ans squashed. |
73f94a1
to
5cb2dd3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just tested on the Nucleo-F303RE and with the PR applied, the network stack works reliable even when frames larger than the buffer come in :-)
Thanks for testing. |
No problem, thanks for identifying and fixing the issue :-) |
Fixes the problem described in #9784. There are no locks anymore. Stress test can be done by flooding the enc28j60 node with IMCP echo request packets with maximum size of 65.507 byte, e.g.:
Thanks to @gebart, I could fix the problem by dropping packets. The fix contains following changes:
nd_recv
function whenbuf = NULL
andmax_len != 0
.on_int
function. The global enable interrupt bit is enabled again in thend_isr
function once all pending interrupts were served.PKTIF
) is cleared now explicitly in thend_isr
function.nd_recv
andnd_isr
functions.