-
Notifications
You must be signed in to change notification settings - Fork 7.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiple high-speed devices on same SPI host end by failing in get_acquiring_dev() during spi_device_polling_transmit() (IDFGH-6528) #8179
Comments
More precisely, in spi_device_polling_start(), after spi_bus_lock_acquire_start() is done, when we return we "think" we have the bus but in fact, a call to get_acquiring_dev() shows that we don't and crash (assert) happens in spi_device_polling_end(). How can we return positively from spi_bus_lock_acquire_start but not have it? NB: there is no "manual" bus acquisition made and I can be sure that there is no re-entrance in my code calling spi_device_polling_transmit() with the same handle. There is only one call to it so it's easy to log that. |
A bit more checks: when putting a mutex around each handle id in spi_device_polling_transmit, the problem still happens unchanged, proving that this is not a self-reentrance issue. More interesting, when putting a global mutex for the whole spi_device_polling_transmit, then the problem goes away. I know it means timing changes as well, but it seems to indicate that something is wrong with the atomic nature of acquiring the bus. [edit]: I've be running that workaround successfully for a few hours now where it never latest more than a few minutes before. |
Nobody with a similar issue? I really don't get it. It seems to be a simple question. Either the doc is wrong or there is a bug. If the acquiring mechanism does not handle concurrent accesses requested by multiple tasks using different devices when they are on the same bus, then it should clearly say:
This clearly means that the SPI-Ethernet drivers are unusable unless they have exclusive access to a SPI host or you patch the spi_master (which I did) |
As discussed in the other thread: this sounds like a bug in the driver. What ESP-IDF version are you using? Could you provide a project that displays the issue, if possible whittled down to the minimum required to reproduce this? |
I'm using 4.3.2 - I'll work to on a minimum example |
I have a working example base on simply the iperf example + a timer that does and spi_device_polling_transmit() as often as possible. It fails "very well" but it requires another config parameter and I'm not which one it is as I took my entire sdkconfig from my project, so I'll find out which one exactly causes the issue. |
@Spritetm here's a cut down dump of quickly changing to device-wide mutex on my app compiled with latest release/v4.1
The breakage is certain, the exact time which it happens is not. Sometimes within a couple of minutes, sometimes right after the second device is added and starts performing transactions (as befits a race condition). |
@philippe44 Just an idea, could you try setting
|
The following example using iperf always fails for me. Can be after 10 seconds or a couple of minutes. No special configuration. Uses idf 4.3.2 and a W5500 ethernet module. It is the raw iperf example with just the following file changed for force frequent asynchronous "fake" transactions using cs=12. SPI bus parameters are host=1,clk=18, miso=19,mosi=23 |
When setting BUS_LOCK_DEBUG, here is what I have
Which is the same assert, except that it is caught in the polling_start() instead of the polling_end() but that's what I detected earlier. Within the start() function, you "think" you have the lock but you don't, it is not happening later in the end() |
I can confirm, I am able to reproduce the issue using modified iperf example as provided by @philippe44 on IDF v4.3.2. However, only when iperf is run in Server mode on ESP. |
This is an uneducated suggestion, so take it for what it worth. In spi_bus_lock.c, we have
Which is called by spi_bus_lock_acquire_start(). My understanding is that the dev_handle->parent is a unique entity per bus that holds the arbitration. So at line #1 we get that entity and then line #2 we set and get the mask atomatically (AFAIU). Now, if there is any miss of 'atomicity" then status becomes dangerous and any context switch where the same acquire_core is called on the same parent will create the mess we see, i.e.; the 2nd caller has acquired the bus, but we overwrite that value 2 lines later (or vice-versa) and when verifying that we are the acquirer, we realize we are not. |
I'm probably beyond my paygrade, but I don't even understand that
Does it actually set something? I made a few logs to confirm that we have re-entrance here (after the "if(status...)) which I don't think we should have. Does "lock_status_fetch_set(lock, dev_handle->mask & LOCK_MASK);" change anything? The dev_handle->mask has not been set with anything (I think, according to the chain of calls), so we don't set anything in the resources to be locked (the dev_handle->parent) so "status" is 0 and we successfully pass the if(). It happens as soon as, unlucky us, acquire_core is suspended by another acquire_core. This seems to be confirmed by the logs |
My previous post is probably all wrong, but I'll leave it still. I've made a few more traces and slowed down the timer request and it seems that (again, take that with a big grain of salt) the following is happening:
|
An interesting finding: I've been having inconsistent results when adding some code in acquire_core(), even very simple piece of code. and adding a few do-nothing-but-not-compiler-optimized-out lines removed the crash. So I started to wonder how time-sensitive or compile-dependent or why not cache-related the issue is. Now, in my iperf example, with adding that none sense "volatile" qualifier in front of status, I was not able to crash it after 30 mins (I stopped after) where before it would crash after 10s-1m
I've not looked at the generated code, but that is really weird |
A better view on that previous post: adding "volatile" forces the compiler (in any optimization mode) to add "memw" around the access to the variable. It is enough for the load induced by the iperf modified example, but not "enough" for my real life application. Now, the following works
It really seems to be linked either to a real time issue, but I'd rather think something with the pipeline/cache where the "lock" variable is not updated when scheduling happens and another call to "acquire_lock" is done. I really don't know anymore, this requires analysis capabilities I don't have |
I'm also having the same issue with v4.4 and a combination of a W5500 and ILI9341 on the same host. |
Only stable option I've found was to add a mutex per SPI host. It works w/o problem. I've done it in a way to minimize impact: the lock is created only if more than one client adds a device on the same host. Otherwise, no mutex. |
I have also been encountering the same issue with v4.4 when using the SDSPI driver and an ILI9341 LCD screen on the same bus, using different spi devices from different threads. @philippe44 Are you willing to share a PR or diff with your mutex locking changes? I've been trying to implement such without success so far. (I rather suspect I've been insufficiently aggressive about what needs protecting). Your volatile for loop suggestion is reducing the frequency of failure for me though, thanks for the tip. EDIT: Found the issue with my mutex protection. Its not as nice as the one you described but it does the trick. Will try to refine it and throw up a PR for anyone else who just wants the driver to work like it says on the tin. |
@dvosully can you please share your "not as nice" solution? ;-) |
@swbet Sure. https://github.com/dvosully/esp-idf/tree/bugfix/spi_locking It's a little rough and ready, and it will have a performance hit, as I'm using a recursive mutex which is pretty heavyweight. But I'm using LVGL on a 240x320 graphic display and doing plenty of file writes, and it's not slow enough to cause me great concern. Probably I could wrap much more selective portions of the driver with a binary mutex and still fix the issue but it's working for me now. It may have some issues - I've seen 3 new ESP_LOGV() assert failures over 2 days of running. Maybe there's a priority inheritance issue or something? To use it, supply the SPICOMMON_BUSFLAG_MUTEX_LOCK option to the bus config.flags in addition to any other flags you are using. If you are using the spi init functions from spi_bus.h in the esp-iot-solution (likely if you are using a graphic display) then you would need to edit spi_bus.c to use that flag upon init. Otherwise use it wherever you initialise your SPI bus configuration. It's based off the 4.4 release, not master, so I can't automatically merge with master yet, thus I don't have a PR to make it nice and easy to link and grab. I'll try and do that at some stage, but for now this'll do. I've also only applied it to spi_master, I haven't thought at all about the spi_slave driver. Likely an spi_slave is normally only operating a single device from a single thread. Ultimately, the best solution would be to fix the spi lock which isn't sufficiently atomic (maybe it's atomic on a single core, but not when threads are running on different cores?). |
@dvosully Wow! Thank you so much! I use a SSD1327 OLED display as second device at the same host with the w5500 and I also use esp-idf 4.4. The mutex does the trick for me. |
@philippe44 Hi, would you help give this a try? The code change is in the attach file, together with a potential issue flow. |
Sure I will |
Right... it took me a while to figure out as I wanted first to reproduce the failing case for sure (I've not been working on it for a few month). So when I tried, it only failed after 5-10 hours. But then I realized that I've patched spi_master.c to add a lock but also I changed spi_bus_lock.c to add a short dummy loop to change timing (just trial and error when I was trying to figure out the cause) and I forgot to remove that. So, anyway, once I've removed that loop, it failed after a few minutes. I'm glad to report that with your PR, it has not failed after a couple of hours where, again, with the real 4.3.2 untouched files, it rarely worked for more than 5 mins. [edit]: one thing I cannot tell is if this is solving the root cause or just masking it by changing timing (this is what I tried before, there was a proper solution with a mutex that worked 100% and a hack with a dummy for loop with a volatile in acquire_core() after lock_status_fetch_set() than worked for 5-10 hours) |
@dvosully with many month delay, here is what I used for mutex in spi_master |
@philippe44 thank you very much for your support! I am only not sure if I fully understand if the fix proposed by @Icarus113 resolved the issue or you observed it after couple of hours. Regarding the Espressif's fix, it should be proper fix. It should ensure that certain part of the lock's code is executed without any preemption and so ensure that all variables are properly written in atomic way (i.e. take care of racing conditions). It is kind of mutex but not that heavy weight as actual FreeRTOS mutex. |
Sorry for not being clear : the fix from @Icarus113 fixes the issue, or at least after few hours it has still not happened |
@philippe44 @kostaond Thanks for the feedback~ |
Thanks @philippe44 and @Icarus113 |
It is still running w/o interruption for about 2 days now |
Thanks for the feedback. This fix does fix a concurrency issue. Although I'm not 100% sure if there're any other concurrency issues yet, it's very hard to expose concurrent issues. |
@Icarus113 @philippe44 Is it within the realm of possibility for #6624 to be caused by this? |
@KaeLL The backtrace you include in your fault report is exactly the same as I was seeing when getting this fault, I would suggest well within the realm of possibility 🙂 |
define: lock_bits = (lock->status & LOCK_MASK) >> LOCK_SHIFT; This `lock_bits` is the Bit 29-20 of the lock->status 1. spi_hdl_1: acquire_end_core(): uint32_t status = lock_status_clear(lock, dev_handle->mask & LOCK_MASK); Becuase this is the first `spi_hdl_1`, so after this , lock_bits == 0`b0. status == 0 2. spi_hdl_2: acquire_core: uint32_t status = lock_status_fetch_set(lock, dev_handle->mask & LOCK_MASK); Then here status is 0`b0, but lock_bits == 0`b10. Because this is the `spi_hdl_2` 3. spi_hdl_2: `acquire_core` return true, because status == 0. `spi_bus_lock_acquire_start(spi_hdl_2)` then won't block. 4. spi_hdl_2: spi_device_polling_end(spi_hdl_2). 5. spi_hdl_1: acquire_end_core: status is 0, so it cleas the lock->acquiring_dev 6. spi_hdl_2: spi_device_polling_end: assert(handle == get_acquiring_dev(host)); Fail Closes #8179
define: lock_bits = (lock->status & LOCK_MASK) >> LOCK_SHIFT; This `lock_bits` is the Bit 29-20 of the lock->status 1. spi_hdl_1: acquire_end_core(): uint32_t status = lock_status_clear(lock, dev_handle->mask & LOCK_MASK); Becuase this is the first `spi_hdl_1`, so after this , lock_bits == 0`b0. status == 0 2. spi_hdl_2: acquire_core: uint32_t status = lock_status_fetch_set(lock, dev_handle->mask & LOCK_MASK); Then here status is 0`b0, but lock_bits == 0`b10. Because this is the `spi_hdl_2` 3. spi_hdl_2: `acquire_core` return true, because status == 0. `spi_bus_lock_acquire_start(spi_hdl_2)` then won't block. 4. spi_hdl_2: spi_device_polling_end(spi_hdl_2). 5. spi_hdl_1: acquire_end_core: status is 0, so it cleas the lock->acquiring_dev 6. spi_hdl_2: spi_device_polling_end: assert(handle == get_acquiring_dev(host)); Fail Closes #8179
@Icarus113 ETA on release/v4.1? |
define: lock_bits = (lock->status & LOCK_MASK) >> LOCK_SHIFT; This `lock_bits` is the Bit 29-20 of the lock->status 1. spi_hdl_1: acquire_end_core(): uint32_t status = lock_status_clear(lock, dev_handle->mask & LOCK_MASK); Becuase this is the first `spi_hdl_1`, so after this , lock_bits == 0`b0. status == 0 2. spi_hdl_2: acquire_core: uint32_t status = lock_status_fetch_set(lock, dev_handle->mask & LOCK_MASK); Then here status is 0`b0, but lock_bits == 0`b10. Because this is the `spi_hdl_2` 3. spi_hdl_2: `acquire_core` return true, because status == 0. `spi_bus_lock_acquire_start(spi_hdl_2)` then won't block. 4. spi_hdl_2: spi_device_polling_end(spi_hdl_2). 5. spi_hdl_1: acquire_end_core: status is 0, so it cleas the lock->acquiring_dev 6. spi_hdl_2: spi_device_polling_end: assert(handle == get_acquiring_dev(host)); Fail Closes espressif#8179
define: lock_bits = (lock->status & LOCK_MASK) >> LOCK_SHIFT; This `lock_bits` is the Bit 29-20 of the lock->status 1. spi_hdl_1: acquire_end_core(): uint32_t status = lock_status_clear(lock, dev_handle->mask & LOCK_MASK); Becuase this is the first `spi_hdl_1`, so after this , lock_bits == 0`b0. status == 0 2. spi_hdl_2: acquire_core: uint32_t status = lock_status_fetch_set(lock, dev_handle->mask & LOCK_MASK); Then here status is 0`b0, but lock_bits == 0`b10. Because this is the `spi_hdl_2` 3. spi_hdl_2: `acquire_core` return true, because status == 0. `spi_bus_lock_acquire_start(spi_hdl_2)` then won't block. 4. spi_hdl_2: spi_device_polling_end(spi_hdl_2). 5. spi_hdl_1: acquire_end_core: status is 0, so it cleas the lock->acquiring_dev 6. spi_hdl_2: spi_device_polling_end: assert(handle == get_acquiring_dev(host)); Fail Closes #8179
I'm using 4.3.2 and I have on SPI2 => mosi=23,miso=19,clk=18. The esp32 is working @240MHz, with a SPIRAM @ 80MHz on SPI3.
The SH1106 and MCP23S17 work together with no problem, the W5500 works standalone no problem. They are all handled by different tasks but each task can access only its device.
Now, when there is heavy traffic on ethernet and the display, it crashes after a few seconds usually. The problem is that the crash happens in spi_device_polling_end(), precisely because get_acquiring_dev() returns NULL, in other word no device has acquired the bus. It does not make sense because the spi_device_polling_start() it just there before. I'm just using polling, I'm not acquiring the bus manually, nothing like that. I've not looked with a ton of details into the W5500 driver but it seems to do polling as well. The device that always fails is the display. The handle context seems correct, there is no (obvious) memory corruption
I looked into that issue #7380 hoping that it would help, but it's a different problem obviously. I'm a bit clueless at that point why esp-idf would think no device has acquired the bus when the spi_device_polling_start was made just before with the same device. Any pointer would be very appreciated.
The text was updated successfully, but these errors were encountered: