Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple high-speed devices on same SPI host end by failing in get_acquiring_dev() during spi_device_polling_transmit() (IDFGH-6528) #8179

Closed
philippe44 opened this issue Jan 1, 2022 · 35 comments
Labels
Resolution: Done Issue is done internally Status: Done Issue is done internally

Comments

@philippe44
Copy link

philippe44 commented Jan 1, 2022

I'm using 4.3.2 and I have on SPI2 => mosi=23,miso=19,clk=18. The esp32 is working @240MHz, with a SPIRAM @ 80MHz on SPI3.

  • a W5500 @ 10,16,20MHz, using cs=5 (int=25)
  • a SH1106 @ 16,20 MHz display using cs=32 and dc=33
  • a MCP23S17 expander @ 8,10 MHz using cs=4

The SH1106 and MCP23S17 work together with no problem, the W5500 works standalone no problem. They are all handled by different tasks but each task can access only its device.

Now, when there is heavy traffic on ethernet and the display, it crashes after a few seconds usually. The problem is that the crash happens in spi_device_polling_end(), precisely because get_acquiring_dev() returns NULL, in other word no device has acquired the bus. It does not make sense because the spi_device_polling_start() it just there before. I'm just using polling, I'm not acquiring the bus manually, nothing like that. I've not looked with a ton of details into the W5500 driver but it seems to do polling as well. The device that always fails is the display. The handle context seems correct, there is no (obvious) memory corruption

I looked into that issue #7380 hoping that it would help, but it's a different problem obviously. I'm a bit clueless at that point why esp-idf would think no device has acquired the bus when the spi_device_polling_start was made just before with the same device. Any pointer would be very appreciated.

@espressif-bot espressif-bot added the Status: Opened Issue is new label Jan 1, 2022
@github-actions github-actions bot changed the title Multiple high-speed devices on same SPI host end by failing in get_acquiring_dev() during spi_device_polling_transmit() Multiple high-speed devices on same SPI host end by failing in get_acquiring_dev() during spi_device_polling_transmit() (IDFGH-6528) Jan 1, 2022
@philippe44
Copy link
Author

philippe44 commented Jan 3, 2022

More precisely, in spi_device_polling_start(), after spi_bus_lock_acquire_start() is done, when we return we "think" we have the bus but in fact, a call to get_acquiring_dev() shows that we don't and crash (assert) happens in spi_device_polling_end().

How can we return positively from spi_bus_lock_acquire_start but not have it?

NB: there is no "manual" bus acquisition made and I can be sure that there is no re-entrance in my code calling spi_device_polling_transmit() with the same handle. There is only one call to it so it's easy to log that.

@philippe44
Copy link
Author

philippe44 commented Jan 3, 2022

A bit more checks: when putting a mutex around each handle id in spi_device_polling_transmit, the problem still happens unchanged, proving that this is not a self-reentrance issue.

More interesting, when putting a global mutex for the whole spi_device_polling_transmit, then the problem goes away. I know it means timing changes as well, but it seems to indicate that something is wrong with the atomic nature of acquiring the bus.

[edit]: I've be running that workaround successfully for a few hours now where it never latest more than a few minutes before.

@philippe44
Copy link
Author

philippe44 commented Jan 4, 2022

Nobody with a similar issue? I really don't get it. It seems to be a simple question. Either the doc is wrong or there is a bug. If the acquiring mechanism does not handle concurrent accesses requested by multiple tasks using different devices when they are on the same bus, then it should clearly say:

The SPI driver does not allow any concurrent access request to the same SPI bus/host, even when using different devices. This must be handled by the user application. 
As a consequence, please note that SPI-Ethernet drivers require to be the sole user of the SPI host they are connected to

This clearly means that the SPI-Ethernet drivers are unusable unless they have exclusive access to a SPI host or you patch the spi_master (which I did)

@Spritetm
Copy link
Member

Spritetm commented Jan 6, 2022

As discussed in the other thread: this sounds like a bug in the driver. What ESP-IDF version are you using? Could you provide a project that displays the issue, if possible whittled down to the minimum required to reproduce this?

@philippe44
Copy link
Author

I'm using 4.3.2 - I'll work to on a minimum example

@philippe44
Copy link
Author

I have a working example base on simply the iperf example + a timer that does and spi_device_polling_transmit() as often as possible. It fails "very well" but it requires another config parameter and I'm not which one it is as I took my entire sdkconfig from my project, so I'll find out which one exactly causes the issue.

@KaeLL
Copy link
Contributor

KaeLL commented Jan 6, 2022

@Spritetm here's a cut down dump of quickly changing to device-wide mutex on my app compiled with latest release/v4.1

Guru Meditation Error: Core  1 panic'ed (LoadProhibited). Exception was unhandled.
Core 1 register dump:
PC      : 0x40085605  PS      : 0x00060f30  A0      : 0x8008586c  A1      : 0x3ffbbb60  
A2      : 0x3ffb18d4  A3      : 0x00000001  A4      : 0x00000001  A5      : 0x00000000  
A6      : 0x00060120  A7      : 0x00000000  A8      : 0x00000001  A9      : 0x3ffb1958  
A10     : 0x3ffbbbec  A11     : 0x3ffbbbf0  A12     : 0x00000010  A13     : 0x00000010  
A14     : 0x00000000  A15     : 0x00000020  SAR     : 0x00000020  EXCCAUSE: 0x0000001c  
EXCVADDR: 0x00000049  LBEG    : 0x4000c2e0  LEND    : 0x4000c2f6  LCOUNT  : 0xffffffff  

0x40085605: device_release_bus_internal at /esp-idf/components/driver/spi_master.c:601 (discriminator 1)

0x40085602: device_release_bus_internal at /esp-idf/components/driver/spi_master.c:601

0x40085869: spi_device_polling_end at /esp-idf/components/driver/spi_master.c:1077

0x40085891: spi_device_polling_transmit at /esp-idf/components/driver/spi_master.c:1089

The breakage is certain, the exact time which it happens is not. Sometimes within a couple of minutes, sometimes right after the second device is added and starts performing transactions (as befits a race condition).

@igrr
Copy link
Member

igrr commented Jan 6, 2022

@philippe44 Just an idea, could you try setting #define BUS_LOCK_DEBUG 1 in

and see if any assert triggers earlier?

@philippe44
Copy link
Author

philippe44 commented Jan 6, 2022

The following example using iperf always fails for me. Can be after 10 seconds or a couple of minutes. No special configuration. Uses idf 4.3.2 and a W5500 ethernet module. It is the raw iperf example with just the following file changed for force frequent asynchronous "fake" transactions using cs=12.

SPI bus parameters are host=1,clk=18, miso=19,mosi=23
The w5500 is with cs =5,rst=22,intr=25

ethernet_example_main.zip

@philippe44
Copy link
Author

When setting BUS_LOCK_DEBUG, here is what I have

        Interval Bandwidth
   0-   3 sec       7.66 Mbits/sec
   3-   6 sec       7.75 Mbits/sec
   6-   9 sec       7.73 Mbits/sec
   9-  12 sec       7.54 Mbits/sec
  12-  15 sec       7.75 Mbits/sec

assert failed: spi_bus_lock_acquire_start spi_bus_lock.c:688 (lock->acquiring_dev == dev_handle)

Backtrace:0x40081817:0x3ffbb490 0x400884c5:0x3ffbb4b0 0x4008e91d:0x3ffbb4d0 0x400857f8:0x3ffbb5f0 0x40085502:0x3ffbb620 0x400856b1:0
x3ffbb650 0x400e47eb:0x3ffbb680 0x400e50fc:0x3ffbb6e0 0x400e3f22:0x3ffbb720 0x40114aef:0x3ffbb750 0x400fa3f5:0x3ffbb770 0x400f9e81:0
x3ffbb790 0x400f3f22:0x3ffbb7b0 0x400f420e:0x3ffbb7d0 0x400f5155:0x3ffbb800 0x400f51f9:0x3ffbb840 0x400f1c59:0x3ffbb870 0x400f1d4d:0
x3ffbb8b0 0x400f1e81:0x3ffbb8d0 0x400fcbb3:0x3ffbb910 0x400e9411:0x3ffbb930 0x400e94b8:0x3ffbb950 0x4008bbcd:0x3ffbb980
0x40081817: panic_abort at D:/Developments/espressif-4.3.2/components/esp_system/panic.c:393
0x400884c5: esp_system_abort at D:/Developments/espressif-4.3.2/components/esp_system/system_api.c:112
0x4008e91d: __assert_func at D:/Developments/espressif-4.3.2/components/newlib/assert.c:85
0x400857f8: spi_bus_lock_acquire_start at D:/Developments/espressif-4.3.2/components/driver/spi_bus_lock.c:688 (discriminator 1)
0x40085502: spi_device_polling_start at D:/Developments/espressif-4.3.2/components/driver/spi_master.c:942
0x400856b1: spi_device_polling_transmit at D:/Developments/espressif-4.3.2/components/driver/spi_master.c:1022 (discriminator 15)
0x400e47eb: w5500_write at D:/Developments/espressif-4.3.2/components/esp_eth/src/esp_eth_mac_w5500.c:80
0x400e50fc: emac_w5500_transmit at D:/Developments/espressif-4.3.2/components/esp_eth/src/esp_eth_mac_w5500.c:542 (discriminator 2)
0x400e3f22: esp_eth_transmit at D:/Developments/espressif-4.3.2/components/esp_eth/src/esp_eth.c:330 (discriminator 2)
0x40114aef: esp_netif_transmit at D:/Developments/espressif-4.3.2/components/esp_netif/lwip/esp_netif_lwip.c:875
0x400fa3f5: ethernet_low_level_output at D:/Developments/espressif-4.3.2/components/lwip/port/esp32/netif/ethernetif.c:118
0x400f9e81: ethernet_output at D:/Developments/espressif-4.3.2/components/lwip/lwip/src/netif/ethernet.c:312 (discriminator 2)
0x400f3f22: etharp_output_to_arp_index at D:/Developments/espressif-4.3.2/components/lwip/lwip/src/core/ipv4/etharp.c:785
0x400f420e: etharp_output at D:/Developments/espressif-4.3.2/components/lwip/lwip/src/core/ipv4/etharp.c:884
0x400f5155: ip4_output_if_opt_src at D:/Developments/espressif-4.3.2/components/lwip/lwip/src/core/ipv4/ip4.c:1089
0x400f51f9: ip4_output_if_src at D:/Developments/espressif-4.3.2/components/lwip/lwip/src/core/ipv4/ip4.c:907
0x400f1c59: udp_sendto_if_src at D:/Developments/espressif-4.3.2/components/lwip/lwip/src/core/udp.c:906 (discriminator 4)
0x400f1d4d: udp_sendto_if at D:/Developments/espressif-4.3.2/components/lwip/lwip/src/core/udp.c:705
0x400f1e81: udp_sendto at D:/Developments/espressif-4.3.2/components/lwip/lwip/src/core/udp.c:612
0x400fcbb3: lwip_netconn_do_send at D:/Developments/espressif-4.3.2/components/lwip/lwip/src/api/api_msg.c:1640
0x400e9411: tcpip_thread_handle_msg at D:/Developments/espressif-4.3.2/components/lwip/lwip/src/api/tcpip.c:168
0x400e94b8: tcpip_thread at D:/Developments/espressif-4.3.2/components/lwip/lwip/src/api/tcpip.c:154
0x4008bbcd: vPortTaskWrapper at D:/Developments/espressif-4.3.2/components/freertos/port/xtensa/port.c:168

Which is the same assert, except that it is caught in the polling_start() instead of the polling_end() but that's what I detected earlier. Within the start() function, you "think" you have the lock but you don't, it is not happening later in the end()

@kostaond
Copy link
Collaborator

kostaond commented Jan 7, 2022

I can confirm, I am able to reproduce the issue using modified iperf example as provided by @philippe44 on IDF v4.3.2. However, only when iperf is run in Server mode on ESP.

@philippe44
Copy link
Author

This is an uneducated suggestion, so take it for what it worth. In spi_bus_lock.c, we have

SPI_MASTER_ISR_ATTR static inline bool acquire_core(spi_bus_lock_dev_t *dev_handle)
{
    spi_bus_lock_t* lock = dev_handle->parent;
    uint32_t status = lock_status_fetch_set(lock, dev_handle->mask & LOCK_MASK);

    // Check all bits except WEAK_BG
    if ((status & (BG_MASK | LOCK_MASK)) == 0) {

Which is called by spi_bus_lock_acquire_start(). My understanding is that the dev_handle->parent is a unique entity per bus that holds the arbitration. So at line #1 we get that entity and then line #2 we set and get the mask atomatically (AFAIU). Now, if there is any miss of 'atomicity" then status becomes dangerous and any context switch where the same acquire_core is called on the same parent will create the mess we see, i.e.; the 2nd caller has acquired the bus, but we overwrite that value 2 lines later (or vice-versa) and when verifying that we are the acquirer, we realize we are not.

@philippe44
Copy link
Author

philippe44 commented Jan 7, 2022

I'm probably beyond my paygrade, but I don't even understand that

//Set the LOCK bit. Handle related stuff and return true if we become the acquiring processor.
SPI_MASTER_ISR_ATTR static inline bool acquire_core(spi_bus_lock_dev_t *dev_handle)
{
    spi_bus_lock_t* lock = dev_handle->parent;
    uint32_t status = lock_status_fetch_set(lock, dev_handle->mask & LOCK_MASK);
   
    // Check all bits except WEAK_BG
    if ((status & (BG_MASK | LOCK_MASK)) == 0) {

Does it actually set something? I made a few logs to confirm that we have re-entrance here (after the "if(status...)) which I don't think we should have. Does "lock_status_fetch_set(lock, dev_handle->mask & LOCK_MASK);" change anything? The dev_handle->mask has not been set with anything (I think, according to the chain of calls), so we don't set anything in the resources to be locked (the dev_handle->parent) so "status" is 0 and we successfully pass the if(). It happens as soon as, unlucky us, acquire_core is suspended by another acquire_core. This seems to be confirmed by the logs

@philippe44
Copy link
Author

My previous post is probably all wrong, but I'll leave it still. I've made a few more traces and slowed down the timer request and it seems that (again, take that with a big grain of salt) the following is happening:

  • w5500 has the bus and does a transaction
  • timer kicks-in but get rejected in acquire_core
  • before the timer requests exits acquire_core, the w5500 finished its transaction and fires up another one immediately
  • the w5500 transaction seems be processed and it fires up yet another one immediately
  • the timer transaction ends, but at this point the driver seems to have gone wild and be totally confused about who owns the bus or not

@philippe44
Copy link
Author

philippe44 commented Jan 8, 2022

An interesting finding: I've been having inconsistent results when adding some code in acquire_core(), even very simple piece of code. and adding a few do-nothing-but-not-compiler-optimized-out lines removed the crash. So I started to wonder how time-sensitive or compile-dependent or why not cache-related the issue is. Now, in my iperf example, with adding that none sense "volatile" qualifier in front of status, I was not able to crash it after 30 mins (I stopped after) where before it would crash after 10s-1m

//Set the LOCK bit. Handle related stuff and return true if we become the acquiring processor.
SPI_MASTER_ISR_ATTR static inline bool acquire_core(spi_bus_lock_dev_t *dev_handle)
{
    spi_bus_lock_t* lock = dev_handle->parent;
    volatile uint32_t status = lock_status_fetch_set(lock, dev_handle->mask & LOCK_MASK);

    // Check all bits except WEAK_BG
    if ((status & (BG_MASK | LOCK_MASK)) == 0) {
        //succeed at once
        lock->acquiring_dev = dev_handle;
        BUS_LOCK_DEBUG_EXECUTE_CHECK(!lock->acq_dev_bg_active);
        if (status & WEAK_BG_FLAG) {
            //Mainly to disable the cache (Weak_BG), that is not able to disable itself
            bg_disable(lock);
        }
        return true;
    }
    return false;
}

I've not looked at the generated code, but that is really weird

@philippe44
Copy link
Author

philippe44 commented Jan 8, 2022

A better view on that previous post: adding "volatile" forces the compiler (in any optimization mode) to add "memw" around the access to the variable. It is enough for the load induced by the iperf modified example, but not "enough" for my real life application. Now, the following works

//Set the LOCK bit. Handle related stuff and return true if we become the acquiring processor.
SPI_MASTER_ISR_ATTR static inline bool acquire_core(spi_bus_lock_dev_t *dev_handle)
{
    spi_bus_lock_t* lock = dev_handle->parent;
    uint32_t status = lock_status_fetch_set(lock, dev_handle->mask & LOCK_MASK);
	
    for (volatile int i = 0; i < 10; i++);
	
    // Check all bits except WEAK_BG
    if ((status & (BG_MASK | LOCK_MASK)) == 0) {
        //succeed at once
        lock->acquiring_dev = dev_handle;
        BUS_LOCK_DEBUG_EXECUTE_CHECK(!lock->acq_dev_bg_active);
        if (status & WEAK_BG_FLAG) {
            //Mainly to disable the cache (Weak_BG), that is not able to disable itself
            bg_disable(lock);
        }
        return true;
    }
    return false;
}

It really seems to be linked either to a real time issue, but I'd rather think something with the pipeline/cache where the "lock" variable is not updated when scheduling happens and another call to "acquire_lock" is done. I really don't know anymore, this requires analysis capabilities I don't have

@mkfrey
Copy link

mkfrey commented Mar 7, 2022

I'm also having the same issue with v4.4 and a combination of a W5500 and ILI9341 on the same host.

@philippe44
Copy link
Author

Only stable option I've found was to add a mutex per SPI host. It works w/o problem. I've done it in a way to minimize impact: the lock is created only if more than one client adds a device on the same host. Otherwise, no mutex.

@dvosully
Copy link
Contributor

dvosully commented Mar 19, 2022

I have also been encountering the same issue with v4.4 when using the SDSPI driver and an ILI9341 LCD screen on the same bus, using different spi devices from different threads.

@philippe44 Are you willing to share a PR or diff with your mutex locking changes? I've been trying to implement such without success so far. (I rather suspect I've been insufficiently aggressive about what needs protecting).

Your volatile for loop suggestion is reducing the frequency of failure for me though, thanks for the tip.

EDIT: Found the issue with my mutex protection. Its not as nice as the one you described but it does the trick. Will try to refine it and throw up a PR for anyone else who just wants the driver to work like it says on the tin.

@swbew
Copy link

swbew commented Mar 21, 2022

@dvosully can you please share your "not as nice" solution? ;-)

@dvosully
Copy link
Contributor

@swbet Sure.

https://github.com/dvosully/esp-idf/tree/bugfix/spi_locking

It's a little rough and ready, and it will have a performance hit, as I'm using a recursive mutex which is pretty heavyweight. But I'm using LVGL on a 240x320 graphic display and doing plenty of file writes, and it's not slow enough to cause me great concern. Probably I could wrap much more selective portions of the driver with a binary mutex and still fix the issue but it's working for me now.

It may have some issues - I've seen 3 new ESP_LOGV() assert failures over 2 days of running. Maybe there's a priority inheritance issue or something?

To use it, supply the SPICOMMON_BUSFLAG_MUTEX_LOCK option to the bus config.flags in addition to any other flags you are using. If you are using the spi init functions from spi_bus.h in the esp-iot-solution (likely if you are using a graphic display) then you would need to edit spi_bus.c to use that flag upon init. Otherwise use it wherever you initialise your SPI bus configuration.

It's based off the 4.4 release, not master, so I can't automatically merge with master yet, thus I don't have a PR to make it nice and easy to link and grab. I'll try and do that at some stage, but for now this'll do.

I've also only applied it to spi_master, I haven't thought at all about the spi_slave driver. Likely an spi_slave is normally only operating a single device from a single thread.

Ultimately, the best solution would be to fix the spi lock which isn't sufficiently atomic (maybe it's atomic on a single core, but not when threads are running on different cores?).

@swbew
Copy link

swbew commented Mar 22, 2022

@dvosully Wow! Thank you so much! I use a SSD1327 OLED display as second device at the same host with the w5500 and I also use esp-idf 4.4. The mutex does the trick for me.

@espressif-bot espressif-bot added Status: Selected for Development Issue is selected for development and removed Status: Opened Issue is new labels Aug 15, 2022
@espressif-bot espressif-bot added Status: In Progress Work is in progress and removed Status: Selected for Development Issue is selected for development labels Sep 2, 2022
@Icarus113
Copy link
Collaborator

@philippe44 Hi, would you help give this a try? The code change is in the attach file, together with a potential issue flow.
It's hard to reproduce it locally, so I'm not quite sure if this is the situation.

0001-SPI_BUS_LOCK-fix-a-concurrency-issue.txt

@philippe44
Copy link
Author

Sure I will

@philippe44
Copy link
Author

philippe44 commented Sep 5, 2022

Right... it took me a while to figure out as I wanted first to reproduce the failing case for sure (I've not been working on it for a few month). So when I tried, it only failed after 5-10 hours. But then I realized that I've patched spi_master.c to add a lock but also I changed spi_bus_lock.c to add a short dummy loop to change timing (just trial and error when I was trying to figure out the cause) and I forgot to remove that. So, anyway, once I've removed that loop, it failed after a few minutes. I'm glad to report that with your PR, it has not failed after a couple of hours where, again, with the real 4.3.2 untouched files, it rarely worked for more than 5 mins.

[edit]: one thing I cannot tell is if this is solving the root cause or just masking it by changing timing (this is what I tried before, there was a proper solution with a mutex that worked 100% and a hack with a dummy for loop with a volatile in acquire_core() after lock_status_fetch_set() than worked for 5-10 hours)

@philippe44
Copy link
Author

@dvosully with many month delay, here is what I used for mutex in spi_master
spi_master.c.zip

@kostaond
Copy link
Collaborator

kostaond commented Sep 6, 2022

@philippe44 thank you very much for your support! I am only not sure if I fully understand if the fix proposed by @Icarus113 resolved the issue or you observed it after couple of hours. Regarding the Espressif's fix, it should be proper fix. It should ensure that certain part of the lock's code is executed without any preemption and so ensure that all variables are properly written in atomic way (i.e. take care of racing conditions). It is kind of mutex but not that heavy weight as actual FreeRTOS mutex.

@philippe44
Copy link
Author

philippe44 commented Sep 6, 2022

Sorry for not being clear : the fix from @Icarus113 fixes the issue, or at least after few hours it has still not happened
where without that fix, it happens after a few seconds/minutes max

@Icarus113
Copy link
Collaborator

@philippe44 @kostaond Thanks for the feedback~

@espressif-bot espressif-bot added Status: Reviewing Issue is being reviewed and removed Status: In Progress Work is in progress labels Sep 6, 2022
@dvosully
Copy link
Contributor

dvosully commented Sep 6, 2022

Thanks @philippe44 and @Icarus113
Am keen to test but am away from my hardware for a few weeks. Sounds like it's working though! Will check once I'm back.

@philippe44
Copy link
Author

It is still running w/o interruption for about 2 days now

@Icarus113
Copy link
Collaborator

Thanks for the feedback.

This fix does fix a concurrency issue. Although I'm not 100% sure if there're any other concurrency issues yet, it's very hard to expose concurrent issues.
I'm gonna merge the fix. And that'll be synced to GH later.

@espressif-bot espressif-bot added Resolution: NA Issue resolution is unavailable Status: Done Issue is done internally Resolution: Done Issue is done internally and removed Status: Reviewing Issue is being reviewed Resolution: NA Issue resolution is unavailable labels Sep 9, 2022
@KaeLL
Copy link
Contributor

KaeLL commented Sep 12, 2022

@Icarus113 @philippe44 Is it within the realm of possibility for #6624 to be caused by this?

@dvosully
Copy link
Contributor

@KaeLL The backtrace you include in your fault report is exactly the same as I was seeing when getting this fault, I would suggest well within the realm of possibility 🙂

espressif-bot pushed a commit that referenced this issue Sep 23, 2022
define: lock_bits = (lock->status & LOCK_MASK) >> LOCK_SHIFT;  This `lock_bits` is the Bit 29-20 of the lock->status

1. spi_hdl_1:
   acquire_end_core():
   uint32_t status = lock_status_clear(lock, dev_handle->mask & LOCK_MASK);
   Becuase this is the first `spi_hdl_1`, so after this , lock_bits == 0`b0. status == 0

2. spi_hdl_2:
   acquire_core:
   uint32_t status = lock_status_fetch_set(lock, dev_handle->mask & LOCK_MASK);
   Then here status is 0`b0, but lock_bits == 0`b10. Because this is the `spi_hdl_2`

3. spi_hdl_2:
   `acquire_core` return true, because status == 0. `spi_bus_lock_acquire_start(spi_hdl_2)` then won't block.

4. spi_hdl_2:
   spi_device_polling_end(spi_hdl_2).

5. spi_hdl_1:
   acquire_end_core:
   status is 0, so it cleas the lock->acquiring_dev

6. spi_hdl_2:
   spi_device_polling_end:
   assert(handle == get_acquiring_dev(host)); Fail

Closes #8179
espressif-bot pushed a commit that referenced this issue Sep 23, 2022
define: lock_bits = (lock->status & LOCK_MASK) >> LOCK_SHIFT;  This `lock_bits` is the Bit 29-20 of the lock->status

1. spi_hdl_1:
   acquire_end_core():
   uint32_t status = lock_status_clear(lock, dev_handle->mask & LOCK_MASK);
   Becuase this is the first `spi_hdl_1`, so after this , lock_bits == 0`b0. status == 0

2. spi_hdl_2:
   acquire_core:
   uint32_t status = lock_status_fetch_set(lock, dev_handle->mask & LOCK_MASK);
   Then here status is 0`b0, but lock_bits == 0`b10. Because this is the `spi_hdl_2`

3. spi_hdl_2:
   `acquire_core` return true, because status == 0. `spi_bus_lock_acquire_start(spi_hdl_2)` then won't block.

4. spi_hdl_2:
   spi_device_polling_end(spi_hdl_2).

5. spi_hdl_1:
   acquire_end_core:
   status is 0, so it cleas the lock->acquiring_dev

6. spi_hdl_2:
   spi_device_polling_end:
   assert(handle == get_acquiring_dev(host)); Fail

Closes #8179
@KaeLL
Copy link
Contributor

KaeLL commented Sep 26, 2022

@Icarus113 ETA on release/v4.1?

antmak pushed a commit to antmak/dev-idf that referenced this issue Sep 28, 2022
define: lock_bits = (lock->status & LOCK_MASK) >> LOCK_SHIFT;  This `lock_bits` is the Bit 29-20 of the lock->status

1. spi_hdl_1:
   acquire_end_core():
   uint32_t status = lock_status_clear(lock, dev_handle->mask & LOCK_MASK);
   Becuase this is the first `spi_hdl_1`, so after this , lock_bits == 0`b0. status == 0

2. spi_hdl_2:
   acquire_core:
   uint32_t status = lock_status_fetch_set(lock, dev_handle->mask & LOCK_MASK);
   Then here status is 0`b0, but lock_bits == 0`b10. Because this is the `spi_hdl_2`

3. spi_hdl_2:
   `acquire_core` return true, because status == 0. `spi_bus_lock_acquire_start(spi_hdl_2)` then won't block.

4. spi_hdl_2:
   spi_device_polling_end(spi_hdl_2).

5. spi_hdl_1:
   acquire_end_core:
   status is 0, so it cleas the lock->acquiring_dev

6. spi_hdl_2:
   spi_device_polling_end:
   assert(handle == get_acquiring_dev(host)); Fail

Closes espressif#8179
espressif-bot pushed a commit that referenced this issue Oct 15, 2022
define: lock_bits = (lock->status & LOCK_MASK) >> LOCK_SHIFT;  This `lock_bits` is the Bit 29-20 of the lock->status

1. spi_hdl_1:
   acquire_end_core():
   uint32_t status = lock_status_clear(lock, dev_handle->mask & LOCK_MASK);
   Becuase this is the first `spi_hdl_1`, so after this , lock_bits == 0`b0. status == 0

2. spi_hdl_2:
   acquire_core:
   uint32_t status = lock_status_fetch_set(lock, dev_handle->mask & LOCK_MASK);
   Then here status is 0`b0, but lock_bits == 0`b10. Because this is the `spi_hdl_2`

3. spi_hdl_2:
   `acquire_core` return true, because status == 0. `spi_bus_lock_acquire_start(spi_hdl_2)` then won't block.

4. spi_hdl_2:
   spi_device_polling_end(spi_hdl_2).

5. spi_hdl_1:
   acquire_end_core:
   status is 0, so it cleas the lock->acquiring_dev

6. spi_hdl_2:
   spi_device_polling_end:
   assert(handle == get_acquiring_dev(host)); Fail

Closes #8179
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Resolution: Done Issue is done internally Status: Done Issue is done internally
Projects
None yet
Development

No branches or pull requests

10 participants