-
Notifications
You must be signed in to change notification settings - Fork 214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
esp32c3: trouble initiating DMA-driven SPI transfers #489
Comments
Prior to this commit, LLVM would have to de-optimize a little bit from its theoretical best, lest we fall off the edge of whatever timing boundary we're running into. For more information, see: esp-rs#489
Wow that is really weird! The fact that a single "nop" makes the difference between working and non-working is quite scary. I also had a brief look at esp-idf to see if they do some magical things in this direction but I haven't seen anything (which doesn't mean it is not there) The only thing I could imagine is that writing to the DMA registers somehow affects the SPI interrupt flags and without the "nop" the clearing of the interrupts collide with that change ... but also, that sounds strange Regarding checking the descriptor-error ... probably it should be checked "during" the transfer since I guess the check is done while reading the descriptors during the transfer Regarding disassembling code: You probably already know it but anyways: Passing the option |
😮 I did not know that, that's so useful! Thank you for sharing!
ohhh, that's a good thought: I'll see if I can spy anything over that way too.
Yeah, that's kind of what I was thinking when I decided to slot in the nop there? Some sort of asynchronous communication between the two peripherals that needed to happen? But I was more than a little surprised that a single nop covers it: I guess it means that the write to the SPI happens ~at the 8th instruction instead of the 7th following the write to the DMA, and 8 is a multiple of 4? (We run the SPI clock at 40 MHz, so that factor of 4 at least feels possibly relevant?). Can you think of a way to track changes that the DMA would be making to the SPI registers "behind the scenes?" Would the debug assist dealio help us with that? Also, something I forgot to mention: we don't have an easy way to test the DMA rx behaviors, but I'm really curious whether they've got the same problem or not, since in
Oh, good point: I went ahead and split that out into #491 for us to track separately. |
I think unfortunately debug-assist would only help with the CPU or actual DMA transfers accessing RAM. I don't yet have a good idea how to debug this |
Well, also lacking better ideas, I've applied brute force: I reviewed esp-rs vs esp-idf
Definitions
DiscussionInitially, I thought that clearing the transaction done interrupt so much earlier than we do might be having an effect, but I tested both that and the possibility of other timing dependencies by (re)moving everything from between enabling the outlink and setting SPI_USR and adding a bunch of delays "buffering" that pair of operations, which seemed to have no effect. So, next, I tried playing with the amount of delay by varying the number of nops; I also inserted a What I came up with was this:
(unsatisfying) conclusionSo, uh, what do? It seems to me that the esp-idf works almost entirely by accident,3 due to how a loosely related chip (the ESP32) happens to be configured. The nop-driven approach has the advantage of explicitly modeling the timing dependency, I suppose; would y'all be happy with a PR that shuffles the SPI order of operations and adds 36 cycles of nops (specifically on the I was pawing through the different hardware-set interrupt bits to see if there's a way we could identify and wait for whatever operation takes ~225ns; the only thing I've found so far is The only other bad idea I had was to try and figure out what a DMA segmented transfer looks like; presumably, if we bake the SPI configuration into a CONF buffer before the data buffers, the hardware will handle it appropriately? I'm very suspicious we'll still have the same problem, though: from what I can see in the TRM, initiating a segmented transfer looks almost identical to a single transfer relative to the OUTLINK & SPI_USR bits. The only other thing a close reading of the TRM revealed was that in between turning on the OUTLINK and setting SPI_USR they've documented a step to reset the SPI's afifos (though esp-idf does it a fair bit earlier, and it didn't make a difference when I tried it), and a step to "wait for SPI [secondary] to get ready for transfer," which I don't believe is specifically meant to wait for the SPI machinery within the chip itself (and it lacks sufficient clarity to do so, anyway). I haven't tried to minimize the reproduction case yet, because it's very easy to tell when things go wrong in our (complex) setup, but I suspect it's possible to demonstrate with the loopback, if the SPI is set up in receive-mode ahead of time before issuing a If you do get a hankerin' to try this out yourself, feel free to poke at our fork, we're working off the Footnotes |
Thanks for the thoroughly done analysis. Probably wasn't much fun I also had a brief look at the TRM (again) and also weren't able to see anything we can reliably wait for - but that would obviously a saner solution than the Maybe @MabezDev and @jessebraham should also have a look at this (if they didn't already) - any ideas from your side? |
I appreciate you, thanks for taking the time to read it! I do have fun with this kind of stuff, but it's definitely more fun when the conclusion is a nice neat "here's a good idea" rather than "here's a bunch of nops" 🙃 Part of what's got me stumped is that my instinct is to go down a layer at this point; in other contexts, I've found that e.g. reading a syscall implementation in the kernel can reveal the difference between "I'm holding it wrong" and "this ought to be improved," and may even suggest leverage points. I'm not sure what that'd look like in this domain though; finding a big pile of VHDL and running it through a simulator a whole bunch? Reaching out to the espressif technical support folks? I suppose I'm looking for a way to bound that uncertainty around "what chips will this affect?" I do want to say: I think this is actually a great problem to have. It really validates y'all's approach and speaks to the excellence of the tools you've built that I'm able to push up against the limits of the hardware like this, and to have done so relatively easily! I mean, I added a handful of linker directives, and now we're talking about introducing targeted de-optimizations because the esp-rs HAL is too good! |
I believe this is now resolved via #1856, thanks for your investigation @sethp. In the end we just added a micro-second delay before starting the DMA to allow the FIFO to contain at least one item. It's not perfect, but I believe this a limitation of the hw and something we need to work around. I hope this solution ends up being viable for your use-case. |
Ah the classic "fixing a timing issue with a sleep call." I'm not really set up to look at this too closely, but IIRC a microsecond delay is probably a bit more than necessary but should still work fine. Thanks for the follow up! |
Apologies for the vague title and the long issue! I'm grateful for all the help you've already given me in getting to this point with your excellent work on the HAL. I very much appreciate your efforts, and my hope for this long letter is that it's received in that spirit of appreciation.
We've recently discovered something strange: our SPI DMA transfers would produce some meaningful data for a few bytes, and then more or less stop long before the configured datalen. We'd get maybe a half dozen to a dozen somewhat distinct values at the beginning before getting "stuck" transmitting the same byte over and over again. Initially, we suspected either something in our code was incompatible with #444 because we happened to pull those changes in at about the same time, but that so far has seemed more coincidental than anything else.
Indeed, this has been a particularly interesting one to track down, because it seems unrelated to (logical) code changes at all. We only need to transmit data, so we ran into this problem using
dma_write
1. However, changing to usingdma_transfer
with an empty receive buffer seems to avoid the problem entirely. From what I can tell,dma_transfer
has an identical logical flow todma_write
, except for some additions that more or less no-op right out with an empty receiver (foreshadowing!). In fact, no amount of perturbing the source (e.g.git bisect
, both on our project's code and the HAL's) revealed a difference in the behavior tied to any particular logic change. No matter the configuration of the peripheral, it seemed,dma_write
would work inconsistently butdma_transfer
would always behave as we expected. A big clue came when I discovered that we had a "load bearing"#[inline(always)]
: removing that annotation would produce the expected DMA transfer withdma_write
, and including it would "break" the output.I believe the main difference there was that forcing inlining acted as a de-optimization. The function in question,
transmit_chunk
, is called from two places, so inliningtransmit_chunk
raised the "cost" of space-wise optimizations and thus de-inlined much of its body. Especially,dma_write
and its transitive callees. This, along with the behavior difference when doing more "non-functional" work withdma_transfer
, is what first suggested to me a timing problem.Some further context that's probably helpful at this point: we've got a hard-real-time bound on getting our DMA transfer out the door (we're emitting pixel data for a video frame), so we've very aggressively optimized the code path through the HAL that initiates transfers for consistency & speed. Mostly that's been ensuring everything on the hot path is memory-resident so we aren't coupled through the cache: touching the flash, even just once or twice, seems to introduce enough variability that we blow our targets.
We've achieved that mainly by a combination of
#[inline(always)]
and#[link_section = ".rwtext"]
2. The latter is a spelling of#[ram]
that still allows LLVM to decide to inline the function (which will be important a little later). We've also turned on LTO and are always running in release mode (though, with debug symbols and-C force-frame-pointers
). The upshot is that we've gotten the whole "start dma write" down to about 2000 cycles3 with almost no variability: either within a given execution or across whole reset/power cycles. It's even remarkably consistent across reflashes, when code "away" from the hot path is changed.Ok, so that's the setup, now for the punchline: the most reliable way I've found to produce the desired operation of the DMA-driven SPI peripheral is this to add a single
nop
aftertx.prepare_transfer(..)
:Suffice to say it took a lot of reading disassembly and failed attempts to figure that out, but it does seem so far resilient in the face of code changes that none of my other attempts have been. The problem seems to only present itself when LLVM maximally inlines all the various traits and register access blocks &c, which only happens some of the time. The main variable, I believe, are those functions we said to "either inline or place in RAM, your choice." The outcome is very stable from a given pile of code: it seems to happen 100% of the time or 0% of the time for a given binary. However, I find it very hard to predict what changes to that code will cause LLVM to optimize differently, so none of my other attempts have been robust in terms of getting LLVM to optimize enough that we meet our performance targets, but not so much as to break the output.
The positioning of the
nop
is meant to act as a "buffer" between the DMA and SPI peripherals during CPU programming. When it does optimize "fully," LLVM produces disassembly that looks like this:The
nop
slots right in between writes to addresses at the 0x6003_f000 (DMA) base, and the subsequent interactions with the hardware registers based at 0x6002_400 (SPI2). I started with a much larger number of nops (128), which also functioned as I hoped, and was more than a little surprised to find that even a single nop makes the difference.But, now I'm not sure where to go next. Probably adding a single
nop
isn't the right answer, but it does show how sensitive to change the issue is: it's very easy to "fix" it by accident. From what I can tell, the machine code faithfully implements the procedures described in the TRM for the esp32c3, but maybe there's a missing step? Some bit that needs to be cleared? There's no mention (nor mechanism) that I can find for waiting on the DMA's internal programming like there is with the update bit for SPI; is that not a concern in this case? It does seem strange that we immediately check the DMA's interrupt register for descriptor errors (is that guaranteed to be set within a single cycle of turning on the outlink?), but that can't be our problem: the descriptors are valid, they're just not being used.I've also got little to no intuition about how broad this problem is: either across projects or esp32* platforms. It's plausible that cranking the optimization settings for one of the bog-standard examples would show the same behavior (I don't think we removed anything that would prohibit LLVM from inlining, just added mandates)?
Thanks again!
Footnotes
We haven't yet switched over to the new SpiDataMode +
write
functionality from Half-duplex SPI #444, but we're looking forward to it! We did run an experiment early on that suggested it suffered from the issue problem, though I haven't gone back and checked. ↩As an aside, we'd be glad to take any suggestions y'all have for how to make this process a little more robust/less intrusive. So far the most reliable way I've found to do it is manually tracing through the disassembly for memory addresses that start with
0x42...
for functions (or0x3c...
for data) and then manually annotating the things I discover. This works, but isn't exactly error-free or particularly reproducible (though I am eternally grateful to whichever wonderful person taught riscv objdump to recognize lui/addi pairs and put the resultant address in a comment). What's more, I have no idea how to implement that without forking whichever dependency I find myself in to add the annotations, which naturally comes with its own set of costs too. ↩1783 as of the time of this writing, as measured by a static cycle counter and
mpccr
(which adds about ~16 cycles of measurement overhead). ↩The text was updated successfully, but these errors were encountered: