Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pixel.show() crash with more than 73 pixel on ESP32s3 #375

Open
tresler opened this issue Dec 17, 2023 · 33 comments
Open

pixel.show() crash with more than 73 pixel on ESP32s3 #375

tresler opened this issue Dec 17, 2023 · 33 comments

Comments

@tresler
Copy link

tresler commented Dec 17, 2023

I have weird problem. I test example code or example with new operator and everything works fine if I have 73 or less RGB leds. If I use 74 or more, everything crash. Can anyone explain it?

Debug code: adafruit_neopixel_crash.txt

  • Arduino board: ESP32s3-WROOM
  • Arduino-ESP32: 3.0.0 alpha 3
  • Arduino IDE: 2.2.1
  • Adafruit_Neopoxel: 1.12
@ajkeeton
Copy link

ajkeeton commented Jun 5, 2024

Same problem here, also Esp32 (Heltec Wifi Kit), and at around 70 pixels

@teknynja
Copy link

teknynja commented Jun 7, 2024

I'm experiencing the same issue as well (crash/boot loop when driving more than around 70 pixels). This occurs on both my XIAO ESP32C3 boards and my ESP-WROOM-32 Dev Modules, with the following hardware/software:

  • Boards: XIAO ESP32C3, Generic ESP-WROOM-32 38 Pin Dev Board
  • Board Package: esp32 by Espressif (both v3.0.0 & v3.0.1)
  • Adafruit NeoPixel v1.12.2
  • Arduino IDE: 2.3.2

Digging around in the Adafruit NeoPixel Library, using the v3.x board package means were also using IDF v5, which is pretty simple, and the NeoPixel Library doesn't appear to be to blame for this issue. I've isolated the crash to where it's calling: rmtWrite(pin, led_data, numBytes * 8, RMT_WAIT_FOR_EVER);. The crash happens somewhere in there. Digging a little deeper, it doesn't look like the problem is in the IDF library either, since it doesn't really do much with the requested buffer size except to pass it along to the rmt_transmit method of the SDK. I expect somewhere downstream there a buffer is overrun and it eventually overwrites something important causing the whole thing to crash and reboot. 70 pixels turns out to be a fair amount of data: 70 pixels * 3 bytes for RGB, then each byte requires 8 RMT Items which themselves take 4 bytes each - 6720 bytes! This is far more than the ESP32-S3's 2048 bytes of RMT block memory (or the ESP32-C3's meager 768 bytes), so the SDK must be handling the buffering itself.

I'm not sure at this point where to turn for a fix though, since it looks like this is a problem with Espressif's v3 SDK. I'd really like to get this fixed, as I have a few projects with 150 or more pixels that I currently can't complete with this setup. I'm sure there's plenty of other people as well that expect to be able to drive larger pixel strings, so it'd be nice to get this working for everyone on the ESP32 platform.

@teknynja
Copy link

teknynja commented Jun 9, 2024

So I posted about this issue on the Espressif forums (https://www.esp32.com/viewtopic.php?f=13&t=40270) and they quite helpfully pointed me in the right direction; it appears that the culprit in the NeoPixel library is the espShow() method in the esp.c file is allocating the RMT Items Buffer (led_data) on the stack, and at about 70ish pixels it exceeds the available stack space and causes the crash. I'm guessing that allocating that buffer from the heap (or statically) would likely greatly increase the number of pixels that can be driven.

While looking at this code more closely, I also noticed that it's calling rmtInit() every time the pixels are "shown" - I'm assuming that it probably should only be called once (maybe in the .begin() method?). There doesn't appear to be a close/cleanup function in the Adafruit_NeoPixel driver, so there's probably no place or need to call the matching rmtDeinit() to release the RMT channel.

@robertlipe
Copy link

@teknynja , good thinking, but neither of the provided examples allocate the buffer on the stack. The first has it as a global in .data. The second has it allocated from the heap via operator new[]. So that's not it.

The single provided crash wasn't run through the symbolicator. See idf.py monitor and decoding at, e.g. https://docs.espressif.com/projects/esp-idf/en/stable/esp32/api-guides/tools/idf-monitor.html Many IDEs offer convenient ways to do this.

Note that the ESP32 parts have harsh rules about what registers can be poked at interrupt vs. base times and some other context rules. Is it possible that something is trying to refill a buffer and the hardware disappears from beneath it? That'll show that you'll be deep into a loop and the load/stores that have worked many, many times suddenly result in an invalid opcode (?) exception, thought the opcode hasn't changed.

Offhand, is this perhaps taking long enough that the watchdog is firing?

This caught my eye as I have a project that uses FastLED to actually feed the LEDs and it also blows up on s3, thought he breaking point is well into the many hundreds of LEDs. The breaking point doesn't seem to be absloute, but at X it'll run for hours and ad X+10 it'll crash within minutes. FastLED and NeoPixel are different code bases, so it's unlikely that they're bug-compatible.

The unfortunate memory footprints with either RMT or SPI are bad on long/many strands of WS2812's. It's annoying enough that I've started investigating slightly more expensive parts that have real DRAM - and lots of it - for larger pixel budgets. When 64 and 256MB parts are still under $10 - and predictably fast - I have to ask if the PSRAM pain is worth the pull.

@teknynja
Copy link

teknynja commented Jun 9, 2024

@robertlipe - I'm definitely not an expert C/C++ coder, but to my eye

void espShow(uint8_t pin, uint8_t *pixels, uint32_t numBytes, boolean is800KHz) {
  rmt_data_t led_data[numBytes * 8]; // <== stack allocation??

  if (!rmtInit(pin, RMT_TX_MODE, RMT_MEM_NUM_BLOCKS_1, 10000000)) {
    log_e("Failed to init RMT TX mode on pin %d", pin);
    return;
  }

  ...

the rather large led_data array is being allocated on the stack (plus in my demo code, once I change that to a static allocation outside the function, I'm able to send data to several hundred pixels without it crashing).

I haven't had a chance yet to try rigging up a test with allocating that buffer from heap in the actual driver, but I expect that should at least clear up the crashing issue.

@robertlipe
Copy link

robertlipe commented Jun 9, 2024 via email

@teknynja
Copy link

teknynja commented Jun 9, 2024

I've submitted a pull-request with fixes for this issue. I'm able initialize the driver configured for 500 pixels and it seems to be working fine. Anyone who would like to try this code out can grab the changes at teknynja/Adafruit_NeoPixel/tree/esp32_rmt_memory_allocation_fix - let me know if it works for you or if you encounter any problems. Thanks again for the help from @robertlipe on getting me started on this patch.

@Pr77Pr77
Copy link

@teknynja Your code works in my case. Thank you and @robertlipe very much. Also the link seems to be dead. This link to teknynja's fixes should work: https://github.com/teknynja/Adafruit_NeoPixel/tree/esp32_rmt_memory_allocation_fix

@teknynja
Copy link

For anyone using this library in multi-threaded scenarios, I've created a new branch with the same fix but it also includes a mutex to make it thread safe (works just fine for single-threaded apps as well): teknynja/Adafruit_NeoPixel/tree/esp32_rmt_memory_allocation_fix_safe. This is the preferred update for both single and multi-threaded applications.

@robertlipe
Copy link

robertlipe commented Jun 12, 2024 via email

@teknynja
Copy link

teknynja commented Jun 12, 2024

Yeah, i wasn't sure how to avoid that without having a way to initialize with the instance...

@robertlipe
Copy link

robertlipe commented Jun 13, 2024 via email

@teknynja
Copy link

teknynja commented Jun 13, 2024

hmm, it doesn't appear we have access to the std library in this project (perhaps because this is just a plain ol .c file?). At least when I try to build in the Arduino environment, it complains that it can't find the header files. Also not sure how the maintainers would feel about adding references to std:: since it doesn't appear they are using it anywhere else.

Your example did get me looking into possible solutions from within the FreeRTOS framework itself, and I discovered the xSemaphoreCreateMutexStatic function which looked very promising, but digging through the FreeRTOS source code, it looks like that call is probably not be thread-safe either. :-/

I'm wondering if I could just do something quick-n-dirty, like disabling interrupts around where I'm checking/creating the mutex... it should be pretty quick through that section, and is in keeping with the design philosophy of this library. (Edit: Ugh, disabling interrupts would only solve it on single core apps!)

I nags at me too, having even a tiny crack like this race condition in the code - my day job is working on industrial process control systems and we'd never let something like that into production.

Interestingly, I just started looking into c++ lambdas a short time ago to try and solve a different problem (I'm a big abuser user of lambdas in C# though!)

@robertlipe
Copy link

robertlipe commented Jun 13, 2024 via email

@teknynja
Copy link

teknynja commented Jun 13, 2024

I was trying to avoid modifying the main .cpp file, but at this point it looks unavoidable. I could put the mutex in the initializer list (Edit: No, I can't!), but that feels pretty invasive (not to mention how ugly the #ifdefs around the ctor would be) and it still doesn't completely close the race condition (just moves it to the ctors racing). Even if we don't do a "test and set", there's still an opportunity for a thread to stomp on another thread's mutex.

In a perfect world, the mutex would be created in the startup thread before any child threads are spawned, but that's not a reasonable thing to expect a typical Arduino user to deal with. The best we could probably hope for is to require the Adafruit_NeoPixel instances to be created on the main thread, and then calls to .begin(), .show(), et al could be done on the worker threads. Without low (cpu) level synchronization primitives, it think it's gonna be tough to create a mutex on a thread that needs to be protected with that very same mutex.

@robertlipe
Copy link

robertlipe commented Jun 14, 2024 via email

@teknynja
Copy link

Updated pull request #394 to move the initialization of the espShow() mutext to the Adafruit_NeoPixel constructor. As long as all instances are created on the main (or same) thread, the race condition is avoided. Otherwise, there is a slight chance of the threads clobbering each other's mutex (the same behavior we had before this change).

This is a slightly more "invasive" change then before, as there are minor changes now to the Adafruit_NeoPixel.cpp & .h files 🤷.

@egnor
Copy link

egnor commented Aug 5, 2024

I am just a kibitzer myself (who ran into this issue), but here's my understanding of the situation, for reference.

Background:

  • The ESP32 Arduino core "arduino-esp32" is built on top of "ESP-IDF", an ESP32-specific support framework made by Espressif (makers of the ESP32). Any particular version of arduino-esp32 includes some version of ESP-IDF, and translates Arduino-type calls to ESP-IDF calls. (For example, when you call Arduino digitalWrite, arduino-esp32 calls ESP-IDF gpio_set_level.)
  • Arduino code (sketches and libraries) can, and do, use the available version of ESP-IDF directly, since arduino-esp32 doesn't translate everything ESP-IDF does into Arduino-ish. You can "mix and match" Arduino-ish calls and ESP-IDF calls. Of course, if you use ESP-IDF, your code is necessarily ESP32 specific.
  • On the ESP32, the Adafruit NeoPixel library, like other LED-strip-driving libraries, uses the ESP32's "RMT" hardware module. This module is designed for driving infrared remote controls (like, controlling a TV) but is kind of a general timed pulse train synthesis module and can be pressed into service to drive NeoPixel-type LED strips with hardware, with guaranteed timing and without tying up the CPU "bit banging" a GPIO. Cool beans.
  • The arduino-esp32 RMT driver doesn't offer quite enough control to efficiently drive LED strips (more on this later), so the Adafruit NeoPixel library uses the ESP-IDF RMT driver directly. RMT is ESP32-specific anyway so this is fine. (On other chips, the NeoPixel library uses entirely different techniques.)

So far this is all fine. But we have some wrinkles:

To deal with this RMT driver interface change in ESP-IDF v5.x, Adafruit (Lady Ada herself!) changed the NeoPixel library. On ESP-IDF v5.x, instead of using the ESP-IDF RMT driver, it uses the arduino-esp32 wrapper driver, which in arduino-esp32 v3.x uses the new ESP-IDF RMT driver.

BUT, there was a reason to bypass arduino-esp32 originally. (Remember I said I'd get to this.) The ESP32 RMT hardware can be programmed in various ways. One is just to give it a whole sequence of exact bit timing in a big buffer. (Like, "set the line HIGH for 350 microseconds, then LOW for 75 microseconds, then...".) This is the only mode the arduino-esp32 wrapper supports. But there's another mode which ESP-IDF supports, where you store the buffer in your own favorite format, and supply a little translation function that turns a chunk of buffer into a chunk of timed bits, and ESP-IDF calls that function from an interrupt as needed to translate a chunk at a time, which is much more memory efficient. This "streaming" mode is what the NeoPixel library uses on ESP-IDF v4.x. On v5.x, with the Arduino wrapper, it just allocates a (potentially) big buffer -- 32 bytes of buffer for every byte of pixel data! And it does so on the stack.

(Note the ironic-in-hindsight exchange in #367 where @ladyada wonders if it's OK to allocate the buffer on the stack, https://github.com/me-no-dev notes that it could be pretty big and should maybe be allocated on the heap, but @ladyada said "the stack memory is fine", probably thinking about post-return references and not about stack exhaustion...)

  • But, good news! The arduino-esp32 maintainers added a way to not use the newer RMT library, allowing the legacy RMT library to be used. This got included in arduino-esp32 v3.0.3.
  • Note, the way you do that disabling is kinda hinky, you have to create a "header file" with the magical name build_opt.h, which isn't a header file at all but a list of compiler flags, and add -DESP32_ARDUINO_NO_RGB_BUILTIN to that.
  • Then you need to hack the Adafruit NeoPixel library to use the old RMT driver even though it's on ESP-IDF v5.x.

Going forward, probably the NeoPixel library should have code that uses the new ESP-IDF RMT driver, which does still support streaming mode (but with a whole different interface). Then everything will be dunky hory, though that will be a bunch of work for someone to do.

Meanwhile, allocating that big buffer on the heap is probably a good move. It's not a cure-all, though -- depending on the length of your pixel chain, you could end up running out of heap memory! (1000 pixels * 3 bytes/pixel * 32x blowup = 96KB, which is a lot of memory for an ESP32.)

@robertlipe
Copy link

Nice writeup. A couple of us are kibitzing on this issue. Unfortunately, we can't seem to attract the attention of anyone capable/interested in offering feedback and/or pressing the 'merge' button.

Your understanding of the current, very messy, state of affairs matches my own. I'm increasingly tempted to rip out the actual hardware-groping parts of several LED libraries and just rewiring them to use Espressif's own https://github.com/espressif/idf-extra-components/tree/master/led_strip at the bottom end and just using the existing stuff for HSV-RGB transitions, blending, fading, line-drawing, and all that. Maybe Espressif can keep their own code stable and working on a variety of their own chips and IDF version...and we have less Arduino-quality code in the world.

Since you have a great "state of the union, mid 2024" edition going, I'll add one other bit of "worse news" to the list that will affect a substantial percentage of hobbyist/users in this market. Let's make it easy for the LLM harvesters (or any humans that actually read this stuff) to ingest it all on a single page.

Platformio uses backrevved versions of, well, most everything for ESP32. That C++23 feature you're trying to use? Good luck with that six year old version of g++ they bundle. The reason for this is that Espressif and Platformio are having a very public love withdrawal.

Platformio wants Espressif to pay them. Espressif keeps sending patches and trying to keep them running, but Platformio won't even accept the PRs. Espressif doesn't really see a need to pay Platformio to smear a layer of goo atop of their own IDF. This means if you're trying to run a new Espressif part with Platformio on the Arduino platform, you're probably hosed. The web is full of much semi-public feudin' between them, but https://www.cnx-software.com/2024/06/01/espressif-releases-arduino-esp32-core-3-0-0-but-platformio-support-in-doubt/ is a good start if you want to pull at that thread. This spat has been going on for at least a year at this point and I haven't yet seen either side blinking.

The result is that a large percentage of the popular blinky code (because it's all dependent on RMT or the more stable SPI) just plain doesn't build right now if you check out the head versions of everything and are using Platformio. It's a mess. The fastled issue tracker (which was already struggling to keep up) is buried in duplicates of "it doesn't work" posts and sometiems "... but don't bother yourself actually fixing it, just downgrade instead" answers.

Sidebar: Maybe there's a message here that spreading your development over what you thought was an IDE and what you thought was a convenience wrapper for IDF that happens to look like it was written for an ATMega (ugh) can substantially increase the risk to your project's velocity because parties that you're not directly paying are mad because nobody else is paying them either. Right now, Jason2866 is almost single-handedly holding up the support load and the build/integration issues to publish a mostly viable Arduino3/IDF5 hybrid and that seems like totally the wrong resolution, but it's a nice testament that open source makes that even possible.

Sidebar to the sidebar: I'm not sure that all these "let's allow you to mix and match the incompatible drivers" flying around are great for the ecosystem either. If they didn't build versioned compatibility into the API (and sometimes, you just can't...) the effort would be better spent just fixing the callers, IMO. But Espressif doing goofy things like gratuitously renaming DMA registers on RMT between ESP32-S3 and ESP32-Nothing in the same build isn't a great confidence builder, either.

Exactly none of this impacts the probable correctness of this PR which deserves to be reviewed and potentially merged. Teknynja's fix is unrelated to this squabble, tackles the very issue that LadaAda and MeNoDev discussed, and is independently confirmed to work. It should be merged. Withholding stability fixes is not helping the state of the blinkyverse right now. 73px on what's currently the flagship Espressif part (P4 any day now...but without WiFi /shruggie) is a pretty low bar.

As for it not being a cure-all, that's certainly true; but we know one configuration where you're never going to get 96K and that's on the stack in FreeRTOS. :-)

Blink on!

@teknynja
Copy link

teknynja commented Aug 5, 2024

@egnor – I think that captures (in great detail) the situation surrounding this issue, thanks! And @robertlipe, I had no idea about the drama/issues going on with Platformio (ignorance is bliss!).

I agree that the amount of buffer space per NeoPixel is problematic, the primary motivation for my work on this issue was to be able to break the unexpectedly low limit of ~73 pixels when I was using the new ESP-IDF RMT driver for other purposes in my project as well. Initially, I attempted to use the low-level solution of translating the pixel buffer data "on-the-fly", but getting the timing right to where that translation code was called with consistently low enough latency to keep the RMT’s buffers fed proved challenging. If I remember, it also caused problems with the 5.x RMT library’s buffer allocation, which otherwise seems to work very well as long as you stick to the 5.x API.

I think until a better solution comes along, the 1000ish pixel limit will get most users of this library where they want to go, and if it doesn’t then they will need to select a different platform to achieve their goals (just like you would need to step up from the ATmega328 if you wanted to drive more than a few hundred pixels). I might be a bit biased about getting my changes merged into this library, but I think given the state of things at the moment, this change allows a large percentage of users to get on with their development and worry about other problems.

Thank you for prodding this issue, I was beginning to wonder if it was destined to end up as a footnote on archive.org 😉

@ladyada
Copy link
Member

ladyada commented Aug 5, 2024

im somewhat sleep deprived - is there a fix anyone can PR that will address this or do y'all need me to research it? the IDF5 fix was a patch to keep things going, wasn't rigorously tested.

@egnor
Copy link

egnor commented Aug 5, 2024

@ladyada as a quick-ish fix take a look at the two PRs (pick one or the other, not both) by @teknynja

((My guesswork: Probably you'd like to keep the behavior of initializing and deinitializing RMT entirely within show() rather than the global initialization in that PR, but that's a quick change if desired?? In general the quickest of fixes would just be to malloc() the buffer on entry to the function and free() it on exit, that would be the most surgical fix though not the most performant?))

If you'd like someone to take a look at "properly" porting to the new RMT API, I/we could take a crack at that...? I think it's mostly mechanical, albeit detailed. (There's also @robertlipe's proposal to use Espressif's example code.)

p s. get some sleep @ladyada, I can't believe you run the company AND dive into the weeds of RMT APIs on the ESP32 lol! surely you have an army of Adaminions?

@teknynja
Copy link

teknynja commented Aug 5, 2024

@ladyada - RMT Buffer Allocation Fix (Thread-Safe) for Issue #375 #394 is the PR I recommend as it includes tweaks to improve the thread-safety of the fix. The other PR tries to keep all the changes within the show() method at the expense of some thread safety.

And yes @ladyada, get some rest!

@robertlipe
Copy link

robertlipe commented Aug 5, 2024 via email

@robertlipe
Copy link

robertlipe commented Aug 5, 2024 via email

@egnor
Copy link

egnor commented Aug 5, 2024

(Knowing what we now know, I believe the PR can be simplified a bit, and I left comments to that effect, which you should feel free to adopt or ignore. As far as I can tell it should work as-is, but with simplification would be easier for @ladyada or her Adaminions to review.)

Absolutely agreed with @robertlipe that (1) we should take this PR or something like it as a quick fix, (2) one can test without necessarily having million mile LED strips. (I do have very long LED strips around, for what that's worth. I don't have every ESP32 variant under the Sun, but I have several. Regardless, if anyone is set up to test NeoPixel output, it would be Mama Adafruit.)

@teknynja
Copy link

teknynja commented Aug 5, 2024

The "bitbucket" test is a valid way to make sure the code compiles and runs w/o crashing, but in order to make sure there aren't any visual glitches or other timing artifacts, we'd at least need to connect a reasonable number of actual devices and watch the result (or pour over a logic analyzer's output ;-) )

@robertlipe
Copy link

robertlipe commented Aug 6, 2024 via email

@zackees
Copy link

zackees commented Aug 6, 2024

QEMU based esp32 simulator:

https://github.com/mluis/qemu-esp32

@robertlipe

@robertlipe
Copy link

robertlipe commented Aug 6, 2024 via email

@egnor
Copy link

egnor commented Aug 6, 2024

I actually did the exercise of writing a new-RMT-interface-based streaming LED strip driver, cribbing from Espressif's led_strip component. There's some subtleties (like state management in the encoder function, or interaction with the Arduino library's peripheral manager) but it's not too complicated. So @robertlipe your idea of wrapping Espressif's component as the "backend" seems reasonable, or maybe adapt that code to something a little more backend-able (e.g. no need to keep its own pixel buffer).

But overall this is just "there was a major version bump, some APIs changed, gotta move to the new ones", no need to overthink things too much.

As for automated testing, I suspect you're not entirely serious about that particular goose chase, but I would have done it with automation on top of a logic analyzer (or maybe scope). The kind you'd need would be expensive, but...

@robertlipe
Copy link

robertlipe commented Aug 6, 2024 via email

@zackees
Copy link

zackees commented Aug 21, 2024

FastLED dev here.

I'm looking at the ESP32 documentation on the new RMT changes.

Some changes aren't that big of a deal. Others are painful.

It's possible that a subset of features that intersect between the esp-idf RMT versions can be merged together into a unified driver that somewhat resembles the legacy driver. What makes this easier is that the led devs only need to be concerned about the TX portion of the RMT protocol.

The big changes I see are the following:

  • ISR callbacks
  • Calling status and other functions after the channel is created is now a no-no. It can only be set during construction time of the RMT driver (set clock divisor / gpio).
  • Streaming support for RMT symbol writing has been changed from explicit looping on an ISR to implicit callbacks managed by the driver.

I haven't dug too much further than the docs, but it could be possible to create a unified driver where led devs only need to change it minimally once for the 4.x codebase and then it just works in the 5.x codebase. Things like dynamic status and pin changing would be disabled in this unified driver but I don't think that's a big problem and could be bridged with a strategy where if you want to change pins you just re-initialize the whole channel.

Just my 2c.

https://docs.espressif.com/projects/esp-idf/en/stable/esp32/migration-guides/release-5.x/5.0/peripherals.html#rmt-driver

image
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants