Speed improvements for discussion #4138

DedeHai · 2024-09-12T19:37:50Z

Added a lot of improvements to core functions making them slightly faster and saving some flash.
I tested all functions to work the same as before. I tested speed by looking at FPS and it's hard to say how much faster it actually is but I see some improvement, maybe 1%.
The inline of getMappedPixelIndex() is just an idea, not sure that is correct nor do I know if it improves anything.
Did not test on ESP8266.

no real difference in FPS but code is faster. also 160bytes smaller, meaning it is actually faster

tested and working, also tested video

its not faster but cleaner (and uses less flash)

-calculations for virtual strips are done on each call, which is unnecessary. moved them into the if statement.

uses less flash so it should be faster (did not notice any FPS difference though) also cleaned code in ColorFromPaletteWLED (it is not faster, same amount of code)

…alette inlining getMappedPixelIndex gets rid of function entry instructions (hopefully) so it should be faster. also added the 'multi color math' trick to color_add function (it will not make much difference but code shrinks by a few bytes)

blazoncek

Excellent!
There are some formatting issues (spaces) but that's minor.

wled00/FX_2Dfcn.cpp

wled00/FX_fcn.cpp

blazoncek · 2024-09-13T07:25:44Z

wled00/FX_fcn.cpp

@@ -1816,7 +1820,7 @@ bool WS2812FX::deserializeMap(uint8_t n) {
  return (customMappingSize > 0);
 }

-uint16_t IRAM_ATTR WS2812FX::getMappedPixelIndex(uint16_t index) const {
+__attribute__ ((always_inline)) inline uint16_t IRAM_ATTR WS2812FX::getMappedPixelIndex(uint16_t index) const {


Non obligatory: I would prefer __attribute__ at the end but [[...]] in front.

I honestly have no Idea what the precompiler instructions mean in detail, I copied this from what they use in fastled...
the idea behind this is to get rid of the 'function entry' instructions that are added when a function is called. When I added the inline flash size increased by a few bytes, telling me that it is actually inlined. Since this short function is only called from two places and is called A LOT this may be faster. I have no way to check (would need a proper debugger that shows assembly instructions being executed line by line).

As I've recently learned these are compiler attributes.

🤔 not sure if always_inline plays well with IRAM_ATTR .... the first tells the compiler to always inline the function, the latter says "put the function into IRAM" which means that a real function is needed.

that is what I was wondering too, this is just a suggestion, i.e. to inline this for speed but how to tell the compiler to inline it to the functions that are in ram... not sure how it will do it.

To my understanding:

inline is a hint/suggestion to the compiler. So it might get inlined, or not.

__attribute__((always_inline)) is a directive. So the compiler must inline this function, no matter if its efficient or not.

If you want to optimize function calls, its sometime useful to add __attribute__((pure)) or __attribute__((const)) to the function declaration. But only do this after double-checking that the code is actually "pure" (no side-effects) or "const" (solely depends on arguments). I did this in the MoonModules fork, but honestly it does not give you more than 1 or 2 fps even if you apply it to lots of functions.

See MoonModules@7f9da30

wled00/colors.cpp

wled00/fcn_declare.h

wled00/colors.cpp

DedeHai · 2024-09-13T16:46:01Z

one more thing I was thinking about:
the way I do color handling in the particle system is very efficient by creating a buffer of the segment size, rendering to that and then passing it to WLED. it would be even more efficient if I would drop rendering 'mirrored' pixels (may implement that at one point). So my idea was this: create a buffer (RGB, I don't think white is needed for FX right?) that is the size of the largest segment (and stays in RAM so no fragmentation unless segments are changed a lot, not sure how buffers work currently). Render one segment to that buffer (clear to zero whatever the segment needs), the buffer is accessed as 'in order' so no mapping (except mirror and transpose) and once it is rendered (including blurring and stuff) transfer the buffer to the 'strip' using mapping. Then clear the buffer, do the next segment and so on. This would also allow to use 'add pixel color' when a buffer is transferred, making overlays play more nicely, independent of FX supporting it. If a segment FX calls 'get pixel color' it would still have to be fetched from the strip buffer though. but blurring would be a lot faster and overlapping segments could be 'transparent'.
This is only a half-baked idea as I do not yet fully grasp how the whole 'color chain' works from segment to strip (and back) but the whole 'mapping' checks could be cut down significantly if it is not done for each 'setPixelColor' (at least I think so)

willmmiles · 2024-09-13T17:19:32Z

the buffer is accessed as 'in order' so no mapping (except mirror and transpose) and once it is rendered (including blurring and stuff)

I had the same idea! I expect that using flat temporary render buffer of the virtual() sizes, then applying all mapping operations as a post-processing step, would yield a substantial average-case performance improvement. That said, if you try to do mirror and transpose during the render, I think you'll give up most of the benefits as every write still ends up being a pile of variable lookups and conditionals instead of just straight local pointer arithmetic -- I think it'd be better to perform all mapping operations after the render. We'd also get some advantages with code layout and IRAM caching, as each FX function wouldn't need to make so many external function calls.

The big tradeoff is RAM usage. The catch is that the temporary buffer needs contiguous RAM. If it's held allocated, we lose the ability to share that memory (in highly constrained systems) with other tasks like the web server... but if it's not, we risk not being able to get a big enough buffer when it's needed for a render. There may also be systems where there just isn't enough RAM to hold two copies of every pixel state at once.

That said, I'd still love to see a PR that implements a temporary render buffer and pulls the mapping operation out to a separate pass after the FX function call. I think the performance and code size improvements are likely to be compelling, even if it reduces the maximum LED count that can be supported on memory constrained hardware.

DedeHai · 2024-09-13T22:36:42Z

How are buffers currently? What is the global buffer vs no global buffer?

blazoncek · 2024-09-14T06:03:14Z

Let's start at the bottom and make our way up.

NeoPixelBus (in 2.7.9) uses 3 buffer: 2 render buffers and 1 transmit buffer. Render buffers are used in "double buffer" fashion. Transmit buffer may be small, but render buffers are of size LED count * pixel size where pixel size can be 24bits up to 64bits. (2.8.0 or later may change that as @Makuna is working on optimisations). Different hardware (RMT, I2S, ...) may have transmit buffers of different size.

WLED creates its own global buffer of size LED count * 32bit (if so selected in settings) for its strip. This is due to the fact that WLED uses NeoPixelBusLg which does luminance adjustments after each SetPixelColor(). This means that GetPixelColor() will not return original color that SetPixelColor() wrote if brightness is less than 255. If you disable global buffer, WLED tries to reconstruct original color but that is not always possible without loss.

Segment does not create any buffers of its own (though in the past we had setUpLeds() that did that). It will utilize strip's buffering when getPixelColor() is called.

This (lossy reconstruction) is why mirroring, reversing and transposing have to be done in the setPixelColor() instead of separately when FX is done writing (BTW each pixel is written only once in FX if the FX function is written correctly).

Newer versions of NPB are going to do brightness scaling only during transmission of data to LEDs so GetPixelColor() will return originally set value. Until that happens we need to deal with it ourselves.
On top of that NPB uses the same bit depth for transmit and render buffers but a future version is scheduled to have transmit buffer of the same bit depth as required by LEDs while render buffers may be of bit depth required by user (in our case 32bits).

And then, there is a CCT support.

NOTE: Note the difference between 'GetPixelColor()andgetPixelColor()`. One is from NBP the other from WLED.

DedeHai · 2024-09-14T11:20:12Z

Thank you for detailing this, much clearer now.

I was just looking at 'soap' FX as an example that uses setPixelColorXY() as well as getPixelColorXY() but the following points apply to most FX. The reason I see this is quite "slow":

colors are handled in CRGB, each getPixelColorXY() is converted from 32bit color, manipulated, converted back in setPixelColorX()Y -> native 32bit handling would improve that i.e. ditch fastled and implement the used functions in CRGBW32, requires a lot of code refactoring though. Speed improvement is probably not going to be huge as conversion is fast (but in principle mostly unnessecary).
the global buffer is in 'strip space mapping' so each call to set or get is mapped to and from there, including mirror, reverse, grouping etc. plus brightness adustment in setPixelColorXY() plus the LED map. This is what mostly makes it slow.

A solution could be to replace the global buffer with individual segment buffers that are not mapped nor brightness adjusted but do that in one go when it is transferred to NeoPixelBus buffer (saving a lot of if statements and back and forth mapping).
The main draw-back would be a lot more memory is used if segments are overlapping. This could be mitigated by setting a memory limit (like it is done with SEGMENT.data) and if that limit is exceeded, fall back to not using the buffers (this is how I do it in the ParticleSystem, btw: if I disable local buffers, PS Fire for example drops from ~85FPS to ~55FPS).

Would that be a good way to try and solve it or is there something fundamentally flawed with this approach?

blazoncek · 2024-09-14T11:46:21Z

* colors are handled in CRGB, each `getPixelColorXY()` is converted from 32bit color, manipulated, converted back in `setPixelColorX()Y`   -> native 32bit handling would improve that i.e. ditch fastled and implement the used functions in CRGBW32, requires a lot of code refactoring though. Speed improvement is probably not going to be huge as conversion is fast (but in principle mostly unnessecary).

I am not going to rewrite all of the effects. 😄

* the global buffer is in 'strip space mapping' so each call to set or get is mapped to and from there, including mirror, reverse, grouping etc. plus brightness adustment in `setPixelColorXY()` plus the LED map. This is what mostly makes it slow.

strip only does mapping via ledmaps. Swapping index if necessary. No real slowness there, at least there was none until someone wanted to exclude ledmaps while realtime data was received.
BusDigital::getPixelColor() uses global buffer instead of querying NeoPixelBus.

A solution could be to replace the global buffer with individual segment buffers that are not mapped nor brightness adjusted but do that in one go when it is transferred to NeoPixelBus buffer (saving a lot of if statements and back and forth mapping). The main draw-back would be a lot more memory is used if segments are overlapping. This could be mitigated by setting a memory limit (like it is done with SEGMENT.data) and if that limit is exceeded, fall back to not using the buffers (this is how I do it in the ParticleSystem, btw: if I disable local buffers, PS Fire for example drops from ~85FPS to ~55FPS).

This was used in first iteration of 0.14 while we used setUpLeds() (MM still does). It presented a whole lot of problems which were mitigated using global buffer (which presented a new set of issues, though).

DedeHai · 2024-09-14T11:55:34Z

I am not going to rewrite all of the effects. 😄
no one is asking you to. I am imagining to clone most of the CRGB struct/methods so it CRGBW32 could be used in the exact same way, if done right, it could be a 1:1 replacement, needing update on all FX but it should not be too complex (mostly a direct find and replace, maybe even a #define to override it for starts).
strip only does mapping via ledmaps. Swapping index if necessary. No real slowness there, at least there was none until someone wanted to exclude ledmaps while realtime data was received.

true, lookup is fast.

BusDigital::getPixelColor() uses global buffer instead of querying NeoPixelBus.

yes, the main difference being the color correction, right? If I disable global buffer, colors get worse but speed stays (almost) the same.

This was used in first iteration of 0.14 while we used setUpLeds() (MM still does). It presented a whole lot of problems which were mitigated using global buffer

Do you mind to quickly elaborate what problems these were?

- there already is a method to calculate the table on the fly, there is no need to store it in flash, it can just be calculated at bootup (or cfg change)

blazoncek · 2024-09-14T12:48:27Z

Do you mind to quickly elaborate what problems these were?

Overlapping segments.

- gamma correction only where needed - paletteIndex should be uint8_t (it is only used as that) note: integrating the new `ColorFromPaletteWLED()` into this would require a whole lot of code rewrite and would result in more color conversions from 32bit to CRGB. It would be really useful only if CRGB is replaced with native 32bit colors.

blazoncek · 2024-09-14T13:00:01Z

wled00/FX_fcn.cpp


-  unsigned paletteIndex = i;
+  uint8_t paletteIndex = i;


it is only used as an uint8_t down the line and not involved in any calculation.
edit:
correction: I have index as an unsigned in the ColorFromPaletteWLED() and do a byte() cast there anyway to prevent overflows. so it may be better to revert back to unsigned for consistency.

yes. ATM. what about in the future?
if it is unsigned, compiler will trim it down when passing to a function.

ok reverted to unsigned locally. saves 6 bytes of code. I thought compiler would do better.

XTensa doesn't have non-word arithmetic instructions -- any operations done on a 8-bit or 16-bit types have to be performed by upcasting, operating, and downcasting afterwards to restrict the output range. At least here, int or unsigned is almost always less code (and slightly faster).

XTensa doesn't have non-word arithmetic instructions -- any operations done on a 8-bit or 16-bit types have to be performed by upcasting, operating, and downcasting afterwards to restrict the output range. At least here, int or unsigned is almost always less code (and slightly faster).

in general: absolutely yes. But if a function is performed purely in 8bit (like most of the fastled stuff) sometimes the compiler optimizes in a way that makes it actually faster (smaller code) when using uint8_t, don't ask me why exactly, I have been playing with types to improve code quite a bit and sometimes its just not really clear what is happening.

DedeHai · 2024-09-14T13:20:02Z

Do you mind to quickly elaborate what problems these were?

Overlapping segments.

though so ;) but glad it is 'only that'
overlapping segments is a problem but mostly due to additional RAM usage, it solves other problems overlapping segments have in a global buffer, namely overlapping being FX dependent.
When transferring the segments, color adding could be used. it is somewhat slower though but if the target buffer would be 32bit it would only affect overlapping segments (as adding to black is just one additional if(c!=0) )

blazoncek · 2024-09-14T16:13:07Z

but glad it is 'only that'

It is not. wait for "blending styles"

DedeHai · 2024-09-15T08:08:38Z

future improvements to buffers aside: should i squash this draft and make a PR? any changes required before I do that?

blazoncek · 2024-09-15T12:55:48Z

Please, and do address points discussed.

DedeHai · 2024-09-16T07:44:37Z

@blazoncek one more general question:
currently, all color calculations are done with 32bit colors, even though FX only use 24bit (CRGB). For color manipulation, CRGB would be faster (scaling, adding, blurring etc.). In what scenarios is the white channel used and how (and at what point) is its value calculated?
Is it all done in 32bit to make it future-proof or why was this approach chosen over 'calculate white channel as a last step'?
I am working on a proof of concept with a CRGB equivalent in 32bit (CRGBW) which would ease 32bit color handling. At least I think it makes handling easier: less overloaded functions needed, the compiler will take care of conversions.

blazoncek · 2024-09-16T08:15:56Z

One thing to keep in mind regarding existing effects is that most were ported from FastLED type effects. Some taken directly from FastLED, some from places like soulmate.com.

FastLED operated on its CRGB (or CHSV) and so most effects disregarded the W channel. I guess this was the reason to introduce "Automatic White Calculation" for strips like SK6812 where white channel could be used to reduce power draw.

Any new effect should, if possible, take into account white channel as well IMO though most will really not benefit at all. It all comes down to the "artist" how he/she envisions an effect on RGB and RGBW strips.

I do not know if stripping W from calculations is wise or not without prior notice to users.

DedeHai · 2024-09-16T11:15:38Z

I do not know if stripping W from calculations is wise or not without prior notice to users.

Agreed. These are just ideas.
I don't want to get too deep in this topic but RGB already contains white, hence my question how and where it is extracted to be a separate channel. To me it makes little sense to treat is seperately in effects: yes there may be some artistic use on RGBW strips for that but it could also be done in RGB. Or how do RGB strips treat the seperated white if I would set it non zero in an effect, expecting it to turn out white?
Also: with the CRGBW struct (or is it a class?) it may be possible to do both with minimal overhead. If a strip is RGBW, the buffer gets allocated as CRGBW, if no seperate white channel: RGB. Not sure this is doable in a easy way but that would get the best of both worlds.

blazoncek · 2024-09-16T12:05:13Z

I would avoid CRGB. Why? Because WLED uses NeoPixelBus and not FastLED (well, palettes are exception). NeoPixelBus has its own classes comparable to CRGB but also extended to include CCT (or WW & CW) information which does not exist in FastLED.
So, to avoid confusion work with uint32_t if possible for now.

Unfortunately the world is not as simple as that. CRGB is very common and a lot of people know what it represents. To top that some people want to control white channel independently of effects.

DedeHai · 2024-09-28T20:08:46Z

wled00/FX_2Dfcn.cpp

@@ -461,37 +461,37 @@ void Segment::box_blur(unsigned radius, bool smear) {

 void Segment::moveX(int8_t delta, bool wrap) {


any reason delta cann not be an int?

DedeHai · 2024-09-28T20:09:19Z

wled00/FX_2Dfcn.cpp

    }
-    for (int x = 0; x < cols; x++) setPixelColorXY(x, y, newPxCol[x]);
+    for (int x = 0; x < vW; x++) setPixelColorXY(x, y, newPxCol[x]);
  }
 }

 void Segment::moveY(int8_t delta, bool wrap) {


DedeHai · 2024-09-29T09:46:32Z

Latest changes from @blazoncek have a huge positive impact on FPS however the addition of bool unScaled has no measurable benefit and adds to code size. Here are my test results:
Vanilla 0_15:
52FPS
RAM: [== ] 15.3% (used 50020 bytes from 327680 bytes)
Flash: [========= ] 89.7% (used 1410392 bytes from 1572864 bytes)

My changes including 202901b:
54FPS
RAM: [== ] 15.3% (used 50020 bytes from 327680 bytes)
Flash: [========= ] 89.5% (used 1407852 bytes from 1572864 bytes) -> saves 2540bytes

Latest addition from blazoncek 9114867:
65FPS (!)
RAM: [== ] 15.3% (used 50036 bytes from 327680 bytes)
Flash: [========= ] 89.6% (used 1409002 bytes from 1572864 bytes) -> adds 1150bytes

I have a commit ready removing the bool unScaled again if there are no objections (also including some fixes and improvements)

softhack007 · 2024-09-29T11:23:10Z

I have a commit ready removing the bool unScaled again if there are no objections (also including some fixes and improvements)

@DedeHai the "unscaled" optimization only affects Segment::drawCircle and Segment::drawLine with anti-aliased enabled. Did you test with an effect that uses these functions?

DedeHai · 2024-09-29T11:26:16Z

Yes, I tested it using "Ripple" and "Blobs" FX (4 overlapping segments to intensify the calculation) and saw no FPS change in either case.

softhack007 · 2024-09-29T11:34:44Z

Yes, I tested it using "Ripple" and "Blobs" FX (4 overlapping segments to intensify the calculation) and saw no FPS change in either case.

Strange ... actually "unscaled=false" removes color_fade(c, currentBri()) from the inner loop, this should have a noticeable effect. Can you test again with a bigger matrix setup? I would expect that some speedup gets visible.

Also keep in mind that our "fps" calc is quite inaccurate, like +- 2 fps of "uncertainty" and almost meaningless above 80fps.

The "Loops/sec: " debug output (WLED_DEBUG) is actually more reliable, because it averages over 30 seconds.

DedeHai · 2024-09-29T11:51:54Z

I tested various versions, with scaling and without scaling (completely commented out) and saw no real impact. Maybe my optimized scaling function is now so fast that it takes a lot of pixels to actually make a difference (I am sure it will on 1000+).
I just discussed with blazoncek to commit my changes. A different approach would be to use a static variable instead of passing a parameter (which is the actual problem eating up flash on every call throughout the code).

- Added pre-calculation for segment brightness: stored in _segBri. The impact on FPS is not huge but measurable (~1-2FPS in my test conditions) - Removed `bool unScaled` from `setPixelColor()` function again (it has no/minimal impact on speed but huge impact on flash usage: +850 bytes) - Removed negative checking in `setPixelColorXY()` and replaced it with a local typecast to unsigned, saves a few instructions (tested and working) - Changed int8_t to int in `moveX()` and `moveY()` - Removed a few functions from IRAM as they are now not called for every pixel but only once per segment update - Removed a `virtualWidth()` call from `ripple_base()` - Bugfix in `mode_colortwinkle()`

softhack007 · 2024-09-29T12:44:09Z

A different approach would be to use a static variable instead of passing a parameter (which is the actual problem eating up flash on every call throughout the code).

just saw the commit, looks ok for me, too 👍- it achieves the same thing, just making "_colorScaled" a private variable (aka sideeffect) instead of using an explicit function parameter.

softhack007 · 2024-09-29T12:49:43Z

wled00/FX_fcn.cpp

@@ -437,7 +440,21 @@ uint32_t IRAM_ATTR_YN Segment::currentColor(uint8_t slot) const {
 #endif
 }

-void Segment::setCurrentPalette() {
+// pre-calculate drawing parameters for faster access


@blazoncek I have the feeling you took an idea that I implemented previously in the MM fork - pre-calculate drawing parameters for faster access. It is indeed a big speedup (my test in MM were up to 15% faster).

I don't want to go pricky about this, but it would be nice if you state where you took inspiration from. Something like "based on an idea from @softhack007 in the MM fork". Please.

@blazoncek thanks

I don't want to go pricky about this, but it would be nice if you state where you took inspiration from.

None of this would be needed if you'd provide similar optimisations upstream or not adopt GPL in MM. So I had to come up with something similar. You have the liberty (due to MIT) to pick anything from upstream while I need to beg people or figure out something.
Just my 2¢.

GPLv3 is not the problem. The problem is MIT licensing in the AC repo, which has become a real no-go for me (and a few others) after some very frustrating experiences. I am still the author/owner of my own code, so I may "provide" it to any other repo licensed with MIT, but you're not allowed to "take" without explicit permission. I know this is not optimal and we must find a better approach soon. In general, I'm willing to provide optimizations back - making "something similar" is not a good way, but a waste of resources.

@Aircoookie You have proposed a while ago to harmonize the licensing between both repo's, and remove this "rift" in the community, by using EUPL-1.2. I'm not sure if there was further discussion on discord.

The MM team is willing to go this way. If you also agree to change your repo to EUPL-1.2, we (MM team @troyhacks @ewoudwijma @netmindz @lost-hope @Brandon502) will start collecting the necessary permissions from contributors since our change to GPLv3, and move to the same license. Please give me a sign :-)

The problem is MIT licensing in the AC repo, which has become a real no-go for me (and a few others) after some very frustrating experiences.

NOTE: Emphasis mine.

Can you explain? Can others mentioned explain as well? I've added about 48k lines of code to WLED (not counting 0_15 branch) but have yet to encounter "frustrating experience". What does "frustrating experience" even mean?
Is that a stolen or "borrowed" code? It inevitably happens with GPL as well.

I am not against license change, I just want to know/understand true reasons for it. Is MIT too permissive? Everyone knew/knows WLED uses MIT so there should be no "frustrating experiences" as a consequence of that IMO. I would really love to hear everyone's perspective, not for WLED project but my personal insights.

I can understand you wanting to know more, but I'm not sure that a public discussion as a side topic on a PR is the appropriate place.

Can we arrange a call/meeting with relevant parties?

swap if statements in color_fade

blazoncek · 2024-09-30T17:26:00Z

IMO this is ready for broader testing.
@netmindz suggested to wait with merging until 0.15.0 is released so IMO it may be good to concentrate on releasing 0.15.0 ASAP ant then release 0.15.1 in "short" period of time.

blazoncek · 2024-10-02T13:26:55Z

I have some additional modifications that are not strictly speed related but do decrease binary size for another kB or two.
Do I push them into this PR?

DedeHai · 2024-10-02T14:14:57Z

as long as we keep the commit history, this PR could be a collection of code improvements.
Initially we said to make this a squashed commit, but with all the changes that could become a nightmare to debug as by now most core (rendering) functions were touched up.

- removing WS2812FX::setMode() - removing WS2812FX::setColor() - removing floating point in transition - color handling modification in set.cpp - replaced uint8_t with unsigned in function parameters - inlined WS2812FX::isUpdating() - (MAY BE BREAKING) alexa & smartnest update

- changes to `setPixelColorXY` give an extra FPS, some checks and the loops are only done when needed, additional function call is still faster (force inlining it gives negligible speed boost but eats more flash) - commented out the unused `boxBlur` function - code size improvemnts (also faster) in `moveX()` and `moveY()` by only copying whats required and avoiding code duplications - consolidated the `blur()` functions by enabling asymmetrical blur2D() to replace `blurRow` and `blurCol` - compiler warning fixes (explicit unsigned casts)

blazoncek

Just a few thoughts.

blazoncek · 2024-10-04T06:30:57Z

wled00/FX_2Dfcn.cpp

-        else           strip.setPixelColorXY(start + width() - xX - 1, startY + yY, col);
+  unsigned groupLen = groupLength();
+
+  if(groupLen > 1)


This reduces the processing if group length is 1 by removing:

four assignments (yY && xX, W & H)

hidden two subtractions and a comparison (W & H)

two multiplications

two comparisons (j & g, uint8_t!)

two comparisons (yY>=H && xX>=W)

two increments (xX & yY)

For each setPixelColorXY() call.

What may seem irrelevant with low number of pixels actually produces a noticeable slowdown on large pixel count.

blazoncek · 2024-10-04T06:33:16Z

wled00/FX_2Dfcn.cpp

  }
 }

+/*


I know it is unused ATM but it produces quite different blur and I put this function in intentionally.
I would not like it removed.

blazoncek · 2024-10-04T06:40:18Z

wled00/FX_2Dfcn.cpp

  uint32_t newPxCol[vW];
+  int newDelta;


I do not see the point in modifying move() functions, especially with convoluted math that is hard to follow.
How much speed does it gain?

blazoncek · 2024-10-06T20:21:59Z

wled00/FX_2Dfcn.cpp

@@ -676,7 +652,10 @@ void Segment::drawCharacter(unsigned char chr, int16_t x, int16_t y, uint8_t w,
      case 60: bits = pgm_read_byte_near(&console_font_5x12[(chr * h) + i]); break; // 5x12 font
      default: return;
    }
-    col = ColorFromPalette(grad, (i+1)*255/h, 255, NOBLEND);
+    uint32_t col = ColorFromPaletteWLED(grad, (i+1)*255/h, 255, NOBLEND);


This is a "shadow" of the col from a few linea above.

blazoncek · 2024-10-09T20:08:05Z

FYI I am now running these improvements on my test set up (8266,32,S2,C3) for about a week and see no pitfalls.

Code is smaller and faster.

DedeHai added 7 commits September 11, 2024 21:41

some improvements to consider

c3f472f

no real difference in FPS but code is faster. also 160bytes smaller, meaning it is actually faster

more improvements to color_scale() now even faster.

9341768

tested and working, also tested video

improvement in color_add

feac45f

its not faster but cleaner (and uses less flash)

Improvements in get/set PixelColor()

992d11b

-calculations for virtual strips are done on each call, which is unnecessary. moved them into the if statement.

improved Segment::setPixelColorXY a tiny bit

b07658b

uses less flash so it should be faster (did not notice any FPS difference though) also cleaned code in ColorFromPaletteWLED (it is not faster, same amount of code)

removed old code

ec938f2

blazoncek reviewed Sep 13, 2024

View reviewed changes

fixes and consistency

d45b4ad

minor tweak (break instead of continue in setPixelColorXY)

2afff05

DedeHai added 2 commits September 14, 2024 14:10

memory improvement: dropped static gamma table

6a37f25

- there already is a method to calculate the table on the fly, there is no need to store it in flash, it can just be calculated at bootup (or cfg change)

remove test printout

0e5bd4e

blazoncek reviewed Sep 14, 2024

View reviewed changes

softhack007 added this to the 0.15.1 candidate milestone Sep 28, 2024

DedeHai commented Sep 28, 2024

View reviewed changes

DedeHai and others added 3 commits September 29, 2024 13:55

Private global _colorScaled

336da25

Merge branch '0_15' into 0_15__speed_improvements

8e78fb4

softhack007 reviewed Sep 29, 2024

View reviewed changes

blazoncek added 2 commits September 29, 2024 15:19

Update comment

0ae7329

Replace uint16_t with unsigned for segment data

ee380c5

swap if statements in color_fade

blazoncek marked this pull request as ready for review September 30, 2024 17:26

blazoncek mentioned this pull request Sep 30, 2024

Squashed commit of FXparticleSystem #3823

Draft

blazoncek and others added 3 commits October 2, 2024 20:14

removed todo.

ca06214

blazoncek reviewed Oct 4, 2024

View reviewed changes

Minor tweaks and whitespace

eb5ad23

blazoncek reviewed Oct 6, 2024

View reviewed changes

blazoncek added 2 commits October 7, 2024 16:50

Indentation and shadowed variable.

be64930

Fix for realtime drawing on main segment

210191b

softhack007 added the optimization re-working an existing feature to be faster, or use less memory label Oct 11, 2024

		@@ -461,37 +461,37 @@ void Segment::box_blur(unsigned radius, bool smear) {

		void Segment::moveX(int8_t delta, bool wrap) {

                 }
               }
+              /*

Speed improvements for discussion #4138

Are you sure you want to change the base?

Speed improvements for discussion #4138

Conversation

DedeHai commented Sep 12, 2024

blazoncek left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

softhack007 Sep 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

softhack007 Sep 19, 2024 • edited Loading

Choose a reason for hiding this comment

DedeHai commented Sep 13, 2024 • edited Loading

willmmiles commented Sep 13, 2024

DedeHai commented Sep 13, 2024

blazoncek commented Sep 14, 2024

DedeHai commented Sep 14, 2024

blazoncek commented Sep 14, 2024

DedeHai commented Sep 14, 2024 • edited Loading

blazoncek commented Sep 14, 2024

Choose a reason for hiding this comment

DedeHai Sep 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DedeHai Sep 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DedeHai commented Sep 14, 2024

blazoncek commented Sep 14, 2024

DedeHai commented Sep 15, 2024

blazoncek commented Sep 15, 2024

DedeHai commented Sep 16, 2024

blazoncek commented Sep 16, 2024

DedeHai commented Sep 16, 2024

blazoncek commented Sep 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DedeHai commented Sep 29, 2024

softhack007 commented Sep 29, 2024

DedeHai commented Sep 29, 2024

softhack007 commented Sep 29, 2024 • edited Loading

DedeHai commented Sep 29, 2024

softhack007 commented Sep 29, 2024

softhack007 Sep 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

softhack007 Sep 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

blazoncek commented Sep 30, 2024

blazoncek commented Oct 2, 2024

DedeHai commented Oct 2, 2024

blazoncek left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

blazoncek commented Oct 9, 2024

softhack007 Sep 19, 2024 •

edited

Loading

softhack007 Sep 19, 2024 •

edited

Loading

DedeHai commented Sep 13, 2024 •

edited

Loading

DedeHai commented Sep 14, 2024 •

edited

Loading

DedeHai Sep 14, 2024 •

edited

Loading

DedeHai Sep 14, 2024 •

edited

Loading

softhack007 commented Sep 29, 2024 •

edited

Loading

softhack007 Sep 29, 2024 •

edited

Loading

softhack007 Sep 29, 2024 •

edited

Loading