Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed improvements for discussion #4138

Open
wants to merge 38 commits into
base: 0_15
Choose a base branch
from

Conversation

DedeHai
Copy link
Collaborator

@DedeHai DedeHai commented Sep 12, 2024

Added a lot of improvements to core functions making them slightly faster and saving some flash.
I tested all functions to work the same as before. I tested speed by looking at FPS and it's hard to say how much faster it actually is but I see some improvement, maybe 1%.
The inline of getMappedPixelIndex() is just an idea, not sure that is correct nor do I know if it improves anything.
Did not test on ESP8266.

no real difference in FPS but code is faster.
also 160bytes smaller, meaning it is actually faster
tested and working, also tested video
its not faster but cleaner (and uses less flash)
-calculations for virtual strips are done on each call, which is unnecessary. moved them into the if statement.
uses less flash so it should be faster (did not notice any FPS difference though)
also cleaned code in ColorFromPaletteWLED (it is not faster, same amount of code)
…alette

inlining getMappedPixelIndex gets rid of function entry instructions (hopefully) so it should be faster.
also added the 'multi color math' trick to color_add function (it will not make much difference but code shrinks by a few bytes)
Copy link
Collaborator

@blazoncek blazoncek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent!
There are some formatting issues (spaces) but that's minor.

wled00/FX_2Dfcn.cpp Outdated Show resolved Hide resolved
wled00/FX_fcn.cpp Outdated Show resolved Hide resolved
@@ -1816,7 +1820,7 @@ bool WS2812FX::deserializeMap(uint8_t n) {
return (customMappingSize > 0);
}

uint16_t IRAM_ATTR WS2812FX::getMappedPixelIndex(uint16_t index) const {
__attribute__ ((always_inline)) inline uint16_t IRAM_ATTR WS2812FX::getMappedPixelIndex(uint16_t index) const {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non obligatory: I would prefer __attribute__ at the end but [[...]] in front.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I honestly have no Idea what the precompiler instructions mean in detail, I copied this from what they use in fastled...
the idea behind this is to get rid of the 'function entry' instructions that are added when a function is called. When I added the inline flash size increased by a few bytes, telling me that it is actually inlined. Since this short function is only called from two places and is called A LOT this may be faster. I have no way to check (would need a proper debugger that shows assembly instructions being executed line by line).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I've recently learned these are compiler attributes.

Copy link
Collaborator

@softhack007 softhack007 Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 not sure if always_inline plays well with IRAM_ATTR .... the first tells the compiler to always inline the function, the latter says "put the function into IRAM" which means that a real function is needed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is what I was wondering too, this is just a suggestion, i.e. to inline this for speed but how to tell the compiler to inline it to the functions that are in ram... not sure how it will do it.

Copy link
Collaborator

@softhack007 softhack007 Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To my understanding:

  • inline is a hint/suggestion to the compiler. So it might get inlined, or not.
  • __attribute__((always_inline)) is a directive. So the compiler must inline this function, no matter if its efficient or not.

If you want to optimize function calls, its sometime useful to add __attribute__((pure)) or __attribute__((const)) to the function declaration. But only do this after double-checking that the code is actually "pure" (no side-effects) or "const" (solely depends on arguments). I did this in the MoonModules fork, but honestly it does not give you more than 1 or 2 fps even if you apply it to lots of functions.

See MoonModules@7f9da30

wled00/colors.cpp Outdated Show resolved Hide resolved
wled00/colors.cpp Outdated Show resolved Hide resolved
wled00/colors.cpp Outdated Show resolved Hide resolved
wled00/colors.cpp Show resolved Hide resolved
wled00/fcn_declare.h Outdated Show resolved Hide resolved
wled00/colors.cpp Show resolved Hide resolved
@DedeHai
Copy link
Collaborator Author

DedeHai commented Sep 13, 2024

one more thing I was thinking about:
the way I do color handling in the particle system is very efficient by creating a buffer of the segment size, rendering to that and then passing it to WLED. it would be even more efficient if I would drop rendering 'mirrored' pixels (may implement that at one point). So my idea was this: create a buffer (RGB, I don't think white is needed for FX right?) that is the size of the largest segment (and stays in RAM so no fragmentation unless segments are changed a lot, not sure how buffers work currently). Render one segment to that buffer (clear to zero whatever the segment needs), the buffer is accessed as 'in order' so no mapping (except mirror and transpose) and once it is rendered (including blurring and stuff) transfer the buffer to the 'strip' using mapping. Then clear the buffer, do the next segment and so on. This would also allow to use 'add pixel color' when a buffer is transferred, making overlays play more nicely, independent of FX supporting it. If a segment FX calls 'get pixel color' it would still have to be fetched from the strip buffer though. but blurring would be a lot faster and overlapping segments could be 'transparent'.
This is only a half-baked idea as I do not yet fully grasp how the whole 'color chain' works from segment to strip (and back) but the whole 'mapping' checks could be cut down significantly if it is not done for each 'setPixelColor' (at least I think so)

@willmmiles
Copy link
Collaborator

the buffer is accessed as 'in order' so no mapping (except mirror and transpose) and once it is rendered (including blurring and stuff)

I had the same idea! I expect that using flat temporary render buffer of the virtual() sizes, then applying all mapping operations as a post-processing step, would yield a substantial average-case performance improvement. That said, if you try to do mirror and transpose during the render, I think you'll give up most of the benefits as every write still ends up being a pile of variable lookups and conditionals instead of just straight local pointer arithmetic -- I think it'd be better to perform all mapping operations after the render. We'd also get some advantages with code layout and IRAM caching, as each FX function wouldn't need to make so many external function calls.

The big tradeoff is RAM usage. The catch is that the temporary buffer needs contiguous RAM. If it's held allocated, we lose the ability to share that memory (in highly constrained systems) with other tasks like the web server... but if it's not, we risk not being able to get a big enough buffer when it's needed for a render. There may also be systems where there just isn't enough RAM to hold two copies of every pixel state at once.

That said, I'd still love to see a PR that implements a temporary render buffer and pulls the mapping operation out to a separate pass after the FX function call. I think the performance and code size improvements are likely to be compelling, even if it reduces the maximum LED count that can be supported on memory constrained hardware.

@DedeHai
Copy link
Collaborator Author

DedeHai commented Sep 13, 2024

How are buffers currently? What is the global buffer vs no global buffer?

@blazoncek
Copy link
Collaborator

Let's start at the bottom and make our way up.

NeoPixelBus (in 2.7.9) uses 3 buffer: 2 render buffers and 1 transmit buffer. Render buffers are used in "double buffer" fashion. Transmit buffer may be small, but render buffers are of size LED count * pixel size where pixel size can be 24bits up to 64bits. (2.8.0 or later may change that as @Makuna is working on optimisations). Different hardware (RMT, I2S, ...) may have transmit buffers of different size.

WLED creates its own global buffer of size LED count * 32bit (if so selected in settings) for its strip. This is due to the fact that WLED uses NeoPixelBusLg which does luminance adjustments after each SetPixelColor(). This means that GetPixelColor() will not return original color that SetPixelColor() wrote if brightness is less than 255. If you disable global buffer, WLED tries to reconstruct original color but that is not always possible without loss.

Segment does not create any buffers of its own (though in the past we had setUpLeds() that did that). It will utilize strip's buffering when getPixelColor() is called.

This (lossy reconstruction) is why mirroring, reversing and transposing have to be done in the setPixelColor() instead of separately when FX is done writing (BTW each pixel is written only once in FX if the FX function is written correctly).

Newer versions of NPB are going to do brightness scaling only during transmission of data to LEDs so GetPixelColor() will return originally set value. Until that happens we need to deal with it ourselves.
On top of that NPB uses the same bit depth for transmit and render buffers but a future version is scheduled to have transmit buffer of the same bit depth as required by LEDs while render buffers may be of bit depth required by user (in our case 32bits).

And then, there is a CCT support.

NOTE: Note the difference between 'GetPixelColor()andgetPixelColor()`. One is from NBP the other from WLED.

@DedeHai
Copy link
Collaborator Author

DedeHai commented Sep 14, 2024

Thank you for detailing this, much clearer now.

I was just looking at 'soap' FX as an example that uses setPixelColorXY() as well as getPixelColorXY() but the following points apply to most FX. The reason I see this is quite "slow":

  • colors are handled in CRGB, each getPixelColorXY() is converted from 32bit color, manipulated, converted back in setPixelColorX()Y -> native 32bit handling would improve that i.e. ditch fastled and implement the used functions in CRGBW32, requires a lot of code refactoring though. Speed improvement is probably not going to be huge as conversion is fast (but in principle mostly unnessecary).
  • the global buffer is in 'strip space mapping' so each call to set or get is mapped to and from there, including mirror, reverse, grouping etc. plus brightness adustment in setPixelColorXY() plus the LED map. This is what mostly makes it slow.

A solution could be to replace the global buffer with individual segment buffers that are not mapped nor brightness adjusted but do that in one go when it is transferred to NeoPixelBus buffer (saving a lot of if statements and back and forth mapping).
The main draw-back would be a lot more memory is used if segments are overlapping. This could be mitigated by setting a memory limit (like it is done with SEGMENT.data) and if that limit is exceeded, fall back to not using the buffers (this is how I do it in the ParticleSystem, btw: if I disable local buffers, PS Fire for example drops from ~85FPS to ~55FPS).

Would that be a good way to try and solve it or is there something fundamentally flawed with this approach?

@blazoncek
Copy link
Collaborator

* colors are handled in CRGB, each `getPixelColorXY()` is converted from 32bit color, manipulated, converted back in `setPixelColorX()Y`   -> native 32bit handling would improve that i.e. ditch fastled and implement the used functions in CRGBW32, requires a lot of code refactoring though. Speed improvement is probably not going to be huge as conversion is fast (but in principle mostly unnessecary).

I am not going to rewrite all of the effects. 😄

* the global buffer is in 'strip space mapping' so each call to set or get is mapped to and from there, including mirror, reverse, grouping etc. plus brightness adustment in `setPixelColorXY()` plus the LED map. This is what mostly makes it slow.

strip only does mapping via ledmaps. Swapping index if necessary. No real slowness there, at least there was none until someone wanted to exclude ledmaps while realtime data was received.
BusDigital::getPixelColor() uses global buffer instead of querying NeoPixelBus.

A solution could be to replace the global buffer with individual segment buffers that are not mapped nor brightness adjusted but do that in one go when it is transferred to NeoPixelBus buffer (saving a lot of if statements and back and forth mapping). The main draw-back would be a lot more memory is used if segments are overlapping. This could be mitigated by setting a memory limit (like it is done with SEGMENT.data) and if that limit is exceeded, fall back to not using the buffers (this is how I do it in the ParticleSystem, btw: if I disable local buffers, PS Fire for example drops from ~85FPS to ~55FPS).

This was used in first iteration of 0.14 while we used setUpLeds() (MM still does). It presented a whole lot of problems which were mitigated using global buffer (which presented a new set of issues, though).

@DedeHai
Copy link
Collaborator Author

DedeHai commented Sep 14, 2024

I am not going to rewrite all of the effects. 😄
no one is asking you to. I am imagining to clone most of the CRGB struct/methods so it CRGBW32 could be used in the exact same way, if done right, it could be a 1:1 replacement, needing update on all FX but it should not be too complex (mostly a direct find and replace, maybe even a #define to override it for starts).
strip only does mapping via ledmaps. Swapping index if necessary. No real slowness there, at least there was none until someone wanted to exclude ledmaps while realtime data was received.

true, lookup is fast.

BusDigital::getPixelColor() uses global buffer instead of querying NeoPixelBus.

yes, the main difference being the color correction, right? If I disable global buffer, colors get worse but speed stays (almost) the same.

This was used in first iteration of 0.14 while we used setUpLeds() (MM still does). It presented a whole lot of problems which were mitigated using global buffer

Do you mind to quickly elaborate what problems these were?

- there already is a method to calculate the table on the fly, there is no need to store it in flash, it can just be calculated at bootup (or cfg change)
@blazoncek
Copy link
Collaborator

Do you mind to quickly elaborate what problems these were?

Overlapping segments.

- gamma correction only where needed
- paletteIndex should be uint8_t (it is only used as that)
note: integrating the new `ColorFromPaletteWLED()` into this would require  a whole lot of code rewrite and would result in more color conversions from 32bit to CRGB. It would be really useful only if CRGB is replaced with native 32bit colors.

unsigned paletteIndex = i;
uint8_t paletteIndex = i;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

Copy link
Collaborator Author

@DedeHai DedeHai Sep 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is only used as an uint8_t down the line and not involved in any calculation.
edit:
correction: I have index as an unsigned in the ColorFromPaletteWLED() and do a byte() cast there anyway to prevent overflows. so it may be better to revert back to unsigned for consistency.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. ATM. what about in the future?
if it is unsigned, compiler will trim it down when passing to a function.

Copy link
Collaborator Author

@DedeHai DedeHai Sep 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok reverted to unsigned locally. saves 6 bytes of code. I thought compiler would do better.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

XTensa doesn't have non-word arithmetic instructions -- any operations done on a 8-bit or 16-bit types have to be performed by upcasting, operating, and downcasting afterwards to restrict the output range. At least here, int or unsigned is almost always less code (and slightly faster).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

XTensa doesn't have non-word arithmetic instructions -- any operations done on a 8-bit or 16-bit types have to be performed by upcasting, operating, and downcasting afterwards to restrict the output range. At least here, int or unsigned is almost always less code (and slightly faster).

in general: absolutely yes. But if a function is performed purely in 8bit (like most of the fastled stuff) sometimes the compiler optimizes in a way that makes it actually faster (smaller code) when using uint8_t, don't ask me why exactly, I have been playing with types to improve code quite a bit and sometimes its just not really clear what is happening.

@DedeHai
Copy link
Collaborator Author

DedeHai commented Sep 14, 2024

Do you mind to quickly elaborate what problems these were?

Overlapping segments.

though so ;) but glad it is 'only that'
overlapping segments is a problem but mostly due to additional RAM usage, it solves other problems overlapping segments have in a global buffer, namely overlapping being FX dependent.
When transferring the segments, color adding could be used. it is somewhat slower though but if the target buffer would be 32bit it would only affect overlapping segments (as adding to black is just one additional if(c!=0) )

@blazoncek
Copy link
Collaborator

but glad it is 'only that'

It is not. wait for "blending styles"

@DedeHai
Copy link
Collaborator Author

DedeHai commented Sep 15, 2024

future improvements to buffers aside: should i squash this draft and make a PR? any changes required before I do that?

@blazoncek
Copy link
Collaborator

Please, and do address points discussed.

@DedeHai
Copy link
Collaborator Author

DedeHai commented Sep 16, 2024

@blazoncek one more general question:
currently, all color calculations are done with 32bit colors, even though FX only use 24bit (CRGB). For color manipulation, CRGB would be faster (scaling, adding, blurring etc.). In what scenarios is the white channel used and how (and at what point) is its value calculated?
Is it all done in 32bit to make it future-proof or why was this approach chosen over 'calculate white channel as a last step'?
I am working on a proof of concept with a CRGB equivalent in 32bit (CRGBW) which would ease 32bit color handling. At least I think it makes handling easier: less overloaded functions needed, the compiler will take care of conversions.

@blazoncek
Copy link
Collaborator

One thing to keep in mind regarding existing effects is that most were ported from FastLED type effects. Some taken directly from FastLED, some from places like soulmate.com.

FastLED operated on its CRGB (or CHSV) and so most effects disregarded the W channel. I guess this was the reason to introduce "Automatic White Calculation" for strips like SK6812 where white channel could be used to reduce power draw.

Any new effect should, if possible, take into account white channel as well IMO though most will really not benefit at all. It all comes down to the "artist" how he/she envisions an effect on RGB and RGBW strips.

I do not know if stripping W from calculations is wise or not without prior notice to users.

@DedeHai
Copy link
Collaborator Author

DedeHai commented Sep 16, 2024

I do not know if stripping W from calculations is wise or not without prior notice to users.

Agreed. These are just ideas.
I don't want to get too deep in this topic but RGB already contains white, hence my question how and where it is extracted to be a separate channel. To me it makes little sense to treat is seperately in effects: yes there may be some artistic use on RGBW strips for that but it could also be done in RGB. Or how do RGB strips treat the seperated white if I would set it non zero in an effect, expecting it to turn out white?
Also: with the CRGBW struct (or is it a class?) it may be possible to do both with minimal overhead. If a strip is RGBW, the buffer gets allocated as CRGBW, if no seperate white channel: RGB. Not sure this is doable in a easy way but that would get the best of both worlds.

@blazoncek
Copy link
Collaborator

I would avoid CRGB. Why? Because WLED uses NeoPixelBus and not FastLED (well, palettes are exception). NeoPixelBus has its own classes comparable to CRGB but also extended to include CCT (or WW & CW) information which does not exist in FastLED.
So, to avoid confusion work with uint32_t if possible for now.

Unfortunately the world is not as simple as that. CRGB is very common and a lot of people know what it represents. To top that some people want to control white channel independently of effects.

@softhack007 softhack007 added this to the 0.15.1 candidate milestone Sep 28, 2024
@@ -461,37 +461,37 @@ void Segment::box_blur(unsigned radius, bool smear) {

void Segment::moveX(int8_t delta, bool wrap) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason delta cann not be an int?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could.

}
for (int x = 0; x < cols; x++) setPixelColorXY(x, y, newPxCol[x]);
for (int x = 0; x < vW; x++) setPixelColorXY(x, y, newPxCol[x]);
}
}

void Segment::moveY(int8_t delta, bool wrap) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dito

@DedeHai
Copy link
Collaborator Author

DedeHai commented Sep 29, 2024

Latest changes from @blazoncek have a huge positive impact on FPS however the addition of bool unScaled has no measurable benefit and adds to code size. Here are my test results:
Vanilla 0_15:
52FPS
RAM: [== ] 15.3% (used 50020 bytes from 327680 bytes)
Flash: [========= ] 89.7% (used 1410392 bytes from 1572864 bytes)

My changes including 202901b:
54FPS
RAM: [== ] 15.3% (used 50020 bytes from 327680 bytes)
Flash: [========= ] 89.5% (used 1407852 bytes from 1572864 bytes) -> saves 2540bytes

Latest addition from blazoncek 9114867:
65FPS (!)
RAM: [== ] 15.3% (used 50036 bytes from 327680 bytes)
Flash: [========= ] 89.6% (used 1409002 bytes from 1572864 bytes) -> adds 1150bytes

I have a commit ready removing the bool unScaled again if there are no objections (also including some fixes and improvements)

@softhack007
Copy link
Collaborator

I have a commit ready removing the bool unScaled again if there are no objections (also including some fixes and improvements)

@DedeHai the "unscaled" optimization only affects Segment::drawCircle and Segment::drawLine with anti-aliased enabled. Did you test with an effect that uses these functions?

@DedeHai
Copy link
Collaborator Author

DedeHai commented Sep 29, 2024

Yes, I tested it using "Ripple" and "Blobs" FX (4 overlapping segments to intensify the calculation) and saw no FPS change in either case.

@softhack007
Copy link
Collaborator

softhack007 commented Sep 29, 2024

Yes, I tested it using "Ripple" and "Blobs" FX (4 overlapping segments to intensify the calculation) and saw no FPS change in either case.

Strange ... actually "unscaled=false" removes color_fade(c, currentBri()) from the inner loop, this should have a noticeable effect. Can you test again with a bigger matrix setup? I would expect that some speedup gets visible.

Also keep in mind that our "fps" calc is quite inaccurate, like +- 2 fps of "uncertainty" and almost meaningless above 80fps.

The "Loops/sec: " debug output (WLED_DEBUG) is actually more reliable, because it averages over 30 seconds.

@DedeHai
Copy link
Collaborator Author

DedeHai commented Sep 29, 2024

I tested various versions, with scaling and without scaling (completely commented out) and saw no real impact. Maybe my optimized scaling function is now so fast that it takes a lot of pixels to actually make a difference (I am sure it will on 1000+).
I just discussed with blazoncek to commit my changes. A different approach would be to use a static variable instead of passing a parameter (which is the actual problem eating up flash on every call throughout the code).

DedeHai and others added 3 commits September 29, 2024 13:55
- Added pre-calculation for segment brightness: stored in _segBri. The impact on FPS is not huge but measurable (~1-2FPS in my test conditions)
- Removed `bool unScaled` from `setPixelColor()` function again (it has no/minimal impact on speed but huge impact on flash usage: +850 bytes)
- Removed negative checking in `setPixelColorXY()` and replaced it with a local typecast to unsigned, saves a few instructions (tested and working)
- Changed int8_t to int in `moveX()` and `moveY()`
- Removed a few functions from IRAM as they are now not called for every pixel but only once per segment update
- Removed a `virtualWidth()` call from `ripple_base()`
- Bugfix in `mode_colortwinkle()`
@softhack007
Copy link
Collaborator

A different approach would be to use a static variable instead of passing a parameter (which is the actual problem eating up flash on every call throughout the code).

just saw the commit, looks ok for me, too 👍- it achieves the same thing, just making "_colorScaled" a private variable (aka sideeffect) instead of using an explicit function parameter.

@@ -437,7 +440,21 @@ uint32_t IRAM_ATTR_YN Segment::currentColor(uint8_t slot) const {
#endif
}

void Segment::setCurrentPalette() {
// pre-calculate drawing parameters for faster access
Copy link
Collaborator

@softhack007 softhack007 Sep 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@blazoncek I have the feeling you took an idea that I implemented previously in the MM fork - pre-calculate drawing parameters for faster access. It is indeed a big speedup (my test in MM were up to 15% faster).

I don't want to go pricky about this, but it would be nice if you state where you took inspiration from. Something like "based on an idea from @softhack007 in the MM fork". Please.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@blazoncek thanks

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to go pricky about this, but it would be nice if you state where you took inspiration from.

None of this would be needed if you'd provide similar optimisations upstream or not adopt GPL in MM. So I had to come up with something similar. You have the liberty (due to MIT) to pick anything from upstream while I need to beg people or figure out something.
Just my 2¢.

Copy link
Collaborator

@softhack007 softhack007 Sep 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GPLv3 is not the problem. The problem is MIT licensing in the AC repo, which has become a real no-go for me (and a few others) after some very frustrating experiences. I am still the author/owner of my own code, so I may "provide" it to any other repo licensed with MIT, but you're not allowed to "take" without explicit permission. I know this is not optimal and we must find a better approach soon. In general, I'm willing to provide optimizations back - making "something similar" is not a good way, but a waste of resources.


@Aircoookie You have proposed a while ago to harmonize the licensing between both repo's, and remove this "rift" in the community, by using EUPL-1.2. I'm not sure if there was further discussion on discord.

The MM team is willing to go this way. If you also agree to change your repo to EUPL-1.2, we (MM team @troyhacks @ewoudwijma @netmindz @lost-hope @Brandon502) will start collecting the necessary permissions from contributors since our change to GPLv3, and move to the same license. Please give me a sign :-)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is MIT licensing in the AC repo, which has become a real no-go for me (and a few others) after some very frustrating experiences.

NOTE: Emphasis mine.

Can you explain? Can others mentioned explain as well? I've added about 48k lines of code to WLED (not counting 0_15 branch) but have yet to encounter "frustrating experience". What does "frustrating experience" even mean?
Is that a stolen or "borrowed" code? It inevitably happens with GPL as well.

I am not against license change, I just want to know/understand true reasons for it. Is MIT too permissive? Everyone knew/knows WLED uses MIT so there should be no "frustrating experiences" as a consequence of that IMO. I would really love to hear everyone's perspective, not for WLED project but my personal insights.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can understand you wanting to know more, but I'm not sure that a public discussion as a side topic on a PR is the appropriate place.

Can we arrange a call/meeting with relevant parties?

@blazoncek
Copy link
Collaborator

IMO this is ready for broader testing.
@netmindz suggested to wait with merging until 0.15.0 is released so IMO it may be good to concentrate on releasing 0.15.0 ASAP ant then release 0.15.1 in "short" period of time.

@blazoncek blazoncek marked this pull request as ready for review September 30, 2024 17:26
@blazoncek
Copy link
Collaborator

I have some additional modifications that are not strictly speed related but do decrease binary size for another kB or two.
Do I push them into this PR?

@DedeHai
Copy link
Collaborator Author

DedeHai commented Oct 2, 2024

as long as we keep the commit history, this PR could be a collection of code improvements.
Initially we said to make this a squashed commit, but with all the changes that could become a nightmare to debug as by now most core (rendering) functions were touched up.

blazoncek and others added 3 commits October 2, 2024 20:14
- removing WS2812FX::setMode()
- removing WS2812FX::setColor()
- removing floating point in transition
- color handling modification in set.cpp
- replaced uint8_t with unsigned in function parameters
- inlined WS2812FX::isUpdating()
- (MAY BE BREAKING) alexa & smartnest update
- changes to `setPixelColorXY` give an extra FPS, some checks and the loops are only done when needed, additional function call is still faster (force inlining it gives negligible speed boost but eats more flash)
- commented out the unused `boxBlur` function
- code size improvemnts (also faster) in `moveX()` and `moveY()` by only copying whats required and avoiding code duplications
- consolidated the `blur()` functions by enabling asymmetrical blur2D() to replace `blurRow` and `blurCol`
- compiler warning fixes (explicit unsigned casts)
Copy link
Collaborator

@blazoncek blazoncek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few thoughts.

else strip.setPixelColorXY(start + width() - xX - 1, startY + yY, col);
unsigned groupLen = groupLength();

if(groupLen > 1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reduces the processing if group length is 1 by removing:

  • four assignments (yY && xX, W & H)
    • hidden two subtractions and a comparison (W & H)
  • two multiplications
  • two comparisons (j & g, uint8_t!)
  • two comparisons (yY>=H && xX>=W)
  • two increments (xX & yY)

For each setPixelColorXY() call.

What may seem irrelevant with low number of pixels actually produces a noticeable slowdown on large pixel count.

}
}

/*
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it is unused ATM but it produces quite different blur and I put this function in intentionally.
I would not like it removed.

uint32_t newPxCol[vW];
int newDelta;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not see the point in modifying move() functions, especially with convoluted math that is hard to follow.
How much speed does it gain?

@@ -676,7 +652,10 @@ void Segment::drawCharacter(unsigned char chr, int16_t x, int16_t y, uint8_t w,
case 60: bits = pgm_read_byte_near(&console_font_5x12[(chr * h) + i]); break; // 5x12 font
default: return;
}
col = ColorFromPalette(grad, (i+1)*255/h, 255, NOBLEND);
uint32_t col = ColorFromPaletteWLED(grad, (i+1)*255/h, 255, NOBLEND);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a "shadow" of the col from a few linea above.

@blazoncek
Copy link
Collaborator

FYI I am now running these improvements on my test set up (8266,32,S2,C3) for about a week and see no pitfalls.

Code is smaller and faster.

@softhack007 softhack007 added the optimization re-working an existing feature to be faster, or use less memory label Oct 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion in progress This pull request is currently being reviewed and, if necessary, modified optimization re-working an existing feature to be faster, or use less memory
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants