Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core (LV::Video) Fix alpha blending of 32-bit videos (#230) #244

Merged
merged 4 commits into from
Dec 26, 2024

Conversation

kaixiong
Copy link
Member

@kaixiong kaixiong commented Feb 2, 2023

This is a rewrite of the buggy 32-bit LV::Video alpha blending code to deal with arithmetic underflows/overflows and the use of an uninitialized register in the MMX implementation (#230).

Take note that GCC/Clang x86-64 (recent only?) produces SSE instructions instead of MMX. GCC 12.2 uses the XMM registers while Clang sticks to MM but throwing in the use of pshuflw.

Perhaps it's time to move on and use SSE2 (introduced in 2000 to Pentium 4s) to work with 2 pixels at once. Or maybe even 4 pixels at once. This will require larger memory alignments and complicate the code a bit more to work with non-divisible row widths.

Here is a link to the plain C and SIMD code in Godbolt.

@kaixiong kaixiong self-assigned this Feb 2, 2023
@kaixiong kaixiong added this to the 0.5.0_alpha1 milestone Feb 2, 2023
@kaixiong kaixiong marked this pull request as ready for review February 5, 2023 02:26
@kaixiong kaixiong requested a review from hartwork February 5, 2023 02:42
Copy link
Member

@hartwork hartwork left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kaixiong I think despite the CI being green I found two bugs in the new version of blit_overlay_alphasrc. Happy to learn I overlooked or misunderstood something, let's see. Interesting stuff!

Comment on lines 81 to 87
uint16_t const c0 = static_cast<uint16_t> (src_pixel[0]) * src_alpha + static_cast<uint16_t> (dst_pixel[0]) * (255 - src_alpha);
uint16_t const c1 = static_cast<uint16_t> (src_pixel[1]) * src_alpha + static_cast<uint16_t> (dst_pixel[1]) * (255 - src_alpha);
uint16_t const c2 = static_cast<uint16_t> (src_pixel[2]) * src_alpha + static_cast<uint16_t> (dst_pixel[2]) * (255 - src_alpha);

destbuf[0] = (alpha * (srcbuf[0] - destbuf[0]) >> 8) + destbuf[0];
destbuf[1] = (alpha * (srcbuf[1] - destbuf[1]) >> 8) + destbuf[1];
destbuf[2] = (alpha * (srcbuf[2] - destbuf[2]) >> 8) + destbuf[2];
dst_pixel[0] = c0 >> 8;
dst_pixel[1] = c1 >> 8;
dst_pixel[2] = c2 >> 8;
Copy link
Member

@hartwork hartwork Feb 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @kaixiong ,

the TL;DR version would be that I think that:

  • The >> 8 in dst_pixel[0] = c0 >> 8; and siblings is too much
  • The cast to 16 bits needs to happen elsewhere, i.e. not cast(a * b + cast(c * d)) but cast(a * b) + cast(c * d). EDIT: mis-read the code, nevermind

Let me go into more detail on former:

My impression is that this tries to implement approach "A over B" at https://en.wikipedia.org/wiki/Alpha_compositing#Description , formalized there as:

A over B

a_0 := a_a + a_b * (1.0 - a_a)
C_0 := (C_a * a_a + C_b * a_b * (1.0 - a_a)) / a_0

Whereas for us:

A := src
B := dst

With a_a and a_b ranging from 0 to 255 rather than 0.0 to 1.0 I end up with these formulas for us, note the output range annotation.

a_0 := a_a/255 + a_b/255 * (1.0 - a_a/255))  # range 0..1
    == a_a/255 + a_b/255 * (255/255 - a_a/255))
    == (a_a + a_b * (255/255 - a_a/255)))/ 255
    == (a_a + a_b * (255 - a_a) / 255) / 255  # range 0..1
    => (a_a + a_b * (255 - a_a) / 255)        # range 0..255

C_0 := (C_a * a_a/255 + C_b * a_b/255 * (1.0 - a_a/255)) / (a_0/255)  # range 0..1
    == (C_a * a_a/255 + C_b * a_b/255 * (1.0 - a_a/255)) / a_0 * 255
    == (C_a * a_a/255 + C_b * a_b/255 * (255/255 - a_a/255)) / a_0 * 255
    == (C_a * a_a/255 + C_b * a_b/255 * ((255 - a_a) / 255)) / a_0 * 255  # range 0..1
    => (C_a * a_a     + C_b * a_b     * ((255 - a_a) / 255)) / a_0 * 255  # range 0..255

Now the current code in the pull request seems to assume that the dst image has an alpha channel with full 255 opacity a_b for all pixels. I'd like to understand why and would like to suggest addition of a comment to the code, but I'll take it for granted below for a moment.

So Inserting a_b := 255 I get …

a_0 := (a_a + a_b * (255 - a_a) / 255)
    == (a_a + 255 * (255 - a_a) / 255)
    == (a_a + (255 - a_a))
    == 255

C_0 := (C_a * a_a + C_b * a_b * ((255 - a_a) / 255)) / a_0 * 255
    == (C_a * a_a + C_b * 255 * ((255 - a_a) / 255)) / 255 * 255
    == (C_a * a_a + C_b *       ((255 - a_a)      ))
    == (src_pixel[0] * src_alpha + dst_pixel[0] * ((255 - src_alpha)))

Due to a_0 == 255 then dst_pixel[3] does not need to be written, confirmed.
However, C_0 is already in range 0..255 so the additional division in dst_pixel[0] = c0 >> 8; seems to be to much and should yield in an "almost black" picture, if I am not mistaken.

I'm happy to learn what I missed or to take this to a call, e.g. if my side here was hard to understand.

Best, Sebastian

Copy link
Member Author

@kaixiong kaixiong Feb 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hartwork, just to be clear this is only meant to be a fix of the original code. The operation here doesn't actually implement 'over' (or 'atop'), it's a simple linear interpolation between the source and target colours, using the source alpha as parameter. The target alpha is unchanged.

At some point, I want to have alpha compositing with the complete set of operators but that'll have to come later. There are additional API design considerations in there due to the necessity of premultiplied alpha, which the Wikipedia article mentions.

Regarding the actual calculation itself, the right-shift by 8 bits is necessary. Alpha is a percentage i.e. value in [0.0, 1.0] that's been mapped to [0, 255]. So what we're calculating here is really:

c = c1 * alpha/255 + c2 * (1 - alpha/255)

Multiplying throughout by 255 to work with only integers, we have:

c * 255 = c1 * alpha + c2 * (255 - alpha)

To get back c, we divide by 255 in the final step. Here >> 8 was originally chosen for performance reasons (as the test mentions), so I kept it as it is.

Since c1, c2 and alpha each have a maximum value of 255 and their products are involved, we need to work in at least uint16_t.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kaixiong thanks for elaborating, let me digest that more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kaixiong make sense to me now! 🙏

@kaixiong
Copy link
Member Author

kaixiong commented Feb 8, 2024

@hartwork, any chance you could look at this again?

@hartwork
Copy link
Member

@hartwork, any chance you could look at this again?

@kaixiong I hope to find time to, in the coming days

Copy link
Member

@hartwork hartwork left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hartwork, any chance you could look at this again?

@kaixiong I understood commit Core (LV::Video): Account for underflow/overflow in the alpha blending of 32-bit videos. now — looks good! 👍

Some of the other parts — in particular MMX — I'll have to just trust you and the test suite with.

3 of the 6 commits could be worth squashing into their respective fix target if you like (I consider that optional and just an idea):

  • Core (LV::Video): Use the correct source alpha.
  • Core (Tests): Fix wrong argument order in calls to Video::get_pixel_ptr().
  • Core (Tests): Fix building of tests.

A rebase onto latest master may also be nice (but optional), I don't expect any new conflicts.

What do you think?

Comment on lines 81 to 87
uint16_t const c0 = static_cast<uint16_t> (src_pixel[0]) * src_alpha + static_cast<uint16_t> (dst_pixel[0]) * (255 - src_alpha);
uint16_t const c1 = static_cast<uint16_t> (src_pixel[1]) * src_alpha + static_cast<uint16_t> (dst_pixel[1]) * (255 - src_alpha);
uint16_t const c2 = static_cast<uint16_t> (src_pixel[2]) * src_alpha + static_cast<uint16_t> (dst_pixel[2]) * (255 - src_alpha);

destbuf[0] = (alpha * (srcbuf[0] - destbuf[0]) >> 8) + destbuf[0];
destbuf[1] = (alpha * (srcbuf[1] - destbuf[1]) >> 8) + destbuf[1];
destbuf[2] = (alpha * (srcbuf[2] - destbuf[2]) >> 8) + destbuf[2];
dst_pixel[0] = c0 >> 8;
dst_pixel[1] = c1 >> 8;
dst_pixel[2] = c2 >> 8;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kaixiong make sense to me now! 🙏

@kaixiong kaixiong force-pushed the alpha-blend-fixes branch 2 times, most recently from 3f21b65 to 41c799d Compare December 25, 2024 09:03
@kaixiong
Copy link
Member Author

@hartwork

Thank you! I have squashed the commits down to 3.

However.. I have also reverted the formulas in blit_overlay_alphasrc():

  1. There is no risk of underflows or overflows (claimed in my commit message) due to C/C++ integral promotion rules. Arithmetic involving uint8_ts and uint16_ts are first implicitly promoted to int, always. The added static_casts obscure this fact.

  2. Worse, the casts achieve nothing. They don't even work as optimization or verification hints. MSVC/Clang/GCC loads and zero-extends the 8-bit colour and alpha values into 32-bit registers before carrying out any computations.

  3. I used a Python script to compare the original and updated formulas for every combination of pixel values and alphas. The unfortunate fact is, the updated code has a worse mean error of -1 (versus 0 in the original), with only a marginal improvement in spread. Adding a correction factor of 255 into the numerator works but I can't explain why it's the optimal choice yet 😄. In any case, this makes the formulas a bit more complicated than what I had intended.

So to keep things simple for everyone, I'm sticking to the original formulas and calling the changes a clean-up instead of a bug-fix. What do you think?

Copy link
Member

@hartwork hartwork left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kaixiong good point about integer promotion! If we stick to the original formula, let's add a comment with the intended math, as it takes too long to understand without something like it. What do you think?

libvisual/libvisual/private/lv_video_blit.cpp Show resolved Hide resolved
…_alphasrc() per Sebastian (hartwork)'s suggestion.
Copy link
Member

@hartwork hartwork left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kaixiong thank's for adding the comment. Approving to the extent that I can comprehend, e.g. not the parts with MMX. Let's get this merged 👍

@kaixiong kaixiong merged commit c3e5d7b into master Dec 26, 2024
6 checks passed
@kaixiong kaixiong deleted the alpha-blend-fixes branch December 26, 2024 17:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants