Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimized SSE2 alpha blit algorithm #2766

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

itzpr3d4t0r
Copy link
Member

@itzpr3d4t0r itzpr3d4t0r commented Mar 24, 2024

Remake of #2680. Sort of a continuation of #2601. This should solve some issues with CI and clear the mess that was the previous PR.

The old strategy of processing 2 pixels at once for even widths or 1 pixel at a time for odd widths has been replaced. Now, it processes 4 pixels at once for (width // 4) iterations, and 1, 2, or 3 pixels as needed.

This change speeds up the algorithm by approximately 55% for even blits and 170% for odd blits.

@itzpr3d4t0r itzpr3d4t0r added Performance Related to the speed or resource usage of the project SIMD Surface pygame.Surface labels Mar 24, 2024
@itzpr3d4t0r itzpr3d4t0r requested a review from a team as a code owner March 24, 2024 16:18
@Starbuck5
Copy link
Member

Hey, this is a great avenue for a PR, the SSE alpha blitters are some of the oldest SIMD code we have, and we've learned new strategies since they were written.

In the old algorithm, since it was going one pixel at a time in the odd path, it had a fast path where it would just copy over the pixel data if certain conditions were met. How does the performance now stack up against that fast path?

Also, tell me more about the changes. Did you do any algorithmic changes or is it just changes to increase the parallelism?

@itzpr3d4t0r
Copy link
Member Author

I can give a partial answer because I'm in a hurry but thanks for the comment, i didn't test that. Using the same test i showed in the previous PR with a surface set to 0 alpha or 255 alpha i get the folowing result:
image
Basically the new strategy is faster (60%) when the surface has odd width and a bit slower (25-30%) in the even case. This may suggest further changes are needed but since these are edge cases (i mean when does someone blit something to a 0 alpha destination or a full 255 alpha surface onto another using alpha?) I don't think this is too bad.

@itzpr3d4t0r itzpr3d4t0r reopened this Mar 31, 2024
@itzpr3d4t0r
Copy link
Member Author

Random test fails aside (not caused by this PR) I've managed to improve performance by 15-20% by removing some unnecessary instructions and now the situation is much better compared to the old strategy in the dst_alpha=0 case as well as src=255:
image

@itzpr3d4t0r itzpr3d4t0r closed this Apr 1, 2024
@itzpr3d4t0r itzpr3d4t0r reopened this Apr 1, 2024
@MyreMylar MyreMylar closed this May 11, 2024
@MyreMylar MyreMylar reopened this May 11, 2024
@itzpr3d4t0r
Copy link
Member Author

Anyway about this:

Also, tell me more about the changes. Did you do any algorithmic changes or is it just changes to increase the parallelism?

The changes are mainly focusing on improving parallelism as I'm following the sdl formula exactly. I tried to save a few unpack here and there but it's really not much different from before.

@MyreMylar
Copy link
Member

MyreMylar commented May 25, 2024

Test program:

from sys import stdout
from pstats import Stats
from cProfile import Profile


import pygame

pygame.init()


def speed_test_blits():
    # chosen because they contain edge cases.
    nums = [0, 1, 65, 126, 127, 199, 254, 255]
    results = {}
    for iterations in range(0, 10000):
        for dst_r, dst_b, dst_a in zip(nums, reversed(nums), reversed(nums)):
            for src_r, src_b, src_a in zip(nums, reversed(nums), nums):
                src_surf = pygame.Surface((66, 66), pygame.SRCALPHA, 32)
                src_surf.fill((src_r, 255, src_b, src_a))
                dest_surf = pygame.Surface((66, 66), pygame.SRCALPHA, 32)
                dest_surf.fill((dst_r, 255, dst_b, dst_a))

                dest_surf.blit(src_surf, (0, 0))
                key = ((dst_r, dst_b, dst_a), (src_r, src_b, src_a))
                results[key] = dest_surf.get_at((65, 33))


if __name__ == "__main__":
    profiler = Profile()
    profiler.runcall(speed_test_blits)
    stats = Stats(profiler, stream=stdout)
    stats.strip_dirs()
    stats.sort_stats("cumulative")
    stats.print_stats()

Results on my Raspberry Pi:


Current main branch:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    2.606    2.606   19.145   19.145 blitter_speed_test.py:11(speed_test_blits)
   640000   13.319    0.000   13.319    0.000 {method 'blit' of 'pygame.surface.Surface' objects}
  1280000    2.997    0.000    2.997    0.000 {method 'fill' of 'pygame.surface.Surface' objects}
   640000    0.223    0.000    0.223    0.000 {method 'get_at' of 'pygame.surface.Surface' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

With PR:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    2.642    2.642   13.446   13.446 blitter_speed_test.py:11(speed_test_blits)
   640000    7.507    0.000    7.507    0.000 {method 'blit' of 'pygame.surface.Surface' objects}
  1280000    3.056    0.000    3.056    0.000 {method 'fill' of 'pygame.surface.Surface' objects}
   640000    0.241    0.000    0.241    0.000 {method 'get_at' of 'pygame.surface.Surface' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

Looks like a solid improvement here when the SSE2 code is converted to NEON code.

Copy link
Member

@MyreMylar MyreMylar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, LGTM 👍

I see a solid performance improvement heading down this code path and it passes all the unit tests. The SIMD code is a lot simpler now and more in tune with what we've been doing in recent blitters rather than these older versions. Much has been learned since!

Good job! 👍 🎈 🍾

@Starbuck5
Copy link
Member

Thanks for the ping on discord, this had slipped off my radar.

Every time I write or review one of our blitter functions I write a fuzzer test to make sure the output is exactly the same before and after.

import random
import hashlib

import pygame

pygame.init()
random.seed(36)

def populate_surf(surf):
    for y in range(surf.get_height()):
        for x in range(surf.get_width()):
            surf.set_at(
                (x, y),
                (
                    random.randint(0, 255),
                    random.randint(0, 255),
                    random.randint(0, 255),
                    random.randint(0, 255),
                ),
            )

sizes = [(random.randint(1, 50), random.randint(1, 50)) for _ in range(30)]
offsets = [(random.randint(1, 20), random.randint(1, 20)) for _ in range(30)]

expected_hashes = [
    "d35958573201ec5788755bc9e3e471bb5415203144671643ad9bfb908f7df646",
    "4f5102ac56cf56462c1fc66d395d9c44777a3b756b82a28b9beb8342f0d0287c",
    "f59a078f67688d66b65a542ed8faa7eacefc91a27db2a5dc85ebc736324100a5",
    "94bc01747d7a182bfc8adf39b43b00ec46c4733a348181529c64cdfbea615c39",
    "e14cffeaeef21879ed6ec6c29171e9645ecd89dd95449914994f1a7ec9b9ccfd",
    "a6d5f758b65088347f896043a92c084d000c1d9bb66319d8dda4f1f2f5f7757f",
    "9d3940ae03ed0f985625c351736ac5ee07dec7e170f76370fdbd8b9257541a1f",
    "83985622f2f39aebe2efa7b8765e567e832e79d7b42d842afb8cc064596ed5eb",
    "3eb6988b34dd22689bb5a0b898b789dc47fd34eb4fe113831564bec02d9f4eb6",
    "66b3c4d65ecb300f12f603a778841c1fa87506ca9923bf72f51f5773566446f3",
    "50c8c736cc2bcf6197ecb5929b044bfdbde393f86e5e0373538c87e4e71ba645",
    "04d04558ad1efe6432c3940293ebe614c68935e94a7eed1577d2cd1246a7e043",
    "240c1db632a0f3c7d0804aca4d2a0937bf428a25bc9517686d0b313035283d0a",
    "a70d794fde1ccbec2addbccd87566b3dc6d89d22969156184e68cfca4b26639b",
    "1f7c02681cbb4ca7c4d6e21bc05b28ad3011984d77e0928960dfc4454573d97f",
    "c421f931cb6cbbeb13aa194967e2575d4f94a9dcfd9d3db197fc8cf7de1cd3f6",
    "6942e0a43afba737ff20ad0418a6a6700a166566f347707dbd80c28c9383e720",
    "93a4be6d2aa71396fe6c8fcb78fad62c8d4e72a22df76c3d1dd9273123dbb15d",
    "867551157ab0c352a5b730664087322e5757c4c50c01ce5778c55e520852c325",
    "15b6c9a4051dd8f2499cd8244c4f0d3effab225ce6eba254e2e6f83da3156a82",
    "fc280abba3b0c05cf8f3a3d67c8bf946a9509f15d00538c41ce9191bb2849dce",
    "22c14fb6f991056287b2aceab9e9a3630bbacc00529564b91a85183cb2559c3c",
    "31b70c1c6da2a55036d0139da7ee1d0171bc87ca80f5e8d3f8b60ef3d714fb16",
    "e2e87c6cc6b07b64e8d9367c28c19d6cd4b712db2fd0514ac941def98470bcf7",
    "a67aa0f76b96ac26c5bfa2f558431b5244a84fe40843d17a85d1bfdb80991a6f",
    "488d7f0cc883e498575f261f52df7cc142cb0f0244d7722f28018c6e87542b31",
    "c6ad14ba201acaba673381093fc7f04764cae6ab1b4c724cd4336dd25341ed7c",
    "47573d029dd21db29351074501033a5c029f47e34d625d39e1be166d5c4ed537",
    "1c3e54146c2a0d4918f64047525dc79fd45c2c3db74d9dea5dd4a9c862e32a87",
    "1940432ae8813a746d4201c14ccc6456d93fb65680c3a69861967579ab14a03e",
]
hashes = []

for size, offset in zip(sizes, offsets):
    source = pygame.Surface(size, pygame.SRCALPHA)
    populate_surf(source)

    dest = pygame.Surface(
        (random.randint(1, 50), (random.randint(1, 50))), pygame.SRCALPHA
    )
    populate_surf(dest)

    dest.blit(source, offset)

    sha256 = hashlib.sha256()
    sha256.update(pygame.image.tobytes(dest, "RGBA"))
    digest = sha256.hexdigest()

    assert digest == expected_hashes.pop(0)

    hashes.append(digest)
    print(digest)

print(hashes)
print("Success")

This script passes successfully on both SSE2 and AVX2 implementations before this PR, but raises an exception in the SSE2 implementation in this PR. I have not tracked down exactly what is different, but this shows that the new implementation is not producing the same results as the old one. If I adjust the code to print instead of assert I can see that 22/30 cases are the same, it's just 8/30 doing something different.

dstp = (Uint32 *)dstp128;

for (i = 0; i < pxl_excess; i++) {
if (!(*dstp & amask)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line definitely needs a comment.

With that in mind I noticed the old implementation had a lot more comments that linked lines of intrinsics to operations in the formula. It would be nice to preserve something like that if possible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree on this, but your program seems to pass on my machine, i got this output:

d35958573201ec5788755bc9e3e471bb5415203144671643ad9bfb908f7df646
4f5102ac56cf56462c1fc66d395d9c44777a3b756b82a28b9beb8342f0d0287c
f59a078f67688d66b65a542ed8faa7eacefc91a27db2a5dc85ebc736324100a5
94bc01747d7a182bfc8adf39b43b00ec46c4733a348181529c64cdfbea615c39
e14cffeaeef21879ed6ec6c29171e9645ecd89dd95449914994f1a7ec9b9ccfd
a6d5f758b65088347f896043a92c084d000c1d9bb66319d8dda4f1f2f5f7757f
9d3940ae03ed0f985625c351736ac5ee07dec7e170f76370fdbd8b9257541a1f
83985622f2f39aebe2efa7b8765e567e832e79d7b42d842afb8cc064596ed5eb
3eb6988b34dd22689bb5a0b898b789dc47fd34eb4fe113831564bec02d9f4eb6
66b3c4d65ecb300f12f603a778841c1fa87506ca9923bf72f51f5773566446f3
50c8c736cc2bcf6197ecb5929b044bfdbde393f86e5e0373538c87e4e71ba645
04d04558ad1efe6432c3940293ebe614c68935e94a7eed1577d2cd1246a7e043
240c1db632a0f3c7d0804aca4d2a0937bf428a25bc9517686d0b313035283d0a
a70d794fde1ccbec2addbccd87566b3dc6d89d22969156184e68cfca4b26639b
1f7c02681cbb4ca7c4d6e21bc05b28ad3011984d77e0928960dfc4454573d97f
c421f931cb6cbbeb13aa194967e2575d4f94a9dcfd9d3db197fc8cf7de1cd3f6
6942e0a43afba737ff20ad0418a6a6700a166566f347707dbd80c28c9383e720
93a4be6d2aa71396fe6c8fcb78fad62c8d4e72a22df76c3d1dd9273123dbb15d
867551157ab0c352a5b730664087322e5757c4c50c01ce5778c55e520852c325
15b6c9a4051dd8f2499cd8244c4f0d3effab225ce6eba254e2e6f83da3156a82
fc280abba3b0c05cf8f3a3d67c8bf946a9509f15d00538c41ce9191bb2849dce
22c14fb6f991056287b2aceab9e9a3630bbacc00529564b91a85183cb2559c3c
31b70c1c6da2a55036d0139da7ee1d0171bc87ca80f5e8d3f8b60ef3d714fb16
e2e87c6cc6b07b64e8d9367c28c19d6cd4b712db2fd0514ac941def98470bcf7
a67aa0f76b96ac26c5bfa2f558431b5244a84fe40843d17a85d1bfdb80991a6f
488d7f0cc883e498575f261f52df7cc142cb0f0244d7722f28018c6e87542b31
c6ad14ba201acaba673381093fc7f04764cae6ab1b4c724cd4336dd25341ed7c
47573d029dd21db29351074501033a5c029f47e34d625d39e1be166d5c4ed537
1c3e54146c2a0d4918f64047525dc79fd45c2c3db74d9dea5dd4a9c862e32a87
1940432ae8813a746d4201c14ccc6456d93fb65680c3a69861967579ab14a03e
['d35958573201ec5788755bc9e3e471bb5415203144671643ad9bfb908f7df646', '4f5102ac56cf56462c1fc66d395d9c44777a3b756b82a28b9beb8342f0d0287c', 'f59a078f67688d66b65a542ed8faa7eacefc91a27db2a5dc85ebc736324100a5', '94bc01747d7a182bfc8adf39b43b00ec46c4733a348181529c64cdfb
ea615c39', 'e14cffeaeef21879ed6ec6c29171e9645ecd89dd95449914994f1a7ec9b9ccfd', 'a6d5f758b65088347f896043a92c084d000c1d9bb66319d8dda4f1f2f5f7757f', '9d3940ae03ed0f985625c351736ac5ee07dec7e170f76370fdbd8b9257541a1f', '83985622f2f39aebe2efa7b8765e567e832e79d7b42d84
2afb8cc064596ed5eb', '3eb6988b34dd22689bb5a0b898b789dc47fd34eb4fe113831564bec02d9f4eb6', '66b3c4d65ecb300f12f603a778841c1fa87506ca9923bf72f51f5773566446f3', '50c8c736cc2bcf6197ecb5929b044bfdbde393f86e5e0373538c87e4e71ba645', '04d04558ad1efe6432c3940293ebe614c689
35e94a7eed1577d2cd1246a7e043', '240c1db632a0f3c7d0804aca4d2a0937bf428a25bc9517686d0b313035283d0a', 'a70d794fde1ccbec2addbccd87566b3dc6d89d22969156184e68cfca4b26639b', '1f7c02681cbb4ca7c4d6e21bc05b28ad3011984d77e0928960dfc4454573d97f', 'c421f931cb6cbbeb13aa194967
e2575d4f94a9dcfd9d3db197fc8cf7de1cd3f6', '6942e0a43afba737ff20ad0418a6a6700a166566f347707dbd80c28c9383e720', '93a4be6d2aa71396fe6c8fcb78fad62c8d4e72a22df76c3d1dd9273123dbb15d', '867551157ab0c352a5b730664087322e5757c4c50c01ce5778c55e520852c325', '15b6c9a4051dd8f2
499cd8244c4f0d3effab225ce6eba254e2e6f83da3156a82', 'fc280abba3b0c05cf8f3a3d67c8bf946a9509f15d00538c41ce9191bb2849dce', '22c14fb6f991056287b2aceab9e9a3630bbacc00529564b91a85183cb2559c3c', '31b70c1c6da2a55036d0139da7ee1d0171bc87ca80f5e8d3f8b60ef3d714fb16', 'e2e87c
6cc6b07b64e8d9367c28c19d6cd4b712db2fd0514ac941def98470bcf7', 'a67aa0f76b96ac26c5bfa2f558431b5244a84fe40843d17a85d1bfdb80991a6f', '488d7f0cc883e498575f261f52df7cc142cb0f0244d7722f28018c6e87542b31', 'c6ad14ba201acaba673381093fc7f04764cae6ab1b4c724cd4336dd25341ed7c', '47573d029dd21db29351074501033a5c029f47e34d625d39e1be166d5c4ed537', '1c3e54146c2a0d4918f64047525dc79fd45c2c3db74d9dea5dd4a9c862e32a87', '1940432ae8813a746d4201c14ccc6456d93fb65680c3a69861967579ab14a03e']
Success

Copy link
Member Author

@itzpr3d4t0r itzpr3d4t0r Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh and the pugame version:
pygame-ce 2.5.0.dev1 (SDL 2.30.1, Python 3.12.0)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure you’re running the SSE2 codepath?

I also was testing on the branch rebased against develop, if it makes a difference.

@itzpr3d4t0r itzpr3d4t0r force-pushed the sse_alphablit_optimization branch from e71d400 to 7f5ac33 Compare December 13, 2024 13:07
@itzpr3d4t0r itzpr3d4t0r requested a review from a team as a code owner December 13, 2024 13:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Related to the speed or resource usage of the project SIMD Surface pygame.Surface
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants