-
-
Notifications
You must be signed in to change notification settings - Fork 158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add SIMD versions of the invert transform #2534
Conversation
slightly faster
I've been quick visual and performance testing with this test program: import pygame
import timeit
def invert_walt(walt_image):
return pygame.transform.invert(walt_image)
pygame.init()
display_surf = pygame.display.set_mode((800, 350))
walt = pygame.image.load("images/walt.png").convert_alpha()
walt_opaque = walt.convert()
walt_inverse = invert_walt(walt)
running = True
while running:
for event in pygame.event.get():
if event.type == pygame.QUIT:
running = False
display_surf.fill((255, 150, 255))
display_surf.blit(walt, (0, 0))
display_surf.blit(walt_inverse, (400, 0))
pygame.display.flip()
time_taken = timeit.timeit(
"invert_walt(walt)",
setup="import pygame\n\n"
"pygame.init()\n\n"
"display_surf = pygame.display.set_mode((800, 350))\n\n"
"walt = pygame.image.load('images/walt.png').convert_alpha()\n\n"
"def invert_walt(walt_image): "
" return pygame.transform.invert(walt_image)",
number=1000,
)
print("time taken to do 1000 walt inverts:", time_taken) It is not compulsory to test with Walter - but it is traditional. |
For better CPI and throughput
# Conflicts: # src_c/simd_transform.h # src_c/simd_transform_sse2.c
Co-authored-by: Alberto <[email protected]>
Co-authored-by: Alberto <[email protected]>
# Conflicts: # src_c/simd_transform.h # src_c/simd_transform_avx2.c # src_c/simd_transform_sse2.c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Also, simplify mask code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested this out locally, gives pixel perfect results for all my test cases before and after this PR.
import random
import hashlib
import pygame
random.seed(36)
expected_hashes = [
"82c38da441d3325c0ae6fa3fffa74b8b760a26d346a9fc7441c4433efc1210c7",
"3b05aaa0469f40bfef54207d34f48a868c3a6fa665a76f8a6a8e7f34feb47935",
"0192eeaf93728e40ab4e11bb5c9d85302ff927b48832fb95fbaceb69e9108813",
"e8e41e3b0c08e81f29062520b0abee1f3104fc8a90966fbd29a2b46acd097791",
"340cf63be38ad92cb7a4d836f82d2dcedc550c14ec68c783f203fd127184f6cc",
"35f3d8ed7268d3f3fb54ebed760e08c85b09ae98960107e3c26f559ba4cba80f",
"3f4dc52b63a567fd77b431c67be01a8cd0a2e5c5d18961a3f8d7d26faaf46976",
"266377ffd231505911509d03a75f6ee757375faf5e965da3aa020322209a4486",
"6d3ec34adca4c1337e000ec339901aef45039009c3355328eca28a5ba5df55e1",
"edc66664ce5448c742e5848cce2dc2cbaed0383a16a46093551c40c5df37e384",
"0fecf8c40a3ba2c67c382b2900e29cea152ecd1f74d212464d0f8b883131c57e",
"f4ad7e15493f8c3862a5aa74ba093c4ab2d6b715939c286c52d61c636d03c1e1",
]
surf_height = 5
offset = (3, 7)
def populate_surf(surf):
for y in range(surf.get_height()):
for x in range(surf.get_width()):
surf.set_at(
(x, y),
(
random.randint(0, 255),
random.randint(0, 255),
random.randint(0, 255),
random.randint(0, 255),
),
)
hashes = []
for pixel_width in range(4, 16):
dest = pygame.Surface((pixel_width, surf_height), pygame.SRCALPHA)
populate_surf(dest)
print(dest, dest.get_pitch())
dest = dest.premul_alpha()
dest = pygame.transform.invert(dest)
sha256 = hashlib.sha256()
sha256.update(pygame.image.tobytes(dest, "RGBA"))
digest = sha256.hexdigest()
assert digest == expected_hashes.pop(0)
hashes.append(digest)
# print(digest)
print(hashes)
Related to #2476.
This adds SSE2/Neon and AVX2 versions of the invert transform.
In quick performance testing the SSE2 version is about 12 times faster at Walt inversions, and AVX2 is about 13 times faster. I'd expect AVX2 performance to do better at larger image sizes. Though; there is only one operation in the invert transform and SSE2 can do four pixels at a time so there isn't as much in it as in some operations.