-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improved Memory Operations #174
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Closes #172 Co-authored-by: Takuya Hashimoto <[email protected]>
On the Leipzig1M dataset, LibC vs SZ: ~ 128b lines, aligned: 2.3 vs 2.6 GB/s ~ 128b lines, unaligned: 2.34 vs 2.53 GB/s ~ 5b tokens, aligned: 0.1 vs 0.1 GB/s ~ 5b tokens, unaligned: 0.1 vs 0.1 GB/s ~ 124 MB, aligned: 19.6 vs 20.3 GB/s ~ 124 MB, unaligned: 19.6 vs 20.3 GB/s
Previously SZ would build too many targets for each debugging session.
This commit accelerates the `sz_fill_avx2` and `sz_copy_avx2` by avoiding unaligned writes. It also adds an `sz_equal_avx2` to help validate large files with matching checksums faster. It also adds a placeholder for `sz_order_avx2`, discouraging further optimizations. C++ API with a matching argument order was added to mimic `std::memcpy`, `std::memset`, `std::memmove`. Matching `test_memory_utilities` tests were extended.
In AVX-512, similar to GLibC we should use the register space to load more data simultaneously and avoid loops and data-dependency between iterations.
The new `sz_look_up_transform` API implements a 256-byte lookup table using serial code and AVX-512 that can significantly accelerates text and image processing. The AVX-512 implementation reaches 18 GB/s on Intel Sapphire Rapids CPU, while serial code stays around 3 GB/s for large files.
ashvardanian
force-pushed
the
main-dev
branch
from
October 12, 2024 18:28
6dacbb2
to
165986f
Compare
ashvardanian
force-pushed
the
main-dev
branch
from
October 12, 2024 21:00
45dd093
to
1baa3a9
Compare
ashvardanian
force-pushed
the
main-dev
branch
3 times, most recently
from
October 12, 2024 21:27
3d20005
to
fb06b66
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This update brings many performance optimizations before the next wave of breaking major releases with new functionality and wider range of CPUs supported. Time to get excited 🥳
Faster
memcpy
andmemset
On Intel Sapphire Rapids:
On AWS Graviton 4 we still have room for improvement.
A potential improvement can come from non-temporal stores on large payloads.
256-byte Look-Up Table Transform
On Intel Sapphire Rapids:
On AWS Graviton 4: