Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up CRC32 calculation on LoongArch #86

Closed
wants to merge 1 commit into from

Conversation

xry111
Copy link
Contributor

@xry111 xry111 commented Feb 25, 2024

Pull request checklist

Please check if your PR fulfills the following requirements:

  • Tests for the changes have been added (for bug fixes / features)
  • Docs have been reviewed and added / updated if needed (for bug fixes / features)
  • Build was run locally and without warnings or errors
  • All previous and new tests pass

Pull request type

Please check the type of change your PR introduces:

  • Feature

What is the current behavior?

On LoongArch the generic table-based CRC32 implementation is used.

Related Issue URL: None

What is the new behavior?

The crc.w.{b/h/w/d}.w instructions in LoongArch can calculate the CRC32 result for 1/2/4/8 bytes in a single operation, making the use of LoongArch CRC32 instructions much faster compared to the general CRC32 algorithm.

Optimized CRC32 will be enabled if the kernel declares the LoongArch CRC32 instructions supported via AT_HWCAP.

Does this introduce a breaking change?

  • No

Other information

@xry111 xry111 force-pushed the xry111/loongarch-crc branch from d2da853 to a5f69f9 Compare February 25, 2024 12:27
@xry111
Copy link
Contributor Author

xry111 commented Feb 25, 2024

Wait a minute... There are some warnings I'd not spotted.

@xry111 xry111 marked this pull request as draft February 25, 2024 13:20
@xry111 xry111 force-pushed the xry111/loongarch-crc branch from a5f69f9 to ced63ba Compare February 25, 2024 13:25
@xry111
Copy link
Contributor Author

xry111 commented Feb 25, 2024

Wait a minute... There are some warnings I'd not spotted.

Fixed. And added cmake support.

@xry111 xry111 marked this pull request as ready for review February 25, 2024 13:26
@xry111 xry111 force-pushed the xry111/loongarch-crc branch from ced63ba to 7cb19f9 Compare February 26, 2024 07:47
@xry111
Copy link
Contributor Author

xry111 commented Feb 26, 2024

Removed tabs from CMakeLists.txt.

CMakeLists.txt Outdated Show resolved Hide resolved
@xry111 xry111 force-pushed the xry111/loongarch-crc branch from 7cb19f9 to 4023849 Compare February 26, 2024 09:52
@JiaT75
Copy link
Contributor

JiaT75 commented Feb 29, 2024

Hello! Thanks for the PR. Overall it looks like you did a great job with this.

Can you provide benchmarks to show the speed increase from this? Specifically, can you show one version with the alignment adjustment in crc32_arch_optimized() and one without? I just want to be sure the alignment adjustment code is worth it for LoongArch. It would be great to vary the input buffer size to see how the speed improvements scale.

Also, how necessary are the runtime detection checks? Are there LoongArch chips that do not have the CRC32 instruction?

configure.ac Show resolved Hide resolved
configure.ac Outdated Show resolved Hide resolved
configure.ac Show resolved Hide resolved
CMakeLists.txt Outdated Show resolved Hide resolved
CMakeLists.txt Show resolved Hide resolved
@xry111 xry111 force-pushed the xry111/loongarch-crc branch 3 times, most recently from 9282d42 to dab983c Compare February 29, 2024 17:23
@xry111
Copy link
Contributor Author

xry111 commented Feb 29, 2024

Hello! Thanks for the PR. Overall it looks like you did a great job with this.

Can you provide benchmarks to show the speed increase from this? Specifically, can you show one version with the alignment adjustment in crc32_arch_optimized() and one without? I just want to be sure the alignment adjustment code is worth it for LoongArch. It would be great to vary the input buffer size to see how the speed improvements scale.

I'll do it tomorrow.

Also, how necessary are the runtime detection checks? Are there LoongArch chips that do not have the CRC32 instruction?

The specification says 64-bit LoongArch chips shall implement CRC32 instructions, but 32-bit LoongArch chips may lack them (though no 32-bit LoongArch chips have been launched as at now).

@xry111 xry111 force-pushed the xry111/loongarch-crc branch from dab983c to f2c2510 Compare February 29, 2024 17:29
@JiaT75
Copy link
Contributor

JiaT75 commented Feb 29, 2024

I'll do it tomorrow.

Thanks!

The specification says 64-bit LoongArch chips shall implement CRC32 instructions, but 32-bit LoongArch chips may lack them (though no 32-bit LoongArch chips have been launched as at now).

Ok that is great to know. I had not found any references to 32-bit LoongArch chips, so that makes sense. Is it likely that 32-bit chips will be made? Otherwise it will simplify things to just design the code for 64-bit LoongArch and not bother with the runtime checks at all. Future 32-bit LoongArch may need extra compiler flags or a function __attribute__() so the code wouldn't be able to work as-is anyway.

@xry111 xry111 force-pushed the xry111/loongarch-crc branch from f2c2510 to e079b80 Compare March 1, 2024 07:37
@xry111
Copy link
Contributor Author

xry111 commented Mar 1, 2024

I'll do it tomorrow.

Thanks!

The specification says 64-bit LoongArch chips shall implement CRC32 instructions, but 32-bit LoongArch chips may lack them (though no 32-bit LoongArch chips have been launched as at now).

Ok that is great to know. I had not found any references to 32-bit LoongArch chips, so that makes sense. Is it likely that 32-bit chips will be made? Otherwise it will simplify things to just design the code for 64-bit LoongArch and not bother with the runtime checks at all. Future 32-bit LoongArch may need extra compiler flags or a function __attribute__() so the code wouldn't be able to work as-is anyway.

It's likely to be made but we are so unsure about some details of it (and whether we need some GCC flags for attributes for it). So I've modified the code to 64-bit-only and removed runtime detection for now.

@xry111
Copy link
Contributor Author

xry111 commented Mar 1, 2024

Can you provide benchmarks to show the speed increase from this?

10M buffer, repeat 100 times: 0.7116s to 0.1015s
1M buffer, repeat 1000 times: 0.7114s to 0.1002s
100K buffer, repeat 10000 times: 0.7009s to 0.1001s
10K buffer, repeat 100000 times: 0.7009s to 0.1002s
1K buffer, repeat 1000000 times: 0.7016s to 0.1010s
100B buffer, repeat 10000000 times: 0.8410s to 0.1081s
10B buffer, repeat 100000000 times: 1.2315s to 0.2002s

Specifically, can you show one version with the alignment adjustment in crc32_arch_optimized() and one without? I just want to be sure the alignment adjustment code is worth it for LoongArch.

Some low-end 64-bit LoongArch CPUs (2K1000 for example) do not support unaligned access, on these CPUs unaligned access will trap and be emulated by the kernel (very slow). So we have to adjust the alignment anyway... I don't have a 2K1000 board for testing though, on my board (3A6000) the alignment adjustment only produces ~1% improvement.

@JiaT75
Copy link
Contributor

JiaT75 commented Mar 2, 2024

Thanks for the benchmarking numbers, those easily justify including this feature :)

Some low-end 64-bit LoongArch CPUs (2K1000 for example) do not support unaligned access, on these CPUs unaligned access will trap and be emulated by the kernel (very slow). So we have to adjust the alignment anyway... I don't have a 2K1000 board for testing though, on my board (3A6000) the alignment adjustment only produces ~1% improvement.

If there are LoongArch CPUs that do not support unaligned access, that is plenty reason to have the code to align the buffer. Thanks for the info!

@emansom
Copy link

emansom commented Mar 29, 2024

Making small digression here, nice organization. You attacked person just because of their place of birth?

No discrimination here, and that org of me you're likely referring to is a wink to 1984 thought police happening in China.

I'm stating that people packaging xz should be very wary of any code touched or given feedback on by JiaT75 given todays security disclosure, no personal gripes to any.

@DanielRuf
Copy link

I presume "speed up" in actuality means something different here? "Speeding up" remote code execution on "special" xz archives?

Currently I'm unsure how this PR is connected to the backdoor, if at all. Is this really related or just unrelated? Because in my opinion the architecture is not relevant (since the requirements for the backdoor are different and unrelated to the CPU architecture).

I'm stating that people packaging xz should be very wary of any code touched or given feedback on by JiaT75 given todays security disclosure, no personal gripes to any.

Makes sense. A very interesting case which seems to have been planned over a long timespan. At least that's the assumption from the mentioned commits from https://www.openwall.com/lists/oss-security/2024/03/29/4

@Trimester6
Copy link

Currently I'm unsure how this PR is connected to the backdoor, if at all. Is this really related or just unrelated? Because in my opinion the architecture is not relevant (since the requirements for the backdoor are different and unrelated to the CPU architecture).

Unrelated

@Larhzu Larhzu closed this Apr 5, 2024
@Larhzu Larhzu reopened this Apr 12, 2024
@Larhzu Larhzu closed this Apr 12, 2024
@Larhzu
Copy link
Member

Larhzu commented Apr 12, 2024

GitHub had closed all pull requests and marked them as if I had closed them on 2024-04-05. I didn't close any PRs on that day, GitHub just misattributed it to me.

Now that I try to re-open these, GitHub immediately closes them again, once again claiming that I did that.

@Larhzu Larhzu reopened this Apr 12, 2024
@Larhzu Larhzu closed this Apr 12, 2024
@Larhzu
Copy link
Member

Larhzu commented Apr 12, 2024

On IRC it was suggested that this likely happens because the forks associated with the PRs are disabled. So this PR would require that @xry111 re-enables or re-creates the fork.

@xry111
Copy link
Contributor Author

xry111 commented Apr 12, 2024

On IRC it was suggested that this likely happens because the forks associated with the PRs are disabled. So this PR would require that @xry111 re-enables or re-creates the fork.

I'll recreate it after 5.8.0 release considering you are focusing on more important issues now.

@xry111
Copy link
Contributor Author

xry111 commented Apr 13, 2024

BTW strangely my previous fork is now marked as "forked from xxxxxx/xz" (xxxxxx is a different user instead of tukaani-project) and it's still suspended. I'm contacting with GH support to see if they can just delete the fork and then I can fork again.

@gnubufferoverflows
Copy link

@xry111 This is strange that GitHub chose to suspend your fork, because during the time that GitHub had suspended everything, there were many other xz forks that were not suspended. So, perhaps they thought you were a Jia sockpuppet due to being Chinese.

@thesamesam
Copy link
Member

thesamesam commented Apr 14, 2024

Not sure that's right. At the very least, the fork count went down to 0 as did star count on the main repo. Plus the repo that his now shows as a fork of is also disabled.

We also can't reopen any other PRs, so...

@xry111
Copy link
Contributor Author

xry111 commented Apr 14, 2024

Not sure that's right. At the very least, the fork count went down to 0 as did star count on the main repo. Plus the repo that his now shows as a fork of is also disabled.

We also can't reopen any other PRs, so...

I found https://github.com/ryandesign/xz is also still suspended, and it seems ryandesign (I'd not use a @ so I won't disturb him) worked it around by creating another fork named "xz-1": https://github.com/ryandesign/xz-1

While I can do the same I don't really like it. Let's wait several days for GH support response...

@gnubufferoverflows
Copy link

I think what happened then is: the forks got suspended, but there were a few mirrors of the xz repo, which didn't get suspended.

@xry111
Copy link
Contributor Author

xry111 commented May 26, 2024

In the meantime I'm having an eye on https://gcc.gnu.org/pipermail/gcc-patches/2024-May/652612.html. If it's landed it'd be better to use the generic built-in instead of target-specific intrinsic.

// Align the input buffer because this was shown to be
// significantly faster than unaligned accesses.
const size_t align_amount = my_min(size,
(8 - (uintptr_t)buf) & (8 - 1));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(0 - (uintptr_t)buf) does the same thing and should map to a negation instruction. Possibly a compiler will do the same from the current code too but using a zero makes it explicit. A similar change was done in 2337f70. Using 0U instead of 0 isn't important here as uintptr_t will promote the math to unsigned anyway so it's just what one prefers.

Some would write just (-(uintptr_t)buf) but that can cause warnings due to sign conversion, thus I prefer writing it as a subtraction from zero.

@Larhzu
Copy link
Member

Larhzu commented Jun 3, 2024

In the meantime I'm having an eye on https://gcc.gnu.org/pipermail/gcc-patches/2024-May/652612.html. If it's landed it'd be better to use the generic built-in instead of target-specific intrinsic.

That makes sense but the target-specific intrinsic works with current toolchains and not everyone can upgrade them as quickly as they might upgrade xz. The target-specific code is simple enough so finishing this PR in the near future makes sense to me.

@Larhzu Larhzu self-assigned this Jun 3, 2024
Larhzu pushed a commit that referenced this pull request Jun 27, 2024
The crc.w.{b/h/w/d}.w instructions in LoongArch can calculate the CRC32
result for 1/2/4/8 bytes in a single operation. Using these is much
faster compared to the generic method.

Optimized CRC32 is enabled unconditionally on 64-bit LoongArch because
the LoongArch specification says that CRC32 instructions shall be
implemented for 64-bit processors. Optimized CRC32 isn't enabled for
32-bit LoongArch processors because not enough information is available
about them.

Co-authored-by: Lasse Collin <[email protected]>

Closes: #86
@Larhzu
Copy link
Member

Larhzu commented Jun 27, 2024

@xry111: Could you check if the branch crc32_loongarch is fine? I tried to be careful but I cannot actually test it. Thanks!

@xry111
Copy link
Contributor Author

xry111 commented Jun 28, 2024

@xry111: Could you check if the branch crc32_loongarch is fine? I tried to be careful but I cannot actually test it. Thanks!

There are some test failures and I'm investigating.

@xry111
Copy link
Contributor Author

xry111 commented Jun 28, 2024

@xry111: Could you check if the branch crc32_loongarch is fine? I tried to be careful but I cannot actually test it. Thanks!

There are some test failures and I'm investigating.

diff --git a/src/liblzma/check/crc32_loongarch.h b/src/liblzma/check/crc32_loongarch.h
index f154ebc8..a25ed2ec 100644
--- a/src/liblzma/check/crc32_loongarch.h
+++ b/src/liblzma/check/crc32_loongarch.h
@@ -22,7 +22,7 @@ crc32_arch_optimized(const uint8_t *buf, size_t size, uint32_t crc_unsigned)
 	int32_t crc = (int32_t)~crc_unsigned;
 
 	if (size >= 8) {
-		size -= (uintptr_t)buf & 7;
+		size -= 8 - (uintptr_t)buf & 7;
 
 		if ((uintptr_t)buf & 1)
 			crc = __crc_w_b_w((int8_t)*buf++, crc);

@xry111
Copy link
Contributor Author

xry111 commented Jun 28, 2024

@xry111: Could you check if the branch crc32_loongarch is fine? I tried to be careful but I cannot actually test it. Thanks!

There are some test failures and I'm investigating.

diff --git a/src/liblzma/check/crc32_loongarch.h b/src/liblzma/check/crc32_loongarch.h
index f154ebc8..a25ed2ec 100644
--- a/src/liblzma/check/crc32_loongarch.h
+++ b/src/liblzma/check/crc32_loongarch.h
@@ -22,7 +22,7 @@ crc32_arch_optimized(const uint8_t *buf, size_t size, uint32_t crc_unsigned)
 	int32_t crc = (int32_t)~crc_unsigned;
 
 	if (size >= 8) {
-		size -= (uintptr_t)buf & 7;
+		size -= 8 - (uintptr_t)buf & 7;
 
 		if ((uintptr_t)buf & 1)
 			crc = __crc_w_b_w((int8_t)*buf++, crc);

Actually (8 - (uintptr_t)buf) & 7 to silence -Wparentheses (cmake enables it but autotools does not).

Larhzu pushed a commit that referenced this pull request Jun 28, 2024
The crc.w.{b/h/w/d}.w instructions in LoongArch can calculate the CRC32
result for 1/2/4/8 bytes in a single operation. Using these is much
faster compared to the generic method.

Optimized CRC32 is enabled unconditionally on 64-bit LoongArch because
the LoongArch specification says that CRC32 instructions shall be
implemented for 64-bit processors. Optimized CRC32 isn't enabled for
32-bit LoongArch processors because not enough information is available
about them.

Co-authored-by: Lasse Collin <[email protected]>

Closes: #86
@Larhzu
Copy link
Member

Larhzu commented Jun 28, 2024

Thanks! It was a stupid mistake, the aligning was wrong too. Your original code had these done correctly.

Now it really should be correct. I think the ARM64 version could use this kind of code too. It's not ideal that the aligning is done one byte at a time in the ARM64 code.

@Larhzu
Copy link
Member

Larhzu commented Jun 28, 2024

I did the change to crc32_arm64.h too in the crc32_arm64 branch. It passes CI on ARM64 macOS.

Larhzu pushed a commit that referenced this pull request Jul 1, 2024
The crc.w.{b/h/w/d}.w instructions in LoongArch can calculate the CRC32
result for 1/2/4/8 bytes in a single operation. Using these is much
faster compared to the generic method.

Optimized CRC32 is enabled unconditionally on 64-bit LoongArch because
the LoongArch specification says that CRC32 instructions shall be
implemented for 64-bit processors. Optimized CRC32 isn't enabled for
32-bit LoongArch processors because not enough information is available
about them.

Co-authored-by: Lasse Collin <[email protected]>

Closes: #86
@Larhzu
Copy link
Member

Larhzu commented Jul 1, 2024

I merged it. Hopefully I sorted your name correctly by family name in THANKS. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants