Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

src_float_to_short_array: Use division instead of shift #85

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

erikd
Copy link
Member

@erikd erikd commented Aug 7, 2019

@janstary I wonder if this fixes your issue on PowerPC.

@Flamefire
Copy link
Contributor

No it won't fix the issue and it is slower due to the integer division (and harder to understand then the solution proposed in #84 (comment))

The PowerPC problem is a broken lrint which produces wrong results for large values as show in #65 (comment) As your change works on the already wrong value, it can't get correct.

An example: lrint(INT_MAX) = lrint(2147483647.000000) = -1
Even using smaller values runs into the same: lrint(2^30 - 2) = lrint(1073741822.000000) = -2

@erikd
Copy link
Member Author

erikd commented Aug 9, 2019

slower due to the integer division

On modern CPUs with deep caches, branches are usually considered more expensive than division. If you can prove to me that my solution using division is slower than your solution, I will take your solution.

The PowerPC problem is a broken lrint which

If PowerPC has a broken lrint function then I will accept a PowerPC #if solution, but i will not take a performance hit on this code for x86_64 just because PowerPC's lrint is busted.

Meanwhile, I am still interested if this fixes @janstary's issue.

@Flamefire
Copy link
Contributor

Meanwhile, I am still interested if this fixes @janstary's issue.

Haven't we already proved that the output of lrint on his machine is wrong for values definitely used here? How could a change working on that wrong value fix anything?

On modern CPUs with deep caches, branches are usually considered more expensive than division.

Speculative execution mitigates some of that. AND:

If you can prove to me that my solution using division is slower than your solution, I will take your solution.

Prove is simple: LONG on most 64bit systems is 64bit: https://godbolt.org/z/HMv8iu
This will disable the "clipping optimization" (which just heuristically checks implementation defined behavior) Hence my code has the exact same branches and float-multiplications but 1 less shift/division and hence is faster.

@janstary
Copy link
Contributor

janstary commented Aug 9, 2019

@janstary I wonder if this fixes your issue on PowerPC.

tests/float_short_test

        float_to_short_test ............................. 

        Line 64 : out [1] == 0

@janstary
Copy link
Contributor

janstary commented Aug 9, 2019

I also tested on an arm machine (current OpenBSD/armv7
on a BeagleBone Black). Without this patch, it fails with

        float_to_short_test ............................. 
        Line 64 : out [1] == -1

With the patch, it fails as

        float_to_short_test ............................. 
        Line 64 : out [1] == 0

@erikd
Copy link
Member Author

erikd commented Aug 9, 2019

Prove is simple:

That's not proof, that's opinion. Proof requires a proper benchmark.

This will disable the "clipping optimization" (which just heuristically checks implementation defined behavior) Hence my code has the exact same branches

You do realize that for CPUs where it works (eg at least x86 and x86_64) the "clipping optimization" results in zero branches, don't you?

@Flamefire
Copy link
Contributor

That's not proof, that's opinion. Proof requires a proper benchmark.

No, you are missing my point. For CPU_CLIPS_POSITIVE ==0 && CPU_CLIPS_NEGATIVE == 0 (what I called "disabled clipping optimization") your code and my code are exactly the same (except the used constants) and both have (at least) 2 branches. BUT my code has 1 operation less (either shift or division) , and hence it is faster. Do you agree until here?

Now examine what the clipping optimization is and when it is enabled:

  1. It assumes that for x>= LONG_MAX: lrint(x) == LONG_MAX and x<= LONG_MIN: lrint(x) == LONG_MIN which is implementation defined. I know if 1 implementation (can't find it again right now) which does this by if-checks on the float
  2. It assumes that long is 32 bit
  3. Given 1 & 2 it holds that lrint(x * 8.0 * 0x10000000) == INT32_MAX for x>=1.0 (and similar for negative numbers)
  4. the behavior of 3 is checked at configure

My claim: On most architectures the clipping optimization is disabled.
If this is true then on most architectures my code is faster.

Quick check: Do you have a "common" 64 bit Linux on a x86_64 architecture? Then please run configure and report the values for CPU_CLIPS_POSITIVE, CPU_CLIPS_NEGATIVE & SIZEOF_LONG. For me those are 0, 0, 8

Proof for "most architectures":

  • The optimization does not work for sizeof(long) == 8. 1 Example: for x==1 the expression lrint(x * 8.0 * 0x10000000) yields INT32_MAX+1. This shifted by 16 results in INT32_MIN. Proof: https://godbolt.org/z/zS1K-q
  • "Most" 64bit architectures use the LP64 model where long is 64 bit (as opposed to LLP64 where it is 32 bit): https://en.wikipedia.org/wiki/64-bit_computing

Many 64-bit platforms today use an LP64 model (including Solaris, AIX, HP-UX, Linux, macOS, BSD, and IBM z/OS). Microsoft Windows uses an LLP64 model.

  • Due to configure being used I assume Linux as the main target system, not Windows.

Summary:

  1. Linux is the target system
  2. Linux uses LP64 model with sizeof(long)==64
  3. sizeof(long)==64 disables clipping optimization
  4. With disabled clipping optimization your code execute 1 more instruction (hence is slower) and is less readable

@erikd erikd force-pushed the master branch 2 times, most recently from 1b3d5b1 to c10c819 Compare August 25, 2019 20:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants