Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pure Julia ldexp function #19491

Merged
merged 4 commits into from
Dec 8, 2016
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
46 changes: 43 additions & 3 deletions base/math.jl
Original file line number Diff line number Diff line change
Expand Up @@ -349,8 +349,49 @@ julia> ldexp(5., 2)
20.0
```
"""
ldexp(x::Float64,e::Integer) = ccall((:scalbn,libm), Float64, (Float64,Int32), x, Int32(e))
ldexp(x::Float32,e::Integer) = ccall((:scalbnf,libm), Float32, (Float32,Int32), x, Int32(e))
function ldexp{T<:AbstractFloat}(x::T, e::Integer)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isn't really valid for every single AbstractFloat subtype, is it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not: realistically it's probably only valid for Float16,Float32 and Float64

Copy link
Contributor

@simonbyrne simonbyrne Feb 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it assumes:

  1. Is a bitstype.
  2. Is base-2
  3. Uses the same layout as the IEEE format (sign | biased exponent | significand )
  4. Uses gradual underflow for subnormals

So it could be valid for, e.g. an x87 Float80 or an IEEE Float128, but not BigFloat, IEEE decimal (as in DecFP.jl), DoubleDoubles, IBM floats or formats which do funny things like two's complement exponents.

xu = reinterpret(Unsigned, x)
xs = xu & ~sign_mask(T)
xs >= exponent_mask(T) && return x # NaN or Inf
k = Int(xs >> significand_bits(T))
if k == 0 # x is subnormal
xs == 0 && return x # +-0
m = leading_zeros(xs) - exponent_bits(T)
ys = xs << unsigned(m)
xu = ys | (xu & sign_mask(T))
k = 1 - m
# underflow, otherwise may have integer underflow in the following n + k
e < -50000 && return flipsign(T(0.0), x)
end
# For cases where e of an Integer larger than Int make sure we properly
# overlfow/underflow; this is optimized away otherwise.
if e > typemax(Int)
return flipsign(T(Inf), x)
elseif e < typemin(Int)
return flipsign(T(0.0), x)
end
n = e % Int
k += n
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if this overflows (e.g. if n = typemax(Int)?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

k will overflow and be negative and line 378 catches that case

Copy link
Contributor

@simonbyrne simonbyrne Dec 3, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could handle Int128 and BigInt by moving line 353 here, and

if q > 50000
    return flipsign(T(Inf), x)
elseif q < -50000
    return flipsign(T(0), x)
end
n = q % Int

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would add another branch, is it worth it, because you still have to check that k is not overflowing and underflowing in addition to this, which currently is cleverly handled. This adds, based on benchmarking the various branches, approximately a 1ns penalty.

Currently we do a lot worse and cast to Int32.

I don't necessarily have a problem with this, but question if this is the best way to do it. The quest for the most generic function is satisfying, but it typically entails performance penalties.

Internally for my codes that need this function, I don't use this since it's a bit too slow in practice but instead use the following:
https://github.com/JuliaMath/Amal.jl/blob/master/src/ldexp.jl
it has some important cases where it doesn't pass all cases, but that's not important since those cases are handeled externally.

Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, another option would be

if q > typemax(Int)
    return flipsign(T(Inf), x)
elseif q < typemin(Int)
    return flipsign(T(0), x)
end
n = q % Int

which should be able to optimise away the check if q::Int.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very nice!

# overflow, if k is larger than maximum posible exponent
if k >= Int(exponent_mask(T) >> significand_bits(T))
return flipsign(T(Inf), x)
end
if k > 0 # normal case
xu = (xu & ~exponent_mask(T)) | (k % typeof(xu) << significand_bits(T))
return reinterpret(T, xu)
else # subnormal case
if k <= -significand_bits(T) # underflow
# overflow, for the case of integer overflow in n + k
e > 50000 && return flipsign(T(Inf), x)
return flipsign(T(0.0), x)
end
k += significand_bits(T)
z = T(2.0)^-significand_bits(T)
xu = (xu & ~exponent_mask(T)) | (k % typeof(xu) << significand_bits(T))
return z*reinterpret(T, xu)
end
end
ldexp(x::Float16, q::Integer) = Float16(ldexp(Float32(x), q))

"""
exponent(x) -> Int
Expand Down Expand Up @@ -604,7 +645,6 @@ for func in (:atan2,:hypot)
end
end

ldexp(a::Float16, b::Integer) = Float16(ldexp(Float32(a), b))
cbrt(a::Float16) = Float16(cbrt(Float32(a)))

# More special functions
Expand Down
35 changes: 33 additions & 2 deletions test/math.jl
Original file line number Diff line number Diff line change
Expand Up @@ -70,12 +70,43 @@ end
x,y = frexp(convert(T,NaN))
@test isnan(x)
@test y == 0

# more ldexp tests
@test ldexp(T(0.8), 4) === T(12.8)
@test ldexp(T(-0.854375), 5) === T(-27.34)
@test ldexp(T(1.0), typemax(Int)) === T(Inf)
@test ldexp(T(1.0), typemin(Int)) === T(0.0)
@test ldexp(prevfloat(realmin(T)), typemax(Int)) === T(Inf)
@test ldexp(prevfloat(realmin(T)), typemin(Int)) === T(0.0)

@test ldexp(T(0.8), Int128(4)) === T(12.8)
@test ldexp(T(-0.854375), Int128(5)) === T(-27.34)
@test ldexp(T(1.0), typemax(Int128)) === T(Inf)
@test ldexp(T(1.0), typemin(Int128)) === T(0.0)
@test ldexp(prevfloat(realmin(T)), typemax(Int128)) === T(Inf)
@test ldexp(prevfloat(realmin(T)), typemin(Int128)) === T(0.0)

@test ldexp(T(0.8), BigInt(4)) === T(12.8)
@test ldexp(T(-0.854375), BigInt(5)) === T(-27.34)
@test ldexp(T(1.0), BigInt(typemax(Int128))) === T(Inf)
@test ldexp(T(1.0), BigInt(typemin(Int128))) === T(0.0)
@test ldexp(prevfloat(realmin(T)), BigInt(typemax(Int128))) === T(Inf)
@test ldexp(prevfloat(realmin(T)), BigInt(typemin(Int128))) === T(0.0)

# Test also against BigFloat reference. Needs to be exactly rounded.
@test ldexp(realmin(T), -1) == T(ldexp(big(realmin(T)), -1))
@test ldexp(realmin(T), -2) == T(ldexp(big(realmin(T)), -2))
@test ldexp(realmin(T)/2, 0) == T(ldexp(big(realmin(T)/2), 0))
@test ldexp(realmin(T)/3, 0) == T(ldexp(big(realmin(T)/3), 0))
@test ldexp(realmin(T)/3, -1) == T(ldexp(big(realmin(T)/3), -1))
@test ldexp(realmin(T)/3, 11) == T(ldexp(big(realmin(T)/3), 11))
@test ldexp(realmin(T)/11, -10) == T(ldexp(big(realmin(T)/11), -10))
@test ldexp(-realmin(T)/11, -10) == T(ldexp(big(-realmin(T)/11), -10))
end
end

# We compare to BigFloat instead of hard-coding
# values, assuming that BigFloat has an independent and independently
# tested implementation.
# values, assuming that BigFloat has an independently tested implementation.
@testset "basic math functions" begin
@testset "$T" for T in (Float32, Float64)
x = T(1//3)
Expand Down