Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sequence generation linspace vs linrange #9637

Closed
bkamins opened this issue Jan 6, 2015 · 32 comments
Closed

Sequence generation linspace vs linrange #9637

bkamins opened this issue Jan 6, 2015 · 32 comments
Milestone

Comments

@bkamins
Copy link
Member

bkamins commented Jan 6, 2015

I think that Julia has excellent handling of sequences of floats (see the my blog for comparison with R and Python).

The only small issue is the following code:

println(linspace(0.1, 1.1, 6) .== [linrange(0.1, 1.1, 6)])

produces:

Bool[true,true,true,true,false,true]

So they are not equal and I would expect an identical result (the root cause is the difference of the results between colon notation and linspace).

It should be decided if this should be guaranteed.

@simonster
Copy link
Member

Some previous discussion in #7420, particularly @StefanKarpinski and @JeffBezanson's comments at the end.

@bkamins
Copy link
Member Author

bkamins commented Jan 6, 2015

Just realized that Julia allows for testing with rational numbers. Here is what I got:

ref = convert(Vector{Float64},
              [linrange(convert(Rational{BigInt}, 0.1),
                        convert(Rational{BigInt}, 1.1), 6)])
print(convert(Vector{Float64}, [linrange(1//10, 11//10, 6)]) .== [0.1:0.2:1.1])
print(ref - [0.1:0.2:1.1])
print(ref - linspace(0.1, 1.1, 6))

gives

Bool[true,true,true,true,true,true]
[0.0,5.551115123125783e-17,0.0,1.1102230246251565e-16,0.0,0.0]
[0.0,0.0,0.0,0.0,-1.1102230246251565e-16,0.0]

This allows us to see what happens under exact calculations (and second and third example take into account that actually 0.2 and 1.1 are only approximated in Float64)

@bkamins
Copy link
Member Author

bkamins commented Jan 6, 2015

Here is the proposal how linspace could be changed (I omit constraining argument types in the example):

function linspace2(start, stop, n)
    by = (stop - start) / (n - 1)
    r = start:by:stop
    return stop == last(r) ? [r] : linspace(start, stop, n)
end

Now following the linspace definition agreed in #2575: "Construct a vector of n linearly-spaced elements from start to stop." I compared linspace and linspace2:

srand(1)

function diffrange(x)
    dmin, dmax = extrema(diff(x))
    dmax - dmin
end

scale1 = 100.0
scale2 = 1000.0
scale3 = 10000
store = Array(Float64, 0)
for i = 1:10000
    a = scale1*(rand() - 0.5)
    b = scale2*(rand() - 0.5)
    n = 3 + int64(rand() * scale3)
    l1 = linspace(a, b, n)
    l2 = linspace2(a, b, n)
    @assert l1[end] == l2[end]
    res = diffrange(l2) - diffrange(l1)
    push!(store, res)
end
println(mean(store))
println(maximum(store))
println(minimum(store))
println(sum(store .< 0))
println(sum(store .== 0))
println(sum(store .> 0))

which produces:

-4.466507164124778e-14
4.263256414560601e-14
-1.7053025658242404e-13
7448
2498
54

which simply says that actually colon is able to produce more evenly spaced points than linspace. The result is the same if we change definition of diffrange to compute variance of differences or maximum absolute difference and seems robust to changing scale1, scale2 and scale3.

So linspace2 seems superior to linspace in terms of the objective (evenly spaced points) while additionally it has the benefit of guaranteeing that colon and linspace produce the same result when it is reasonable to expect so.

Additionally a question (I did not find it discussed):

Would it not be useful to make linspace produce an iterator not a vector? Often we do not need a vector and you can always convert an iterator to a vector if needed.

@StefanKarpinski
Copy link
Member

The linspace function produces the straightforward linear intermediate points between its start and stop values. Since that's a pretty obvious, well-defined behavior – although, as it turns out, not always what you want – I'm hesitant to change it to anything else. The fact that our colon is better at producing what you want is due to "lifting" ranges by integer multiples that match the start, step and stop.

@JeffBezanson
Copy link
Member

Yes, my understanding is that the purpose of linspace is to give the values generated by the formula start*(n-i)/n + stop*i/n. I don't know how often people want that, but if not often I'd just as soon remove the function.

@bkamins thanks for the nice writeup on your blog :)

@bkamins
Copy link
Member Author

bkamins commented Jan 6, 2015

I am aware about "lifitng" and that is why I have proposed the change. But I agree that it is a minor one.

However, my understanding is that most often people need to generate sequences from start to stop that have n points as equally spaced as possible and do not think much how they are produced.

In fact after the discussion I see that the iterator thing is more important than identity between colon and linspace. It is often the case that you would want to:

for x = linspace(a, b, n)
    # do something with x in a loop
end

and in such cases you actually need only an iterator, and it is often easier to specify required size of the collection than step size.

I assume that linspace was borrowed from Python where it produces a vector because there vectorization is the only efficient way to perform computations. As you know best :) in Julia it is often better to devectorize for memory and speed reasons.

Just a simple example:

const n = 10^8+1
function test1()
    s = 0.0
    for x in 0.0:(1/(n-1)):1.0
        s += x
    end
end

function test2()
    s = 0.0
    for x in linspace(0.0, 1.0, n)
        s += x
    end
end

@time test1()
@time test1()

@time test2()
@time test2()

on my computer gives

elapsed time: 0.031019712 seconds (131112 bytes allocated)
elapsed time: 0.028756213 seconds (160 bytes allocated)
elapsed time: 1.055671265 seconds (800569544 bytes allocated)
elapsed time: 1.086587485 seconds (800000144 bytes allocated)

@nalimilan
Copy link
Member

Maybe at least add a warning to the linspace docs stating that you'll generally prefer to use ranges?

@StefanKarpinski
Copy link
Member

I took a crack at making linspace as good as our ranges. I don't actually think this will ever break anyone's expectations. It lifts the construction the same kind of way when possible, and otherwise falls back on the current "straightforward" construction.

@StefanKarpinski
Copy link
Member

With this change, we would have

linspace(0.1, 1.1, 6) == [0.1:0.2:1.1]
linspace(0.1, 1.1, 6) != [linrange(0.1, 1.1, 6)]

So now linrange is the odd one out, giving

[0.1,0.30000000000000004,0.5,0.7000000000000001,0.9,1.1]

instead of [0.1,0.3,0.5,0.7,0.9,1.1]. Progress in any case.

@bkamins
Copy link
Member Author

bkamins commented Jan 7, 2015

Well this is a possible approach to modify linrange:

function linrange2(a::Real, b::Real, len::Integer)
    if  len >= 2
        by = (b - a) / (len - 1)
        r = a:by:b
        return b == last(r) && r.len == len ? r : range(a, by, len)
    end
    len == 1 && a == b ? range(a, zero((b-a)/(len-1)), 1) :
                         error("invalid range length")
end

as it takes the advantage of "lifting" in standard colon notation (and we know the endpoint).
We have to check b == last(r) && r.len == len as sometimes it is impossible to choose any step value with given 'a' and 'b' to make colon notation generate sequence that has exactly 'len' elements and ends with 'b', e.g. 0.0:1.0:nextfloat(100.0) last is too small and 0.0:nextfloat(1.0):nextfloat(100.0) last is also too small and the sequence is too short.

Probably it is enough to check b == last(r) as it should imply r.len == len but it is safer to check and it is cheap.

@ivarne
Copy link
Member

ivarne commented Jan 7, 2015

Looks like an improvement to me.

It's hard to find good test cases, but to suggest a good metric for the "quality" of linspace, I'd want var(diff(linspace(a,b,n))) to be as small as possible. This actually makes it smaller in one test case:

julia> var(diff(linspace2(0.1,1.1,11)))
2.139922160430262e-33

julia> var(diff(linspace(0.1,1.1,11)))
4.879022525780997e-33

@bkamins
Copy link
Member Author

bkamins commented Jan 7, 2015

I have checked var on my random test and it is similar to difference between extrema in my proposed code.

@StefanKarpinski
Copy link
Member

Minimizing var(diff(linspace(a,b,n))) is a nice side effect but I suspect you can contrive cases where it does not have that effect.

@StefanKarpinski
Copy link
Member

Note that I added tests for this that generate and check a lot of cases, all of which pass. I didn't add a list of specific hard cases since my experience with the range work was that the generated tests are strictly harder to pass than the hand-crafted hard cases. Before merging this, I should probably test those too though.

@StefanKarpinski
Copy link
Member

Ok, added the "handcrafted" test cases too. They all pass, which is unsurprising.

@bkamins
Copy link
Member Author

bkamins commented Jan 8, 2015

Just an example of a handcrafted test case (I used unmodified definition of linspace for testing). I am not sure if it should be fixed but I give it for the reference:

m1 = nextfloat(-Inf)
m2 = prevfloat(Inf)    # m1 == -m2
linspace(m1, m2, 3)    # OK
[linrange(m1, m2, 3)]  # wrong
[m1:m2:m2]             # error

@simonbyrne
Copy link
Contributor

Some of these issues did arise when I wrote linrange, e.g. see this comment (I ended up going with option 1).

Furthermore, in this case linrange is technically "more correct", in that it gives the correctly rounded interpolated values:

julia> for i in linspace(big(0.1),big(1.1),6)
       println(float64(i))
       end
0.1
0.30000000000000004
0.5
0.7000000000000001
0.9
1.1

julia> for i in linspace(0.1,1.1,6)
       println(i)
       end
0.1
0.30000000000000004
0.5
0.7000000000000001
0.9000000000000001
1.1

julia> for i in linrange(0.1,1.1,6)
       println(i)
       end
0.1
0.30000000000000004
0.5
0.7000000000000001
0.9
1.1

@StefanKarpinski
Copy link
Member

@bkamins, man, that's a diabolical example, which we should, of course, still handle correctly.

@StefanKarpinski
Copy link
Member

@simonbyrne, the new proposed linspace gives this result:

julia> for i in linspace(0.1,1.1,6)
       println(i)
       end
0.1
0.3
0.5
0.7
0.9
1.1

@simonbyrne
Copy link
Contributor

Sorry, I should have clarified what I meant: since the Float64 0.1 is actually 0.1000000000000000055511151231257827021181583404541015625, and 1.1 is actually 1.100000000000000088817841970012523233890533447265625, then the point one-fifth of the way between 0.1 and 1.1 is actually 0.30000000000000002220446049250313080847263336181640625, for which the nearest Float64 is 0.30000000000000004, not 0.3.

So for this example, the current linrange behaviour is correct for all values, the old linspace is incorrect for 1 value (0.9), and the proposed linspace is incorrect for 2 values (0.3 and 0.7).

@StefanKarpinski
Copy link
Member

Yes, that's true, but our current range behavior – which everyone seems to like – is also "wrong" by that standard.

@bkamins
Copy link
Member Author

bkamins commented Jan 8, 2015

@StefanKarpinski, you asked for a handcrafted example so I gave one :). It seems that replacing (a-b)/n by a/n-b/n in range and colon should give exactly the same results in normal cases and solve the problem.

@simonbyrne, This is exactly how I understood it. That is why in my second post in this issue I used Rational{BigInt}. Let me give a simple example showing a more serious problem introduced by "lifting" in colon (in fact I assume that for this reason R introduces safety buffer of relative 1e-10; I do not like the R solution though):

println([0.1:0.1:0.3])
# [0.1,0.2,0.3]
println(map(float64,[big(0.1):big(0.1):big(0.3)]))
# [0.1,0.2]

@StefanKarpinski
Copy link
Member

@bkamins, it's a tradeoff. I don't think you can have intuitive float range behavior and not have cases where it also depends on the precision of float type you're working with. Personally, I think that example is fairly acceptable. The real issue is actually that big(0.1) is a problem and should possibly be disallowed. If you use BigFloat("0.1") instead, it works as expected:

julia> println(map(float64,[BigFloat("0.1"):BigFloat("0.1"):BigFloat("0.3")]))
[0.1,0.2,0.3]

@bkamins
Copy link
Member Author

bkamins commented Jan 8, 2015

@StefanKarpinski, Yes BigFloat("0.1") works as expected. But in general even if we disallow big(0.1) it is simply possible that the value of 0.1 is input not as a literal but as some variable. But this tradeoff is also in my opinion acceptable (I just wanted to give another example) as normally you have such numbers as 0.1 etc. provided as literals and expect the behavior Julia currently delivers.

And as for replacement of (a-b)/n by a/n-b/n I was wrong as for n>2 the first is more exact. So we there also have a tradeoff between good behavior in normal situation and ability to handle special cases (and I prefer better precision for normal cases).

@StefanKarpinski
Copy link
Member

And as for replacement of (a-b)/n by a/n-b/n I was wrong as for n>2 the first is more exact.

That was my suspicion, although I hadn't gotten a chance to check it. We may be able to work around this somehow and get correct behavior in both cases.

@StefanKarpinski
Copy link
Member

I think the way to go here is to create a LinSpace type that is the generator form and have linspace return an object of that type. Then we can delete linrange since producing a LinSpace object for this kind of thing is strictly better than producing a FloatRange object.

@bkamins
Copy link
Member Author

bkamins commented Feb 5, 2015

How do you see that LinSpace type should be specified? How would it differ from FloatRange?

In general, as I mentioned earlier, I agree that returning generators should be preferable to returning arrays, as you can always run collect on a generator if needed.

@StefanKarpinski
Copy link
Member

It would work by implementing the non-eager version of this:

https://github.com/JuliaLang/julia/pull/9666/files#diff-8cc03187983013adb308460f8365e1d0R241

When I've got some time, I'll implement it in that PR. It different in the non-exact case:

https://github.com/JuliaLang/julia/pull/9666/files#diff-8cc03187983013adb308460f8365e1d0R245

@StefanKarpinski
Copy link
Member

I've updated my pull request and I could use some opinions. I actually think there may be a way to generally avoid overflow, which I can implement a bit later (basically scaling down and then back up by powers of 2 when values are too large).

@bkamins
Copy link
Member Author

bkamins commented Apr 3, 2015

I have written the following simple test procedure

srand(1)
simsize = 10000
maxrange = 10
ref = [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
counts = zeros(Int, maxrange)
for d = 1:maxrange, i = 1:simsize
    x = round(rand(), d)
    if linspace(x, round(x+1, d), 11) != round(x + ref, d)
        counts[d] += 1
    end
end
print(counts / simsize * 100)

The old linspace produces the following output:

[95.14,98.42,98.4,98.3,98.47,98.32,98.56,98.5,98.55000000000001,98.49]

And with the new linspace we get:

[0.0,0.0,0.0,0.0,0.0,0.0,12.68,82.59,95.73,98.22]

And I like the improvement 👍.

@StefanKarpinski
Copy link
Member

I'm not entirely sure what that code is measuring. Is this some statistical count of accuracy?

@bkamins
Copy link
Member Author

bkamins commented Apr 6, 2015

Assume we wanted to generate an 11 element sequence starting from some value x and ending with the value 1+x.
Assume that x is a hand picked value from the interval [0,1[ entered as a literal by the user and that this literal has d digits of precision. In line x = round(rand(), d) we pick it randomly. So x is now exactly what we would obtain if user would input a literal. Similarly round(x+1,d) is exactly what user would input as a literal for a stop of the sequence.

User can generate this 11 element sequence in two ways:

  1. using linspace(x, round(x+1, d), 11)
  2. by manually inputing all 11 literals, which is expressed by round(x + ref, d)

What I do is a comparison what is the probability that those two sequences are different when we change d. The printed value is this probability for d ranging from 1 to 10.

You can see that when d<7 new linspace is always exactly identical to the result obtained by manual input of 11 literals. The old linspace is almost never exact.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants