-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ak.sum on flat 1-D numpy array: different results between axis=None and axis=-1 #1241
Comments
I think this is probably expected behaviour. When you call Consider a naive sum: def sum_t(x):
tot = np.array(0, dtype=x.dtype)
for y in x:
tot += y
return tot Evaluating this for your array gives me: >>> sum_t(x) - np.sum(x)
0.0001373291 Now, with a pairwise sum with def pairwise_t(x):
if len(x) < 128:
return sum_t(x)
m = math.floor(len(x)/2)
return pairwise_t(x[:m])+pairwise_t(x[m:]) Evaluating this for your array gives me >>> pairwise_t(x) - np.sum(x)
1.5258789e-05 which is a factor of ten closer. I suspect the rest of the difference might be due to the fact that NumPy does more here with vectorization that effectively reduces |
I see, so this is a speed vs accuracy trade-off. I guess from a science perspective it is more important to understand the uncertainty of a given method - accuracy is key (sometimes). |
It's not a bug. I'm not 100% sure whether the NumPy discussion about
Fundamentally, they would never be exactly the same because this is explicitly adding the numbers in a different order and floating point addition is not associative. The However, there's more going on because your values are all The Awkward 2.0 implementation of this uses only NumPy: but I just got notified (while writing this comment) that it's about 2× slower than the v1 implementation, so I'll be looking into that. We might revert to using |
Which is now fixed: #1245 So—no performance issues, but the floating point accuracy is unchanged. I replaced the |
@jpivarski yes, IIRC that reduction is just over the separate parts of the array. In this case, Awkward shouldn't perform a reduction because Evidence for this being a loss of significance problem (to me) is that when one uses a less naive sum, the difference approaches the order of the float precision rather than 1e-4. If we use |
What do you mean by that? Is there something that we could be doing differently that would handle the precision better? As an updated pointer, this is now the state of the art (in v2 after #1245): where |
The loss of significance that I'm referring to above is in the kernel, where we are currently just using a for (int64_t i = 0; i < lenparents; i++) {
toptr[parents[i]] += (OUT)fromptr[i];
} Obviously this is slightly more complicated than a single sum as we perform multiple sums here. There are methods like Kahan summation whose theoretical worst case is relatively independent of the number of values to be added, but this particular example is reasonably slower than the naive (simple) sum. NumPy uses a middle-of-the-road solution that performs pair-wise summation - diving the array into chunks, with the idea being that the error is bounded by the error of summing naively within one chunk: I think that we could reasonably update our sum kernels to do this without much loss of performance. |
Hi @agoose77 and @jpivarski, Thank you for the illuminating discussion. I think I understand the discrepancy now. In the meantime, @jpivarski, you mentioned the v2 - I've seen references to it in the code. Is there an easy way to switch between |
@agoose77 Oh, I see! I was just considering the fact that non-associativity of floating point addition implies that different algorithms will give different results, but you were talking about using associativity to get closer to the correct result. I've seen something similar to Kahan summation for computing variances and standard deviations, where the numerical stability is a bigger issue. (Histogrammar uses it.) The reductions that we do in Awkward Array's cpu-kernels, might be able to benefit from that. The fact that NumPy doesn't do a naive, left-to-right sum would explain why @kreczko, the central place for talking about v2, so far, is #1151. You can switch between v1 and v2 with import awkward._v2 as ak but it's not all there yet. All of the functions for creating arrays exist (so that we can run the tests), and some of the functions exist. It's enough for https://github.com/continuumIO/dask-awkward/ to be able to develop, so far. P.S. If you create a v2 Array, call |
Version of Awkward Array
1.5.1
Description and code to reproduce
I might be misunderstanding something, but for a non-jagged, 1-D array I would expect
ak.sum
to return the same result independent of the axis. However, onlyaxis = None
agrees.The example code
produces
which is a small, but relevant difference (the example uses event weights).
Input file: eventweights.zip
Note: I added the input file which includes
1486
entries which are typically< 1
. I tried to reproduce the issue with smaller numpy arrays, but there they all agree. We've seen issues like these in the past withTH2F
vsTH2D
, but I am not sure what the culprit is hereIs that expected or a bug?
The text was updated successfully, but these errors were encountered: