Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

50 x perfomance regression in 0.5.0 when indexing > 4 dimensional array #18774

Closed
samuelpowell opened this issue Oct 3, 2016 · 13 comments
Closed
Labels
arrays [a, r, r, a, y, s] help wanted Indicates that a maintainer wants help on an issue or pull request performance Must go faster potential benchmark Could make a good benchmark in BaseBenchmarks regression Regression in behavior compared to a previous version
Milestone

Comments

@samuelpowell
Copy link
Member

I have a private code which is experiencing a severe performance regression:

0.4.6: 0.072002 seconds (324.63 k allocations: 41.946 MB, 46.46% gc time),
0.5.0: 3.908818 seconds (18.74 M allocations: 512.229 MB, 1.76% gc time).

Profiling the code on 0.5.0 indicates that almost all the time is spent in to_indexes following an assignment operation on a 5-dimensional matrix:

L[flmidx(l,m),tidx,elidx[1],elidx[2],elidx[3]] += val.

If I rewrite the subscript as a linear index:

L[sub2ind(size(L),flmidx(l,m),tidx,elidx[1],elidx[2],elidx[3])] += val,

the regression is avoided

0.5.0: 0.075638 seconds (698.01 k allocations: 54.585 MB, 13.93% gc time).

Notes:

  • the assignment operation is part of a (very hot) nested loop.
  • the problem does not occur when indexing an array of less than 5-dimensions.
  • the enclosing function is annotated with @inline and the assignment with @inbounds

This looks very much like #11819, though the example in said issue does not demonstrate a regression.

I have tried repeatedly to create a suitable MWE, but the regression disappears when I change almost anything about the setting of the assignment. If someone from core Julia would like to have a look, please let me know.

@ViralBShah ViralBShah added performance Must go faster regression Regression in behavior compared to a previous version labels Oct 3, 2016
@ViralBShah ViralBShah added this to the 0.5.x milestone Oct 3, 2016
@kshyatt kshyatt added the arrays [a, r, r, a, y, s] label Oct 3, 2016
@StefanKarpinski
Copy link
Member

Thanks for the report. If you're willing to share code privately, let us know. cc: @mbauman.

@mbauman
Copy link
Member

mbauman commented Oct 11, 2016

As a first pass, you could try adding an inline annotation to to_indexes. You might be able to get away with redefining it at the REPL if you do first thing in a new session:

import Base: to_index, to_indexes
@inline to_indexes(i1, I...) = (to_index(i1), to_indexes(I...)...)

@StefanKarpinski
Copy link
Member

@mbauman: is that suggestion meant as a work-around or a way of debugging the perf issue?

@mbauman
Copy link
Member

mbauman commented Oct 11, 2016

It's a stab at a remote debugging attempt. If it resolves the issue it's a very easy fix.

@samuelpowell
Copy link
Member Author

@mbauman, forcing the inlining indeed resolves the regression, bravo!

I don't know what I am doing which is contrary to the inlining heuristic (though the indexing is by a mixture of inlined functions, scalars, and array elements)... If you would like to look at the code privately let me know.

@aaowens
Copy link

aaowens commented Oct 14, 2016

I also experienced this issue in a code which is much slower on 0.5 than 0.4. I made an example.

A3 = rand(10, 10, 10);
function test3(A, nx1, nx2, nx3)
  for i = 1:10_000_000
    A[nx1, nx2, nx3]
  end
end

A5 = rand(10, 10, 10, 10, 10);
function test5(A, nx1, nx2, nx3, nx4, nx5)
  for i = 1:10_000_000
    A[nx1, nx2, nx3, nx4, nx5]
  end
end

A6 = rand(10, 10, 10, 10, 10, 10);
function test6(A, nx1, nx2, nx3, nx4, nx5, nx6)
  for i = 1:10_000_000
    A[nx1, nx2, nx3, nx4, nx5, nx6]
  end
end
function test6_fast(A, nx1, nx2, nx3, nx4, nx5, nx6)
  Asize = size(A)
  for i = 1:10_000_000
    A[sub2ind(Asize, nx1, nx2, nx3, nx4, nx5, nx6 )]
  end
end
@time test3(A3, 1, 1, 1)
@time test5(A5, 1, 1, 1, 1, 1)
@time test6(A6, 1, 1, 1, 1, 1, 1)
@time test6_fast(A6, 1, 1, 1, 1, 1, 1)

0.4:

julia> @time test6(A6, 1, 1, 1, 1, 1, 1)
  0.012555 seconds (4 allocations: 160 bytes)

0.5:

julia> @time test6(A6, 1, 1, 1, 1, 1, 1)
 18.955561 seconds (90.00 M allocations: 2.086 GB, 1.52% gc time)

For me the cutoff is from 5 to 6 dimensional arrays. I tried running the code @mbauman posted, but it didn't fix the issue. Using a linear index fixes the issue.

@KristofferC KristofferC added the potential benchmark Could make a good benchmark in BaseBenchmarks label Oct 25, 2016
@StefanKarpinski StefanKarpinski added help wanted Indicates that a maintainer wants help on an issue or pull request and removed help wanted Indicates that a maintainer wants help on an issue or pull request labels Oct 27, 2016
timholy added a commit that referenced this issue Oct 27, 2016
Force inlining on getindex and to_indexes (fixes #18774)
@JaredCrean2
Copy link
Contributor

I'm not sure this is completely fixed. I have a 6 dimensional stencil computation testcase here that shows a slowdown compared to julia 0.4. I build 0.6 commit 4ec2db0* (2016-10-29 05:01 UTC), so it should have the most recent fix.

0.4 timing:

testing outer_func2
  0.023961 seconds
  0.025217 seconds

0.5 timing:

testing outer_func2
 85.783995 seconds (496.04 M allocations: 11.510 GB, 0.98% gc time)
# I didn't wait for the second one

0.6 timing:

testing outer_func2
  3.052716 seconds (17.72 M allocations: 432.587 MB, 3.72% gc time)
  5.415282 seconds (17.72 M allocations: 432.510 MB, 0.83% gc time)

The profiler is running the second time, which is why the second run is slower.

The profile results:

0.4:

 Count File                       Function                                 Line
    32 ./boot.jl                  include                                   261
    33 ./client.jl                _start                                    378
    33 ./client.jl                process_options                           284
    32 ./loading.jl               include_from_node1                        304
     1 ./stream.jl                buffer_or_write                           969
     1 ./stream.jl                stream_wait                                60
     1 ./stream.jl                uv_write                                  945
     1 ./stream.jl                write                                     993
     1 ./task.jl                  wait                                      369
     1 ...test_inline_testcase.jl outer_func2                                26
     2 ...test_inline_testcase.jl outer_func2                                38
     1 ...test_inline_testcase.jl outer_func2                                44
     3 ...test_inline_testcase.jl outer_func2                                62
     1 printf.jl                  print_fixed                               191
     1 printf.jl                  time_print                                162
     1 simdloop.jl                outer_func2                                67
     3 simdloop.jl                outer_func2                                75
    20 simdloop.jl                outer_func2                                84
    31 util.jl                    runtest                                   155
     1 util.jl                    runtest                                   158

0.6:

 Count File                        Line Function
  6920 ./array.jl                   475 setindex!(::Array{Float64,6}, ::Flo...
  7161 ./client.jl                  335 _start()
  7161 ./client.jl                  271 process_options(::Base.JLOptions)
     1 ./event.jl                   153 wait()
     1 ./io.jl                      263 unsafe_write(::Base.TTY, ::Base.Ref...
     1 ./io.jl                      302 write(::Base.TTY, ::Char)
  7161 ./loading.jl                 532 include_from_node1(::String)
  9906 ./operators.jl               868 to_indexes(::Int64, ::Int64, ::Vara...
     1 ./printf.jl                  187 print_fixed(::Base.TTY, ::Int64, ::...
  7161 ./profile.jl                  22 macro expansion
    22 ./simdloop.jl                 71 macro expansion
  7124 ./simdloop.jl                 72 macro expansion
     3 ./simdloop.jl                 71 outer_func2(::Array{Float64,6}, ::A...
     1 ./stream.jl                   42 stream_wait(::Task)
     1 ./stream.jl                  774 unsafe_write(::Base.TTY, ::Ptr{UInt...
     1 ./stream.jl                  764 uv_write(::Base.TTY, ::Ptr{UInt8}, ...
  7161 ./sysimg.jl                   14 include(::String)
  7160 ./util.jl                    194 macro expansion
     1 ./util.jl                    197 macro expansion
     1 ./util.jl                    156 time_print(::UInt64, ::Int64, ::Int...
     4 ...test_inline_testcase.jl    26 macro expansion
    78 ...test_inline_testcase.jl    32 macro expansion
  7042 ...test_inline_testcase.jl    62 macro expansion
     4 ...test_inline_testcase.jl    15 outer_func2(::Array{Float64,6}, ::A...
  7146 ...test_inline_testcase.jl    16 outer_func2(::Array{Float64,6}, ::A...
  7161 ...test_inline_testcase.jl    88 runtest()

Its better than 0.5, but to_indexes is taking up most of the time

@JaredCrean2
Copy link
Contributor

cc @timholy

@timholy
Copy link
Member

timholy commented Nov 3, 2016

Would be great if you'd contribute that to BaseBenchmarks.

@JaredCrean2
Copy link
Contributor

Sure, I'll put together a PR this weekend.

@JaredCrean2
Copy link
Contributor

Finally got the benchmark together and merged.

@tkelman backport?

@tkelman
Copy link
Contributor

tkelman commented Nov 20, 2016

#19097 is already labeled

@JaredCrean2
Copy link
Contributor

Ah, I missed that. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrays [a, r, r, a, y, s] help wanted Indicates that a maintainer wants help on an issue or pull request performance Must go faster potential benchmark Could make a good benchmark in BaseBenchmarks regression Regression in behavior compared to a previous version
Projects
None yet
Development

No branches or pull requests

10 participants