50 x perfomance regression in 0.5.0 when indexing > 4 dimensional array #18774

samuelpowell · 2016-10-03T11:51:25Z

I have a private code which is experiencing a severe performance regression:

0.4.6: 0.072002 seconds (324.63 k allocations: 41.946 MB, 46.46% gc time),
0.5.0: 3.908818 seconds (18.74 M allocations: 512.229 MB, 1.76% gc time).

Profiling the code on 0.5.0 indicates that almost all the time is spent in to_indexes following an assignment operation on a 5-dimensional matrix:

L[flmidx(l,m),tidx,elidx[1],elidx[2],elidx[3]] += val.

If I rewrite the subscript as a linear index:

L[sub2ind(size(L),flmidx(l,m),tidx,elidx[1],elidx[2],elidx[3])] += val,

the regression is avoided

0.5.0: 0.075638 seconds (698.01 k allocations: 54.585 MB, 13.93% gc time).

Notes:

the assignment operation is part of a (very hot) nested loop.
the problem does not occur when indexing an array of less than 5-dimensions.
the enclosing function is annotated with @inline and the assignment with @inbounds

This looks very much like #11819, though the example in said issue does not demonstrate a regression.

I have tried repeatedly to create a suitable MWE, but the regression disappears when I change almost anything about the setting of the assignment. If someone from core Julia would like to have a look, please let me know.

The text was updated successfully, but these errors were encountered:

StefanKarpinski · 2016-10-11T20:35:04Z

Thanks for the report. If you're willing to share code privately, let us know. cc: @mbauman.

mbauman · 2016-10-11T21:10:30Z

As a first pass, you could try adding an inline annotation to to_indexes. You might be able to get away with redefining it at the REPL if you do first thing in a new session:

import Base: to_index, to_indexes
@inline to_indexes(i1, I...) = (to_index(i1), to_indexes(I...)...)

StefanKarpinski · 2016-10-11T21:36:27Z

@mbauman: is that suggestion meant as a work-around or a way of debugging the perf issue?

mbauman · 2016-10-11T21:41:47Z

It's a stab at a remote debugging attempt. If it resolves the issue it's a very easy fix.

samuelpowell · 2016-10-12T08:35:11Z

@mbauman, forcing the inlining indeed resolves the regression, bravo!

I don't know what I am doing which is contrary to the inlining heuristic (though the indexing is by a mixture of inlined functions, scalars, and array elements)... If you would like to look at the code privately let me know.

aaowens · 2016-10-14T18:00:05Z

I also experienced this issue in a code which is much slower on 0.5 than 0.4. I made an example.

A3 = rand(10, 10, 10);
function test3(A, nx1, nx2, nx3)
  for i = 1:10_000_000
    A[nx1, nx2, nx3]
  end
end

A5 = rand(10, 10, 10, 10, 10);
function test5(A, nx1, nx2, nx3, nx4, nx5)
  for i = 1:10_000_000
    A[nx1, nx2, nx3, nx4, nx5]
  end
end

A6 = rand(10, 10, 10, 10, 10, 10);
function test6(A, nx1, nx2, nx3, nx4, nx5, nx6)
  for i = 1:10_000_000
    A[nx1, nx2, nx3, nx4, nx5, nx6]
  end
end
function test6_fast(A, nx1, nx2, nx3, nx4, nx5, nx6)
  Asize = size(A)
  for i = 1:10_000_000
    A[sub2ind(Asize, nx1, nx2, nx3, nx4, nx5, nx6 )]
  end
end
@time test3(A3, 1, 1, 1)
@time test5(A5, 1, 1, 1, 1, 1)
@time test6(A6, 1, 1, 1, 1, 1, 1)
@time test6_fast(A6, 1, 1, 1, 1, 1, 1)

0.4:

julia> @time test6(A6, 1, 1, 1, 1, 1, 1)
  0.012555 seconds (4 allocations: 160 bytes)

0.5:

julia> @time test6(A6, 1, 1, 1, 1, 1, 1)
 18.955561 seconds (90.00 M allocations: 2.086 GB, 1.52% gc time)

For me the cutoff is from 5 to 6 dimensional arrays. I tried running the code @mbauman posted, but it didn't fix the issue. Using a linear index fixes the issue.

Force inlining on getindex and to_indexes (fixes #18774)

JaredCrean2 · 2016-10-29T19:25:17Z

I'm not sure this is completely fixed. I have a 6 dimensional stencil computation testcase here that shows a slowdown compared to julia 0.4. I build 0.6 commit 4ec2db0* (2016-10-29 05:01 UTC), so it should have the most recent fix.

0.4 timing:

testing outer_func2
  0.023961 seconds
  0.025217 seconds

0.5 timing:

testing outer_func2
 85.783995 seconds (496.04 M allocations: 11.510 GB, 0.98% gc time)
# I didn't wait for the second one

0.6 timing:

testing outer_func2
  3.052716 seconds (17.72 M allocations: 432.587 MB, 3.72% gc time)
  5.415282 seconds (17.72 M allocations: 432.510 MB, 0.83% gc time)

The profiler is running the second time, which is why the second run is slower.

The profile results:

0.4:

 Count File                       Function                                 Line
    32 ./boot.jl                  include                                   261
    33 ./client.jl                _start                                    378
    33 ./client.jl                process_options                           284
    32 ./loading.jl               include_from_node1                        304
     1 ./stream.jl                buffer_or_write                           969
     1 ./stream.jl                stream_wait                                60
     1 ./stream.jl                uv_write                                  945
     1 ./stream.jl                write                                     993
     1 ./task.jl                  wait                                      369
     1 ...test_inline_testcase.jl outer_func2                                26
     2 ...test_inline_testcase.jl outer_func2                                38
     1 ...test_inline_testcase.jl outer_func2                                44
     3 ...test_inline_testcase.jl outer_func2                                62
     1 printf.jl                  print_fixed                               191
     1 printf.jl                  time_print                                162
     1 simdloop.jl                outer_func2                                67
     3 simdloop.jl                outer_func2                                75
    20 simdloop.jl                outer_func2                                84
    31 util.jl                    runtest                                   155
     1 util.jl                    runtest                                   158

0.6:

 Count File                        Line Function
  6920 ./array.jl                   475 setindex!(::Array{Float64,6}, ::Flo...
  7161 ./client.jl                  335 _start()
  7161 ./client.jl                  271 process_options(::Base.JLOptions)
     1 ./event.jl                   153 wait()
     1 ./io.jl                      263 unsafe_write(::Base.TTY, ::Base.Ref...
     1 ./io.jl                      302 write(::Base.TTY, ::Char)
  7161 ./loading.jl                 532 include_from_node1(::String)
  9906 ./operators.jl               868 to_indexes(::Int64, ::Int64, ::Vara...
     1 ./printf.jl                  187 print_fixed(::Base.TTY, ::Int64, ::...
  7161 ./profile.jl                  22 macro expansion
    22 ./simdloop.jl                 71 macro expansion
  7124 ./simdloop.jl                 72 macro expansion
     3 ./simdloop.jl                 71 outer_func2(::Array{Float64,6}, ::A...
     1 ./stream.jl                   42 stream_wait(::Task)
     1 ./stream.jl                  774 unsafe_write(::Base.TTY, ::Ptr{UInt...
     1 ./stream.jl                  764 uv_write(::Base.TTY, ::Ptr{UInt8}, ...
  7161 ./sysimg.jl                   14 include(::String)
  7160 ./util.jl                    194 macro expansion
     1 ./util.jl                    197 macro expansion
     1 ./util.jl                    156 time_print(::UInt64, ::Int64, ::Int...
     4 ...test_inline_testcase.jl    26 macro expansion
    78 ...test_inline_testcase.jl    32 macro expansion
  7042 ...test_inline_testcase.jl    62 macro expansion
     4 ...test_inline_testcase.jl    15 outer_func2(::Array{Float64,6}, ::A...
  7146 ...test_inline_testcase.jl    16 outer_func2(::Array{Float64,6}, ::A...
  7161 ...test_inline_testcase.jl    88 runtest()

Its better than 0.5, but to_indexes is taking up most of the time

JaredCrean2 · 2016-11-02T17:37:24Z

cc @timholy

timholy · 2016-11-03T13:45:50Z

Would be great if you'd contribute that to BaseBenchmarks.

JaredCrean2 · 2016-11-03T22:56:01Z

Sure, I'll put together a PR this weekend.

JaredCrean2 · 2016-11-20T00:55:48Z

Finally got the benchmark together and merged.

@tkelman backport?

tkelman · 2016-11-20T01:04:15Z

#19097 is already labeled

JaredCrean2 · 2016-11-20T01:07:28Z

Ah, I missed that. Thanks.

(cherry picked from commit fb803b2) ref #19097

ViralBShah added performance Must go faster regression Regression in behavior compared to a previous version labels Oct 3, 2016

ViralBShah added this to the 0.5.x milestone Oct 3, 2016

kshyatt added the arrays [a, r, r, a, y, s] label Oct 3, 2016

timholy mentioned this issue Oct 25, 2016

Force inlining on getindex and to_indexes (fixes #18774) #19097

Merged

KristofferC added the potential benchmark Could make a good benchmark in BaseBenchmarks label Oct 25, 2016

StefanKarpinski added help wanted Indicates that a maintainer wants help on an issue or pull request and removed help wanted Indicates that a maintainer wants help on an issue or pull request labels Oct 27, 2016

timholy closed this as completed in fb803b2 Oct 27, 2016

timholy added a commit that referenced this issue Oct 27, 2016

Merge pull request #19097 from JuliaLang/teh/fix_18774

ee589e9

Force inlining on getindex and to_indexes (fixes #18774)

yuyichao mentioned this issue Nov 1, 2016

Significant slowdown of high-dim array subselection between 0.4.7 and 0.5.0 (170x slower..) #19174

Closed

timholy mentioned this issue Nov 2, 2016

Force inlining of varargs-setindex! #19197

Merged

JaredCrean2 mentioned this issue Nov 8, 2016

add 6d indexing [WIP] JuliaCI/BaseBenchmarks.jl#40

Merged

JaredCrean2 mentioned this issue Feb 4, 2017

many-dimensional arrays broken in 0.5 #20449

Closed

tkelman pushed a commit that referenced this issue Feb 22, 2017

Force inlining on getindex and to_indexes (fixes #18774)

eafaf13

(cherry picked from commit fb803b2) ref #19097

fcard pushed a commit to fcard/julia that referenced this issue Feb 28, 2017

Force inlining on getindex and to_indexes (fixes JuliaLang#18774)

cc0fd54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

50 x perfomance regression in 0.5.0 when indexing > 4 dimensional array #18774

50 x perfomance regression in 0.5.0 when indexing > 4 dimensional array #18774

samuelpowell commented Oct 3, 2016

StefanKarpinski commented Oct 11, 2016

mbauman commented Oct 11, 2016 •

edited

Loading

StefanKarpinski commented Oct 11, 2016

mbauman commented Oct 11, 2016

samuelpowell commented Oct 12, 2016

aaowens commented Oct 14, 2016

JaredCrean2 commented Oct 29, 2016

JaredCrean2 commented Nov 2, 2016

timholy commented Nov 3, 2016

JaredCrean2 commented Nov 3, 2016

JaredCrean2 commented Nov 20, 2016

tkelman commented Nov 20, 2016

JaredCrean2 commented Nov 20, 2016

50 x perfomance regression in 0.5.0 when indexing > 4 dimensional array #18774

50 x perfomance regression in 0.5.0 when indexing > 4 dimensional array #18774

Comments

samuelpowell commented Oct 3, 2016

StefanKarpinski commented Oct 11, 2016

mbauman commented Oct 11, 2016 • edited Loading

StefanKarpinski commented Oct 11, 2016

mbauman commented Oct 11, 2016

samuelpowell commented Oct 12, 2016

aaowens commented Oct 14, 2016

JaredCrean2 commented Oct 29, 2016

JaredCrean2 commented Nov 2, 2016

timholy commented Nov 3, 2016

JaredCrean2 commented Nov 3, 2016

JaredCrean2 commented Nov 20, 2016

tkelman commented Nov 20, 2016

JaredCrean2 commented Nov 20, 2016

mbauman commented Oct 11, 2016 •

edited

Loading