Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speeding up copy() #121

Closed
ViralBShah opened this issue Jul 16, 2011 · 10 comments · Fixed by JuliaLang/Compat.jl#130
Closed

Speeding up copy() #121

ViralBShah opened this issue Jul 16, 2011 · 10 comments · Fixed by JuliaLang/Compat.jl#130
Assignees
Labels
performance Must go faster

Comments

@ViralBShah
Copy link
Member

Create a deep_copy() so that it can be explicitly used where necessary.

Also, copy() and copy_to() need to be optimized to use the fastest implementations. Here are some tests (Mac, Intel Core 2 Duo) that suggest:

  1. memcpy for large copy and copy_to
  2. Native julia for small copy
  3. BLAS for small copy_to

Case 2 may be omitted since it is close enough to case 3 to keep things simple. Also, these tests need to be carried out on different architectures too.

!##### Test 1 #######

julia> a = ones(100)
[1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0]

!## Julia implementation (FASTEST)
julia> @time for i=1:1e5; jcopy(a); end;
elapsed time: 0.38377809524536133 sec

!## This is DCOPY from BLAS. It is an assembly language implementation in openblas
julia> @time for i=1:1e5; bcopy(a); end;
elapsed time: 0.41508984565734863 sec

!## This one dispatches to memcpy
julia> @time for i=1:1e5; copy(a); end;
elapsed time: 0.61258411407470703 sec

!##### Test 2 #######

I now implemented copy_to() for all cases to remove allocation/GC costs:

julia> a = ones(100)
[1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0]

julia> b = ones(100)
[1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0]

!## Julia implementation
julia> @time for i=1:1e5; jcopy_to(b,a); end;
elapsed time: 0.04522800445556641 sec

!## BLAS (FASTEST)
julia> @time for i=1:1e5; bcopy_to(b,a); end;
elapsed time: 0.01788496971130371 sec

!## memcpy
julia> @time for i=1:1e5; copy_to(b,a); end;
elapsed time: 0.27470088005065918 sec

!##### Test 3 #######

And now, a larger size:

julia> a = ones(1000000)
[1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0]

julia> b = ones(1000000)
[1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0]

!## Julia implementation
julia> @time for i=1:100; jcopy_to(b,a); end;
elapsed time: 0.5620429515838623 sec

!## BLAS
julia> @time for i=1:100; bcopy_to(b,a); end;
elapsed time: 0.5299229621887207 sec

!## memcpy (FASTEST)
julia> @time for i=1:100; copy_to(b,a); end;
elapsed time: 0.35404396057128906 sec
@ghost ghost assigned ViralBShah Jul 16, 2011
@ViralBShah
Copy link
Member Author

Filed DCOPY performance issue for openblas: OpenMathLib/OpenBLAS#45

@ViralBShah
Copy link
Member Author

I notice similar issues on Opteron+Linux also.

Small size:

julia> @time for i=1:1e5; copy_to(b, a); end;
elapsed time: 0.25034022331237793 sec

julia> @time for i=1:1e5; bcopy_to(b, a); end;
elapsed time: 0.01954412460327148 sec

julia> @time for i=1:1e5; jcopy_to(b, a); end;
elapsed time: 0.04978704452514648 sec

Large size:

julia> @time for i=1:1e3; copy_to(b, a); end;
elapsed time: 3.59907007217407227 sec

julia> @time for i=1:1e3; jcopy_to(b, a); end;
elapsed time: 8.26361894607543945 sec

julia> @time for i=1:1e3; bcopy_to(b, a); end;
elapsed time: 5.8034358024597168 sec

@ViralBShah
Copy link
Member Author

@ViralBShah
Copy link
Member Author

memcpy in FreeBSD. The first links is generic, whereas the second is for amd64.

http://www.freebsd.org/cgi/cvsweb.cgi/src/lib/libc/string/bcopy.c?rev=1.7.14.1;content-type=text%2Fplain
http://www.freebsd.org/cgi/cvsweb.cgi/src/lib/libc/amd64/string/bcopy.S?rev=1.4;content-type=text%2Fplain

It is quite likely that Apple uses an optimized memcpy supplied by Intel. Here's the Apple libc source:

http://www.opensource.apple.com/source/Libc/Libc-594.9.5/string/

@ViralBShah
Copy link
Member Author

Some notes on memcpy at Intel - a bit dated:

http://software.intel.com/en-us/articles/memcpy-performance/

@JeffBezanson
Copy link
Member

I'm seeing less difference - no factors of 10 between memcpy and blas. But I also see blas faster up to 200 or 300 elements. Can we just pick a cutoff of 200 and close it?

@ViralBShah
Copy link
Member Author

Working on closing this.

-viral

On 05-Aug-2011, at 12:54 AM, JeffBezanson wrote:

I'm seeing less difference - no factors of 10 between memcpy and blas. But I also see blas faster up to 200 or 300 elements. Can we just pick a cutoff of 200 and close it?

Reply to this email directly or view it on GitHub:
#121 (comment)

@ViralBShah
Copy link
Member Author

Now, this is much faster.

julia> a = ones(100)
[1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0]

julia> @time for i=1:1e5; copy(a); end;
elapsed time: 0.28148198127746582 sec

@ViralBShah
Copy link
Member Author

Ok, I have now completed the implementation, in what I believe is a systematic way - and all speed gains of the blas copy have disappeared. I suspect that I have introduced some julia overhead that can be avoided. For testing, I am checking in mcopy_to() and bcopy_to(), which can be removed later.

julia> n = 100; a = ones(n); b = Array(Float64,n);

julia> @time for i=1:1000000; mcopy_to(pointer(b), pointer(a), n); end
elapsed time: 0.32441210746765137 sec

julia> @time for i=1:1000000; bcopy_to(pointer(b), pointer(a), n); end
elapsed time: 0.43614697456359863 sec

-viral

On 05-Aug-2011, at 12:54 AM, JeffBezanson wrote:

I'm seeing less difference - no factors of 10 between memcpy and blas. But I also see blas faster up to 200 or 300 elements. Can we just pick a cutoff of 200 and close it?

Reply to this email directly or view it on GitHub:
#121 (comment)

@ViralBShah
Copy link
Member Author

The other explanation could be that OS X 10.7 (Lion) has a much improved memcpy on mac for small memcpys.

-viral

On 05-Aug-2011, at 6:05 PM, Viral Shah wrote:

Ok, I have now completed the implementation, in what I believe is a systematic way - and all speed gains of the blas copy have disappeared. I suspect that I have introduced some julia overhead that can be avoided. For testing, I am checking in mcopy_to() and bcopy_to(), which can be removed later.

julia> n = 100; a = ones(n); b = Array(Float64,n);

julia> @time for i=1:1000000; mcopy_to(pointer(b), pointer(a), n); end
elapsed time: 0.32441210746765137 sec

julia> @time for i=1:1000000; bcopy_to(pointer(b), pointer(a), n); end
elapsed time: 0.43614697456359863 sec

-viral

On 05-Aug-2011, at 12:54 AM, JeffBezanson wrote:

I'm seeing less difference - no factors of 10 between memcpy and blas. But I also see blas faster up to 200 or 300 elements. Can we just pick a cutoff of 200 and close it?

Reply to this email directly or view it on GitHub:
#121 (comment)

StefanKarpinski pushed a commit that referenced this issue Feb 8, 2018
add AbstractFloat (fix #121)
KristofferC pushed a commit that referenced this issue Feb 14, 2018
Use `git grep` on each package verison compatible with julia 0.7
to figure out which stdlib packages they use. This may be slightly
wrong in cases of dynamic code loading, but it's a good first-order
approximation of stdlib dependencies. Also generate a registry for
standard library packages so that version resolution can figure out
how to interact with them.
cmcaine pushed a commit to cmcaine/julia that referenced this issue Sep 24, 2020
Upgrade exercises to work in v0.7/v1.0
LilithHafner pushed a commit to LilithHafner/julia that referenced this issue Oct 11, 2021
NHDaly pushed a commit that referenced this issue May 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants