-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement the OrdinaryDiffEq interface for Kets #16
Conversation
These changes are self-contained and let you define Here is a quick example: t₀, t₁ = (0.0, pi)
ℋ = SpinBasis(1//2)
σx = sigmax(ℋ)
↓ = spindown(ℋ)
iσx = im * σx
schrod!(dψ,ψ,p,t) = mul!(dψ, iσx, ψ)
prob = ODEProblem(schrod!, ↓, (t₀, t₁))
@benchmark sol = solve(prob,DP5(),save_everystep=false) # Allocates ~600KiB
schrod(ψ,p,t) = iσx*ψ
prob = ODEProblem(schrod, ↓, (t₀, t₁))
@benchmark sol = solve(prob,DP5(),save_everystep=false) # Allocates ~170KiB which is less than the schrod! !?!?
@benchmark timeevolution.schroedinger([t₀,t₁], ↓, σx) # Allocates ~12KiB Help with this would be appreciated :) I am fairly sure it is not due to the current broadcasting implementation (which does have some overhead, but it is a constant overhead, not something that scales with size). Any idea how to find what allocates so much? |
Thanks for digging into this! I think it's the broadcasting that ruins the performance here though. Replacing the in-place broadcasting method here QuantumOpticsBase.jl/src/states.jl Line 242 in 201b630
with @inline function Base.copyto!(dest::Ket{B}, bc::Broadcast.Broadcasted{Style,Axes,F,Args}) where {B<:Basis,Style<:KetStyle{B},Axes,F,Args}
axes(dest) == axes(bc) || throwdm(axes(dest), axes(bc))
bc′ = Base.Broadcast.preprocess(dest, bc)
dest′ = dest.data
@inbounds @simd for I in eachindex(bc′)
dest′[I] = bc′[I]
end
return dest
end
Base.getindex(st::StateVector, idx) = getindex(st.data, idx) which is essentially the example Chris Rackauckas posted in your discourse thread: Then I get julia> @benchmark sol = solve($prob,$alg,save_everystep=false)
BenchmarkTools.Trial:
memory estimate: 9.92 KiB
allocs estimate: 133
--------------
minimum time: 14.402 μs (0.00% GC)
median time: 16.446 μs (0.00% GC)
mean time: 17.666 μs (3.53% GC)
maximum time: 6.282 ms (99.16% GC)
--------------
samples: 10000
evals/sample: 1 for the in-place version of your code. This is much better, but still not as good as FYI you're also benchmarking in global scope and use some non-constant globals which you should avoid to get proper results. Either put all the benchmarking inside a function or make the globals constant (in this case you need |
Actually the amount of additional allocations of the in-place version is constant in that the "allocs estimate" is the same regardless of the problem size. The memory size increases though and is always higher than with For the in-place BenchmarkTools.Trial:
memory estimate: 15.67 KiB
allocs estimate: 145
--------------
minimum time: 101.143 μs (0.00% GC)
median time: 109.095 μs (0.00% GC)
mean time: 111.976 μs (0.94% GC)
maximum time: 5.413 ms (97.68% GC)
--------------
samples: 10000
evals/sample: 1 for BenchmarkTools.Trial:
memory estimate: 13.62 KiB
allocs estimate: 73
--------------
minimum time: 155.181 μs (0.00% GC)
median time: 168.481 μs (0.00% GC)
mean time: 170.973 μs (0.30% GC)
maximum time: 5.366 ms (96.15% GC)
--------------
samples: 10000
evals/sample: 1 |
Another chunk of performance is lost in the out-of-place broadcasting method. Using the following: @inline function Base.copy(bc::Broadcast.Broadcasted{Style,Axes,F,Args}) where {B<:Basis,Style<:KetStyle{B},Axes,F,Args<:Tuple}
bcf = Broadcast.flatten(bc)
b = find_basis(bcf)
T = find_dType(bcf)
data = zeros(T, length(b))
@inbounds @simd for I in eachindex(bcf)
data[I] = bcf[I]
end
return Ket{B}(b, data)
end
for f ∈ [:find_basis,:find_dType]
@eval ($f)(bc::Broadcast.Broadcasted) = ($f)(bc.args)
@eval ($f)(args::Tuple) = ($f)(($f)(args[1]), Base.tail(args))
@eval ($f)(x) = x
@eval ($f)(::Any, rest) = ($f)(rest)
end
find_basis(a::StateVector, rest) = a.basis
find_dType(a::StateVector, rest) = eltype(a) puts me really close in speed and allocations for your original example. The allocations also seem to scale a bit better. For my spin-20//2 example, I get |
I am a bit lost what is happening here. Could you explain what |
I updated the pull request with the changes you explained and also added some extra tests. In some cases, it is curiously faster than the ℋ = SpinBasis(20//1)
↓ = spindown(ℋ)
t₀, t₁ = (0.0, pi)
const σx = sigmax(ℋ)
const iσx = im * σx
schrod!(dψ,ψ,p,t) = mul!(dψ, iσx, ψ)
prob! = ODEProblem(schrod!, ↓, (t₀, t₁))
julia> @benchmark sol = solve($prob!,DP5(),save_everystep=false)
BenchmarkTools.Trial:
memory estimate: 22.67 KiB
allocs estimate: 178
--------------
minimum time: 374.463 μs (0.00% GC)
median time: 397.327 μs (0.00% GC)
mean time: 406.738 μs (0.37% GC)
maximum time: 4.386 ms (89.76% GC)
--------------
samples: 10000
evals/sample: 1
julia> @benchmark timeevolution.schroedinger([$t₀,$t₁], $↓, $σx)
BenchmarkTools.Trial:
memory estimate: 23.34 KiB
allocs estimate: 161
--------------
minimum time: 748.106 μs (0.00% GC)
median time: 774.601 μs (0.00% GC)
mean time: 786.933 μs (0.14% GC)
maximum time: 4.459 ms (80.46% GC)
--------------
samples: 6350
evals/sample: 1 |
Codecov Report
@@ Coverage Diff @@
## master #16 +/- ##
==========================================
+ Coverage 93.61% 93.76% +0.14%
==========================================
Files 24 24
Lines 2475 2486 +11
==========================================
+ Hits 2317 2331 +14
+ Misses 158 155 -3
Continue to review full report at Codecov.
|
I also updated the issue description with a short TODO. Does the current list sound reasonable to you? Next weekend I might have some time to make the listed changes, but feel free to jump in in case you are impatient ;) |
The following now works ```julia using QuantumOptics using DifferentialEquations ℋ = SpinBasis(1//2) σx = sigmax(ℋ) ↓ = s = spindown(ℋ) schrod(ψ,p,t) = im * σx * ψ t₀, t₁ = (0.0, pi) Δt = 0.1 prob = ODEProblem(schrod, ↓, (t₀, t₁)) sol = solve(prob,Tsit5()) ``` It works for Bras as well. It works for in-place operations and in some situations it is faster than the standard `timeevolution.schroedinger`. ```julia ℋ = SpinBasis(20//1) ↓ = spindown(ℋ) t₀, t₁ = (0.0, pi) const σx = sigmax(ℋ) const iσx = im * σx schrod!(dψ,ψ,p,t) = mul!(dψ, iσx, ψ) prob! = ODEProblem(schrod!, ↓, (t₀, t₁)) julia> @benchmark sol = solve($prob!,DP5(),save_everystep=false) BenchmarkTools.Trial: memory estimate: 22.67 KiB allocs estimate: 178 -------------- minimum time: 374.463 μs (0.00% GC) median time: 397.327 μs (0.00% GC) mean time: 406.738 μs (0.37% GC) maximum time: 4.386 ms (89.76% GC) -------------- samples: 10000 evals/sample: 1 julia> @benchmark timeevolution.schroedinger([$t₀,$t₁], $↓, $σx) BenchmarkTools.Trial: memory estimate: 23.34 KiB allocs estimate: 161 -------------- minimum time: 748.106 μs (0.00% GC) median time: 774.601 μs (0.00% GC) mean time: 786.933 μs (0.14% GC) maximum time: 4.459 ms (80.46% GC) -------------- samples: 6350 evals/sample: 1 ```
I got them from Chris Rackauckas' example, but had to look them up myself. If I understand correctly, the
I'm not 100% sure what you're asking here, but I'll try to answer anyway. The
Well we're actually comparing two different things. const ix = iσx.data
schrod_data!(dψ,ψ,p,t) = mul!(dψ, ix, ψ)
u0 = (↓).data
prob_data! = ODEProblem(schrod_data!, u0, (t₀, t₁))
alg = DP5()
@benchmark solve($prob_data!, $alg, save_everystep=false) matching the performance of this (with a constant overhead) is the goal.
Those sound good! You could also split the broadcasting for Finally, just for reference, we can in principle do the same for the |
The following works now: ```julia ℋ = SpinBasis(20//1) const σx = sigmax(ℋ) const iσx = im * σx const σ₋ = sigmam(ℋ) const σ₊ = σ₋' const mhalfσ₊σ₋ = -σ₊*σ₋/2 ↓ = spindown(ℋ) ρ = dm(↓) lind(ρ,p,t) = - iσx * ρ + ρ * iσx + σ₋*ρ*σ₊ + mhalfσ₊σ₋ * ρ + ρ * mhalfσ₊σ₋ t₀, t₁ = (0.0, pi) Δt = 0.1 prob = ODEProblem(lind, ρ, (t₀, t₁)) sol = solve(prob,Tsit5()) ``` Works in-place as well. It is slightly slower than `timeevolution.master`: ```julia function makelind!() tmp = zero(ρ) # this is the global rho function lind!(dρ,ρ,p,t) # TODO this can be much better with a good Tullio kernel mul!(tmp, ρ, σ₊) mul!(dρ, σ₋, ρ) mul!(dρ, ρ, mhalfσ₊σ₋, true, true) mul!(dρ, mhalfσ₊σ₋, ρ, true, true) mul!(dρ, iσx, ρ, -ComplexF64(1), ComplexF64(1)) mul!(dρ, ρ, iσx, true, true) return dρ end end lind! = makelind!() prob! = ODEProblem(lind!, ρ, (t₀, t₁)) julia> @benchmark sol = solve($prob!,DP5(),save_everystep=false) BenchmarkTools.Trial: memory estimate: 408.94 KiB allocs estimate: 213 -------------- minimum time: 126.334 ms (0.00% GC) median time: 127.359 ms (0.00% GC) mean time: 127.876 ms (0.00% GC) maximum time: 138.660 ms (0.00% GC) -------------- samples: 40 evals/sample: 1 julia> @benchmark timeevolution.master([$t₀,$t₁], $ρ, $σx, [$σ₋]) BenchmarkTools.Trial: memory estimate: 497.91 KiB allocs estimate: 210 -------------- minimum time: 97.902 ms (0.00% GC) median time: 98.469 ms (0.00% GC) mean time: 98.655 ms (0.00% GC) maximum time: 104.850 ms (0.00% GC) -------------- samples: 51 evals/sample: 1 ```
@david-pl, I looked through the rest of the TODOs, but it seems most of that work is to be done in QuantumOptics, not in QOBase. Does it make sense to merge this PR now, and then I will make an issue listing the rest of the TODOs and start working on it in a QuantumOptics pull request? An exception to this is figuring out what is necessary in order to work with sparse state vectors, but that might be out of scope. |
@Krastanov Before moving on to the other time evolution functions, we should make sure the ones that already work are performant. This needs to be done in QOBase. Currently, there are allocations and slowdowns happening using
I'd say this is definitely out of scope for now. |
This has been stale for a long time. Is there any update on this? |
@david-pl @Krastanov Additional allocations come from the fact that
and now run the example, we get
This leads to a pretty significant performance improvement. |
It's also now pretty comparable to using
|
@apkille thanks for taking a look at this! The allocations from deepcopy are a great find. However, I don't think we're quite there yet. The first benchmark you mention is again comparing Now for the second one: this is the benchmark we need to look at since it really measures the impact of using QO.jl types.
Well, if I read the timings correctly, the time it takes almost doubles. The additional time cost may be okay if it's a constant offset, but I previously saw that it scales with system size meaning you'd be significantly slowing down time evolution computations here. Could you maybe check the behaviour with increasing system again? |
@david-pl If we increase the problem size with
and for pure Julia types we get
So the allocations are the same, but there is still a time difference. Although the time it takes does not almost double, as was the case for Anyways, I was mainly trying to find the big culprit of the additional allocations in the above comment about |
@apkille brilliant, thanks for digging into this. I'd love to merge this so we have native DiffEq support ;) |
@david-pl Here's another decent improvement on allocations and memory, which comes from also defining a
Using the example from above for
So the allocations are now pretty much the same and the memory is much better for QO types. Although the mean time is greater by ~0.3 ms. There's probably one or two allocations in |
Me too. I'm happy to continue this work for other modules in QO.jl as well. I think a lot of users would find it quite useful, and I geek out on all of the SciML features :) |
Yeah, there's probably some limitation on what we can do with broadcasting while keeping track of the bases of the state. Anyway, I thought a bit more about this and maybe your version is already more efficient when used in |
I agree. Comparing QO types against pure Julia types is certainly the right approach, but I would be surprised if we ever get zero performance difference between the two, simply because of the additional (yet necessary) steps for working with QO types. That said, I'm all for making things efficient.
I'll make a new PR that looks into this. |
You're probably right, but I still think it should be possible in principle to get the same performance here. I mean a
Awesome, thanks! |
closed in favor of #172 |
TODO
StateVectors
andSchroedinger
equations to work naively withOrdinaryDiffEq
Actually converting the current solvers to use this simpler "no-recast" interface is out of scope to this pull request.
Old message:
The following now works
It works for Bras as well.
It works for in-place operations, however there are spurrious
allocations due to inefficient broadcasting that ruin the performance.