Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bring CUDA support to Tracking.jl #33

Merged
merged 105 commits into from
Nov 17, 2021

Conversation

coezmaden
Copy link
Contributor

This brings (the long awaited) GPU functionality to Tracking.jl. At the moment only CUDA.jl.

Notes / Features:

  • Compatibility with JuliaGNSS/GNSSSignals.jl#19
  • Opt-in GPU signal processing
  • Monolithic CUDA.jl kernel combining carrier and code replication, downconversion, and correlation for arbitrary num_samples, num_ants, num_correlators. (arbitrary as long as the hardware allows)
  • Conditional unit testing (only if CUDA.functional())
  • Updated README.md with a GPU example
  • Updated README.md for correct usage after Speculative: Move PRN to TrackingState #31

This is a first fully functioning version. Optimizations regarding the performance of the CUDA kernel will follow. This version will be considered a baseline.

Can Ozmaden and others added 30 commits April 12, 2020 23:33
Update forked master due to hotfix
Incorporate the hotfix from the master repo
Compat LoopVectorization
Compat LoopVectorization
@codecov-commenter
Copy link

codecov-commenter commented Nov 14, 2021

Codecov Report

Merging #33 (724dcf6) into master (5c32278) will decrease coverage by 5.81%.
The diff coverage is 84.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #33      +/-   ##
==========================================
- Coverage   90.59%   84.77%   -5.82%     
==========================================
  Files          16       17       +1     
  Lines         404      473      +69     
==========================================
+ Hits          366      401      +35     
- Misses         38       72      +34     
Impacted Files Coverage Δ
src/Tracking.jl 100.00% <ø> (ø)
src/carrier_replica.jl 100.00% <ø> (ø)
src/downconvert_and_correlate.jl 33.33% <ø> (ø)
src/tracking_loop.jl 97.50% <82.35%> (-2.50%) ⬇️
src/tracking_state.jl 95.34% <87.50%> (-1.80%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5c32278...724dcf6. Read the comment docs.

@coezmaden
Copy link
Contributor Author

coezmaden commented Nov 14, 2021

A comment on the last commit 90358a9. It checks inside the track function if the signal and codes are of the same type in the GPU processing case. Here are the two examples where an error is thrown:

julia> signal_cuarray
2500-element CuArray{ComplexF32, 1, CUDA.Mem.DeviceBuffer}:

julia> results_gpu = track(signal_cuarray, state_gpu, sampling_frequency)
ERROR: ArgumentError: Signal is not a StructArray, initialize the signal properly and try again.

julia> fake_struct = StructArray(signal_cpu)
2500-element StructArray(::Vector{Float32}, ::Vector{Float32}) with eltype ComplexF32:

julia> results_gpu = track(fake_struct, state_gpu, sampling_frequency)
ERROR: ArgumentError: Signal and GNSS codes are not of the same type. Please check if CPU or GPU is used.

It comes with a slight penatly on the performance in the GPU case, but I think it's manageable.

#BEFORE THE COMMIT
julia> @benchmark track($signal_gpu, $state_gpu, $sampling_frequency)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  81.162 μs   26.223 ms  ┊ GC (min  max): 0.00%  41.79%
 Time  (median):     84.806 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   98.184 μs ± 439.458 μs  ┊ GC (mean ± σ):  3.16% ±  0.71%

   █▄                                                           
  ▅██▅▄▄▄▄▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▂▂▂▂▂▂▂▂▂▂▂ ▂
  81.2 μs         Histogram: frequency by time          177 μs <

 Memory estimate: 29.80 KiB, allocs estimate: 418.

#AFTER THE COMMIT
julia> @benchmark track($signal_gpu, $state_gpu, $sampling_frequency)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):   83.498 μs   26.311 ms  ┊ GC (min  max): 0.00%  40.27%
 Time  (median):      89.555 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   111.975 μs ± 519.990 μs  ┊ GC (mean ± σ):  3.66% ±  0.80%

  █▇▄▂▁▁▁▁                                                      ▂
  ███████████▇▇▇▆▄▆▆▅▆▆▇▅▄▄▅▄▄▅▄▃▃▅▄▃▃▃▁▄▁▄▃▃▄▁▁▃▃▁▃▄▁▃▅▄▄▅▄▃▄▄ █
  83.5 μs       Histogram: log(frequency) by time        460 μs <

 Memory estimate: 29.80 KiB, allocs estimate: 418.

No regression for the CPU:

#BEFORE THE COMMIT
julia> @benchmark track($signal_cpu, $state_cpu, $sampling_frequency)
BenchmarkTools.Trial: 10000 samples with 3 evaluations.
 Range (min  max):  8.174 μs   2.401 ms  ┊ GC (min  max): 0.00%  99.11%
 Time  (median):     8.553 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   9.361 μs ± 32.195 μs  ┊ GC (mean ± σ):  4.84% ±  1.40%

   ▂█▅                                                        
  ▃████▅▅▄▃▃▄▄▄▄▅▅▄▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▂▂▂▂▂▁▁▂▁▁▁▂▂▂▂ ▃
  8.17 μs        Histogram: frequency by time        13.1 μs <

 Memory estimate: 5.12 KiB, allocs estimate: 94.

#AFTER THE COMMIT
julia> @benchmark track($signal_cpu, $state_cpu, $sampling_frequency)
BenchmarkTools.Trial: 10000 samples with 3 evaluations.
 Range (min  max):  7.917 μs   2.161 ms  ┊ GC (min  max): 0.00%  99.45%
 Time  (median):     8.399 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   9.155 μs ± 30.340 μs  ┊ GC (mean ± σ):  4.67% ±  1.40%

    ██▆▁                                                      
  ▂▇████▅▄▄▅▅▅▆▆▅▄▄▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▂▁▂▁▁▂▂▂▂▂▂▂▂▂▁▂▂▂▂ ▃
  7.92 μs        Histogram: frequency by time        13.6 μs <

 Memory estimate: 5.12 KiB, allocs estimate: 94.

Copy link
Member

@zsoerenm zsoerenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some minor changes. Otherwise looks good!

Comment on lines 27 to 47
Floating point CPU StructArray carrier generation
"""
function gen_carrier_replica!(
carrier_replica::StructArray{Complex{T},1,NamedTuple{(:re, :im),Tuple{Array{T,1},Array{T,1}}},Int64},
carrier_frequency,
sampling_frequency,
start_phase,
carrier_amplitude_power::Val{N},
start_sample,
num_samples
) where {
T <: AbstractFloat,
N
}
sample_range = start_sample:num_samples + start_sample - 1
@views @. carrier_replica.re[sample_range] = 2pi * (sample_range) * carrier_frequency / sampling_frequency + start_phase
@. carrier_replica.im[sample_range] = sin(carrier_replica.re[sample_range])
@. carrier_replica.re[sample_range] = cos(carrier_replica.re[sample_range])
return carrier_replica
end

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is a left over from previous tests. I think this can be removed.

@@ -124,7 +124,7 @@ Get prompt correlator
function get_prompt(correlator::AbstractCorrelator, correlator_sample_shifts)
correlator.accumulators[get_prompt_index(correlator_sample_shifts)]
end

CUDA.dot
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's that?

It is determined based on the signal, which is given in the track
function.
"""
struct CarrierReplicaGPU{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CarrierReplicaGPU is never used, isn't it? I guess we can get rid of this structure.

It is determined based on the signal, which is given in the track
function.
"""
struct DownconvertedSignalGPU{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DownconvertedSignalGPU is never used, isn't it? I guess we can get rid of this structure.

DS <: DownconvertedSignalCPU, # Union{DownconvertedSignalCPU, CuArray{ComplexF32}}
CAR <: CarrierReplicaCPU, # Union{CarrierReplicaCPU, CuArray{ComplexF32}}
COR <: Vector{Int8}, # Union{Vector{Int8}, CuArray{Float32}}
DS <: Union{DownconvertedSignalCPU, DownconvertedSignalGPU},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can leave DownconvertedSignalCPU even for the GPU case and initialize that as an empty array.

CAR <: CarrierReplicaCPU, # Union{CarrierReplicaCPU, CuArray{ComplexF32}}
COR <: Vector{Int8}, # Union{Vector{Int8}, CuArray{Float32}}
DS <: Union{DownconvertedSignalCPU, DownconvertedSignalGPU},
CAR <: Union{CarrierReplicaCPU, CarrierReplicaGPU},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can leave CarrierReplicaCPU even for the GPU case and initialize that as an empty array.

src/tracking_state.jl Show resolved Hide resolved
@coezmaden
Copy link
Contributor Author

@zsoerenm Thanks for the feedback. ccb75f0 deals with the first two comments, the others are handeled by 2a24f7d. The latter even improves the performance:

Before 2a24f7d

julia> @benchmark track($signal_gpu, $state_gpu, $sampling_frequency)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):   82.073 μs   22.217 ms  ┊ GC (min  max): 0.00%  38.81%
 Time  (median):      86.151 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   103.747 μs ± 487.141 μs  ┊ GC (mean ± σ):  3.95% ±  0.84%

  ▆█▇▄▄▅▄▃▂▁                                                    ▂
  ██████████▇▆▆▇▇█▇▆▇▇▇▆▇▇▆▆▅▅▅▅▅▅▅▅▅▆▄▄▄▅▃▅▅▅▃▁▁▁▄▄▁▁▃▄▁▁▄▃▄▅▄ █
  82.1 μs       Histogram: log(frequency) by time        211 μs <

 Memory estimate: 29.80 KiB, allocs estimate: 418.

After 2a24f7d

julia> @benchmark track($signal_gpu, $state_gpu, $sampling_frequency)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  64.100 μs   21.582 ms  ┊ GC (min  max): 0.00%  38.88%
 Time  (median):     66.980 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   82.030 μs ± 475.433 μs  ┊ GC (mean ± σ):  4.97% ±  0.86%

   ██                                                           
  ▄██▇▃▃▃▄▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▁▂ ▂
  64.1 μs         Histogram: frequency by time          134 μs <

 Memory estimate: 29.33 KiB, allocs estimate: 406.

if sample_idx <= num_samples && antenna_idx <= num_ants && corr_idx <= num_corrs
# generate carrier
carrier_im[sample_idx], carrier_re[sample_idx] = CUDA.sincos(2π * ((sample_idx - 1) * carrier_frequency / sampling_frequency + carrier_phase))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what purpose do you save the sincos result in a vector? Is it for performance improvements in the future? If this is the case, let's dismiss that here to have a clean baseline.

@@ -51,6 +50,7 @@ function DownconvertedSignalCPU(num_ants::NumAnts{N}) where N
)
end

@inline gen_blank_carrier_gpu(system::S, num_samples) where {CO <: CuMatrix,S <: AbstractGNSS{CO}} = StructArray{ComplexF32}((CuArray{Float32}(undef, num_samples), CuArray{Float32}(undef, num_samples)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be dismissed, if you do not save the sincos result in a vector.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't used anymore, is it? So I think you can rid of this, too.

CAR <: CarrierReplicaCPU, # Union{CarrierReplicaCPU, CuArray{ComplexF32}}
COR <: Vector{Int8}, # Union{Vector{Int8}, CuArray{Float32}}
DS <: Union{DownconvertedSignalCPU, Nothing},
CAR <: Union{CarrierReplicaCPU, StructVector{ComplexF32, NamedTuple{(:re, :im), Tuple{CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}, Int64}},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

@coezmaden
Copy link
Contributor Author

@zsoerenm I have implemented your proposal. Now that I thought about it, I too decided it made more sense to just leave everything at nothing. No significant regression in the execution time.

Before eab881c:

julia> @benchmark track($signal_gpu, $state_gpu, $sampling_frequency)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  65.749 μs   24.988 ms  ┊ GC (min  max): 0.00%  46.31%
 Time  (median):     69.804 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   92.254 μs ± 509.974 μs  ┊ GC (mean ± σ):  4.92% ±  0.90%

  █▆▅▃▂▁▁▁▁                                                    ▁
  ██████████▇▆▆▄▄▆▆▅▅▄▄▅▄▆▄▅▄▃▃▅▃▃▄▅▅▅▃▄▄▅▄▅▆▅▃▅▁▅▅▅▄▄▃▃▁▁▅▄▄▄ █
  65.7 μs       Histogram: log(frequency) by time       381 μs <

 Memory estimate: 29.33 KiB, allocs estimate: 406.

After eab881c:

julia> @benchmark track($signal_gpu, $state_gpu, $sampling_frequency)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  65.857 μs   25.658 ms  ┊ GC (min  max): 0.00%  33.48%
 Time  (median):     69.429 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   87.554 μs ± 426.402 μs  ┊ GC (mean ± σ):  3.05% ±  0.63%

  █▆▄▃▁▁▁▁▁▁                                                   ▁
  ███████████▆▆▆▇▅▆▅▅▄▅▅▄▅▅▆▄▄▅▄▄▁▁▄▄▄▅▃▁▄▄▃▃▃▅▅▄▄▅▃▄▅▄▄▄▄▄▄▄▃ █
  65.9 μs       Histogram: log(frequency) by time       366 μs <

 Memory estimate: 26.23 KiB, allocs estimate: 388.

Copy link
Member

@zsoerenm zsoerenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, looks good.
Just 2 minor code smells: One is duplicate code and the other is a function, that isn't used anymore.
BTW: I'm surprised that writing to a CuArray in the kernel does not generate an overhead.

Comment on lines 361 to 363
function choose(replica::Nothing, signal::AbstractArray)
nothing
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those 3 lines are exactly the same as 370-372. So I think we can get rid of this.

@@ -51,6 +50,7 @@ function DownconvertedSignalCPU(num_ants::NumAnts{N}) where N
)
end

@inline gen_blank_carrier_gpu(system::S, num_samples) where {CO <: CuMatrix,S <: AbstractGNSS{CO}} = StructArray{ComplexF32}((CuArray{Float32}(undef, num_samples), CuArray{Float32}(undef, num_samples)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't used anymore, is it? So I think you can rid of this, too.

@coezmaden
Copy link
Contributor Author

I have removed the mentioned code smell.

BTW: I'm surprised that writing to a CuArray in the kernel does not generate an overhead.

Yeah, strange. Wonder why that is, I have run the benchmark in multiple independent sessions just to be sure. The results stay the same.

@zsoerenm zsoerenm merged commit fc50604 into JuliaGNSS:master Nov 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants