You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Sep 25, 2023. It is now read-only.
Hi, I'm using cuSignal in my real-time processing application. The upfirdn-kernel is bottlenecking my application. To me, the upfirdn kernel does not look highly optimized as it makes no use of shared memory (e.g. fir taps) or Tensor Cores. Do you think the upfirdn-kernels will be improved in future?
In any case, I like the cuSignal library and appreciate your work!
Greetings
The text was updated successfully, but these errors were encountered:
Hi @awthomp, thanks for the fast response. I've attached a screenshot of a Nsight Systems profile. Here, I'm writing int8 data to my processing pipeline, performing a cuBLAS complex64 multiplication with an oscillation, downsampling the signal and applying further filtering.
The sizes and the height of the 2D upfirdn vary depending on the configuration options. However, in the profile I used a complex64 matrix of size 2x33554432.
In some configurations (e.g. a 1D matrix) and an up-to-down ratio of 64, the pipeline reaches up to 3.5 GB/s input data rate (with int8). I need to achieve at least 4GB/s. The best case would be an input data rate of 6 GB/s and a matrix with 4 "rows".
If you're interested, here is a python class that implements the pipeline - the actual processing starts in line 314
Edit 1: Benchmark running on RTX 3090
Edit 2: The upfirdn-function always applies zero padding at the start and end of a batch. In my streaming application, I'm using additional padding anyway to get time-coherent outputs. So, this isn't efficient (at least for my app).
Edit 3: updated the link to the repo. For reference, the repo is a plug-in for our radio astronomy backends and supports an observation mode called Very Long Baseline Interferometer (VLBI). The main application is a digital down conversion.
Thanks, @NiclasEsser1! This is exactly the detail I was looking for. We'll take a closer look over the next week or so and respond with any questions/comments we have.
I'm getting a dead-link with your python class that implements the pipeline, as an aside. We can, of course, generate synthetic data for perf benchmarking, but working with the same baseline is helpful too!
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Hi, I'm using cuSignal in my real-time processing application. The upfirdn-kernel is bottlenecking my application. To me, the upfirdn kernel does not look highly optimized as it makes no use of shared memory (e.g. fir taps) or Tensor Cores. Do you think the upfirdn-kernels will be improved in future?
In any case, I like the cuSignal library and appreciate your work!
Greetings
The text was updated successfully, but these errors were encountered: