Skip to content
This repository has been archived by the owner on Sep 25, 2023. It is now read-only.

[QST] Performance improvment for upfirdn #582

Open
NiclasEsser1 opened this issue Jun 20, 2023 · 3 comments
Open

[QST] Performance improvment for upfirdn #582

NiclasEsser1 opened this issue Jun 20, 2023 · 3 comments
Assignees
Labels
question Further information is requested

Comments

@NiclasEsser1
Copy link

Hi, I'm using cuSignal in my real-time processing application. The upfirdn-kernel is bottlenecking my application. To me, the upfirdn kernel does not look highly optimized as it makes no use of shared memory (e.g. fir taps) or Tensor Cores. Do you think the upfirdn-kernels will be improved in future?

In any case, I like the cuSignal library and appreciate your work!

Greetings

@NiclasEsser1 NiclasEsser1 added ? - Needs Triage Need team to review and classify question Further information is requested labels Jun 20, 2023
@awthomp
Copy link
Member

awthomp commented Jun 20, 2023

Hi @NiclasEsser1 -- Thanks for the kind words on cuSignal, and I'm pleased you're having a positive experience with the library.

Could you provide your NSight Systems profile showing current upfirdn performance for your respective data sizes/types? What performance is needed?

@mnicely wrote the original upfirdn kernel and my be able to provide some suggestions for performance improvements.

@awthomp awthomp self-assigned this Jun 20, 2023
@awthomp awthomp removed the ? - Needs Triage Need team to review and classify label Jun 20, 2023
@NiclasEsser1
Copy link
Author

NiclasEsser1 commented Jun 20, 2023

Hi @awthomp, thanks for the fast response. I've attached a screenshot of a Nsight Systems profile. Here, I'm writing int8 data to my processing pipeline, performing a cuBLAS complex64 multiplication with an oscillation, downsampling the signal and applying further filtering.

The sizes and the height of the 2D upfirdn vary depending on the configuration options. However, in the profile I used a complex64 matrix of size 2x33554432.

In some configurations (e.g. a 1D matrix) and an up-to-down ratio of 64, the pipeline reaches up to 3.5 GB/s input data rate (with int8). I need to achieve at least 4GB/s. The best case would be an input data rate of 6 GB/s and a matrix with 4 "rows".

profile

If you're interested, here is a python class that implements the pipeline - the actual processing starts in line 314

Edit 1: Benchmark running on RTX 3090

Edit 2: The upfirdn-function always applies zero padding at the start and end of a batch. In my streaming application, I'm using additional padding anyway to get time-coherent outputs. So, this isn't efficient (at least for my app).

Edit 3: updated the link to the repo. For reference, the repo is a plug-in for our radio astronomy backends and supports an observation mode called Very Long Baseline Interferometer (VLBI). The main application is a digital down conversion.

@awthomp
Copy link
Member

awthomp commented Jun 20, 2023

Thanks, @NiclasEsser1! This is exactly the detail I was looking for. We'll take a closer look over the next week or so and respond with any questions/comments we have.

I'm getting a dead-link with your python class that implements the pipeline, as an aside. We can, of course, generate synthetic data for perf benchmarking, but working with the same baseline is helpful too!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants