[QST] Performance improvment for upfirdn #582

NiclasEsser1 · 2023-06-20T15:42:57Z

Hi, I'm using cuSignal in my real-time processing application. The upfirdn-kernel is bottlenecking my application. To me, the upfirdn kernel does not look highly optimized as it makes no use of shared memory (e.g. fir taps) or Tensor Cores. Do you think the upfirdn-kernels will be improved in future?

In any case, I like the cuSignal library and appreciate your work!

Greetings

awthomp · 2023-06-20T16:05:39Z

Hi @NiclasEsser1 -- Thanks for the kind words on cuSignal, and I'm pleased you're having a positive experience with the library.

Could you provide your NSight Systems profile showing current upfirdn performance for your respective data sizes/types? What performance is needed?

@mnicely wrote the original upfirdn kernel and my be able to provide some suggestions for performance improvements.

NiclasEsser1 · 2023-06-20T16:23:14Z

Hi @awthomp, thanks for the fast response. I've attached a screenshot of a Nsight Systems profile. Here, I'm writing int8 data to my processing pipeline, performing a cuBLAS complex64 multiplication with an oscillation, downsampling the signal and applying further filtering.

The sizes and the height of the 2D upfirdn vary depending on the configuration options. However, in the profile I used a complex64 matrix of size 2x33554432.

In some configurations (e.g. a 1D matrix) and an up-to-down ratio of 64, the pipeline reaches up to 3.5 GB/s input data rate (with int8). I need to achieve at least 4GB/s. The best case would be an input data rate of 6 GB/s and a matrix with 4 "rows".

If you're interested, here is a python class that implements the pipeline - the actual processing starts in line 314

Edit 1: Benchmark running on RTX 3090

Edit 2: The upfirdn-function always applies zero padding at the start and end of a batch. In my streaming application, I'm using additional padding anyway to get time-coherent outputs. So, this isn't efficient (at least for my app).

Edit 3: updated the link to the repo. For reference, the repo is a plug-in for our radio astronomy backends and supports an observation mode called Very Long Baseline Interferometer (VLBI). The main application is a digital down conversion.

awthomp · 2023-06-20T17:10:42Z

Thanks, @NiclasEsser1! This is exactly the detail I was looking for. We'll take a closer look over the next week or so and respond with any questions/comments we have.

I'm getting a dead-link with your python class that implements the pipeline, as an aside. We can, of course, generate synthetic data for perf benchmarking, but working with the same baseline is helpful too!

NiclasEsser1 added ? - Needs Triage Need team to review and classify question Further information is requested labels Jun 20, 2023

awthomp self-assigned this Jun 20, 2023

awthomp removed the ? - Needs Triage Need team to review and classify label Jun 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] Performance improvment for upfirdn #582

[QST] Performance improvment for upfirdn #582

NiclasEsser1 commented Jun 20, 2023

awthomp commented Jun 20, 2023

NiclasEsser1 commented Jun 20, 2023 •

edited

Loading

awthomp commented Jun 20, 2023

[QST] Performance improvment for upfirdn #582

[QST] Performance improvment for upfirdn #582

Comments

NiclasEsser1 commented Jun 20, 2023

awthomp commented Jun 20, 2023

NiclasEsser1 commented Jun 20, 2023 • edited Loading

awthomp commented Jun 20, 2023

NiclasEsser1 commented Jun 20, 2023 •

edited

Loading