Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cuda synchronize alternative for profiling #304

Open
aimilefth opened this issue Jul 13, 2022 · 8 comments
Open

Cuda synchronize alternative for profiling #304

aimilefth opened this issue Jul 13, 2022 · 8 comments
Assignees

Comments

@aimilefth
Copy link

Greetings,

I am currently using tf-trt and I want to measure the perfomance of my models (Latency, Throughput).

The tensorrt c++ API has the functionality of cuda synchronize via the cuda events API https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#cuda-events

On top of that, Pytorch contains the torch.cuda.synchronize() alternative
https://pytorch.org/docs/stable/generated/torch.cuda.synchronize.html

However in the TF TRT docs, I cant find something similar, which in my opinion is essential in order to correctly measure perfomance metrics

Have I missed anything or are there plans to integrate such functionality?

Thank you

@ncomly-nvidia
Copy link

Hi @aimilefth, you are correct on all counts. This is critical to measure performance in tensorflow, however, the APIs do not currently exist in TF (not just TF-TRT). We are in the process of adding such APIs. @DEKHTIARJonathan can add more.

You can also check out the benchmarking scripts for how TF-TRT overcomes this currently.

@slai-natanijel
Copy link

slai-natanijel commented Jan 12, 2023

@ncomly-nvidia I was looking at the TensorRT ResNet50 benchmarking example here. The throughput seems exceptionally high, almost 250,000 IPS on the T4, whereas ML Perf reports 39,000 IPS for the A100, which is a better GPU.

Is the use of time.perf_counter() correct here? - just putting it around the inference function?

@DEKHTIARJonathan
Copy link
Collaborator

@slai-natanijel what is the input size for MLPerf, because TF-TRT uses MNIST (a very small input size) for demo purpose.
We chose on using MNIST because it's easy to download and use, clearly not comparable to the performance you would get with an input size 10x larger (so 100x more pixels)

@slai-natanijel
Copy link

slai-natanijel commented Jan 13, 2023

@DEKHTIARJonathan Ah yes you are right - ML Perf uses 224x224x3 images. However, when I tested on A100 on this image size, I get like 700,000 IPS (expected 30,000 IPS) when I wrap time.perf_counter() around an inference call.

So how do your benchmarking scripts overcome the synchronisation issue currently?

@DEKHTIARJonathan
Copy link
Collaborator

@slai-natanijel let me guess... Did you call '.numpy()' or resynchronize the GPU after the computation before the final perf_counter() call?

Don't forget that TF is eager executed which means there is no guarantee the computation is actually over when you return from 'result = model(data)'.

@slai-natanijel
Copy link

slai-natanijel commented Jan 16, 2023

@DEKHTIARJonathan
I tried the following:

start = time.perf_counter()
pred = func(x)['predictions'].numpy()
end = time.perf_counter()

where func(x) is an inference call to TensorRT. I get more reasonable IPS numbers with the above code, although I can't estimate how much overhead is added with the dictionary access and numpy() .

@DEKHTIARJonathan
Copy link
Collaborator

DEKHTIARJonathan commented Jan 16, 2023

@slai-natanijel actually it's a very good point ;) And it's a lot... And even worse, it's actually very changing due to the nature of memcpyDtoH ...

But you're in luck my friend :)

We actually are adding a feature in TensorFlow right now to address this issue: tensorflow/community#434

Now in the meantime, you can use a little bit of TensorFlow dark magic to minimize that overhead:

def force_gpu_resync(func):
    p = tf.constant(0.)  # Create small tensor to force GPU resync

    def wrapper(*args, **kwargs):
        rslt = func(*args, **kwargs)
        (p + 1.).numpy()  # Sync the GPU
        return rslt

    return wrapper

model = ...  # a TF function, Eager Function, TF-TRT converted model, etc.
model = force_gpu_resync(model)

It will add very minor overhead, until the RFC above is merged, it's the best you can do.

@slai-natanijel may I ask which company do you work for ? That way we can follow up with you

@slai-natanijel
Copy link

Great - I'll be watching the sync API!
I tried your code snippet - I think it works fine, although there is no noticeable difference in performance compared to the numpy method. I guess if the output tensor is large, then we'd see a bigger difference.
My email is [email protected]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants