How do i run inference with multiple models without maxing out my GPU VRAM? #55

HSHallucinations · 2023-11-28T00:28:21Z

I'm trying to tag a dataset using more than one WD14 model, so i wrote a simple script that iterates all the files in a directory for every model in a list, like this

model_list = ['SwinV2', 'ConvNextV2', 'MOAT', 'ViT']

for m in model_list:
    for child in directory.glob('**/*'):
          ratings, features, chars = get_wd14_tags(child, model_name=m, general_threshold=thresh)

my problem is, after every loop in model_list, inference time increases a lot, for the first loop it takes ~0.15 seconds to extract the tags from each image, no matter how many images, but by the time i'm doing the 4th loop it takes 15 seconds. But if i run the script 4 times with a single model in the list, every model takes the same 0.15 seconds.

Running it with the task manager open, i noticed that every time it loads a new model, the dedicated VRAM used by python increases by ~1.5gb, so once i reach the third loop, my poor 970 doesn't have any more free memory so i guess it start using the system RAM and that's why it slows down.

Is there a way to free the VRAM before loading a new model? I tried looking in the ONNX documentation but it's way above my level of understanding.

I'm running it on Win10 / onnxruntime-gpu / cuda11.8

narugo1992 · 2023-12-04T05:09:51Z

by the time i'm doing the 4th loop it takes 15 seconds

Is this mean 15 secs per image? or something else?

narugo1992 · 2023-12-04T05:39:11Z

Running it with the task manager open, i noticed that every time it loads a new model, the dedicated VRAM used by python increases by ~1.5gb, so once i reach the third loop, my poor 970 doesn't have any more free memory so i guess it start using the system RAM and that's why it slows down.

I remember the Geforce GTX 790 has at least 12GB of VRAM, so it seems not to be related to VRAM.

HSHallucinations · 2023-12-04T14:59:54Z

yes, it's 15 seconds per image once i max out the VRAM. Unfortunately the gtx970 has only 3.5 gb of VRAM, it's almot 10years old at this point, maybe you're thinking of some newer AMD card with a similar name

narugo1992 · 2023-12-10T06:45:41Z

Actually, we can release VRAM by clearing the cache. The source code is available here: https://github.com/deepghs/imgutils/blob/main/imgutils/tagging/wd14.py#L69

Here's how you can use it:

from imgutils.tagging.wd14 import _get_wd14_model

_get_wd14_model.cache_clear()

Once the cache is cleared, the previously loaded model will be released.

However, this method is currently just a workaround. A more suitable approach would be for us to provide a complete VRAM management layer in the future. This part has already been added to the todo list.

HSHallucinations · 2023-12-17T23:06:22Z

Works perfectly for what i need to do, thanks for the help and also for writing this library, i spent months trying every commercial software with auto tagging but they were all too generic, while this does exactly what i wanted

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How do i run inference with multiple models without maxing out my GPU VRAM? #55

How do i run inference with multiple models without maxing out my GPU VRAM? #55

HSHallucinations commented Nov 28, 2023 •

edited

Loading

narugo1992 commented Dec 4, 2023

narugo1992 commented Dec 4, 2023

HSHallucinations commented Dec 4, 2023

narugo1992 commented Dec 10, 2023

HSHallucinations commented Dec 17, 2023

How do i run inference with multiple models without maxing out my GPU VRAM? #55

How do i run inference with multiple models without maxing out my GPU VRAM? #55

Comments

HSHallucinations commented Nov 28, 2023 • edited Loading

narugo1992 commented Dec 4, 2023

narugo1992 commented Dec 4, 2023

HSHallucinations commented Dec 4, 2023

narugo1992 commented Dec 10, 2023

HSHallucinations commented Dec 17, 2023

HSHallucinations commented Nov 28, 2023 •

edited

Loading