Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding parallelism using ProcessPoolExecutor and concurrent.futures #314

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

pvbhanuteja
Copy link

adding parallelism using ProcessPoolExecutor and concurrent.futures

@snakers4
Copy link
Owner

snakers4 commented Apr 2, 2023

Hi,

Many thanks for your contribution to the examples!

determine_label_name

This function seems too specific for a certain task.
Could you maybe trim the example so that it would be generally applicable?

def process_wav_file(wav_file: str):

Looks like you init a VAD each time the new file gets processed.
Maybe there is a way to pass a custom init function to the ProcessPoolExecutor class, so that the VAD gets initiated only one time per process? (also please do not forget to reset the state of the model each time, or to make sure this is done inside of the imported function)

For long files this overhead will be negligible, but I believe we should do our best to provide the best example we can. Because sometimes, e.g. in a multi-threaded application, it may lead to painful memory leaks.

You see, if you are using PyTorch (not ONNX), it actually creates a reference to an underlying object, and in this particular case it hardly matters, but in general I guess a proper way is to init a separate model instance per process.

@pvbhanuteja
Copy link
Author

Hello @snakers4

Spent a few hours working to pickle the model. And found a small workaround please take a look at the new commit. Cleaned up the code a bit as well.

@snakers4
Copy link
Owner

snakers4 commented Apr 4, 2023

Hi,

def determine_label_name(wav_file):

Looks like this is not used.

And found a small workaround please take a look at the new commit.

with ProcessPoolExecutor(max_workers=NUM_PROCESS, initializer=initialize_vad, initargs=(model, utils)) as ex:

def initialize_vad(model, utils):
    global vad_model, vad_utils
    vad_model = model
    vad_utils = utils

Looks like you nevertheless initialize the model in the main process, and then pass it as a reference to process pool executor processes.

Why not just init a separate instance of model in each child process, i.e. in initialize_vad separately each time the new process is spun up?

Spent a few hours working to pickle the model.

You do not really need to pass any parameters to the init. You can write some plain logs to make sure that several instances of model were created in parallel.

@snakers4 snakers4 changed the title adding parallelism using ProcessPoolExecutor and concurrent.futures Adding parallelism using ProcessPoolExecutor and concurrent.futures Apr 5, 2023
@sobomax

This comment was marked as off-topic.

@snakers4
Copy link
Owner

Here is my batched vs unbatched example, the second variant executes in about 0.5s on A770, which is 200x real-time with batch size=50 and is not even catching a sweat.

For a real-time system (the VAD is intended for this use-case) latency of about 500ms when processing one 30ms chunk takes about 1ms is not very good.

Also batching works fine when you process a lot of files locally, but it becomes problematic when you can receive chunks asynchronously. Of course you can keep the state for each batch element and pass it back and forth for a next chunk in a messaging-like system, but we decided not to maintain public examples of such complexity.

@sobomax
Copy link

sobomax commented Apr 18, 2024

Here is my batched vs unbatched example, the second variant executes in about 0.5s on A770, which is 200x real-time with batch size=50 and is not even catching a sweat.

For a real-time system (the VAD is intended for this use-case) latency of about 500ms when processing one 30ms chunk takes about 1ms is not very good.

Also batching works fine when you process a lot of files locally, but it becomes problematic when you can receive chunks asynchronously. Of course you can keep the state for each batch element and pass it back and forth for a next chunk in a messaging-like system, but we decided not to maintain public examples of such complexity.

Well, I still don't see why provide example of batch size=1? Like if someone has 200 files those can be processed by a single executor with batch size 200, not by 200 in parallel.

@sobomax
Copy link

sobomax commented Apr 18, 2024

For a real-time system (the VAD is intended for this use-case) latency of about 500ms when processing one 30ms chunk takes about 1ms is not very good.

Oh that's the whole processing time for 50 wavs of ~10s each. Chunk-wise it was obviously 0.5s/nchunks ~ 5ms.

For a real-time system (the VAD is intended for this use-case) latency of about 500ms when processing one 30ms chunk takes about 1ms is not very good.

Also batching works fine when you process a lot of files locally, but it becomes problematic when you can receive chunks asynchronously. Of course you can keep the state for each batch element and pass it back and forth for a next chunk in a messaging-like system, but we decided not to maintain public examples of such complexity.

Well, it's not terribly complicated. I just implemented it (save/restore) for our own use in couple of hours.

Arguably to make the best use of your GPU resources (i.e. execution units, ram, scheduling etc), you need to batch up local tasks, not to multiplex them. With proper batching I can multiplex small number of functionally different models (i.e. TTS, STT, VAD etc) for say 50 channels on a single GPU today in a real-time, but I cannot multiplex 150 copies of the same 3 models. Well, I can but on 10x number of GPUs. ;-)

So giving only examples of batch size 1 promotes the most inefficient way to use the model, IMHO.

@sobomax
Copy link

sobomax commented Apr 18, 2024

@snakers4 P.S. Sorry for hiding that comment, it's bit late here, I wanted to give it a fresh read and re-post in the morning but you have beaten me to it. 😂

@pablodz
Copy link

pablodz commented Jul 28, 2024

Do not use python for concurrency, GIL will generate cpu spikes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants