-
Hello, I’m attempting to utilize the Silero VAD model from a huggingface-based dataset using the map function and multiple workers for parallel processing. Here’s the relevant code snippet:
However, I’m encountering an issue that generates the following error message:
I'm unsure about how to correctly pass the VAD model to the 'datasets.map' function and make it run with multiprocessing. Any assistance you can provide would be greatly appreciated. Thank you for your time. Best regards. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 3 replies
-
The correct low-level way to run VAD with multiple python processes is as follows:
I am not sure how it works, but looks like it pickles the objects it passes to sub-processes. You have 4 options:
|
Beta Was this translation helpful? Give feedback.
The correct low-level way to run VAD with multiple python processes is as follows:
Have a separately run init function, that is invoked at the start of EACH process. This function should load the VAD and set the number of CPU threads for PyTorch if necessary. Same for ONNX. Please note that both PyTorch and ONNX models are not python objects, but merely pointers to the underlying objects;
Python has may APIs for multiprocessing (Process, ProcessPool, ProcessPoolExecutor, etc etc). Many of them have a custom init function parameters. The key here is NOT to reuse the same pointer, but to create truly separate model instances for each process;