Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU memory issues when composing some of the waveform augmentations #132

Open
luisfvc opened this issue Apr 5, 2022 · 12 comments
Open

GPU memory issues when composing some of the waveform augmentations #132

luisfvc opened this issue Apr 5, 2022 · 12 comments

Comments

@luisfvc
Copy link

luisfvc commented Apr 5, 2022

Hi, I have been experiencing some memory problems when using some of the transforms on the GPU. When I apply the low or high pass filtering, the memory usage of my GPU increases each training iteration. And since I updated from v0.9.0 to the latest release, the same happens with the impulse response transform. This does not happen when I compose other transforms, like polarity inversion, gain, noise or pitch shift.

Any ideas on why this is happening? I went through the package source code but couldn't spot any bug.
Thanks & regards

@iver56
Copy link
Collaborator

iver56 commented Apr 5, 2022

Hi. That's curious!
I haven't noticed this issue myself, and I use LPF and HPF in some of my own training scripts.
I don't have an idea on why this is happening at the moment. If you can create a minimal script that reproduces the problem, that would be helpful 👍

The impulse response transform had almost no changes between 0.9.0 and 0.10.1 🤔

@iver56
Copy link
Collaborator

iver56 commented Apr 5, 2022

Do you init your transforms once and then use them many times or do you init them every time you need to run them?

@migperfer
Copy link

Hi! I'm running into a similar problem, but only when training on multiple GPUS. I use pytorch lightning, It'll take some time but I will try to create a script to reproduce the problem. Are you also using multiple GPUs @luisfvc ?

@nukelash
Copy link

Hi, I've noticed the same problem with HPF and LPF. I'm only training on a single GPU, but found that it only occurs if I'm using multiprocessing in my dataloader (i.e. num_workers > 0). Could it be related to pytorch/pytorch#13246 (comment)? That's what I thought I was debugging until I realized these filters were the real culprit

@iver56
Copy link
Collaborator

iver56 commented Sep 21, 2022

Thanks, that comment helps us getting closer to reproducing the bug

@RoyJames
Copy link

Hi, I've noticed the same problem with HPF and LPF. I'm only training on a single GPU, but found that it only occurs if I'm using multiprocessing in my dataloader (i.e. num_workers > 0). Could it be related to pytorch/pytorch#13246 (comment)? That's what I thought I was debugging until I realized these filters were the real culprit

I have the exact same experience. Had to set num_workers=0 when using torch-audiomentations. Curious if we found a better solution?

@iver56
Copy link
Collaborator

iver56 commented Sep 29, 2022

Thanks RoyJames :) Just so I understand your way of using torch-audiomentations, I'd like to know:

Did you run the transforms on CPU (in each data loader worker)? And did you train the ML model on GPU?

I have added a "Known issues" section to readme now, by the way: https://github.com/asteroid-team/torch-audiomentations#known-issues

@iver56
Copy link
Collaborator

iver56 commented Sep 29, 2022

I should write this article soon, to make it easier to decide if torch-audiomentations is a good fit and how. Also, it would be swell if someone/we could reproduce and fix this memory leak 😅 I don't have a lot of spare time to do it right now, but I'd love to help

@RoyJames
Copy link

RoyJames commented Sep 29, 2022

torch-audiomentations

I think (hope) I did those augmentations on the GPU since the incoming data is already on CUDA. I wrapped torch-audiomentations functions in a preprocessor class that was used as the collate function of my dataloader. While I can't provide a complete code snippet, it is something like:

class MyPreprocessor:
    def __init__(self, noise_set: Path, device: str = "cuda"):
        self._augmentor = Compose(
                transforms=[
                    Gain(
                        min_gain_in_db=-15.0,
                        max_gain_in_db=5.0,
                        p=0.5,
                        p_mode="per_example",
                    ),
                    LowPassFilter(
                        min_cutoff_freq=4000.0,
                        max_cutoff_freq=8000.0,
                        p=0.5,
                        p_mode="per_example",
                    ),
                    AddBackgroundNoise(
                        background_paths=noise_set, 
                        min_snr_in_db=0.0,
                        max_snr_in_db=30.0,
                        p=0.5,
                        p_mode="per_example",
                    )
                ]
            )

    def __call__(self, batch: T.List[np.ndarray])
        AudioPair = namedtuple('AudioPair', ['clean', 'noisy'])
        batch_pairs = [AudioPair(pair[0], pair[1]) for pair in batch]
        batch_pairs = torch.utils.data.dataloader.default_collate(batch_pairs)
        y = batch_pairs.clean.unsqueeze(1).to(self._device)
        
        x = batch_pairs.noisy.unsqueeze(1).to(self._device)
        x = self._augmentor(x, sample_rate=SAMPLE_RATE)
        return x, y

Then my dataloader looks like:

        self.train_loader = torch.utils.data.DataLoader(
            self.train_set,
            sampler=train_sampler,
            collate_fn=MyPreprocessor(noise_set=noise_set, device="cuda"),
            batch_size=BATCH_SIZE,
            drop_last=True,
            num_workers=num_workers,
            shuffle=train_shuffle,
            worker_init_fn=seed_worker,
        )

and I had to set num_workers=0 when training on >1 GPUs. But please correct me if this is not the expected way. I'm running code on remote GPUs and don't really know a good way to debug memory issues (I wish to contribute, any suggestions on where to look?). Currently, this single thread scheme worked ok for me since my GPU utilization is kept high.

Edit: I forgot to mention that I use the above with torch.nn.parallel.DistributedDataParallel if that's relevant.

@RoyJames
Copy link

RoyJames commented Oct 5, 2022

torch-audiomentations

I think (hope) I did those augmentations on the GPU since the incoming data is already on CUDA. I wrapped torch-audiomentations functions in a preprocessor class that was used as the collate function of my dataloader. While I can't provide a complete code snippet, it is something like:

class MyPreprocessor:
    def __init__(self, noise_set: Path, device: str = "cuda"):
        self._augmentor = Compose(
                transforms=[
                    Gain(
                        min_gain_in_db=-15.0,
                        max_gain_in_db=5.0,
                        p=0.5,
                        p_mode="per_example",
                    ),
                    LowPassFilter(
                        min_cutoff_freq=4000.0,
                        max_cutoff_freq=8000.0,
                        p=0.5,
                        p_mode="per_example",
                    ),
                    AddBackgroundNoise(
                        background_paths=noise_set, 
                        min_snr_in_db=0.0,
                        max_snr_in_db=30.0,
                        p=0.5,
                        p_mode="per_example",
                    )
                ]
            )

    def __call__(self, batch: T.List[np.ndarray])
        AudioPair = namedtuple('AudioPair', ['clean', 'noisy'])
        batch_pairs = [AudioPair(pair[0], pair[1]) for pair in batch]
        batch_pairs = torch.utils.data.dataloader.default_collate(batch_pairs)
        y = batch_pairs.clean.unsqueeze(1).to(self._device)
        
        x = batch_pairs.noisy.unsqueeze(1).to(self._device)
        x = self._augmentor(x, sample_rate=SAMPLE_RATE)
        return x, y

Then my dataloader looks like:

        self.train_loader = torch.utils.data.DataLoader(
            self.train_set,
            sampler=train_sampler,
            collate_fn=MyPreprocessor(noise_set=noise_set, device="cuda"),
            batch_size=BATCH_SIZE,
            drop_last=True,
            num_workers=num_workers,
            shuffle=train_shuffle,
            worker_init_fn=seed_worker,
        )

and I had to set num_workers=0 when training on >1 GPUs. But please correct me if this is not the expected way. I'm running code on remote GPUs and don't really know a good way to debug memory issues (I wish to contribute, any suggestions on where to look?). Currently, this single thread scheme worked ok for me since my GPU utilization is kept high.

Edit: I forgot to mention that I use the above with torch.nn.parallel.DistributedDataParallel if that's relevant.

I was able to use num_workers>0 as long as I don't use torch-audiomentations in GPU mode as part of the collate function (or any operations that will be forked during CPU multiprocessing). This way I essentially define the GPU preprocessor function as part of my trainer (rather than the dataloader), and call it first in the forward() function after each mini-batch has been collated and uploaded to GPU. I guess the lesson for me here is to not invoke GPU processing as part of the CPU multiprocessing routine, while those GPUs are already busy with forward & backward computation for the current batch of data. I think it's more like a pytorch/python issue (or just not a good practice at all) rather than an issue with this package.

Maybe this is obvious to some experienced folks. I feel we could mention this caveat to other unaware users?

@iver56
Copy link
Collaborator

iver56 commented Oct 6, 2022

This way I essentially define the GPU preprocessor function as part of my trainer (rather than the dataloader), and call it first in the forward() function after each mini-batch has been collated and uploaded to GPU.

Yes, this is the way I use torch-audiomentations on GPU too 👍 It would indeed be nice to have this documented well. I'm currently focusing on the documentation website for audiomentations, but I want to eventually make one for torch-audiomentations too, using the knowledge I gained for making the audiomentations documentation

@Bloos
Copy link

Bloos commented Jun 14, 2023

I implemented it in the same way and applied it in the training loop, but I'm still experiencing the memory leak.

I've got a GPU Server with multiple GPUs and I am using pytorch lightning with DDP. I'm using only one GPU per process.
The exception happened in the bandpass filter somewhere in the julius code in cufft. Sadly i cannot copy the stack trace because the server is in an offline environment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants