Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MVDR module can't work in distributed training #1991

Closed
nateanl opened this issue Nov 6, 2021 · 5 comments
Closed

MVDR module can't work in distributed training #1991

nateanl opened this issue Nov 6, 2021 · 5 comments

Comments

@nateanl
Copy link
Member

nateanl commented Nov 6, 2021

🐛 Describe the bug

NCCL currently only supports floating point and integer dtype. (See pytorch/pytorch#45760)
The MVDR module forces to use cdouble dtype to improve the numerical robustness, however, it will cause an runtime error when integrating it to a neural network:

NCCL Backend doesn't support torch.complexdouble data type

To solve it, the temporary workaround is relax the dtype constraint by allowing MVDR to use cfloat dtype.

Versions

The issue is general to the current PyTorch and torchaudio version.

@mthrok
Copy link
Collaborator

mthrok commented Nov 7, 2021

This makes me wonder ... Should we keep using pseudo complex type? (#1337)

Or can view_as_real / view_as_complex workaround?

@nateanl
Copy link
Member Author

nateanl commented Nov 7, 2021

I tested view_as_complex, it actually has the same functionality. But I made it work by converting the waveform to torch.float after MVDR beamforming.
So the current MVDR indeed works in distributed training, just need to convert the output to floating point precision. I'm wondering how to check the dtype of gradient in MVDR to make sure it's double instead of float?

@nateanl
Copy link
Member Author

nateanl commented Nov 7, 2021

The PyTorch that works is built from source. Let me try with a stable version to see how it is like.

@nateanl
Copy link
Member Author

nateanl commented Nov 8, 2021

I think this issue is solved in latest PyTorch. Using torch.cdouble inside MVDR works fine within distributed training. I only need to convert it to torch.floatafter InverseSpectrogram.

After discussion with @fmassa, I'm thinking adding torch.cfloat is also a good option, if the performance is not degraded. I will verify it by using torch.cfloat in the tutorial.

@nateanl
Copy link
Member Author

nateanl commented Nov 10, 2021

Close it since MVDR works in latest PyTorch in distributed training.

@nateanl nateanl closed this as completed Nov 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants