MVDR module can't work in distributed training #1991

nateanl · 2021-11-06T12:45:18Z

🐛 Describe the bug

NCCL currently only supports floating point and integer dtype. (See pytorch/pytorch#45760)
The MVDR module forces to use cdouble dtype to improve the numerical robustness, however, it will cause an runtime error when integrating it to a neural network:

NCCL Backend doesn't support torch.complexdouble data type

To solve it, the temporary workaround is relax the dtype constraint by allowing MVDR to use cfloat dtype.

Versions

The issue is general to the current PyTorch and torchaudio version.

The text was updated successfully, but these errors were encountered:

mthrok · 2021-11-07T14:48:15Z

This makes me wonder ... Should we keep using pseudo complex type? (#1337)

Or can view_as_real / view_as_complex workaround?

nateanl · 2021-11-07T15:23:01Z

I tested view_as_complex, it actually has the same functionality. But I made it work by converting the waveform to torch.float after MVDR beamforming.
So the current MVDR indeed works in distributed training, just need to convert the output to floating point precision. I'm wondering how to check the dtype of gradient in MVDR to make sure it's double instead of float?

nateanl · 2021-11-07T15:59:20Z

The PyTorch that works is built from source. Let me try with a stable version to see how it is like.

nateanl · 2021-11-08T11:01:18Z

I think this issue is solved in latest PyTorch. Using torch.cdouble inside MVDR works fine within distributed training. I only need to convert it to torch.floatafter InverseSpectrogram.

After discussion with @fmassa, I'm thinking adding torch.cfloat is also a good option, if the performance is not degraded. I will verify it by using torch.cfloat in the tutorial.

nateanl · 2021-11-10T16:19:21Z

Close it since MVDR works in latest PyTorch in distributed training.

nateanl mentioned this issue Nov 7, 2021

Relax mvdr dtype constraint #1992

Closed

nateanl closed this as completed Nov 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MVDR module can't work in distributed training #1991

MVDR module can't work in distributed training #1991

nateanl commented Nov 6, 2021

mthrok commented Nov 7, 2021 •

edited

Loading

nateanl commented Nov 7, 2021

nateanl commented Nov 7, 2021

nateanl commented Nov 8, 2021

nateanl commented Nov 10, 2021

MVDR module can't work in distributed training #1991

MVDR module can't work in distributed training #1991

Comments

nateanl commented Nov 6, 2021

🐛 Describe the bug

Versions

mthrok commented Nov 7, 2021 • edited Loading

nateanl commented Nov 7, 2021

nateanl commented Nov 7, 2021

nateanl commented Nov 8, 2021

nateanl commented Nov 10, 2021

mthrok commented Nov 7, 2021 •

edited

Loading