Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Export NeMo FastConformer Hybrid Transducer Large Streaming to ONNX #844

Merged
merged 4 commits into from
May 8, 2024

Conversation

csukuangfj
Copy link
Collaborator

@csukuangfj csukuangfj commented May 8, 2024

Following #843

This PR handles the transducer part.


CC @tempops @sangeet2020

Also CC @titu1994

NeMo fuses the decoder + joiner into a single model decoder_joint.

The disadvantage of the fusion is that it increases the computation overhead during decoding.

(I don't see any benefits of the fusion.)

This PR instead exports the decoder and joiner separately.

@sangeet2020
Copy link
Contributor

Hi,
Thanks for the PR.
I tried adding the meta-data and then exporting the model using your script. However, they dont look as expected..I mean the encoder part.

$ python show-onnx-transudcer.py 
=========encoder==========
NodeArg(name='audio_signal', type='tensor(float)', shape=['audio_signal_dynamic_axes_1', 80, 'audio_signal_dynamic_axes_2'])
NodeArg(name='length', type='tensor(int64)', shape=['length_dynamic_axes_1'])
-----
NodeArg(name='outputs', type='tensor(float)', shape=['outputs_dynamic_axes_1', 512, 'outputs_dynamic_axes_2'])
NodeArg(name='encoded_lengths', type='tensor(int64)', shape=['encoded_lengths_dynamic_axes_1'])
=========decoder==========
NodeArg(name='targets', type='tensor(int32)', shape=['targets_dynamic_axes_1', 'targets_dynamic_axes_2'])
NodeArg(name='target_length', type='tensor(int32)', shape=['target_length_dynamic_axes_1'])
NodeArg(name='states.1', type='tensor(float)', shape=[1, 'states.1_dim_1', 640])
NodeArg(name='onnx::LSTM_3', type='tensor(float)', shape=[1, 1, 640])
-----
NodeArg(name='outputs', type='tensor(float)', shape=['outputs_dynamic_axes_1', 640, 'outputs_dynamic_axes_2'])
NodeArg(name='prednet_lengths', type='tensor(int32)', shape=['prednet_lengths_dynamic_axes_1'])
NodeArg(name='states', type='tensor(float)', shape=[1, 'states_dynamic_axes_1', 640])
NodeArg(name='74', type='tensor(float)', shape=[1, 'LSTM74_dim_1', 640])
=========joiner==========
NodeArg(name='encoder_outputs', type='tensor(float)', shape=['encoder_outputs_dynamic_axes_1', 512, 'encoder_outputs_dynamic_axes_2'])
NodeArg(name='decoder_outputs', type='tensor(float)', shape=['decoder_outputs_dynamic_axes_1', 640, 'decoder_outputs_dynamic_axes_2'])
-----
NodeArg(name='outputs', type='tensor(float)', shape=['outputs_dynamic_axes_1', 'outputs_dynamic_axes_2', 'outputs_dynamic_axes_3', 1025])

I am missing these meta-data

        "cache_last_channel_dim1": cache_last_channel_dim1,
        "cache_last_channel_dim2": cache_last_channel_dim2,
        "cache_last_channel_dim3": cache_last_channel_dim3,
        "cache_last_time_dim1": cache_last_time_dim1,
        "cache_last_time_dim2": cache_last_time_dim2,
        "cache_last_time_dim3": cache_last_time_dim3,

What could be the reason? Also, why do we need these information in the meta-data? Are these necessary and important to be supported by sherpa-decoder?

thank you

@csukuangfj
Copy link
Collaborator Author

I just added the streaming CTC support for NeMo Hybrid fast conformer transducer+ctc.
Please see #857

I think you can use it as a reference to add the streaming transducer support.


I am missing these meta-data

You can find their usage at

std::array<int64_t, 4> cache_last_channel_shape{1, cache_last_channel_dim1_,
cache_last_channel_dim2_,
cache_last_channel_dim3_};

std::array<int64_t, 4> cache_last_time_shape{
1, cache_last_time_dim1_, cache_last_time_dim2_, cache_last_time_dim3_};


I tried adding the meta-data and then exporting the model using your script.

Make sure you have followed

asr_model.set_export_config({"decoder_type": "rnnt", "cache_support": True})

You have to use

"cache_support": True

in order to export a streaming model.

@csukuangfj
Copy link
Collaborator Author

Screenshot 2024-05-10 at 16 47 56

In case you have any confusions, please see the above screenshot for scripts about exporting streaming and non-streaming models.

@sangeet2020
Copy link
Contributor

thats so detailed explanation. Thank you so much @csukuangfj .

I think for using offline decoding I had set- "cache_support": False, that was the problem. But setting it to true solved the problem.
thanks again!

@FawazCL
Copy link

FawazCL commented Sep 17, 2024

Hello! @sangeet2020 @csukuangfj
I wanted to know if the FastConformer model scripts are available for microphone inference? I see only inference for online from file.

Thanks!

@csukuangfj
Copy link
Collaborator Author

Hello! @sangeet2020 @csukuangfj I wanted to know if the FastConformer model scripts are available for microphone inference? I see only inference for online from file.

Thanks!

Yes, you can.

Please follow how we use streaming transducers in sherpa-onnx.

All you need to do is to use the model filenames for fast conformer transducers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants