Support pre-trained CTC models from NeMo #332

csukuangfj · 2023-03-09T14:29:40Z

Fixes #303

Fixes #238

TODOs

Add CI tests
Update doc
Convert more pre-trained models from NeMo

Usage example

We have converted Citrinet-512 from NeMo:
https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_en_citrinet_512

The model is saved at
https://huggingface.co/csukuangfj/sherpa-nemo-ctc-en-citrinet-512

In the following,we describe how to use sherpa to decode sound files with pre-trained CTC models from NeMo.

Build sherpa

git clone https://github.com/k2-fsa/sherpa
# Note, you need to switch to this commit, we ignore it here.
cd sherpa
mkdir build
cd build
cmake ..
make -j

Download the pre-trained model

cd /path/to/sherpa

GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/csukuangfj/sherpa-nemo-ctc-en-citrinet-512
cd sherpa-nemo-ctc-en-citrinet-512
git lfs pull --include "*.pt"

Use the pre-trained model

cd /path/to/sherpa

./build/bin/sherpa-offline \
  --nn-model=./sherpa-nemo-ctc-en-citrinet-512/model.pt \
  --tokens=./sherpa-nemo-ctc-en-citrinet-512/tokens.txt \
  --use-gpu=false \
  --modified=false \
  --nemo-normalize=per_feature \
  ./sherpa-nemo-ctc-en-citrinet-512/test_wavs/0.wav

You should see the following output:

[I] /root/fangjun/open-source/sherpa/sherpa/csrc/parse-options.cc:495:int sherpa::ParseOptions::Read(int, const char* const*) 2023-03-09 22:27:55.828 ./build/bin/sherpa-offline --nn-model=./sherpa-nemo-ctc-en-citrinet-512/model.pt --tokens=./sherpa-nemo-ctc-en-citrinet-512/tokens.txt --use-gpu=false --modified=false --nemo-normalize=per_feature ./sherpa-nemo-ctc-en-citrinet-512/test_wavs/0.wav

[I] /root/fangjun/open-source/sherpa/sherpa/cpp_api/bin/offline-recognizer.cc:125:int main(int, char**) 2023-03-09 22:27:55.844 OfflineRecognizerConfig(ctc_decoder_config=OfflineCtcDecoderConfig(modified=False, hlg="", lm_scale=1, search_beam=20, output_beam=8, min_active_states=30, max_active_states=10000), feat_config=FeatureConfig(fbank_opts=FbankOptions(frame_opts=FrameExtractionOptions(samp_freq=16000, frame_shift_ms=10, frame_length_ms=25, dither=0, preemph_coeff=0.97, remove_dc_offset=True, window_type="povey", round_to_power_of_two=True, blackman_coeff=0.42, snip_edges=True, max_feature_vectors=-1), mel_opts=MelBanksOptions(num_bins=80, low_freq=20, high_freq=0, vtln_low=100, vtln_high=-500, debug_mel=False, htk_mode=False), use_energy=False, energy_floor=0, raw_energy=True, htk_compat=False, use_log_fbank=True, use_power=True, device="cpu"), normalize_samples=True, nemo_normalize="per_feature"), nn_model="./sherpa-nemo-ctc-en-citrinet-512/model.pt", tokens="./sherpa-nemo-ctc-en-citrinet-512/tokens.txt", use_gpu=False, decoding_method="greedy_search", num_active_paths=4)
[I] /root/fangjun/open-source/sherpa/sherpa/cpp_api/offline-recognizer-ctc-impl.h:172:void sherpa::OfflineRecognizerCtcImpl::WarmUp() 2023-03-09 22:27:57.216 WarmUp begins
[I] /root/fangjun/open-source/sherpa/sherpa/cpp_api/offline-recognizer-ctc-impl.h:185:void sherpa::OfflineRecognizerCtcImpl::WarmUp() 2023-03-09 22:27:57.623 WarmUp ended
[W BinaryOps.cpp:601] Warning: floor_divide is deprecated, and will be removed in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values.
To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (function operator())

filename: ./sherpa-nemo-ctc-en-citrinet-512/test_wavs/0.wav
text:  after early nightfall the yellow lamps would light up here and there the squalid quarter of the brothels
token IDs:  after  early  night f a ll  the  y e ll ow  la mp s  would  light  up  here  and  there  the  s qu al id  qu ar ter  of  the  b ro th el s
timestamps (after subsampling): 0.4 0.8 1.2 1.44 1.52 1.6 1.76 1.92 2 2.08 2.16 2.32 2.48 2.56 2.72 2.88 3.2 3.36 3.6 3.76 4.16 4.32 4.4 4.48 4.72 4.96 5.04 5.2 5.36 5.44 5.6 5.68 5.76 5.92 6.08

csukuangfj · 2023-03-09T14:30:46Z

@titu1994 You may find this pull-request interesting and helpful.

csukuangfj · 2023-03-09T14:35:28Z

Caution:
In NeMo, the last token is the blank token. However, in sherpa, we always use ID 0 for the blank token.

Therefore, while creating tokens.txt, we set ID 0 to blank and increase the ID of all other tokens by one.
During neural network computation, we shift the last column of the log_prob tensor to the first column.

See the code below

sherpa/sherpa/csrc/offline-nemo-enc-dec-ctc-model-bpe.cc

Line 38 in ea4c0dc

return logit.roll(1 /*shift right with 1 column*/, 2 /*dim*/);

titu1994 · 2023-03-09T17:30:48Z

This is fantastic ! Thank you very much for this integration, and let me know how I can help (I'm adding docs as discussed in other thread).

We could potentially add a link to your example in our decoding section docs if it supports CTC models with both char/subword Tokenizer.

csukuangfj · 2023-03-10T04:32:21Z

Will update the doc in a separate PR.

csukuangfj · 2023-03-10T08:11:26Z

This is fantastic ! Thank you very much for this integration, and let me know how I can help (I'm adding docs as discussed in other thread).

We could potentially add a link to your example in our decoding section docs if it supports CTC models with both char/subword Tokenizer.

@titu1994

There are several issues about the torchscript models from NeMo.

It is not clear what is the signature of the forward() method of the exported model

It turns out the comment at
https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/asr/models/asr_model.py#L165
is not correct.

    def forward_for_export(self, input, length=None, cache_last_channel=None, cache_last_time=None):
        """
        This forward is used when we need to export the model to ONNX format.
        Inputs cache_last_channel and cache_last_time are needed to be passed for exporting streaming models.
        When they are passed, it just passes the inputs through the encoder part and currently the ONNX conversion does not fully work for this case.
        Args:
            input: Tensor that represents a batch of raw audio signals,
                of shape [B, T]. T here represents timesteps.
            length: Vector of length B, that contains the individual lengths of the audio sequences.

The comment says the shape of input is (B, T). But actually, the shape is (B, C, T).

It took me really, really, a hard time to figure that out.

The exported model takes as inputs two tensors: features and features_length. but it
returns only one single output log_probs. Is it possible to also return log_probs_length?

titu1994 · 2023-03-10T09:03:04Z

Noted about the docstring. To be cleared, that's a mixin class, it should not be asserting any type of input output shape of the inputs in the first place. I'll revert that part of the docstring.

Nemo has Neural types in each of it's models, that's what you should use to determine shape. You can do this by calling model.input_types and model.output_types which will both return dictionary of neural types, and usually also note the order of tensor shape inside of each arg, along with arg name if you want to pass args by key:value pairs.

Our CTC decoders are simple 1d conv and no stride. So output shape is same as encoder len, no modification. Since it does not change we do not return the lengths. However, the encoder should return the encoded lengths so you can use that probably? Or is it that due to the fusion of encoder decoder, the seq length is not returned at all ?

csukuangfj · 2023-03-10T09:19:27Z

To give you an example, the following code

import torchaudio

citrinet_zh = nemo_asr.models.EncDecCTCModel.from_pretrained('stt_zh_citrinet_512');
citrinet_zh.export("model.pt")

samples, sample_rate = torchaudio.load("./BAC009S0764W0121.wav")
print('samples', samples.shape, sample_rate)

features, features_len = citrinet_zh.preprocessor(input_signal=samples, length=torch.tensor([samples.shape[1]]))

print('features', features.shape, features_len)

model = torch.jit.load("model.pt")

log_probs = model(features, features_len)
print(log_probs.shape)

has the following output

[NeMo I 2023-03-10 09:16:37 exportable:86] Successfully exported EncDecCTCModel to model.pt
samples torch.Size([1, 67263]) 16000
features torch.Size([1, 80, 432]) tensor([421])
torch.Size([1, 54, 5207])

You can see that the model returns only a single output.

The model has merged the encoder and decoder into a single module.

It would be nice if the model can also return the length of log_probs, or if it's possible, just don't
merge the encoder and decoder. We can invoke them separately, just like what transcribe is doing.

titu1994 · 2023-03-10T09:30:40Z

Hmm I see. I will ask our team members if it's possible to change this, as the output shape requirement needs to be optional (Riva does not want it usually).

Should be doable, RNNT already does support this, but we need to check how to implement this without damaging preexisting models and exposed paths.

csukuangfj · 2023-03-10T09:34:32Z

as the output shape requirement needs to be optional (Riva does not want it usually).

Does Riva support batch CTC decoding?

We need the length information for batch CTC decoding in sherpa.

csukuangfj · 2023-03-10T09:34:50Z

Hmm I see. I will ask our team members if it's possible to change this

Thanks!

titu1994 · 2023-03-10T09:37:38Z

Btw, to note - one a model calls .export() it is considered corrupted model. I would suggest not to trust the output of such model, instead delete it and load the jit model and restore another copy of the Nemo model if you're in need of the preprocessor.

Another thing is, if you have Torchaudio [installed, you can export the preprocessor too - https://github.com/NVIDIA/NeMo/pull/5512

Dunno why but I forgot to add it to the docs

titu1994 · 2023-03-10T09:38:40Z

Does Riva support batch CTC decoding?

Yep.

We need the length information for batch CTC decoding in sherpa.

Their preprocessor internally keeps track of it for CTC so it somehow works but I'm not sure of the internals.

csukuangfj · 2023-03-10T09:42:49Z

Does Riva also use the torchscript model? If so, how does Riva know the length of log_probs?

csukuangfj · 2023-03-10T09:44:17Z

FYI:

The documentation of pre-trained models from NeMo is available at
https://k2-fsa.github.io/sherpa/cpp/pretrained_models/offline_ctc/index.html

The following is a screenshot:

We can convert more models if needed.

titu1994 · 2023-03-10T09:47:10Z

Riva supports both onnx and TS output, as to how they support it without explicit export, no idea. It's easy enough to estimate the seq length by dividing by model stride (so the length from the preprocessor //4 for conformer or 8 for Citrinet) should give you nearly correct seq length.

titu1994 · 2023-03-10T09:48:41Z

That's great ! Can you try one of the Conformer CTC models ? That would be the current state of the art models in NeMo trained on much more data than the Citrinet

csukuangfj · 2023-03-10T10:44:43Z

It's easy enough to estimate the seq length by dividing by model stride (so the length from the preprocessor //4 for conformer or 8 for Citrinet)

The C++ code is fairly generic and all it takes is a .pt file.

Is it possible to read the subsampling factor from the torchscript model? If not, could you add some attributes to the model before exporting so that we can read them in C++ code.

In icefall, we add attributes to the model, such as vocab size and subsampling factor so that we can read them in C++ within sherpa.

That would be the current state of the art models in NeMo trained on much more data than the Citrinet

Thanks. I will try the conformer model.

uni-manjunath-ke · 2023-03-10T11:04:01Z

Hi @csukuangfj , Everything that I have tried till now is using conformer ctc models only. FYI. Thanks

csukuangfj · 2023-03-10T11:04:39Z

@titu1994
I am trying the Conformer CTC models. I just realize that it is also of type nemo_asr.models.EncDecCTCModelBPE, so there is no need to change the C++ code.

I am having the following error while exporting a conformer ctc model to torchscript.

The code:

m = nemo_asr.models.EncDecCTCModelBPE.from_pretrained('stt_en_conformer_ctc_small')
m.export("model.pt")

The error for the above code:

[NeMo I 2023-03-10 10:59:23 export_utils:398] Swapped 96 modules
[NeMo W 2023-03-10 10:59:24 nemo_logging:349] /usr/local/lib/python3.9/dist-packages/nemo/collections/asr/modules/conformer_encoder.py:397: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
      if seq_length > self.max_audio_length:
    
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
[<ipython-input-102-6bd572e31b75>](https://localhost:8080/#) in <module>
----> 1 m.export("model.pt")

[/usr/local/lib/python3.9/dist-packages/nemo/core/classes/exportable.py](https://localhost:8080/#) in export(self, output, input_example, verbose, do_constant_folding, onnx_opset_version, check_trace, dynamic_axes, check_tolerance, export_modules_as_functions, keep_initializers_as_inputs)
     67             model = self.get_export_subnet(subnet_name)
     68             out_name = augment_filename(output, subnet_name)
---> 69             out, descr, out_example = model._export(
     70                 out_name,
     71                 input_example=input_example,

[/usr/local/lib/python3.9/dist-packages/nemo/core/classes/exportable.py](https://localhost:8080/#) in _export(self, output, input_example, verbose, do_constant_folding, onnx_opset_version, check_trace, dynamic_axes, check_tolerance, export_modules_as_functions, keep_initializers_as_inputs)
    165                         logging.info(f"JIT code:\n{jitted_model.code}")
    166                     jitted_model.save(output)
--> 167                     jitted_model = torch.jit.load(output)
    168 
    169                     if check_trace:

[/usr/local/lib/python3.9/dist-packages/torch/jit/_serialization.py](https://localhost:8080/#) in load(f, map_location, _extra_files)
    160     cu = torch._C.CompilationUnit()
    161     if isinstance(f, str) or isinstance(f, pathlib.Path):
--> 162         cpp_module = torch._C.import_ir_module(cu, str(f), map_location, _extra_files)
    163     else:
    164         cpp_module = torch._C.import_ir_module_from_buffer(

RuntimeError: required keyword attribute 'value' is undefined

Do you have any suggestions about how to fix it?

I am using the following code to install NeMo in a google colab notebook:

## Install dependencies
!pip install wget
!apt-get install sox libsndfile1 ffmpeg
!pip install text-unidecode

# ## Install NeMo
BRANCH = 'main'
!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]

## Install TorchAudio
!pip install torchaudio>=0.10.0 -f https://download.pytorch.org/whl/torch_stable.html

## Grab the config we'll use in this example
!mkdir configs

csukuangfj · 2023-03-10T11:20:26Z

I find a solution from pytorch/pytorch#81085 (comment)
by disabling

                    # jitted_model = torch.jit.optimize_for_inference(torch.jit.freeze(jitted_model))

in exportable.py

csukuangfj · 2023-03-10T13:50:22Z

@titu1994

I have updated the documentation to include conformer ctc models from NeMo.

I also add a section to describe how to export CTC models from NeMo to sherpa.

Please see

By the way, could you add Conformer CTC models for more languages, e.g., Chinese ?

titu1994 · 2023-03-10T18:45:02Z

Is it possible to read the subsampling factor from the torchscript model? If not, could you add some attributes to the model before exporting so that we can read them in C++ code

Sure we can look into this. Is there an example or documentation of how to add attributes to torchscript export ?

I have notified the team about the issue of torchscript export of conformer, thanks for finding out ! We have export test for conformer onnx since that's usually more efficient but we want to support both onnx and TS in Nemo for compatibility.

titu1994 · 2023-03-10T18:49:23Z

By the way, could you add Conformer CTC models for more languages, e.g., Chinese ?

Yes absolutely ! We have a ton of languages with conformer support. Here is a non exhaustive list of all models on NGC for languages we currently support - https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/scores.html

Huggingface just has the most popular models but we can add more to HF if there is request for it

csukuangfj · 2023-03-10T21:24:49Z

Sure we can look into this. Is there an example or documentation of how to add attributes to torchscript export ?

If a model has some scalar attributes, then after exporting to torchscript, the scalar attributes will be kept in the resulting exported model.

For instance, in icefall, the decoder model has the following attributes:
https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/pruned_transducer_stateless2/decoder.py#L66-L67

        self.context_size = context_size
        self.vocab_size = vocab_size

And we can access the attributes of the exported decoder model in sherpa using the following code

sherpa/sherpa/csrc/online-conv-emformer-transducer-model.cc

Line 27 in cac92c8

context_size_ = decoder_.attr("context_size").toInt();

csukuangfj · 2023-03-10T21:26:28Z

We have a ton of languages with conformer support. Here is a non exhaustive list of all models on NGC for languages we currently support - https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/scores.html

Unfortunately, the list does not have conformer CTC models for Chinese.

csukuangfj · 2023-03-10T21:27:03Z

I have notified the team about the issue of torchscript export of conformer, thanks for finding out ! We have export test for conformer onnx since that's usually more efficient but we want to support both onnx and TS in Nemo for compatibility.

Thanks!

titu1994 · 2023-03-10T21:51:08Z

Oh it seems I was mistaken, we have a conformer transducer large trained on mandarin but not CTC. We found Citrinet to do better in cer so we didn't release the checkpoint. I suppose we could look into it in the future

csukuangfj · 2023-03-10T23:45:49Z

Oh it seems I was mistaken, we have a conformer transducer large trained on mandarin but not CTC. We found Citrinet to do better in cer so we didn't release the checkpoint. I suppose we could look into it in the future

I see. Thanks!

csukuangfj added 2 commits March 9, 2023 22:07

Support pre-trained ctc models from NeMo

6f2a9c8

Fix style issues

ea4c0dc

This was referenced Mar 9, 2023

Sherpa support for Nemo ctc models via torchscript #303

Open

[help wanted] run server streaming #238

Closed

csukuangfj added 3 commits March 10, 2023 12:07

Fix Python

b1e191d

add CI test for NeMo CTC pre-trained models

0946ebe

Update offline_ctc_asr.py to support models from NeMo

4817ba3

csukuangfj added the cpp label Mar 10, 2023

csukuangfj added 2 commits March 10, 2023 12:36

Fix CMake for windows

155f4a4

fix testing

04e946f

csukuangfj added cpp and removed cpp labels Mar 10, 2023

csukuangfj merged commit 32da448 into k2-fsa:master Mar 10, 2023

csukuangfj deleted the support-nemo-ctc branch March 10, 2023 07:06

csukuangfj mentioned this pull request Apr 7, 2023

Begin to support CTC models k2-fsa/sherpa-onnx#119

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support pre-trained CTC models from NeMo #332

Support pre-trained CTC models from NeMo #332

csukuangfj commented Mar 9, 2023 •

edited

Loading

csukuangfj commented Mar 9, 2023

csukuangfj commented Mar 9, 2023

titu1994 commented Mar 9, 2023

csukuangfj commented Mar 10, 2023

csukuangfj commented Mar 10, 2023

titu1994 commented Mar 10, 2023 •

edited

Loading

csukuangfj commented Mar 10, 2023

titu1994 commented Mar 10, 2023

csukuangfj commented Mar 10, 2023

csukuangfj commented Mar 10, 2023

titu1994 commented Mar 10, 2023 •

edited

Loading

titu1994 commented Mar 10, 2023

csukuangfj commented Mar 10, 2023

csukuangfj commented Mar 10, 2023

titu1994 commented Mar 10, 2023

titu1994 commented Mar 10, 2023

csukuangfj commented Mar 10, 2023

uni-manjunath-ke commented Mar 10, 2023

csukuangfj commented Mar 10, 2023

csukuangfj commented Mar 10, 2023

csukuangfj commented Mar 10, 2023

titu1994 commented Mar 10, 2023

titu1994 commented Mar 10, 2023

csukuangfj commented Mar 10, 2023 •

edited

Loading

csukuangfj commented Mar 10, 2023

csukuangfj commented Mar 10, 2023

titu1994 commented Mar 10, 2023

csukuangfj commented Mar 10, 2023

Support pre-trained CTC models from NeMo #332

Support pre-trained CTC models from NeMo #332

Conversation

csukuangfj commented Mar 9, 2023 • edited Loading

TODOs

Usage example

Build sherpa

Download the pre-trained model

Use the pre-trained model

csukuangfj commented Mar 9, 2023

csukuangfj commented Mar 9, 2023

titu1994 commented Mar 9, 2023

csukuangfj commented Mar 10, 2023

csukuangfj commented Mar 10, 2023

titu1994 commented Mar 10, 2023 • edited Loading

csukuangfj commented Mar 10, 2023

titu1994 commented Mar 10, 2023

csukuangfj commented Mar 10, 2023

csukuangfj commented Mar 10, 2023

titu1994 commented Mar 10, 2023 • edited Loading

titu1994 commented Mar 10, 2023

csukuangfj commented Mar 10, 2023

csukuangfj commented Mar 10, 2023

titu1994 commented Mar 10, 2023

titu1994 commented Mar 10, 2023

csukuangfj commented Mar 10, 2023

uni-manjunath-ke commented Mar 10, 2023

csukuangfj commented Mar 10, 2023

csukuangfj commented Mar 10, 2023

csukuangfj commented Mar 10, 2023

titu1994 commented Mar 10, 2023

titu1994 commented Mar 10, 2023

csukuangfj commented Mar 10, 2023 • edited Loading

csukuangfj commented Mar 10, 2023

csukuangfj commented Mar 10, 2023

titu1994 commented Mar 10, 2023

csukuangfj commented Mar 10, 2023

csukuangfj commented Mar 9, 2023 •

edited

Loading

titu1994 commented Mar 10, 2023 •

edited

Loading

titu1994 commented Mar 10, 2023 •

edited

Loading

csukuangfj commented Mar 10, 2023 •

edited

Loading