Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support pre-trained CTC models from NeMo #332

Merged
merged 7 commits into from
Mar 10, 2023

Conversation

csukuangfj
Copy link
Collaborator

@csukuangfj csukuangfj commented Mar 9, 2023

Fixes #303

Fixes #238

TODOs

  • Add CI tests
  • Update doc
  • Convert more pre-trained models from NeMo

Usage example

We have converted Citrinet-512 from NeMo:
https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_en_citrinet_512

The model is saved at
https://huggingface.co/csukuangfj/sherpa-nemo-ctc-en-citrinet-512

In the following,we describe how to use sherpa to decode sound files with pre-trained CTC models from NeMo.

Build sherpa

git clone https://github.com/k2-fsa/sherpa
# Note, you need to switch to this commit, we ignore it here.
cd sherpa
mkdir build
cd build
cmake ..
make -j 

Download the pre-trained model

cd /path/to/sherpa

GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/csukuangfj/sherpa-nemo-ctc-en-citrinet-512
cd sherpa-nemo-ctc-en-citrinet-512
git lfs pull --include "*.pt"

Use the pre-trained model

cd /path/to/sherpa

./build/bin/sherpa-offline \
  --nn-model=./sherpa-nemo-ctc-en-citrinet-512/model.pt \
  --tokens=./sherpa-nemo-ctc-en-citrinet-512/tokens.txt \
  --use-gpu=false \
  --modified=false \
  --nemo-normalize=per_feature \
  ./sherpa-nemo-ctc-en-citrinet-512/test_wavs/0.wav

You should see the following output:

[I] /root/fangjun/open-source/sherpa/sherpa/csrc/parse-options.cc:495:int sherpa::ParseOptions::Read(int, const char* const*) 2023-03-09 22:27:55.828 ./build/bin/sherpa-offline --nn-model=./sherpa-nemo-ctc-en-citrinet-512/model.pt --tokens=./sherpa-nemo-ctc-en-citrinet-512/tokens.txt --use-gpu=false --modified=false --nemo-normalize=per_feature ./sherpa-nemo-ctc-en-citrinet-512/test_wavs/0.wav

[I] /root/fangjun/open-source/sherpa/sherpa/cpp_api/bin/offline-recognizer.cc:125:int main(int, char**) 2023-03-09 22:27:55.844 OfflineRecognizerConfig(ctc_decoder_config=OfflineCtcDecoderConfig(modified=False, hlg="", lm_scale=1, search_beam=20, output_beam=8, min_active_states=30, max_active_states=10000), feat_config=FeatureConfig(fbank_opts=FbankOptions(frame_opts=FrameExtractionOptions(samp_freq=16000, frame_shift_ms=10, frame_length_ms=25, dither=0, preemph_coeff=0.97, remove_dc_offset=True, window_type="povey", round_to_power_of_two=True, blackman_coeff=0.42, snip_edges=True, max_feature_vectors=-1), mel_opts=MelBanksOptions(num_bins=80, low_freq=20, high_freq=0, vtln_low=100, vtln_high=-500, debug_mel=False, htk_mode=False), use_energy=False, energy_floor=0, raw_energy=True, htk_compat=False, use_log_fbank=True, use_power=True, device="cpu"), normalize_samples=True, nemo_normalize="per_feature"), nn_model="./sherpa-nemo-ctc-en-citrinet-512/model.pt", tokens="./sherpa-nemo-ctc-en-citrinet-512/tokens.txt", use_gpu=False, decoding_method="greedy_search", num_active_paths=4)
[I] /root/fangjun/open-source/sherpa/sherpa/cpp_api/offline-recognizer-ctc-impl.h:172:void sherpa::OfflineRecognizerCtcImpl::WarmUp() 2023-03-09 22:27:57.216 WarmUp begins
[I] /root/fangjun/open-source/sherpa/sherpa/cpp_api/offline-recognizer-ctc-impl.h:185:void sherpa::OfflineRecognizerCtcImpl::WarmUp() 2023-03-09 22:27:57.623 WarmUp ended
[W BinaryOps.cpp:601] Warning: floor_divide is deprecated, and will be removed in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values.
To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (function operator())

filename: ./sherpa-nemo-ctc-en-citrinet-512/test_wavs/0.wav
text:  after early nightfall the yellow lamps would light up here and there the squalid quarter of the brothels
token IDs:  after  early  night f a ll  the  y e ll ow  la mp s  would  light  up  here  and  there  the  s qu al id  qu ar ter  of  the  b ro th el s
timestamps (after subsampling): 0.4 0.8 1.2 1.44 1.52 1.6 1.76 1.92 2 2.08 2.16 2.32 2.48 2.56 2.72 2.88 3.2 3.36 3.6 3.76 4.16 4.32 4.4 4.48 4.72 4.96 5.04 5.2 5.36 5.44 5.6 5.68 5.76 5.92 6.08

@csukuangfj
Copy link
Collaborator Author

@titu1994 You may find this pull-request interesting and helpful.

@csukuangfj
Copy link
Collaborator Author

Caution:
In NeMo, the last token is the blank token. However, in sherpa, we always use ID 0 for the blank token.

Therefore, while creating tokens.txt, we set ID 0 to blank and increase the ID of all other tokens by one.
During neural network computation, we shift the last column of the log_prob tensor to the first column.

See the code below

return logit.roll(1 /*shift right with 1 column*/, 2 /*dim*/);

@titu1994
Copy link

titu1994 commented Mar 9, 2023

This is fantastic ! Thank you very much for this integration, and let me know how I can help (I'm adding docs as discussed in other thread).

We could potentially add a link to your example in our decoding section docs if it supports CTC models with both char/subword Tokenizer.

@csukuangfj csukuangfj added the cpp label Mar 10, 2023
@csukuangfj
Copy link
Collaborator Author

Will update the doc in a separate PR.

@csukuangfj csukuangfj added cpp and removed cpp labels Mar 10, 2023
@csukuangfj csukuangfj merged commit 32da448 into k2-fsa:master Mar 10, 2023
@csukuangfj csukuangfj deleted the support-nemo-ctc branch March 10, 2023 07:06
@csukuangfj
Copy link
Collaborator Author

This is fantastic ! Thank you very much for this integration, and let me know how I can help (I'm adding docs as discussed in other thread).

We could potentially add a link to your example in our decoding section docs if it supports CTC models with both char/subword Tokenizer.

@titu1994

There are several issues about the torchscript models from NeMo.

  1. It is not clear what is the signature of the forward() method of the exported model

It turns out the comment at
https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/asr/models/asr_model.py#L165
is not correct.

    def forward_for_export(self, input, length=None, cache_last_channel=None, cache_last_time=None):
        """
        This forward is used when we need to export the model to ONNX format.
        Inputs cache_last_channel and cache_last_time are needed to be passed for exporting streaming models.
        When they are passed, it just passes the inputs through the encoder part and currently the ONNX conversion does not fully work for this case.
        Args:
            input: Tensor that represents a batch of raw audio signals,
                of shape [B, T]. T here represents timesteps.
            length: Vector of length B, that contains the individual lengths of the audio sequences.

The comment says the shape of input is (B, T). But actually, the shape is (B, C, T).

It took me really, really, a hard time to figure that out.

  1. The exported model takes as inputs two tensors: features and features_length. but it
    returns only one single output log_probs. Is it possible to also return log_probs_length?

@titu1994
Copy link

titu1994 commented Mar 10, 2023

  1. Noted about the docstring. To be cleared, that's a mixin class, it should not be asserting any type of input output shape of the inputs in the first place. I'll revert that part of the docstring.

Nemo has Neural types in each of it's models, that's what you should use to determine shape. You can do this by calling model.input_types and model.output_types which will both return dictionary of neural types, and usually also note the order of tensor shape inside of each arg, along with arg name if you want to pass args by key:value pairs.

  1. Our CTC decoders are simple 1d conv and no stride. So output shape is same as encoder len, no modification. Since it does not change we do not return the lengths. However, the encoder should return the encoded lengths so you can use that probably? Or is it that due to the fusion of encoder decoder, the seq length is not returned at all ?

@csukuangfj
Copy link
Collaborator Author

To give you an example, the following code

import torchaudio

citrinet_zh = nemo_asr.models.EncDecCTCModel.from_pretrained('stt_zh_citrinet_512');
citrinet_zh.export("model.pt")

samples, sample_rate = torchaudio.load("./BAC009S0764W0121.wav")
print('samples', samples.shape, sample_rate)

features, features_len = citrinet_zh.preprocessor(input_signal=samples, length=torch.tensor([samples.shape[1]]))

print('features', features.shape, features_len)

model = torch.jit.load("model.pt")

log_probs = model(features, features_len)
print(log_probs.shape)

has the following output

[NeMo I 2023-03-10 09:16:37 exportable:86] Successfully exported EncDecCTCModel to model.pt
samples torch.Size([1, 67263]) 16000
features torch.Size([1, 80, 432]) tensor([421])
torch.Size([1, 54, 5207])

You can see that the model returns only a single output.

The model has merged the encoder and decoder into a single module.

It would be nice if the model can also return the length of log_probs, or if it's possible, just don't
merge the encoder and decoder. We can invoke them separately, just like what transcribe is doing.

@titu1994
Copy link

Hmm I see. I will ask our team members if it's possible to change this, as the output shape requirement needs to be optional (Riva does not want it usually).

Should be doable, RNNT already does support this, but we need to check how to implement this without damaging preexisting models and exposed paths.

@csukuangfj
Copy link
Collaborator Author

as the output shape requirement needs to be optional (Riva does not want it usually).

Does Riva support batch CTC decoding?

We need the length information for batch CTC decoding in sherpa.

@csukuangfj
Copy link
Collaborator Author

Hmm I see. I will ask our team members if it's possible to change this

Thanks!

@titu1994
Copy link

titu1994 commented Mar 10, 2023

Btw, to note - one a model calls .export() it is considered corrupted model. I would suggest not to trust the output of such model, instead delete it and load the jit model and restore another copy of the Nemo model if you're in need of the preprocessor.

Another thing is, if you have Torchaudio [installed, you can export the preprocessor too - https://github.com/NVIDIA/NeMo/pull/5512

Dunno why but I forgot to add it to the docs

@titu1994
Copy link

Does Riva support batch CTC decoding?

Yep.

We need the length information for batch CTC decoding in sherpa.

Their preprocessor internally keeps track of it for CTC so it somehow works but I'm not sure of the internals.

@csukuangfj
Copy link
Collaborator Author

Does Riva also use the torchscript model? If so, how does Riva know the length of log_probs?

@csukuangfj
Copy link
Collaborator Author

FYI:

The documentation of pre-trained models from NeMo is available at
https://k2-fsa.github.io/sherpa/cpp/pretrained_models/offline_ctc/index.html

The following is a screenshot:

Screenshot 2023-03-10 at 17 43 28

We can convert more models if needed.

@titu1994
Copy link

Riva supports both onnx and TS output, as to how they support it without explicit export, no idea. It's easy enough to estimate the seq length by dividing by model stride (so the length from the preprocessor //4 for conformer or 8 for Citrinet) should give you nearly correct seq length.

@titu1994
Copy link

That's great ! Can you try one of the Conformer CTC models ? That would be the current state of the art models in NeMo trained on much more data than the Citrinet

@csukuangfj
Copy link
Collaborator Author

It's easy enough to estimate the seq length by dividing by model stride (so the length from the preprocessor //4 for conformer or 8 for Citrinet)

The C++ code is fairly generic and all it takes is a .pt file.

Is it possible to read the subsampling factor from the torchscript model? If not, could you add some attributes to the model before exporting so that we can read them in C++ code.

In icefall, we add attributes to the model, such as vocab size and subsampling factor so that we can read them in C++ within sherpa.


That would be the current state of the art models in NeMo trained on much more data than the Citrinet

Thanks. I will try the conformer model.

@uni-manjunath-ke
Copy link
Contributor

Hi @csukuangfj , Everything that I have tried till now is using conformer ctc models only. FYI. Thanks

@csukuangfj
Copy link
Collaborator Author

@titu1994
I am trying the Conformer CTC models. I just realize that it is also of type nemo_asr.models.EncDecCTCModelBPE, so there is no need to change the C++ code.


I am having the following error while exporting a conformer ctc model to torchscript.

The code:

m = nemo_asr.models.EncDecCTCModelBPE.from_pretrained('stt_en_conformer_ctc_small')
m.export("model.pt")

The error for the above code:

[NeMo I 2023-03-10 10:59:23 export_utils:398] Swapped 96 modules
[NeMo W 2023-03-10 10:59:24 nemo_logging:349] /usr/local/lib/python3.9/dist-packages/nemo/collections/asr/modules/conformer_encoder.py:397: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
      if seq_length > self.max_audio_length:
    
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
[<ipython-input-102-6bd572e31b75>](https://localhost:8080/#) in <module>
----> 1 m.export("model.pt")

[/usr/local/lib/python3.9/dist-packages/nemo/core/classes/exportable.py](https://localhost:8080/#) in export(self, output, input_example, verbose, do_constant_folding, onnx_opset_version, check_trace, dynamic_axes, check_tolerance, export_modules_as_functions, keep_initializers_as_inputs)
     67             model = self.get_export_subnet(subnet_name)
     68             out_name = augment_filename(output, subnet_name)
---> 69             out, descr, out_example = model._export(
     70                 out_name,
     71                 input_example=input_example,

[/usr/local/lib/python3.9/dist-packages/nemo/core/classes/exportable.py](https://localhost:8080/#) in _export(self, output, input_example, verbose, do_constant_folding, onnx_opset_version, check_trace, dynamic_axes, check_tolerance, export_modules_as_functions, keep_initializers_as_inputs)
    165                         logging.info(f"JIT code:\n{jitted_model.code}")
    166                     jitted_model.save(output)
--> 167                     jitted_model = torch.jit.load(output)
    168 
    169                     if check_trace:

[/usr/local/lib/python3.9/dist-packages/torch/jit/_serialization.py](https://localhost:8080/#) in load(f, map_location, _extra_files)
    160     cu = torch._C.CompilationUnit()
    161     if isinstance(f, str) or isinstance(f, pathlib.Path):
--> 162         cpp_module = torch._C.import_ir_module(cu, str(f), map_location, _extra_files)
    163     else:
    164         cpp_module = torch._C.import_ir_module_from_buffer(

RuntimeError: required keyword attribute 'value' is undefined

Do you have any suggestions about how to fix it?


I am using the following code to install NeMo in a google colab notebook:

## Install dependencies
!pip install wget
!apt-get install sox libsndfile1 ffmpeg
!pip install text-unidecode

# ## Install NeMo
BRANCH = 'main'
!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]

## Install TorchAudio
!pip install torchaudio>=0.10.0 -f https://download.pytorch.org/whl/torch_stable.html

## Grab the config we'll use in this example
!mkdir configs

@csukuangfj
Copy link
Collaborator Author

I find a solution from pytorch/pytorch#81085 (comment)
by disabling

                    # jitted_model = torch.jit.optimize_for_inference(torch.jit.freeze(jitted_model))

in exportable.py

@csukuangfj
Copy link
Collaborator Author

@titu1994

I have updated the documentation to include conformer ctc models from NeMo.

I also add a section to describe how to export CTC models from NeMo to sherpa.

Please see

Screenshot 2023-03-10 at 21 47 47


By the way, could you add Conformer CTC models for more languages, e.g., Chinese ?

@titu1994
Copy link

Is it possible to read the subsampling factor from the torchscript model? If not, could you add some attributes to the model before exporting so that we can read them in C++ code

Sure we can look into this. Is there an example or documentation of how to add attributes to torchscript export ?

I have notified the team about the issue of torchscript export of conformer, thanks for finding out ! We have export test for conformer onnx since that's usually more efficient but we want to support both onnx and TS in Nemo for compatibility.

@titu1994
Copy link

By the way, could you add Conformer CTC models for more languages, e.g., Chinese ?

Yes absolutely ! We have a ton of languages with conformer support. Here is a non exhaustive list of all models on NGC for languages we currently support - https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/scores.html

Huggingface just has the most popular models but we can add more to HF if there is request for it

@csukuangfj
Copy link
Collaborator Author

csukuangfj commented Mar 10, 2023

Sure we can look into this. Is there an example or documentation of how to add attributes to torchscript export ?

If a model has some scalar attributes, then after exporting to torchscript, the scalar attributes will be kept in the resulting exported model.

For instance, in icefall, the decoder model has the following attributes:
https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/pruned_transducer_stateless2/decoder.py#L66-L67

        self.context_size = context_size
        self.vocab_size = vocab_size

And we can access the attributes of the exported decoder model in sherpa using the following code

context_size_ = decoder_.attr("context_size").toInt();

@csukuangfj
Copy link
Collaborator Author

We have a ton of languages with conformer support. Here is a non exhaustive list of all models on NGC for languages we currently support - https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/scores.html

Unfortunately, the list does not have conformer CTC models for Chinese.

Screenshot 2023-03-11 at 05 25 29

@csukuangfj
Copy link
Collaborator Author

I have notified the team about the issue of torchscript export of conformer, thanks for finding out ! We have export test for conformer onnx since that's usually more efficient but we want to support both onnx and TS in Nemo for compatibility.

Thanks!

@titu1994
Copy link

Oh it seems I was mistaken, we have a conformer transducer large trained on mandarin but not CTC. We found Citrinet to do better in cer so we didn't release the checkpoint. I suppose we could look into it in the future

@csukuangfj
Copy link
Collaborator Author

Oh it seems I was mistaken, we have a conformer transducer large trained on mandarin but not CTC. We found Citrinet to do better in cer so we didn't release the checkpoint. I suppose we could look into it in the future

I see. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sherpa support for Nemo ctc models via torchscript [help wanted] run server streaming
3 participants