The differents among versions of funasr-1.x.x and funasr-0.x.x #1319

LauraGPT · 2024-01-30T03:10:46Z

LauraGPT
Jan 30, 2024
Maintainer

FunASR

To run without errors, the versions of modelscope, funasr and model params should follows:

funasr>=1.0.0, modelscope>=1.11.1 (recommend):

We recommend the usage of AutoModel (recommend):

from funasr import AutoModel
# paraformer-zh is a multi-functional asr model
# use vad, punc, spk or not as you need
model = AutoModel(model="paraformer-zh", model_revision="v2.0.4",
                  vad_model="fsmn-vad", vad_model_revision="v2.0.4",
                  punc_model="ct-punc-c", punc_model_revision="v2.0.4",
                  # spk_model="cam++", spk_model_revision="v2.0.2",
                  )
res = model.generate(input=f"{model.model_path}/example/asr_example.wav", 
                     batch_size_s=300, 
                     hotword='魔搭')
print(res)
# res: [{'key': 'wav_name, 'text': 'transcripts text'}]

More examples could be found in docs

If you still want to use the pipeline of modelscope:

from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks

inference_pipeline = pipeline(
    task=Tasks.auto_speech_recognition,
    model='iic/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch', model_revision="v2.0.4",
    vad_model='iic/speech_fsmn_vad_zh-cn-16k-common-pytorch', vad_model_revision="v2.0.4",
    punc_model='iic/punc_ct-transformer_zh-cn-common-vocab272727-pytorch', punc_model_revision="v2.0.4",
    # spk_model="iic/speech_campplus_sv_zh-cn_16k-common",
    # spk_model_revision="v2.0.2",
)

res = inference_pipeline(input='https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_zh.wav')
print(res)
# res: [{'key': 'wav_name, 'text': 'transcripts text'}]

funasr==0.8.8, modelscope==1.10.0 (legacy, not recommend):

The old version is no longer in maintenance:

from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks

inference_pipeline = pipeline(
    task=Tasks.auto_speech_recognition,
    model='damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch',
    vad_model='damo/speech_fsmn_vad_zh-cn-16k-common-pytorch',
    punc_model='damo/punc_ct-transformer_zh-cn-common-vocab272727-pytorch', 
)

res = inference_pipeline(audio_in='https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_zh.wav')
print(res)
# res: {'key': 'wav_name, 'text': 'transcripts text'}

The main difference of usage:

3.1. In the latest version of funasr>=1.0.3 and modelscope>=1.11.1, you could download the model params by:
- a. automatically download by funasr (default):
  
  (Notes: Both latest and old version are supported. In the latest version (funasr>=1.0.3), you should add the model_revision. And you could not add it in the old version (funasr-0.8.8), otherwise it would run with errors.)
  
  When you run the code above, it would check whether model is a local path or model name.
  If the model is the local path, it would skip the downloading.
  If the model is model name from model zoo, it would automatically download the model params from zoos.
- b. git clone manually (only in the lates version):
  
  Notes: Only use git clone in the latest version (funasr>=1.0.3). If your version if funasr-0.8.3, it would run with errors.
  
  You could download the model params by git clone, for example:
```
git clone https://www.modelscope.cn/iic/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch.git
```
  Then you could set the model to the local path you downloaded.
3.2. In the latest version of funasr>=1.0.3 and modelscope>=1.11.1, the input name is input:
```
res = model.generate(input='audio.wav')
```
or
```
res = pipeline(input='audio.wav')
```
But in the old version, the input name is audio_in:
```
res = pipeline(audio_in='audio.wav')
```
3.3. In the latest version of funasr>=1.0.3 and modelscope>=1.11.1, the output result is list:
```
print(res)
# res: [{'key': 'wav_name, 'text': 'transcripts text'}]
```
But in the old version, the output result is dict:
```
print(res)
# res: {'key': 'wav_name, 'text': 'transcripts text'}
```

3.4. In the latest version of funasr>=1.0.3 and modelscope>=1.11.1, the batch_size:

If you inference without vad_model, the batch_size refer to numbers of audio files:

(Notes: both latest and old version are support)

model = AutoModel(model="paraformer-zh", model_revision="v2.0.4")

res = model.generate(input='audio.wav', batch_size=64)

or

inference_pipeline = pipeline(task=Tasks.auto_speech_recognition,
    model='damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch')

res = pipeline(input='wav.scp', batch_size=64)

If you inference with vad_model, the batch_size_s refer to the total duration of audio file in seconds (s):

model = AutoModel(model="paraformer-zh", model_revision="v2.0.4",
                  vad_model="fsmn-vad", vad_model_revision="v2.0.4",
                  punc_model="ct-punc-c", punc_model_revision="v2.0.4",
                  )

res = model.generate(input='audio.wav', batch_size_s=300)

or

inference_pipeline = pipeline(
    task=Tasks.auto_speech_recognition,
    model='iic/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch', model_revision="v2.0.4",
    vad_model='iic/speech_fsmn_vad_zh-cn-16k-common-pytorch', vad_model_revision="v2.0.4",
    punc_model='iic/punc_ct-transformer_zh-cn-common-vocab272727-pytorch', punc_model_revision="v2.0.4",
)

res = pipeline(input='audio.wav', batch_size_s=300)

But in the old version, it is the batch_size_token:

inference_pipeline = pipeline(
    task=Tasks.auto_speech_recognition,
    model='iic/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch', 
    vad_model='iic/speech_fsmn_vad_zh-cn-16k-common-pytorch', 
    punc_model='iic/punc_ct-transformer_zh-cn-common-vocab272727-pytorch', 
)

res = pipeline(input='audio.wav', batch_size_token=5000)

iamanigeeit · 2024-07-16T01:06:07Z

iamanigeeit
Jul 16, 2024

For me, the previous Mandarin-only model (aishell2-vocab5212) was better for non-standard Mandarin. It is also 4-5x faster during inference.

Example: In the AISHELL-3 test set, generated (SSB06930005.txt -- please rename file to .wav to listen)

I forced all models to produce Chinese by setting decoder_out = -∞ for non-Hanzi tokens.

seaco_paraformer_large: 江苏苏咻咻西安乃濑十四涮明西之一软跟户阳折赫执画换品还版嗯画好与售 (nonsense)
aishell2-vocab5212: 江苏休闲奶奶玩去捐传一为养婆幻配老婆和儿走
Ground Truth: 江苏修鞋奶奶婉拒捐款一人养活患病老伴和儿子

However, the old pipeline method does not work anymore. Therefore, I have mapped the configs over:
speech_paraformer_asr_nat-zh-cn-16k-aishell2-vocab5212-pytorch.zip

Just unzip and copy it into the ~/.cache/modelscope/hub/iic/speech_paraformer_asr_nat-zh-cn-16k-aishell2-vocab5212-pytorch/ folder after downloading. Then AutoModel should work.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The differents among versions of funasr-1.x.x and funasr-0.x.x #1319

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

The differents among versions of funasr-1.x.x and funasr-0.x.x #1319

LauraGPT Jan 30, 2024 Maintainer

Replies: 1 comment

iamanigeeit Jul 16, 2024

LauraGPT
Jan 30, 2024
Maintainer

iamanigeeit
Jul 16, 2024