-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Efficient Conformer] Support ONNX GPU export, add librispeech results, and fix V2 streaming decode issue #1701
Conversation
…e changes. Completed the casual and non-casual convolution model tests for the EfficientConformer, as well as JIT runtime tests. Modified yaml files for Aishell-1
…, and fix bug of V2 streaming decode
wenet/efficient_conformer/encoder.py
Outdated
for i, layer in enumerate(self.encoders): | ||
factor = self.calculate_downsampling_factor(i) | ||
# NOTE(xcsong): Before layer.forward | ||
# shape(att_cache[i:i + 1]) is (1, head, cache_t1, d_k * 2), | ||
# shape(cnn_cache[i]) is (b=1, hidden-dim, cache_t2) | ||
# shape(new_att_cache) = [ batch, head, time2, outdim//head * 2 ] | ||
att_cache_trunc = 0 | ||
if xs.size(1) + att_cache.size(2) / factor > pos_emb.size(1): | ||
# The time step is not divisible by the downsampling multiple | ||
# We propose to double the chunk_size. | ||
att_cache_trunc = xs.size(1) + \ | ||
att_cache.size(2) // factor - pos_emb.size(1) + 1 | ||
xs, _, new_att_cache, new_cnn_cache = layer( | ||
xs, att_mask, pos_emb, | ||
mask_pad=mask_pad, | ||
att_cache=att_cache[i:i + 1, :, ::factor, :], | ||
att_cache=att_cache[i:i + 1, :, ::factor, :][:, :, att_cache_trunc:, :], | ||
cnn_cache=cnn_cache[i, :, :, :] | ||
if cnn_cache.size(0) > 0 else cnn_cache | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q1:
xs.size(1) + att_cache.size(2) / factor > pos_emb.size(1)
OR
( xs.size(1) + att_cache.size(2) ) / factor > pos_emb.size(1)
?
results of the above two lines are not equal :
Q2:
What do you mean by double the chunk_size
? I think [:, :, att_cache_trunc:, :]
will simply eliminate any unnecessary attention cache at the beginning, so where is the double
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q1: xs.size(1) + att_cache.size(2) / factor > pos_emb.size(1)
, because xs
was downsampled in the previous block.
Q2: For the train_u2++_efficonformer_v2.yaml
, downsample rate is 1/2 (conv2d2) + 1/4 (efficonformer block), so 18->36 can reduce the downsampling loss.
The description of double is really inaccurate here, so let me adjust it.
examples/librispeech/s0/README.md
Outdated
## Efficient Conformer V1 Result | ||
|
||
* Feature info: using fbank feature, cmvn, speed perturb, dither | ||
* Training info: train_u2++_efficonformer_v1.yaml, 8 gpu | ||
* Decoding info: ctc_weight 0.5, reverse_weight 0.3, average_num 20 | ||
|
||
test clean | ||
|
||
| decoding mode | full | 18 | 16 | | ||
|------------------------|------|------|------| | ||
| attention decoder | 3.65 | 3.88 | 3.87 | | ||
| ctc_greedy_search | 3.46 | 3.79 | 3.77 | | ||
| ctc prefix beam search | 3.44 | 3.75 | 3.74 | | ||
| attention rescoring | 3.17 | 3.44 | 3.41 | | ||
|
||
test other | ||
|
||
| decoding mode | full | 18 | 16 | | ||
|------------------------|------|-------|-------| | ||
| attention decoder | 8.51 | 9.24 | 9.25 | | ||
| ctc_greedy_search | 8.94 | 10.04 | 10.06 | | ||
| ctc prefix beam search | 8.91 | 10 | 10.01 | | ||
| attention rescoring | 8.21 | 9.25 | 9.25 | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx, the results are much better than the standard conformer! I would suggest adding model params in README for a clearer comparison.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, no problem
if self.global_chunk_size > 0: | ||
# for ONNX decode simulation, padding xs to chunk_size | ||
real_len = xs.size(1) | ||
pad_len = self.chunk_feature_map - real_len | ||
xs = F.pad(xs, (0, 0, 0, pad_len), value=0.0) | ||
chunk_masks = F.pad(chunk_masks, (0, pad_len), value=0.0) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, will padding only be applied to the last chunk, given that previous chunks always have a valid chunk size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, padding will only be applied to the last chunk. Also, this part is currently only valid if you manually specify use_onnx=True
, to simulate the CER after exporting ONNX.
Many thx! |
Support ONNX GPU export.
train_u2++_efficonformer_v1_stream.yaml
withcausal:true
, the CER increases from 9.30% to 9.33% in our dataset.Add librispeech results and conf:
Add streaming conf in aishell:
Fix bug of V2 streaming decode: