Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Efficient Conformer] Support ONNX GPU export, add librispeech results, and fix V2 streaming decode issue #1701

Merged
merged 11 commits into from
Feb 22, 2023

Conversation

zwglory
Copy link
Contributor

@zwglory zwglory commented Feb 21, 2023

Support ONNX GPU export.

  • After exporting onnx, the loss of CER is small. Using the train_u2++_efficonformer_v1_stream.yaml with causal:true, the CER increases from 9.30% to 9.33% in our dataset.

Add librispeech results and conf:

  • train_u2++_efficonformer_v1.yaml
  • train_u2++_efficonformer_v2.yaml

Add streaming conf in aishell:

  • train_u2++_efficonformer_v1_stream.yaml

Fix bug of V2 streaming decode:

Comment on lines 419 to 437
for i, layer in enumerate(self.encoders):
factor = self.calculate_downsampling_factor(i)
# NOTE(xcsong): Before layer.forward
# shape(att_cache[i:i + 1]) is (1, head, cache_t1, d_k * 2),
# shape(cnn_cache[i]) is (b=1, hidden-dim, cache_t2)
# shape(new_att_cache) = [ batch, head, time2, outdim//head * 2 ]
att_cache_trunc = 0
if xs.size(1) + att_cache.size(2) / factor > pos_emb.size(1):
# The time step is not divisible by the downsampling multiple
# We propose to double the chunk_size.
att_cache_trunc = xs.size(1) + \
att_cache.size(2) // factor - pos_emb.size(1) + 1
xs, _, new_att_cache, new_cnn_cache = layer(
xs, att_mask, pos_emb,
mask_pad=mask_pad,
att_cache=att_cache[i:i + 1, :, ::factor, :],
att_cache=att_cache[i:i + 1, :, ::factor, :][:, :, att_cache_trunc:, :],
cnn_cache=cnn_cache[i, :, :, :]
if cnn_cache.size(0) > 0 else cnn_cache
)
Copy link
Member

@xingchensong xingchensong Feb 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q1:

xs.size(1) + att_cache.size(2) / factor > pos_emb.size(1)

OR

( xs.size(1) + att_cache.size(2) ) / factor > pos_emb.size(1)

?
results of the above two lines are not equal :

image

Q2:

What do you mean by double the chunk_size ? I think [:, :, att_cache_trunc:, :] will simply eliminate any unnecessary attention cache at the beginning, so where is the double ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q1: xs.size(1) + att_cache.size(2) / factor > pos_emb.size(1), because xs was downsampled in the previous block.

Q2: For the train_u2++_efficonformer_v2.yaml, downsample rate is 1/2 (conv2d2) + 1/4 (efficonformer block), so 18->36 can reduce the downsampling loss.

The description of double is really inaccurate here, so let me adjust it.

Comment on lines 224 to 246
## Efficient Conformer V1 Result

* Feature info: using fbank feature, cmvn, speed perturb, dither
* Training info: train_u2++_efficonformer_v1.yaml, 8 gpu
* Decoding info: ctc_weight 0.5, reverse_weight 0.3, average_num 20

test clean

| decoding mode | full | 18 | 16 |
|------------------------|------|------|------|
| attention decoder | 3.65 | 3.88 | 3.87 |
| ctc_greedy_search | 3.46 | 3.79 | 3.77 |
| ctc prefix beam search | 3.44 | 3.75 | 3.74 |
| attention rescoring | 3.17 | 3.44 | 3.41 |

test other

| decoding mode | full | 18 | 16 |
|------------------------|------|-------|-------|
| attention decoder | 8.51 | 9.24 | 9.25 |
| ctc_greedy_search | 8.94 | 10.04 | 10.06 |
| ctc prefix beam search | 8.91 | 10 | 10.01 |
| attention rescoring | 8.21 | 9.25 | 9.25 |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx, the results are much better than the standard conformer! I would suggest adding model params in README for a clearer comparison.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, no problem

Comment on lines +374 to +380
if self.global_chunk_size > 0:
# for ONNX decode simulation, padding xs to chunk_size
real_len = xs.size(1)
pad_len = self.chunk_feature_map - real_len
xs = F.pad(xs, (0, 0, 0, pad_len), value=0.0)
chunk_masks = F.pad(chunk_masks, (0, pad_len), value=0.0)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, will padding only be applied to the last chunk, given that previous chunks always have a valid chunk size?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, padding will only be applied to the last chunk. Also, this part is currently only valid if you manually specify use_onnx=True, to simulate the CER after exporting ONNX.

@xingchensong xingchensong merged commit 9a7d947 into wenet-e2e:main Feb 22, 2023
@xingchensong
Copy link
Member

Many thx!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants