-
Notifications
You must be signed in to change notification settings - Fork 302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Try to use multiple datasets with pruned transducer loss #245
Conversation
If removing the extra nn.Linear(), the [edit] I mean the extra nn.Linear() in decoder. The extra nn.Linear() in joiner is to reduce parameters (If the vocab-size is large), can be removed. |
Yes, for the librispeech recipe, we are using vocab size 500, so the
I don't know the underlying reason. Maybe we should document it in k2 |
Guys, |
... if we're using the pruned-loss training, it might be worthwhile trying with encoder-output-dim = 1024. |
Can it be fixed on the k2 side so that we can use a larger encoder_out_dim without adding extra |
Guys, I'm not so enthusiastic about avoiding the extra linear layer if it requires that embedding_dim >= vocab_size. |
In that case, more than half of the encoder outputs are not used in px_am = torch.gather(
am.unsqueeze(1).expand(B, S, T, C),
dim=3,
index=symbols.reshape(B, S, 1, 1).expand(B, S, T, 1),
).squeeze(
-1
) # [B][S][T] You can see that only the left half of |
@csukuangfj I think it is a mistake to be confusing or identifying the encoder_output_dim with the vocabulary size; I think there should be a projection from one to the other. But actually, in my opinion, it might make more sense to conceptualize the encoder_output_dim as the "hidden dim" of the joiner, i.e. where the nonlinearity (tanh) takes place. That is: we'd change it so the network would have output of dim==attention-dim (i.e. no linear projection at the output), and we could project that in different ways in the Transducer model: |
1 similar comment
@csukuangfj I think it is a mistake to be confusing or identifying the encoder_output_dim with the vocabulary size; I think there should be a projection from one to the other. But actually, in my opinion, it might make more sense to conceptualize the encoder_output_dim as the "hidden dim" of the joiner, i.e. where the nonlinearity (tanh) takes place. That is: we'd change it so the network would have output of dim==attention-dim (i.e. no linear projection at the output), and we could project that in different ways in the Transducer model: |
Thanks! I see. Will a make a change. |
mask = make_pad_mask(lengths) | ||
x = self.encoder(x, src_key_padding_mask=mask) # (T, N, C) | ||
|
||
x = x.permute(1, 0, 2) # (T, N, C) ->(N, T, C) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The last nn.Linear()
from the transformer model is removed.
if self.normalize_before: | ||
x = self.after_norm(x) | ||
|
||
x = x.permute(1, 0, 2) # (T, N, C) ->(N, T, C) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The last nn.Linear()
from the conformer model is removed.
src = residual + self.ff_scale * self.dropout(self.feed_forward(src)) | ||
if not self.normalize_before: | ||
src = self.norm_ff(src) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The last nn.LayerNorm of the conformer encoder layer is also removed.
Otherwise, when normalize_before is True,
(1) The output of the LayerNorm of the i-th encoder layer is fed into the input of the LayerNorm of the (i+1)-th encoder layer.
(2) The output of the LayerNorm of the laster encoder layer is fed into the input of the LayerNorm in the conformer model
"subsampling_factor": 4, | ||
"attention_dim": 512, | ||
"decoder_embedding_dim": 512, | ||
"joiner_dim": 1024, # input dim of the joiner |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Joiner dim is set to 1024.
boundary[:, 2] = y_lens | ||
boundary[:, 3] = x_lens | ||
|
||
simple_decoder_out = simple_decoder_linear(decoder_out) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two nn.Linear()
layers are used to transform the encoder output and decoder output for computing the simple loss.
am=simple_encoder_out, lm=simple_decoder_out, ranges=ranges | ||
) | ||
|
||
am_pruned = encoder_linear(am_pruned) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two nn.Linear()
layers are used to transform the pruned outputs to the dimension of joiner_dim.
Cool. We should experiment whether joiner_dim=512 or joiner_dim=1024 works better.... e.g. with a few epochs. I imagine 1024 will be an easy win, but we'll see. |
Why is this not merged yet? Was it worse? [oh, I see, this is not the latest pruned_transducer_stateless2 setup...] |
Closing via #312 |
It also refactors the decoder and joiner to remove the extra
nn.Linear()
layer.Will try #229 with this PR.