-
Notifications
You must be signed in to change notification settings - Fork 297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extract framewise alignment information using CTC decoding #39
Conversation
The following shows the probabilities and log_probabilities of the alignments at each frame after subsampling. You can see that the probability is very spiky, which is almost always one. If we use word alignment information, it is difficult, if not impossible, to insert blanks between words. From lhotse-speech/lhotse#378 (comment)
The word alignment from https://github.com/CorentinJ/librispeech-alignments assumes that a word's end time is the next word's start time. Furthermore, we have to break words into tokens, which makes the implementation more complicated than that of using framewise alignment.
|
Sure, I think this approach makes sense. Certainly we will need to have scripts to compute alignments, at some point. |
Print information about k2, lhotse, PyTorch, and icefall.
Unlike features, I would propose to store framewise alignment information separately. We can have the following layout:
where Line 329 in 27a6d5e
Alignments are indexed by utterance IDs, i.e., cut IDs. |
The alignment does not occupy too much memory. I think we can keep it in memory and lookup it on the fly:
|
Would there be interest to store the alignments in Cuts using the proposed mechanisms described in lhotse-speech/lhotse#393? |
I'm personally OK with either method but I'll let Fangjun do whatever is easiest for him. |
For this specific task, i.e., using alignment information in MMI training, I feel it is easier to store the alignment separately. After we getting the I agree the approach in lhotse-speech/lhotse#393 is more general. However, it needs more work, I think ( I haven't figured out how it would be implemented.) |
For the following test wave
librispeech/LibriSpeech/test-clean/8224/274384/8224-274384-0008.flac
,the framewise alignment computed using the model from #17 is
(the first column is in seconds, the second column is BPE tokens, i.e.,
lattice.labels
. I have scaled it by subsampling factor (which is 4) )The alignment information from https://github.com/CorentinJ/librispeech-alignments of this wave is
The following table compares the alignment information obtained with this pull-request with the one from
https://github.com/CorentinJ/librispeech-alignments:
Since we are using a subsampling factor of 4 in the model, the resolution of the alignment is 4-frame, which is
0.04 seconds as the frameshift is 0.01 seconds.
To compare the alignment information in a more detailed way, I select a subpart of the wave corresponding to
The waveform and spectrogram of that part are shown in the following:
You can see that this pull request assigns 4.68 as the start time of
SHOWED
, which is closer to the actual start.@danpovey
Can we compute the alignment information by ourselves using a pre-trained CTC model?
The reasons are that:
(1) It is framewise (after subsampling), easier to use than the one using word alignment
(2) It is as accurate as the one computed with https://github.com/CorentinJ/librispeech-alignments , though I have only compared just one wave
(3) Users don't need to download the extra alignment information, though they have to pre-train a CTC model. But for
datasets that don't have alignment information publicly available, this is the only way to go, I think.