Extract framewise alignment information using CTC decoding #39

csukuangfj · 2021-09-08T12:41:37Z

For the following test wave librispeech/LibriSpeech/test-clean/8224/274384/8224-274384-0008.flac,
the framewise alignment computed using the model from #17 is

(the first column is in seconds, the second column is BPE tokens, i.e., lattice.labels. I have scaled it by subsampling factor (which is 4) )

0.0, <blk>
0.04, <blk>
0.08, <blk>
0.12, <blk>
0.16, <blk>
0.2, <blk>
0.24, <blk>
0.28, <blk>
0.32, <blk>
0.36, <blk>
0.4, <blk>
0.44, <blk>
0.48, ▁THE
0.52, <blk>
0.56, <blk>
0.6, <blk>
0.64, ▁GOOD
0.68, <blk>
0.72, <blk>
0.76, <blk>
0.8, <blk>
0.84, <blk>
0.88, ▁NATURE

I delete "xxx.xx <blk>" in the following for display purposes

1.12, D
1.16, D
1.32, ▁AUDIENCE
2.0, ▁IN
2.24, ▁PITY
2.64, ▁TO
2.88, ▁FALLEN
3.2800000000000002, ▁MAJESTY
4.68, ▁SHOWED
5.04, ▁FOR
5.24, ▁ONCE
5.76, ▁GREATER
6.16, ▁DE
6.28, FER
6.32, FER
6.36, FER
6.4, ENCE
6.44, ENCE
6.72, ▁TO
6.88, ▁THE
7.04, ▁KING
7.88, ▁THAN
8.120000000000001, ▁TO
8.44, ▁MINISTER
9.64, ▁AND
9.96, ▁SUN
10.040000000000001, G
10.08, G
10.120000000000001, G
10.28, ▁THE
10.36, ▁P
10.4, ▁P
10.44, S
10.48, S
10.52, AL
10.56, AL
10.68, M
10.72, M
10.76, M
11.120000000000001, ▁WHICH
11.36, ▁THE
11.6, ▁FORMER
12.0, ▁HAD
12.280000000000001, ▁CALLED
12.6, ▁FOR

The alignment information from https://github.com/CorentinJ/librispeech-alignments of this wave is

8224-274384-0008 ",THE,GOOD,NATURED,AUDIENCE,IN,PITY,TO,FALLEN,MAJESTY,,SHOWED,FOR,ONCE,GREATER,DEFERENCE,TO,THE,KING,,THAN,TO,THE,MINISTER,,AND,SUNG,THE,PSALM,,WHICH,THE,FORMER,HAD,CALLED,FOR," 
"0.500,0.620,0.830,1.310,2.000,2.210,2.610,2.760,3.280,4.020,4.560,4.960,5.170,5.720,6.110,6.710,6.860,6.990,7.470,7.850,8.120,8.280,8.390,9.010,9.580,9.830,10.260,10.390,10.980,11.010,11.330,11.490,11.930,12.220,12.540,12.950,13.42"

The following table compares the alignment information obtained with this pull-request with the one from
https://github.com/CorentinJ/librispeech-alignments:



This PR, token,  CorentinJ/librispeech-alignments
0.48, ▁THE  0.500
0.64, ▁GOOD  0.620
0.88, ▁NATURE 0.830
1.12, D
1.16, D
1.32, ▁AUDIENCE  1.310
2.0, ▁IN 2.000
2.24, ▁PITY 2.210
2.64, ▁TO 2.610
2.88, ▁FALLEN 2.760
3.2800000000000002, ▁MAJESTY 3.280
    silence 4.020
4.68, ▁SHOWED  4.560
5.04, ▁FOR 4.960,
5.24, ▁ONCE 5.170
5.76, ▁GREATER 5.720
6.16, ▁DE 6.110
6.28, FER
6.32, FER
6.36, FER
6.4, ENCE
6.44, ENCE
6.72, ▁TO 6.710
6.88, ▁THE 6.860
7.04, ▁KING 6.990,
  silence 7.470
7.88, ▁THAN 7.850,
8.120000000000001, ▁TO 8.120,
8.28, ▁THE 8.280
8.44, ▁MINISTER 8.390
  silence 9.010
9.64, ▁AND 9.580
9.96, ▁SUN 9.830
10.040000000000001, G
10.08, G
10.120000000000001, G
10.28, ▁THE 10.260
10.36, ▁P 10.390
10.4, ▁P 
10.44, S
10.48, S
10.52, AL
10.56, AL
10.68, M
10.72, M
10.76, M
  silence 10.980,
11.120000000000001, ▁WHICH 11.010
11.36, ▁THE 11.330
11.6, ▁FORMER 11.490
12.0, ▁HAD 11.930
12.280000000000001, ▁CALLED 12.220,
12.6, ▁FOR 12.540
  (end of FOR) 12.950
  (end of utterance) 13.42

Since we are using a subsampling factor of 4 in the model, the resolution of the alignment is 4-frame, which is
0.04 seconds as the frameshift is 0.01 seconds.

To compare the alignment information in a more detailed way, I select a subpart of the wave corresponding to

4.68, ▁SHOWED  4.560
5.04, ▁FOR 4.960,
5.24, ▁ONCE 5.170
5.76, ▁GREATER 5.720
6.16, ▁DE 6.110

The waveform and spectrogram of that part are shown in the following:

You can see that this pull request assigns 4.68 as the start time of SHOWED, which is closer to the actual start.

@danpovey

Can we compute the alignment information by ourselves using a pre-trained CTC model?
The reasons are that:
(1) It is framewise (after subsampling), easier to use than the one using word alignment
(2) It is as accurate as the one computed with https://github.com/CorentinJ/librispeech-alignments , though I have only compared just one wave
(3) Users don't need to download the extra alignment information, though they have to pre-train a CTC model. But for
datasets that don't have alignment information publicly available, this is the only way to go, I think.

csukuangfj · 2021-09-09T04:12:56Z

The following shows the probabilities and log_probabilities of the alignments at each frame after subsampling.

You can see that the probability is very spiky, which is almost always one.
Also, <blk> appears most of the time in the alignment. Moreover, tokens are not repeated
most of the time, even though CTC allows tokens to be repeated.

If we use word alignment information, it is difficult, if not impossible, to insert blanks between words.

From lhotse-speech/lhotse#378 (comment)

We could add, say, -10 to arcs where the time is out of bounds. That should be more than enough to get training
started.

The word alignment from https://github.com/CorentinJ/librispeech-alignments assumes that a word's end time is the next word's start time. Furthermore, we have to break words into tokens, which makes the implementation more complicated than that of using framewise alignment.

# time_in_seconds, token, log_prob, prob
 0.00, <blk>, -0.00003147, 1.0000
 0.04, <blk>, -0.00000691, 1.0000
 0.08, <blk>, -0.00001979, 1.0000
 0.12, <blk>, -0.00002837, 1.0000
 0.16, <blk>, -0.00002027, 1.0000
 0.20, <blk>, -0.00001824, 1.0000
 0.24, <blk>, -0.00001657, 1.0000
 0.28, <blk>, -0.00001097, 1.0000
 0.32, <blk>, -0.00000298, 1.0000
 0.36, <blk>, -0.00000155, 1.0000
 0.40, <blk>, -0.00000143, 1.0000
 0.44, <blk>, -0.00000107, 1.0000
 0.48, ▁THE, -0.00025388, 0.9997
 0.52, <blk>, -0.01130119, 0.9888
 0.56, <blk>, -0.00000012, 1.0000
 0.60, <blk>, -0.00000763, 1.0000
 0.64, ▁GOOD, -0.00003779, 1.0000
 0.68, <blk>, -0.00060409, 0.9994
 0.72, <blk>, 0.00000000, 1.0000
 0.76, <blk>, -0.00000036, 1.0000
 0.80, <blk>, 0.00000000, 1.0000
 0.84, <blk>, -0.00000215, 1.0000
 0.88, ▁NATURE, -0.00015877, 0.9998
 0.92, <blk>, -0.00000095, 1.0000
 0.96, <blk>, 0.00000000, 1.0000
 1.00, <blk>, -0.00000012, 1.0000
 1.04, <blk>, -0.00074550, 0.9993
 1.08, <blk>, -0.05740658, 0.9442
 1.12, D, -0.22990498, 0.7946
 1.16, D, -0.00000203, 1.0000
 1.20, <blk>, -0.01423326, 0.9859
 1.24, <blk>, 0.00000000, 1.0000
 1.28, <blk>, -0.00000143, 1.0000
 1.32, ▁AUDIENCE, -0.00001502, 1.0000
 1.36, <blk>, -0.00006163, 0.9999
 1.40, <blk>, 0.00000000, 1.0000
 1.44, <blk>, 0.00000000, 1.0000
 1.48, <blk>, -0.00000179, 1.0000
 1.52, <blk>, -0.00000572, 1.0000
 1.56, <blk>, -0.00000346, 1.0000
 1.60, <blk>, -0.00000119, 1.0000
 1.64, <blk>, -0.00000024, 1.0000
 1.68, <blk>, -0.00000048, 1.0000
 1.72, <blk>, -0.00000238, 1.0000
 1.76, <blk>, -0.00000417, 1.0000
 1.80, <blk>, -0.00000286, 1.0000
 1.84, <blk>, -0.00000143, 1.0000
 1.88, <blk>, -0.00000036, 1.0000
 1.92, <blk>, 0.00000000, 1.0000
 1.96, <blk>, -0.00033027, 0.9997
 2.00, ▁IN, -0.00001550, 1.0000
 2.04, <blk>, -0.00000024, 1.0000
 2.08, <blk>, 0.00000000, 1.0000
 2.12, <blk>, -0.00000012, 1.0000
 2.16, <blk>, 0.00000000, 1.0000
 2.20, <blk>, -0.00000620, 1.0000
 2.24, ▁PITY, -0.00004911, 1.0000
 2.28, <blk>, -0.00002813, 1.0000
 2.32, <blk>, 0.00000000, 1.0000
 2.36, <blk>, 0.00000000, 1.0000
 2.40, <blk>, -0.00000334, 1.0000
 2.44, <blk>, -0.00000632, 1.0000
 2.48, <blk>, -0.00000882, 1.0000
 2.52, <blk>, -0.00000465, 1.0000
 2.56, <blk>, -0.00000012, 1.0000
 2.60, <blk>, -0.00052927, 0.9995
 2.64, ▁TO, -0.00002944, 1.0000
 2.68, <blk>, -0.00027843, 0.9997
 2.72, <blk>, -0.00000143, 1.0000
 2.76, <blk>, -0.00005209, 0.9999
 2.80, <blk>, -0.00000036, 1.0000
 2.84, <blk>, -0.00000191, 1.0000
 2.88, ▁FALLEN, -0.01618837, 0.9839
 2.92, <blk>, -0.00004017, 1.0000
 2.96, <blk>, 0.00000000, 1.0000
 3.00, <blk>, -0.00000012, 1.0000
 3.04, <blk>, -0.00000739, 1.0000
 3.08, <blk>, -0.00076991, 0.9992
 3.12, <blk>, -0.00012767, 0.9999
 3.16, <blk>, -0.00000393, 1.0000
 3.20, <blk>, -0.00000691, 1.0000
 3.24, <blk>, -0.00000083, 1.0000
 3.28, ▁MAJESTY, -0.00005996, 0.9999
 3.32, <blk>, -0.00003171, 1.0000
 3.36, <blk>, -0.00000048, 1.0000
 3.40, <blk>, -0.00000072, 1.0000
 3.44, <blk>, -0.00000596, 1.0000
 3.48, <blk>, -0.00000262, 1.0000
 3.52, <blk>, -0.00000668, 1.0000
 3.56, <blk>, -0.00000703, 1.0000
 3.60, <blk>, -0.00000143, 1.0000
 3.64, <blk>, -0.00000083, 1.0000
 3.68, <blk>, -0.00000489, 1.0000
 3.72, <blk>, -0.00000632, 1.0000
 3.76, <blk>, -0.00000536, 1.0000
 3.80, <blk>, -0.00000060, 1.0000
 3.84, <blk>, -0.00000155, 1.0000
 3.88, <blk>, -0.00000215, 1.0000
 3.92, <blk>, -0.00000012, 1.0000
 3.96, <blk>, -0.00003135, 1.0000
 4.00, <blk>, -0.00002110, 1.0000
 4.04, <blk>, -0.00001812, 1.0000
 4.08, <blk>, -0.00001669, 1.0000
 4.12, <blk>, -0.00000894, 1.0000
 4.16, <blk>, 0.00000000, 1.0000
 4.20, <blk>, 0.00000000, 1.0000
 4.24, <blk>, 0.00000000, 1.0000
 4.28, <blk>, 0.00000000, 1.0000
 4.32, <blk>, -0.00001991, 1.0000
 4.36, <blk>, -0.00000250, 1.0000
 4.40, <blk>, -0.00000012, 1.0000
 4.44, <blk>, -0.00000012, 1.0000
 4.48, <blk>, -0.00000036, 1.0000
 4.52, <blk>, -0.00000024, 1.0000
 4.56, <blk>, -0.00000024, 1.0000
 4.60, <blk>, 0.00000000, 1.0000
 4.64, <blk>, -0.00002027, 1.0000
 4.68, ▁SHOWED, -0.00007963, 0.9999
 4.72, <blk>, -0.00177536, 0.9982
 4.76, <blk>, -0.00000513, 1.0000
 4.80, <blk>, -0.00001013, 1.0000
 4.84, <blk>, -0.00000942, 1.0000
 4.88, <blk>, -0.00000203, 1.0000
 4.92, <blk>, -0.00000024, 1.0000
 4.96, <blk>, 0.00000000, 1.0000
 5.00, <blk>, -0.00002503, 1.0000
 5.04, ▁FOR, -0.00002074, 1.0000
 5.08, <blk>, -0.00000525, 1.0000
 5.12, <blk>, -0.00000584, 1.0000
 5.16, <blk>, -0.00000083, 1.0000
 5.20, <blk>, -0.00009632, 0.9999
 5.24, ▁ONCE, -0.00003552, 1.0000
 5.28, <blk>, -0.00000346, 1.0000
 5.32, <blk>, 0.00000000, 1.0000
 5.36, <blk>, -0.00000024, 1.0000
 5.40, <blk>, -0.00000060, 1.0000
 5.44, <blk>, -0.00000191, 1.0000
 5.48, <blk>, -0.00000143, 1.0000
 5.52, <blk>, -0.00000072, 1.0000
 5.56, <blk>, -0.00000107, 1.0000
 5.60, <blk>, -0.00000107, 1.0000
 5.64, <blk>, -0.00000060, 1.0000
 5.68, <blk>, -0.00000012, 1.0000
 5.72, <blk>, -0.00000024, 1.0000
 5.76, ▁GREATER, -0.00002444, 1.0000
 5.80, <blk>, -0.00003076, 1.0000
 5.84, <blk>, 0.00000000, 1.0000
 5.88, <blk>, 0.00000000, 1.0000
 5.92, <blk>, -0.00000012, 1.0000
 5.96, <blk>, -0.00000370, 1.0000
 6.00, <blk>, -0.00000417, 1.0000
 6.04, <blk>, -0.00000393, 1.0000
 6.08, <blk>, -0.00000048, 1.0000
 6.12, <blk>, -0.01328650, 0.9868
 6.16, ▁DE, -0.00000274, 1.0000
 6.20, <blk>, -0.03788889, 0.9628
 6.24, <blk>, -0.00004351, 1.0000
 6.28, FER, -0.02246549, 0.9778
 6.32, FER, -0.00010347, 0.9999
 6.36, FER, -0.09728961, 0.9073
 6.40, ENCE, -0.00000823, 1.0000
 6.44, ENCE, -0.00153840, 0.9985
 6.48, <blk>, -0.00276101, 0.9972
 6.52, <blk>, -0.00000703, 1.0000
 6.56, <blk>, -0.00000882, 1.0000
 6.60, <blk>, -0.00001121, 1.0000
 6.64, <blk>, -0.00000679, 1.0000
 6.68, <blk>, -0.02932693, 0.9711
 6.72, ▁TO, -0.00000525, 1.0000
 6.76, <blk>, -0.04336526, 0.9576
 6.80, <blk>, -0.00000107, 1.0000
 6.84, <blk>, -0.01518593, 0.9849
 6.88, ▁THE, -0.00003123, 1.0000
 6.92, <blk>, -0.00050222, 0.9995
 6.96, <blk>, 0.00000000, 1.0000
 7.00, <blk>, -0.00000417, 1.0000
 7.04, ▁KING, -0.00004125, 1.0000
 7.08, <blk>, -0.00010740, 0.9999
 7.12, <blk>, -0.00000024, 1.0000
 7.16, <blk>, 0.00000000, 1.0000
 7.20, <blk>, -0.00000107, 1.0000
 7.24, <blk>, -0.00000167, 1.0000
 7.28, <blk>, -0.00000131, 1.0000
 7.32, <blk>, -0.00000107, 1.0000
 7.36, <blk>, -0.00000024, 1.0000
 7.40, <blk>, 0.00000000, 1.0000
 7.44, <blk>, -0.00003183, 1.0000
 7.48, <blk>, -0.00002265, 1.0000
 7.52, <blk>, -0.00001800, 1.0000
 7.56, <blk>, -0.00001657, 1.0000
 7.60, <blk>, -0.00000226, 1.0000
 7.64, <blk>, -0.00000024, 1.0000
 7.68, <blk>, 0.00000000, 1.0000
 7.72, <blk>, -0.00000012, 1.0000
 7.76, <blk>, -0.00000024, 1.0000
 7.80, <blk>, 0.00000000, 1.0000
 7.84, <blk>, -0.00002074, 1.0000
 7.88, ▁THAN, -0.00004506, 1.0000
 7.92, <blk>, -0.00075479, 0.9992
 7.96, <blk>, -0.00000131, 1.0000
 8.00, <blk>, -0.00001431, 1.0000
 8.04, <blk>, -0.00001860, 1.0000
 8.08, <blk>, -0.02900053, 0.9714
 8.12, ▁TO, -0.00001490, 1.0000
 8.16, <blk>, -0.30554244, 0.7367
 8.20, <blk>, -0.00000739, 1.0000
 8.24, <blk>, -0.00163338, 0.9984
 8.28, ▁THE, -0.00006485, 0.9999
 8.32, <blk>, -0.00049638, 0.9995
 8.36, <blk>, -0.00000036, 1.0000
 8.40, <blk>, -0.00018488, 0.9998
 8.44, ▁MINISTER, -0.00005007, 0.9999
 8.48, <blk>, -0.00001144, 1.0000
 8.52, <blk>, 0.00000000, 1.0000
 8.56, <blk>, -0.00000548, 1.0000
 8.60, <blk>, -0.00002360, 1.0000
 8.64, <blk>, -0.00000119, 1.0000
 8.68, <blk>, 0.00000000, 1.0000
 8.72, <blk>, -0.00000167, 1.0000
 8.76, <blk>, -0.00000203, 1.0000
 8.80, <blk>, -0.00000393, 1.0000
 8.84, <blk>, -0.00000727, 1.0000
 8.88, <blk>, -0.00000834, 1.0000
 8.92, <blk>, -0.00000024, 1.0000
 8.96, <blk>, -0.00003517, 1.0000
 9.00, <blk>, -0.00001943, 1.0000
 9.04, <blk>, -0.00001836, 1.0000
 9.08, <blk>, -0.00001597, 1.0000
 9.12, <blk>, -0.00001836, 1.0000
 9.16, <blk>, 0.00000000, 1.0000
 9.20, <blk>, 0.00000000, 1.0000
 9.24, <blk>, 0.00000000, 1.0000
 9.28, <blk>, 0.00000000, 1.0000
 9.32, <blk>, 0.00000000, 1.0000
 9.36, <blk>, -0.00002229, 1.0000
 9.40, <blk>, -0.00000095, 1.0000
 9.44, <blk>, 0.00000000, 1.0000
 9.48, <blk>, 0.00000000, 1.0000
 9.52, <blk>, -0.00000012, 1.0000
 9.56, <blk>, -0.00000119, 1.0000
 9.60, <blk>, -0.06021266, 0.9416
 9.64, ▁AND, -0.00002646, 1.0000
 9.68, <blk>, -0.00001132, 1.0000
 9.72, <blk>, -0.00000024, 1.0000
 9.76, <blk>, -0.00000036, 1.0000
 9.80, <blk>, -0.00000012, 1.0000
 9.84, <blk>, -0.00000024, 1.0000
 9.88, <blk>, -0.00000250, 1.0000
 9.92, <blk>, -0.00698115, 0.9930
 9.96, ▁SUN, -0.00033063, 0.9997
 10.00, <blk>, -0.02058383, 0.9796
 10.04, G, -0.45129657, 0.6368
 10.08, G, -0.00001228, 1.0000
 10.12, G, -0.00056930, 0.9994
 10.16, <blk>, -0.00321633, 0.9968
 10.20, <blk>, -0.00000143, 1.0000
 10.24, <blk>, -0.00706022, 0.9930
 10.28, ▁THE, -0.00005114, 0.9999
 10.32, <blk>, -0.02052602, 0.9797
 10.36, ▁P, -0.00117948, 0.9988
 10.40, ▁P, -0.00040618, 0.9996
 10.44, S, -0.00498766, 0.9950
 10.48, S, -0.05732632, 0.9443
 10.52, AL, -0.00001538, 1.0000
 10.56, AL, -0.01960917, 0.9806
 10.60, <blk>, -0.00001001, 1.0000
 10.64, <blk>, -0.01069422, 0.9894
 10.68, M, -0.00004792, 1.0000
 10.72, M, -0.00003147, 1.0000
 10.76, M, -0.55103242, 0.5764
 10.80, <blk>, -0.00017129, 0.9998
 10.84, <blk>, -0.00004732, 1.0000
 10.88, <blk>, -0.00000298, 1.0000
 10.92, <blk>, -0.00002944, 1.0000
 10.96, <blk>, 0.00000000, 1.0000
 11.00, <blk>, 0.00000000, 1.0000
 11.04, <blk>, 0.00000000, 1.0000
 11.08, <blk>, -0.00078945, 0.9992
 11.12, ▁WHICH, -0.00000989, 1.0000
 11.16, <blk>, -0.00007295, 0.9999
 11.20, <blk>, 0.00000000, 1.0000
 11.24, <blk>, -0.00000024, 1.0000
 11.28, <blk>, -0.00000048, 1.0000
 11.32, <blk>, -0.00032694, 0.9997
 11.36, ▁THE, -0.00002420, 1.0000
 11.40, <blk>, -0.02687876, 0.9735
 11.44, <blk>, -0.00000298, 1.0000
 11.48, <blk>, -0.00000358, 1.0000
 11.52, <blk>, -0.00000048, 1.0000
 11.56, <blk>, -0.00000632, 1.0000
 11.60, ▁FORMER, -0.00006521, 0.9999
 11.64, <blk>, -0.00000560, 1.0000
 11.68, <blk>, 0.00000000, 1.0000
 11.72, <blk>, 0.00000000, 1.0000
 11.76, <blk>, -0.00000048, 1.0000
 11.80, <blk>, -0.00000298, 1.0000
 11.84, <blk>, -0.00000167, 1.0000
 11.88, <blk>, -0.00000358, 1.0000
 11.92, <blk>, -0.00000024, 1.0000
 11.96, <blk>, -0.00019584, 0.9998
 12.00, ▁HAD, -0.00001574, 1.0000
 12.04, <blk>, -0.00001311, 1.0000
 12.08, <blk>, 0.00000000, 1.0000
 12.12, <blk>, -0.00000370, 1.0000
 12.16, <blk>, -0.00000191, 1.0000
 12.20, <blk>, 0.00000000, 1.0000
 12.24, <blk>, -0.00005305, 0.9999
 12.28, ▁CALLED, -0.00006389, 0.9999
 12.32, <blk>, -0.00027057, 0.9997
 12.36, <blk>, -0.00000715, 1.0000
 12.40, <blk>, -0.00000429, 1.0000
 12.44, <blk>, -0.00000572, 1.0000
 12.48, <blk>, -0.00000060, 1.0000
 12.52, <blk>, 0.00000000, 1.0000
 12.56, <blk>, -0.00000668, 1.0000
 12.60, ▁FOR, -0.00000465, 1.0000
 12.64, <blk>, -0.00280879, 0.9972
 12.68, <blk>, 0.00000000, 1.0000
 12.72, <blk>, -0.00000131, 1.0000
 12.76, <blk>, -0.00000978, 1.0000
 12.80, <blk>, -0.00001717, 1.0000
 12.84, <blk>, -0.00000274, 1.0000
 12.88, <blk>, -0.00003076, 1.0000
 12.92, <blk>, -0.00001824, 1.0000
 12.96, <blk>, -0.00001526, 1.0000
 13.00, <blk>, -0.00001466, 1.0000
 13.04, <blk>, -0.00001729, 1.0000
 13.08, <blk>, -0.00000358, 1.0000
 13.12, <blk>, -0.00000107, 1.0000
 13.16, <blk>, -0.00000048, 1.0000
 13.20, <blk>, 0.00000000, 1.0000
 13.24, <blk>, 0.00000000, 1.0000
 13.28, <blk>, -0.00000215, 1.0000
 13.32, <blk>, -0.00000119, 1.0000

danpovey · 2021-09-09T04:46:51Z

Sure, I think this approach makes sense. Certainly we will need to have scripts to compute alignments, at some point.

Print information about k2, lhotse, PyTorch, and icefall.

csukuangfj · 2021-09-23T12:42:52Z

Unlike features, I would propose to store framewise alignment information separately.
The reason is that we may try different modelling units, e.g., phones, BPE (with varying vocabulary size), each
of which has a different framewise alignment.

We can have the following layout:

data/ali-500
|-- test_clean.pt
|-- test_other.pt
|-- train-960.pt
`-- valid.pt

data/ali-5000
|-- test_clean.pt
|-- test_other.pt
|-- train-960.pt
`-- valid.pt

where data/ali-500 contains the alignment when we use BPE with a vocab size equal to 500.

icefall/icefall/utils.py

Line 329 in 27a6d5e

alignments: Dict[str, List[int]],

Alignments are indexed by utterance IDs, i.e., cut IDs.

csukuangfj · 2021-09-23T12:53:04Z

The alignment does not occupy too much memory. I think we can keep it in memory and lookup it on the fly:

$ ls -lh data/ali-500/
total 61M
-rw-r--r-- 1 xxx xxx 1.1M Sep 23 20:26 test_clean.pt
-rw-r--r-- 1 xxx xxx 1.1M Sep 23 20:27 test_other.pt
-rw-r--r-- 1 xxx xxx  57M Sep 23 20:49 train-960.pt
-rw-r--r-- 1 xxx xxx 2.1M Sep 23 20:51 valid.pt

pzelasko · 2021-09-23T12:54:35Z

Would there be interest to store the alignments in Cuts using the proposed mechanisms described in lhotse-speech/lhotse#393?

danpovey · 2021-09-24T04:01:43Z

I'm personally OK with either method but I'll let Fangjun do whatever is easiest for him.

csukuangfj · 2021-09-24T04:17:04Z

For this specific task, i.e., using alignment information in MMI training, I feel it is easier to store the alignment separately.
When we start the training, we load the alignment information into memory, which is quite small and the alignment for the whole training set can be kept in memory.

After we getting the cut from the dataloader, we can use cut.id to look up its alignment.

I agree the approach in lhotse-speech/lhotse#393 is more general. However, it needs more work, I think ( I haven't figured out how it would be implemented.)

csukuangfj added 5 commits September 7, 2021 15:48

Use new APIs with k2.RaggedTensor

7a83dd1

Fix style issues.

4d06ca4

Update the installation doc, saying it requires at least k2 v1.7

c43dc89

Merge remote-tracking branch 'dan/master' into ctc-ali

4a2ae16

Extract framewise alignment information using CTC decoding.

2cb438c

csukuangfj added 5 commits September 13, 2021 11:06

Merge remote-tracking branch 'dan/master' into ctc-ali

5072e28

Print environment information.

8f64fb9

Print information about k2, lhotse, PyTorch, and icefall.

Fix CI.

62b2759

Merge remote-tracking branch 'dan/master' into ctc-ali

0f2be18

Fix CI.

d8bef09

csukuangfj mentioned this pull request Sep 17, 2021

Does word alignment or phone alignment availbale? #48

Open

csukuangfj changed the title ~~Extract framewise alignment information using CTC decoding~~ WIP: Extract framewise alignment information using CTC decoding Sep 23, 2021

csukuangfj added 2 commits September 23, 2021 16:55

Merge remote-tracking branch 'dan/master' into ctc-ali

4580ff1

Compute framewise alignment information of the LibriSpeech dataset.

27a6d5e

csukuangfj added the ready label Sep 23, 2021

csukuangfj changed the title ~~WIP: Extract framewise alignment information using CTC decoding~~ Extract framewise alignment information using CTC decoding Sep 24, 2021

csukuangfj added 5 commits September 25, 2021 19:52

Update comments for the time to compute alignments of train-960.

0f3d922

Preserve cut id in mix cut transformer.

b27d67d

Minor fixes.

e4c5388

Merge remote-tracking branch 'dan/master' into ctc-ali

1c603c3

Add doc about how to extract framewise alignments.

07140e5

csukuangfj merged commit 4890e27 into k2-fsa:master Oct 18, 2021

danpovey mentioned this pull request Nov 27, 2021

Decoding error 'Fsa' object doesn't support assignment. #133

Open

TianyuCao mentioned this pull request Jan 23, 2022

Extract framewise alignment information by the pretrained model #188

Open

csukuangfj deleted the ctc-ali branch July 28, 2023 02:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract framewise alignment information using CTC decoding #39

Extract framewise alignment information using CTC decoding #39

csukuangfj commented Sep 8, 2021

csukuangfj commented Sep 9, 2021 •

edited

Loading

danpovey commented Sep 9, 2021

csukuangfj commented Sep 23, 2021

csukuangfj commented Sep 23, 2021

pzelasko commented Sep 23, 2021

danpovey commented Sep 24, 2021

csukuangfj commented Sep 24, 2021

Extract framewise alignment information using CTC decoding #39

Extract framewise alignment information using CTC decoding #39

Conversation

csukuangfj commented Sep 8, 2021

csukuangfj commented Sep 9, 2021 • edited Loading

danpovey commented Sep 9, 2021

csukuangfj commented Sep 23, 2021

csukuangfj commented Sep 23, 2021

pzelasko commented Sep 23, 2021

danpovey commented Sep 24, 2021

csukuangfj commented Sep 24, 2021

csukuangfj commented Sep 9, 2021 •

edited

Loading