WIP: Add timestamp #52

csukuangfj · 2022-07-06T14:29:21Z

Here are some initial results.

For the following test wav (Note: Its name should be 1089-134686-0002.wav, not 1089-134686-0001.wav)
https://huggingface.co/csukuangfj/icefall-asr-librispeech-pruned-transducer-stateless3-2022-05-13/blob/main/test_wavs/1089-134686-0001.wav

The ground truth word timestamp from https://github.com/CorentinJ/librispeech-alignments
is

",AFTER,EARLY,NIGHTFALL,THE,YELLOW,LAMPS,WOULD,LIGHT,UP,HERE,AND,THERE,THE,SQUALID,QUARTER,OF,THE,BROTHELS,"                          
"0.360,0.730,1.040,1.770,1.900,2.160,2.590,2.760,3.070,3.270,3.520,3.660,4.090,4.210,4.780,5.310,5.420,5.500,6.160,6.625"

The predicted results of this PR using the pre-trained model from
https://huggingface.co/csukuangfj/icefall-asr-librispeech-pruned-transducer-stateless3-2022-05-13/tree/main/exp
are

2022-07-06 21:47:53,604 INFO [offline_server.py:667] Connected: ('127.0.0.1', 48118). Number of connections: 1/10
[('_AFTER', 0.0), ('_E', 0.28), ('AR', 0.44), ('LY', 0.56), ('_NIGHT', 0.88), ('F', 1.24), ('A', 1.36), ('LL', 1.44), ('_THE', 1.6), ('_YE', 1.76), ('LL', 1.8800000000000001), ('OW', 1.96), ('_LA', 2.16), ('M', 2.32), ('P', 2.44), ('S', 2.48), ('_WOULD', 2.6), ('_LIGHT', 2.8000000000000003), ('_UP', 3.08), ('_HE', 3.2800000000000002), ('RE', 3.4), ('_AND', 3.6), ('_THERE', 3.8000000000000003), ('_THE', 4.08), ('_S', 4.28), ('QUA', 4.32), ('LI', 4.48), ('D', 4.64), ('_', 4.84), ('QUA', 4.88), ('R', 5.04), ('TER', 5.12), ('_OF', 5.4), ('_THE', 5.5200000000000005), ('_B', 5.68), ('RO', 5.72), ('TH', 5.88), ('EL', 6.16), ('S', 6.36)]
[0.0, 0.28, 0.44, 0.56, 0.88, 1.24, 1.36, 1.44, 1.6, 1.76, 1.8800000000000001, 1.96, 2.16, 2.32, 2.44, 2.48, 2.6, 2.8000000000000003,
3.08, 3.2800000000000002, 3.4, 3.6, 3.8000000000000003, 4.08, 4.28, 4.32, 4.48, 4.64, 4.84, 4.88, 5.04, 5.12, 5.4, 5.5200000000000005, 5.68, 5.72, 5.88, 6.16, 6.36]
2022-07-06 21:47:54,929 INFO [offline_server.py:651] Disconnected: ('127.0.0.1', 48118). Number of connections: 0/10

For ease of reference, the above results are listed in the following table:

word	CorentinJ/librispeech-alignments	this PR
AFTER	0.36-0.73	0-0.28
EARLY	0.73-1.04	0.28-0.88
NIGHTFALL	1.04-1.77	0.88-1.60
THE	1.77-1.90	1.60-1.76
YELLOW	1.90-2.16	1.76-2.16
LAMPS	2.16-2.59	2.16-2.60
WOULD	2.59-2.76	2.60-2.80
LIGHT	2.76-3.07	2.80-3.08
UP	3.07-3.27	3.08-3.28
HERE	3.27-3.52	3.28-3.60
AND	3.52-3.66	3.60-3.80
THERE	3.66-4.09	3.80-4.08
THE	4.09-4.21	4.08-4.28
SQUALID	4.21-4.78	4.28-4.84
QUARTER	4.78-5.31	4.84-5.40
OF	5.31-5.42	5.40-5.52
THE	5.42-5.50	5.52-5.68
BROTHELS	5.50-6.16	5.68-6.36
silence	6.16-6.625	N/A

Since the model subsampling factor is 4 and frame shift is 0.01s, the resolution of the timestamp is 0.04s.

You can see that the predicted timestamps of the words in the middle of the utterance are very close to the one
from https://github.com/CorentinJ/librispeech-alignments

Also note that we are not doing force-alignment here. The words are decoded using greedy search and it happens that all words are predicted correctly.

csukuangfj · 2022-07-06T14:37:14Z

From #43
@ngoel17

What kind of timestamp format do you recommend?

I am considering using the one from https://github.com/alphacep/vosk-api/blob/master/src/vosk_api.h#L168

An example message is given below (in json format):

{
  "timestamp": [
     {"token": "_AFTER", "start": 0.0},
     {"token": "_E", "start": 0.28},
     {"token": "AR", "start": 0.44},
     {"token": "LY", "start": 0.56},
     {"token": "_NIGHT", "start": 0.88},
   ],
  "text": "AFTER EARLY NIGHT"
}

The message would be constructed in Python, so it should be fairly straightforward to change its format.

Note that only the start time of a BPE token is given. You have to figure out the start and end time of a word from it.

jtrmal · 2022-07-06T14:52:17Z

this makes sense to me y.

…

On Wed, Jul 6, 2022 at 10:37 AM Fangjun Kuang ***@***.***> wrote: From #43 <#43> @ngoel17 <https://github.com/ngoel17> What kind of timestamp format do you recommend? I am considering using the one from https://github.com/alphacep/vosk-api/blob/master/src/vosk_api.h#L168 An example message is given below (in json format): { "timestamp": [ {"token": "_AFTER", "start": 0.0}, {"token": "_E", "start": 0.28}, {"token": "AR", "start": 0.44}, {"token": "LY", "start": 0.56}, {"token": "_NIGHT", "start": 0.88}, ], "text": "AFTER EARLY NIGHT" } The message would be constructed in Python, so it should be fairly straightforward to change its format. Note that only the start time of a BPE token is given. You have to figure out the start and end time of a word from it. — Reply to this email directly, view it on GitHub <#52 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACUKYXYEZZYT23YZTMN3S5DVSWK2RANCNFSM52Z7LBTA> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

ngoel17 · 2022-07-06T21:10:52Z

To create a CTM start time and duration are needed. The issue will be with words that have long silence in between, unless silence token is not deleted.

ahazned · 2022-07-07T07:03:40Z

One problem with this is the first word is always labeled with start:0.0 regardless of how much silence/non-speech there is before that word.

csukuangfj · 2022-07-07T07:06:53Z

One problem with this is the first word is always labeled with start:0.0 regardless of how much silence/non-speech there is before that word.

No.

It is 0 for this specific test wave since the first decoded token happens to be on the first frame.

ahazned · 2022-07-07T07:22:04Z

One problem with this is the first word is always labeled with start:0.0 regardless of how much silence/non-speech there is before that word.

No.

It is 0 for this specific test wave since the first decoded token happens to be on the first frame.

In the test file the first word "AFTER" approximately starts at 0.402s which is way after the first frame i think.

Can you try the attached file? I added ~5 seconds of silence before the first word and is still gives start:0.0
silence_added_beginning_1089-134686-0001.zip

Here is what i got (on a different model than yours):

Original file (1089-134686-0001.wav): [('▁after', 0.0), ('▁e', 0.56), ('ar', 0.68), ('ly', 0.8), ('▁night', 1.04) .....
Silence added file: [('▁after', 0.0), ('▁e', 5.5200000000000005), ('ar', 5.64), ('ly', 5.76), ('▁night', 6.0) .....

csukuangfj · 2022-07-07T07:52:30Z

One problem with this is the first word is always labeled with start:0.0 regardless of how much silence/non-speech there is before that word.

No.
It is 0 for this specific test wave since the first decoded token happens to be on the first frame.

In the test file the first word "AFTER" approximately starts at 0.402s which is way after the first frame i think.

Can you try the attached file? I added ~5 seconds of silence before the first word and is still gives start:0.0 silence_added_beginning_1089-134686-0001.zip

Here is what i got (on a different model than yours):

Original file (1089-134686-0001.wav): [('▁after', 0.0), ('▁e', 0.56), ('ar', 0.68), ('ly', 0.8), ('▁night', 1.04) ..... Silence added file: [('▁after', 0.0), ('▁e', 5.5200000000000005), ('ar', 5.64), ('ly', 5.76), ('▁night', 6.0) .....

Yes, you are right. I can reproduce the results. Both greedy search and modified_beam_search emit the first token on the first frame.

I think part of the reason is that the model uses global attention. The model can see all the remaining frames on frame 0.

I will try to use the streaming model to test it.

csukuangfj · 2022-07-07T08:28:46Z

The following are the results for the streaming greedy search using the pre-trained model from
https://huggingface.co/csukuangfj/icefall-asr-librispeech-pruned-stateless-emformer-rnnt2-2022-06-01/tree/main/exp

You can see that the first token is no longer decoded at frame 0.

Moreover, it shows the model tends to delay (about 0.4 s) the output.

Without pre-pended silence

2022-07-07 16:16:59,019 INFO [server.py:642] connection open
2022-07-07 16:16:59,019 INFO [streaming_server.py:414] Connected: ('127.0.0.1', 56324). Number of connections: 1/2
[('_AFTER', 0.48), ('_E', 1.04), ('AR', 1.12), ('LY', 1.28), ('_NIGHT', 1.44), ('F', 1.8), ('A', 1.84), 
('LL', 1.96), ('_THE', 2.08), ('_YE', 2.2), ('LL', 2.36), ('OW', 2.4), ('_LA', 2.52), ('M', 2.68), 
('P', 2.72), ('S', 2.8000000000000003),  ('_WOULD', 2.96), ('_LIGHT', 3.16), ('_UP', 3.4), ('_HE', 3.64), 
('RE', 3.72), ('_AND', 3.92),  ('_THERE', 4.12), ('_THE', 4.5200000000000005), 
('_S', 4.68), ('QUA', 4.72), ('LI', 4.92), ('D', 5.0), ('_', 5.2), ('QUA', 5.24), ('R', 5.44), ('TER', 5.48), 
('_OF', 5.64), ('_THE', 5.76), ('_B', 5.92), ('RA', 5.96), ('FF', 6.16), ('LE', 6.32), ('S', 6.48)]
2022-07-07 16:17:05,782 INFO [streaming_server.py:398] Disconnected: ('127.0.0.1', 56324). Number of connections: 0/2
2022-07-07 16:17:05,783 INFO [server.py:260] connection closed

With pre-pended silence

2022-07-07 16:18:19,143 INFO [server.py:642] connection open
2022-07-07 16:18:19,143 INFO [streaming_server.py:414] Connected: ('127.0.0.1', 57802). Number of connections: 1/2
[('_AFTER', 5.44), ('_E', 6.0), ('AR', 6.08), ('LY', 6.28), ('_NIGHT', 6.44), ('F', 6.8), ('A', 6.84), 
('LL', 6.96), ('_THE', 7.08), ('_YE', 7.2), ('LL', 7.32), ('OW', 7.4), ('_LA', 7.5200000000000005), ('M', 7.68), 
('P', 7.72), ('S', 7.8), ('_WOULD', 7.96), ('_LIGHT', 8.16), ('_UP', 8.4), ('_HE', 8.6), 
('RE', 8.72), ('_AND', 8.92), ('_THERE', 9.120000000000001), ('_THE', 9.52), 
('_S', 9.68), ('QUA', 9.72), ('LI', 9.92), ('D', 10.0), ('_', 10.200000000000001), ('QUA', 10.24), 
('R', 10.44), ('TER', 10.48), ('_OF', 10.64), ('_THE', 10.76), ('_B', 10.92), ('RA', 10.96), ('FF', 11.16), 
('LE', 11.28), ('S', 11.44)]
2022-07-07 16:18:31,113 INFO [streaming_server.py:398] Disconnected: ('127.0.0.1', 57802). Number of connections: 0/2
2022-07-07 16:18:31,114 INFO [server.py:260] connection closed

csukuangfj · 2022-07-07T08:51:17Z

The following table summarizes the results so far for non-streaming and streaming decoding.

There are about 0.3 to 0.4 seconds delay for the streaming model.

word	CorentinJ/librispeech-alignments	this PR (non-streaming greedy search)	streaming greedy search
AFTER	0.36-0.73	0-0.28	0.48-1.04
EARLY	0.73-1.04	0.28-0.88	1.04-1.44
NIGHTFALL	1.04-1.77	0.88-1.60	1.44-2.08
THE	1.77-1.90	1.60-1.76	2.08-2.20
YELLOW	1.90-2.16	1.76-2.16	2.20-2.52
LAMPS	2.16-2.59	2.16-2.60	2.52-2.80
WOULD	2.59-2.76	2.60-2.80	2.96-3.16
LIGHT	2.76-3.07	2.80-3.08	3.16-3.40
UP	3.07-3.27	3.08-3.28	3.40-3.64
HERE	3.27-3.52	3.28-3.60	3.64-3.92
AND	3.52-3.66	3.60-3.80	3.92-4.12
THERE	3.66-4.09	3.80-4.08	4.12-4.52
THE	4.09-4.21	4.08-4.28	4.52-4.68
SQUALID	4.21-4.78	4.28-4.84	4.68-5.20
QUARTER	4.78-5.31	4.84-5.40	5.20-5.64
OF	5.31-5.42	5.40-5.52	5.64-5.76
THE	5.42-5.50	5.52-5.68	5.76-5.92
BROTHELS	5.50-6.16	5.68-6.36	5.92-6.48
silence	6.16-6.625	N/A	N/A

ngoel17 · 2022-07-07T18:18:04Z

If we assume that the chunk size is 40ms and we have 4 chunks, then the left context is 160ms. But 320 ms would be double that. Any intuition why we get an offset in this range?

pkufool · 2022-07-08T02:13:28Z

There are about 0.3 to 0.4 seconds delay for the streaming model.

Maybe we can try the models trained with delay penalty, see if it helps.

pkufool · 2022-07-11T08:56:38Z

The following are the timestamps of model trained with delay penalty (see k2-fsa/k2#976). The encoder is reworked conformer.

You can see from the table, the model trained without penalty delay, the delay is about 0.3 to 0.4 seconds. the model trained with penalty==0.0015, the delay decreases to 0.1 to 0.2 seconds, so the delay penalty does help to let symbols emit earlier.

Apart from the delay penalty, the decode chunk size and left context also affect the symbols delay, the 4th, 5th, 6th columns were decoded with chunk-size=8, left-context=32, while the 7th column was decoded with chunk-size=16, left-context64, comparing with 6th and 7th columns, when decode chunk is larger the symbols delay will be slightly larger.

No penalty

2022-07-11 16:07:55,675 INFO [streaming_server.py:472] Connected: ('127.0.0.1', 34732). Number of connections: 1/500
[('▁AFTER', 0.32), ('▁E', 1.04), ('AR', 1.12), ('LY', 1.24), ('▁NIGHT', 1.6), ('F', 1.72), ('A', 1.8), ('LL', 1.92), ('▁THE', 2.04), ('▁YE', 2.2), ('LL', 2.2800000000000002), ('OW', 2.36), ('▁LA', 2.52), ('M', 2.64), ('P', 2.68), ('S', 2.7600000000000002), ('▁WOULD', 2.92), ('▁LIGHT', 3.16), ('▁UP', 3.36), ('▁HE', 3.6), ('RE', 3.72), ('▁AND', 3.88), ('▁THERE', 4.12), ('▁THE', 4.48), ('▁S', 4.68), ('QUA', 4.72), ('LI', 4.84), ('D', 5.0), ('▁', 5.16), ('QUA', 5.2), ('R', 5.32), ('TER', 5.4), ('▁OF', 5.6000000000000005), ('▁THE', 5.72), ('▁B', 5.88), ('RA', 5.92), ('FF', 6.0), ('EL', 6.24), ('S', 6.48)]
2022-07-11 16:08:02,471 INFO [streaming_server.py:456] Disconnected: ('127.0.0.1', 34732). Number of connections: 0/500
2022-07-11 16:08:02,472 INFO [server.py:260] connection closed

Penalty=0.001

2022-07-11 16:10:08,641 INFO [streaming_server.py:472] Connected: ('127.0.0.1', 36388). Number of connections: 1/500
[('▁AFTER', 0.32), ('▁E', 0.96), ('AR', 1.0), ('LY', 1.12), ('▁NIGHT', 1.32), ('F', 1.68), ('A', 1.72), ('LL', 1.84), ('▁THE', 1.96), ('▁YE', 2.08), ('LL', 2.16), ('OW', 2.24), ('▁LA', 2.4), ('M', 2.52), ('P', 2.6), ('S', 2.68),
('▁WOULD', 2.84), ('▁LIGHT', 3.0), ('▁UP', 3.24), ('▁HE', 3.44), ('RE', 3.6), ('▁AND', 3.72), ('▁THERE', 3.92), ('▁THE', 4.28), ('▁S', 4.48), ('QUA', 4.5200000000000005), ('LI', 4.72), ('D', 4.84), ('▁', 5.0), ('QUA', 5.04), ('R', 5.2), ('TER', 5.24), ('▁OF', 5.44), ('▁THE', 5.5600000000000005), ('▁B', 5.76), ('RO', 5.8), ('TH', 5.96), ('EL', 6.16), ('S', 6.32)]
2022-07-11 16:10:15,437 INFO [streaming_server.py:456] Disconnected: ('127.0.0.1', 36388). Number of connections: 0/500
2022-07-11 16:10:15,438 INFO [server.py:260] connection closed

Penalty=0.0015

2022-07-11 15:54:46,658 INFO [streaming_server.py:472] Connected: ('127.0.0.1', 52946). Number of connections: 1/500
[('▁AFTER', 0.32), ('▁E', 0.8), ('AR', 0.84), ('LY', 1.08), ('▁NIGHT', 1.28), ('F', 1.6400000000000001), ('A', 1.68), ('LL', 1.76), ('▁THE', 1.92), ('▁YE', 2.04), ('LL', 2.08), ('OW', 2.24), ('▁LA', 2.36), ('M', 2.48), ('P', 2.52), ('S', 2.64), ('▁WOULD', 2.7600000000000002), ('▁LIGHT', 3.0), ('▁UP', 3.2), ('▁HE', 3.4), ('RE', 3.56), ('▁AND', 3.68), ('▁THERE', 3.88), ('▁THE', 4.28), ('▁S', 4.48), ('QUA', 4.5200000000000005), ('LI', 4.68), ('D', 4.72), ('▁', 4.92), ('QUA', 4.96), ('R', 5.16), ('TER', 5.2), ('▁OF', 5.4), ('▁THE', 5.5200000000000005), ('▁B', 5.68), ('RO', 5.72), ('TH', 5.92), ('EL', 6.04), ('S', 6.28)]
2022-07-11 15:54:53,462 INFO [streaming_server.py:456] Disconnected: ('127.0.0.1', 52946). Number of connections: 0/500
2022-07-11 15:54:53,463 INFO [server.py:260] connection closed

Penalty=0.0015 (left-context=64, chunk-size=16)

2022-07-11 16:17:32,326 INFO [streaming_server.py:472] Connected: ('127.0.0.1', 42458). Number of connections: 1/500
[('▁AFTER', 0.0), ('▁E', 0.8), ('AR', 0.84), ('LY', 1.04), ('▁NIGHT', 1.28), ('F', 1.6), ('A', 1.6400000000000001), ('LL', 1.76), ('▁THE', 1.92), ('▁YE', 2.04), ('LL', 2.08), ('OW', 2.24), ('▁LA', 2.4), ('M', 2.48), ('P', 2.52),
('S', 2.64), ('▁WOULD', 2.8000000000000003), ('▁LIGHT', 3.0), ('▁UP', 3.24), ('▁HE', 3.4), ('RE', 3.56), ('▁AND', 3.68), ('▁THERE', 3.92), ('▁THE', 4.28), ('▁S', 4.48), ('QUA', 4.5200000000000005), ('LI', 4.68), ('D', 4.72), ('▁', 4.96), ('QUA', 5.0), ('R', 5.16), ('TER', 5.2), ('▁OF', 5.4), ('▁THE', 5.5200000000000005), ('▁B', 5.68), ('RO', 5.72), ('TH', 5.96), ('EL', 6.12), ('S', 6.28)]
2022-07-11 16:17:39,075 INFO [streaming_server.py:456] Disconnected: ('127.0.0.1', 42458). Number of connections: 0/500
2022-07-11 16:17:39,076 INFO [server.py:260] connection closed

word	CorentinJ/librispeech-alignments	this PR (non-streaming greedy search)	streaming no penalty	streaming penalty=0.001	streaming penalty=0.0015	streaming penalty = 0.0015 (left=64, chunk=16)
AFTER	0.36-0.73	0-0.28	0.32-1.04	0.32-0.96	0.32-0.8	0.0 - 0.8
EARLY	0.73-1.04	0.28-0.88	1.04-1.6	0.96-1.32	0.8-1.28	0.8-1.28
NIGHTFALL	1.04-1.77	0.88-1.60	1.6-2.04	1.32-1.96	1.28-1.92	1.28-1.92
THE	1.77-1.90	1.60-1.76	2.04-2.20	1.96-2.08	1.92-2.04	1.92-2.04
YELLOW	1.90-2.16	1.76-2.16	2.20-2.52	2.08-2.4	2.04-2.36	2.04-2.4
LAMPS	2.16-2.59	2.16-2.60	2.52-2.92	2.4-2.84	2.36-2.76	2.4-2.8
WOULD	2.59-2.76	2.60-2.80	2.92-3.16	2.84-3.0	2.76-3.0	2.8-3.0
LIGHT	2.76-3.07	2.80-3.08	3.16-3.36	3.0-3.24	3.0-3.2	3.0-3.24
UP	3.07-3.27	3.08-3.28	3.36-3.6	3.24-3.44	3.2-3.4	3.24-3.4
HERE	3.27-3.52	3.28-3.60	3.6-3.88	3.44-3.72	3.4-3.68	3.4-3.68
AND	3.52-3.66	3.60-3.80	3.88-4.12	3.72-3.92	3.68-3.88	3.68-3.92
THERE	3.66-4.09	3.80-4.08	4.12-4.48	3.92-4.28	3.88-4.28	3.92-4.28
THE	4.09-4.21	4.08-4.28	4.48-4.68	4.28-4.48	4.28-4.48	4.28-4.48
SQUALID	4.21-4.78	4.28-4.84	4.68-5.16	4.48-5.0	4.48-4.92	4.48-4.96
QUARTER	4.78-5.31	4.84-5.40	5.16-5.6	5.0-5.44	4.92-5.4	4.96-5.4
OF	5.31-5.42	5.40-5.52	5.6-5.72	5.44-5.56	5.4-5.52	5.4-5.52
THE	5.42-5.50	5.52-5.68	5.72-5.88	5.56-5.76	5.52-5.68	5.52-5.68
BROTHELS	5.50-6.16	5.68-6.36	5.88-6.48	5.76-6.32	5.68-6.28	5.68-6.28
silence	6.16-6.625	N/A	N/A	N/A	N/A	N/A

csukuangfj · 2022-07-31T15:07:40Z

I just found that a model trained using modified_transducer (from optimized_transducer)
won't predict the first token at frame 0.
See k2-fsa/icefall#239

danpovey · 2022-07-31T20:48:46Z

I just found that a model trained using modified_transducer (from optimized_transducer) won't predict the first token at frame 0. See k2-fsa/icefall#239

I don't see a recent comment there. I don't understand what you are saying, is this a Sherpa-specific issue? Are you saying that empirically, the 1st token never seems to appear on frame 0?

csukuangfj · 2022-08-01T14:02:42Z

I just found that a model trained using modified_transducer (from optimized_transducer) won't predict the first token at frame 0. See k2-fsa/icefall#239

I don't see a recent comment there. I don't understand what you are saying, is this a Sherpa-specific issue? Are you saying that empirically, the 1st token never seems to appear on frame 0?

k2-fsa/icefall#239 uses a model trained with standard RNN-T loss plus using modified_transducer with prob 0.25 to do force alignment. You can see that the first token does not start with the first frame.
The model is trained using k2-fsa/icefall#213. Also, it is a non-streaming model.

This PR uses a reworked conformer model trained with pruned RNN-T. It is also non-streaming.

#52 (comment) shows that for the given audio, the model from this PR always predicts the first token on the first frame, no matter whether you prepend it with 5 seconds of noise or not.

This PR and k2-fsa/icefall#239 are testing the same audio.

One difference between this PR and k2-fsa/icefall#239
is that this PR predicts the timestamp of the tokens while k2-fsa/icefall#239 does a force alignment.

csukuangfj · 2022-11-30T10:01:47Z

Already supported.

WIP: Add timestamp to non-streaming ASR.

a76c3bb

Add timestamps for modified_beam_search in offline ASR.

6885695

Add timestamp for streaming greedy search.

1319259

csukuangfj changed the title ~~WIP: Add timestamp to non-streaming ASR.~~ WIP: Add timestamp Jul 7, 2022

pkufool and others added 2 commits July 11, 2022 21:33

Add timestamps for streaming conformer

82ea270

Merge pull request #7 from pkufool/timestamp

52529c7

csukuangfj mentioned this pull request Sep 19, 2022

WIP: Add timestamps for streaming ASR #119

Closed

csukuangfj closed this Nov 30, 2022

csukuangfj deleted the timestamp-for-offline-asr branch November 30, 2022 10:01

joazoa mentioned this pull request Oct 27, 2023

offline zipformer2, timestamp for first token is always 0 k2-fsa/icefall#1347

Closed

guoyifan97 mentioned this pull request Jun 24, 2024

Non-streaming Conformer model with pruned_rnnt_loss always emits the first non-blank characters on the very first frames. k2-fsa/icefall#1666

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Add timestamp #52

WIP: Add timestamp #52

csukuangfj commented Jul 6, 2022 •

edited

Loading

csukuangfj commented Jul 6, 2022

jtrmal commented Jul 6, 2022 via email

ngoel17 commented Jul 6, 2022

ahazned commented Jul 7, 2022

csukuangfj commented Jul 7, 2022

ahazned commented Jul 7, 2022 •

edited

Loading

csukuangfj commented Jul 7, 2022

csukuangfj commented Jul 7, 2022 •

edited

Loading

csukuangfj commented Jul 7, 2022

ngoel17 commented Jul 7, 2022

pkufool commented Jul 8, 2022

pkufool commented Jul 11, 2022 •

edited

Loading

csukuangfj commented Jul 31, 2022

danpovey commented Jul 31, 2022

csukuangfj commented Aug 1, 2022

csukuangfj commented Nov 30, 2022

WIP: Add timestamp #52

WIP: Add timestamp #52

Conversation

csukuangfj commented Jul 6, 2022 • edited Loading

csukuangfj commented Jul 6, 2022

jtrmal commented Jul 6, 2022 via email

ngoel17 commented Jul 6, 2022

ahazned commented Jul 7, 2022

csukuangfj commented Jul 7, 2022

ahazned commented Jul 7, 2022 • edited Loading

csukuangfj commented Jul 7, 2022

csukuangfj commented Jul 7, 2022 • edited Loading

Without pre-pended silence

With pre-pended silence

csukuangfj commented Jul 7, 2022

ngoel17 commented Jul 7, 2022

pkufool commented Jul 8, 2022

pkufool commented Jul 11, 2022 • edited Loading

No penalty

Penalty=0.001

Penalty=0.0015

Penalty=0.0015 (left-context=64, chunk-size=16)

csukuangfj commented Jul 31, 2022

danpovey commented Jul 31, 2022

csukuangfj commented Aug 1, 2022

csukuangfj commented Nov 30, 2022

csukuangfj commented Jul 6, 2022 •

edited

Loading

ahazned commented Jul 7, 2022 •

edited

Loading

csukuangfj commented Jul 7, 2022 •

edited

Loading

pkufool commented Jul 11, 2022 •

edited

Loading