Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Add timestamp #52

Closed

Conversation

csukuangfj
Copy link
Collaborator

@csukuangfj csukuangfj commented Jul 6, 2022

Here are some initial results.

For the following test wav (Note: Its name should be 1089-134686-0002.wav, not 1089-134686-0001.wav)
https://huggingface.co/csukuangfj/icefall-asr-librispeech-pruned-transducer-stateless3-2022-05-13/blob/main/test_wavs/1089-134686-0001.wav

The ground truth word timestamp from https://github.com/CorentinJ/librispeech-alignments
is

",AFTER,EARLY,NIGHTFALL,THE,YELLOW,LAMPS,WOULD,LIGHT,UP,HERE,AND,THERE,THE,SQUALID,QUARTER,OF,THE,BROTHELS,"                          
"0.360,0.730,1.040,1.770,1.900,2.160,2.590,2.760,3.070,3.270,3.520,3.660,4.090,4.210,4.780,5.310,5.420,5.500,6.160,6.625"

The predicted results of this PR using the pre-trained model from
https://huggingface.co/csukuangfj/icefall-asr-librispeech-pruned-transducer-stateless3-2022-05-13/tree/main/exp
are

2022-07-06 21:47:53,604 INFO [offline_server.py:667] Connected: ('127.0.0.1', 48118). Number of connections: 1/10
[('_AFTER', 0.0), ('_E', 0.28), ('AR', 0.44), ('LY', 0.56), ('_NIGHT', 0.88), ('F', 1.24), ('A', 1.36), ('LL', 1.44), ('_THE', 1.6), ('_YE', 1.76), ('LL', 1.8800000000000001), ('OW', 1.96), ('_LA', 2.16), ('M', 2.32), ('P', 2.44), ('S', 2.48), ('_WOULD', 2.6), ('_LIGHT', 2.8000000000000003), ('_UP', 3.08), ('_HE', 3.2800000000000002), ('RE', 3.4), ('_AND', 3.6), ('_THERE', 3.8000000000000003), ('_THE', 4.08), ('_S', 4.28), ('QUA', 4.32), ('LI', 4.48), ('D', 4.64), ('_', 4.84), ('QUA', 4.88), ('R', 5.04), ('TER', 5.12), ('_OF', 5.4), ('_THE', 5.5200000000000005), ('_B', 5.68), ('RO', 5.72), ('TH', 5.88), ('EL', 6.16), ('S', 6.36)]
[0.0, 0.28, 0.44, 0.56, 0.88, 1.24, 1.36, 1.44, 1.6, 1.76, 1.8800000000000001, 1.96, 2.16, 2.32, 2.44, 2.48, 2.6, 2.8000000000000003,
3.08, 3.2800000000000002, 3.4, 3.6, 3.8000000000000003, 4.08, 4.28, 4.32, 4.48, 4.64, 4.84, 4.88, 5.04, 5.12, 5.4, 5.5200000000000005, 5.68, 5.72, 5.88, 6.16, 6.36]
2022-07-06 21:47:54,929 INFO [offline_server.py:651] Disconnected: ('127.0.0.1', 48118). Number of connections: 0/10

For ease of reference, the above results are listed in the following table:

word CorentinJ/librispeech-alignments this PR
AFTER 0.36-0.73 0-0.28
EARLY 0.73-1.04 0.28-0.88
NIGHTFALL 1.04-1.77 0.88-1.60
THE 1.77-1.90 1.60-1.76
YELLOW 1.90-2.16 1.76-2.16
LAMPS 2.16-2.59 2.16-2.60
WOULD 2.59-2.76 2.60-2.80
LIGHT 2.76-3.07 2.80-3.08
UP 3.07-3.27 3.08-3.28
HERE 3.27-3.52 3.28-3.60
AND 3.52-3.66 3.60-3.80
THERE 3.66-4.09 3.80-4.08
THE 4.09-4.21 4.08-4.28
SQUALID 4.21-4.78 4.28-4.84
QUARTER 4.78-5.31 4.84-5.40
OF 5.31-5.42 5.40-5.52
THE 5.42-5.50 5.52-5.68
BROTHELS 5.50-6.16 5.68-6.36
silence 6.16-6.625 N/A

Since the model subsampling factor is 4 and frame shift is 0.01s, the resolution of the timestamp is 0.04s.

You can see that the predicted timestamps of the words in the middle of the utterance are very close to the one
from https://github.com/CorentinJ/librispeech-alignments

Also note that we are not doing force-alignment here. The words are decoded using greedy search and it happens that all words are predicted correctly.

@csukuangfj
Copy link
Collaborator Author

From #43
@ngoel17

What kind of timestamp format do you recommend?

I am considering using the one from https://github.com/alphacep/vosk-api/blob/master/src/vosk_api.h#L168

An example message is given below (in json format):

{
  "timestamp": [
     {"token": "_AFTER", "start": 0.0},
     {"token": "_E", "start": 0.28},
     {"token": "AR", "start": 0.44},
     {"token": "LY", "start": 0.56},
     {"token": "_NIGHT", "start": 0.88},
   ],
  "text": "AFTER EARLY NIGHT"
}

The message would be constructed in Python, so it should be fairly straightforward to change its format.

Note that only the start time of a BPE token is given. You have to figure out the start and end time of a word from it.

@jtrmal
Copy link
Collaborator

jtrmal commented Jul 6, 2022 via email

@ngoel17
Copy link
Contributor

ngoel17 commented Jul 6, 2022

To create a CTM start time and duration are needed. The issue will be with words that have long silence in between, unless silence token is not deleted.

@ahazned
Copy link
Contributor

ahazned commented Jul 7, 2022

One problem with this is the first word is always labeled with start:0.0 regardless of how much silence/non-speech there is before that word.

@csukuangfj
Copy link
Collaborator Author

One problem with this is the first word is always labeled with start:0.0 regardless of how much silence/non-speech there is before that word.

No.

It is 0 for this specific test wave since the first decoded token happens to be on the first frame.

@ahazned
Copy link
Contributor

ahazned commented Jul 7, 2022

One problem with this is the first word is always labeled with start:0.0 regardless of how much silence/non-speech there is before that word.

No.

It is 0 for this specific test wave since the first decoded token happens to be on the first frame.

In the test file the first word "AFTER" approximately starts at 0.402s which is way after the first frame i think.

Can you try the attached file? I added ~5 seconds of silence before the first word and is still gives start:0.0
silence_added_beginning_1089-134686-0001.zip

Here is what i got (on a different model than yours):

Original file (1089-134686-0001.wav): [('▁after', 0.0), ('▁e', 0.56), ('ar', 0.68), ('ly', 0.8), ('▁night', 1.04) .....
Silence added file: [('▁after', 0.0), ('▁e', 5.5200000000000005), ('ar', 5.64), ('ly', 5.76), ('▁night', 6.0) .....

@csukuangfj
Copy link
Collaborator Author

One problem with this is the first word is always labeled with start:0.0 regardless of how much silence/non-speech there is before that word.

No.
It is 0 for this specific test wave since the first decoded token happens to be on the first frame.

In the test file the first word "AFTER" approximately starts at 0.402s which is way after the first frame i think.

Can you try the attached file? I added ~5 seconds of silence before the first word and is still gives start:0.0 silence_added_beginning_1089-134686-0001.zip

Here is what i got (on a different model than yours):

Original file (1089-134686-0001.wav): [('▁after', 0.0), ('▁e', 0.56), ('ar', 0.68), ('ly', 0.8), ('▁night', 1.04) ..... Silence added file: [('▁after', 0.0), ('▁e', 5.5200000000000005), ('ar', 5.64), ('ly', 5.76), ('▁night', 6.0) .....

Yes, you are right. I can reproduce the results. Both greedy search and modified_beam_search emit the first token on the first frame.

I think part of the reason is that the model uses global attention. The model can see all the remaining frames on frame 0.

I will try to use the streaming model to test it.

@csukuangfj
Copy link
Collaborator Author

csukuangfj commented Jul 7, 2022

The following are the results for the streaming greedy search using the pre-trained model from
https://huggingface.co/csukuangfj/icefall-asr-librispeech-pruned-stateless-emformer-rnnt2-2022-06-01/tree/main/exp

You can see that the first token is no longer decoded at frame 0.

Moreover, it shows the model tends to delay (about 0.4 s) the output.

Without pre-pended silence

2022-07-07 16:16:59,019 INFO [server.py:642] connection open
2022-07-07 16:16:59,019 INFO [streaming_server.py:414] Connected: ('127.0.0.1', 56324). Number of connections: 1/2
[('_AFTER', 0.48), ('_E', 1.04), ('AR', 1.12), ('LY', 1.28), ('_NIGHT', 1.44), ('F', 1.8), ('A', 1.84), 
('LL', 1.96), ('_THE', 2.08), ('_YE', 2.2), ('LL', 2.36), ('OW', 2.4), ('_LA', 2.52), ('M', 2.68), 
('P', 2.72), ('S', 2.8000000000000003),  ('_WOULD', 2.96), ('_LIGHT', 3.16), ('_UP', 3.4), ('_HE', 3.64), 
('RE', 3.72), ('_AND', 3.92),  ('_THERE', 4.12), ('_THE', 4.5200000000000005), 
('_S', 4.68), ('QUA', 4.72), ('LI', 4.92), ('D', 5.0), ('_', 5.2), ('QUA', 5.24), ('R', 5.44), ('TER', 5.48), 
('_OF', 5.64), ('_THE', 5.76), ('_B', 5.92), ('RA', 5.96), ('FF', 6.16), ('LE', 6.32), ('S', 6.48)]
2022-07-07 16:17:05,782 INFO [streaming_server.py:398] Disconnected: ('127.0.0.1', 56324). Number of connections: 0/2
2022-07-07 16:17:05,783 INFO [server.py:260] connection closed

With pre-pended silence

2022-07-07 16:18:19,143 INFO [server.py:642] connection open
2022-07-07 16:18:19,143 INFO [streaming_server.py:414] Connected: ('127.0.0.1', 57802). Number of connections: 1/2
[('_AFTER', 5.44), ('_E', 6.0), ('AR', 6.08), ('LY', 6.28), ('_NIGHT', 6.44), ('F', 6.8), ('A', 6.84), 
('LL', 6.96), ('_THE', 7.08), ('_YE', 7.2), ('LL', 7.32), ('OW', 7.4), ('_LA', 7.5200000000000005), ('M', 7.68), 
('P', 7.72), ('S', 7.8), ('_WOULD', 7.96), ('_LIGHT', 8.16), ('_UP', 8.4), ('_HE', 8.6), 
('RE', 8.72), ('_AND', 8.92), ('_THERE', 9.120000000000001), ('_THE', 9.52), 
('_S', 9.68), ('QUA', 9.72), ('LI', 9.92), ('D', 10.0), ('_', 10.200000000000001), ('QUA', 10.24), 
('R', 10.44), ('TER', 10.48), ('_OF', 10.64), ('_THE', 10.76), ('_B', 10.92), ('RA', 10.96), ('FF', 11.16), 
('LE', 11.28), ('S', 11.44)]
2022-07-07 16:18:31,113 INFO [streaming_server.py:398] Disconnected: ('127.0.0.1', 57802). Number of connections: 0/2
2022-07-07 16:18:31,114 INFO [server.py:260] connection closed

@csukuangfj
Copy link
Collaborator Author

The following table summarizes the results so far for non-streaming and streaming decoding.

There are about 0.3 to 0.4 seconds delay for the streaming model.

word CorentinJ/librispeech-alignments this PR (non-streaming greedy search) streaming greedy search
AFTER 0.36-0.73 0-0.28 0.48-1.04
EARLY 0.73-1.04 0.28-0.88 1.04-1.44
NIGHTFALL 1.04-1.77 0.88-1.60 1.44-2.08
THE 1.77-1.90 1.60-1.76 2.08-2.20
YELLOW 1.90-2.16 1.76-2.16 2.20-2.52
LAMPS 2.16-2.59 2.16-2.60 2.52-2.80
WOULD 2.59-2.76 2.60-2.80 2.96-3.16
LIGHT 2.76-3.07 2.80-3.08 3.16-3.40
UP 3.07-3.27 3.08-3.28 3.40-3.64
HERE 3.27-3.52 3.28-3.60 3.64-3.92
AND 3.52-3.66 3.60-3.80 3.92-4.12
THERE 3.66-4.09 3.80-4.08 4.12-4.52
THE 4.09-4.21 4.08-4.28 4.52-4.68
SQUALID 4.21-4.78 4.28-4.84 4.68-5.20
QUARTER 4.78-5.31 4.84-5.40 5.20-5.64
OF 5.31-5.42 5.40-5.52 5.64-5.76
THE 5.42-5.50 5.52-5.68 5.76-5.92
BROTHELS 5.50-6.16 5.68-6.36 5.92-6.48
silence 6.16-6.625 N/A N/A

@csukuangfj csukuangfj changed the title WIP: Add timestamp to non-streaming ASR. WIP: Add timestamp Jul 7, 2022
@ngoel17
Copy link
Contributor

ngoel17 commented Jul 7, 2022

If we assume that the chunk size is 40ms and we have 4 chunks, then the left context is 160ms. But 320 ms would be double that. Any intuition why we get an offset in this range?

@pkufool
Copy link
Collaborator

pkufool commented Jul 8, 2022

There are about 0.3 to 0.4 seconds delay for the streaming model.

Maybe we can try the models trained with delay penalty, see if it helps.

@pkufool
Copy link
Collaborator

pkufool commented Jul 11, 2022

The following are the timestamps of model trained with delay penalty (see k2-fsa/k2#976). The encoder is reworked conformer.

You can see from the table, the model trained without penalty delay, the delay is about 0.3 to 0.4 seconds. the model trained with penalty==0.0015, the delay decreases to 0.1 to 0.2 seconds, so the delay penalty does help to let symbols emit earlier.

Apart from the delay penalty, the decode chunk size and left context also affect the symbols delay, the 4th, 5th, 6th columns were decoded with chunk-size=8, left-context=32, while the 7th column was decoded with chunk-size=16, left-context64, comparing with 6th and 7th columns, when decode chunk is larger the symbols delay will be slightly larger.

No penalty

2022-07-11 16:07:55,675 INFO [streaming_server.py:472] Connected: ('127.0.0.1', 34732). Number of connections: 1/500
[('▁AFTER', 0.32), ('▁E', 1.04), ('AR', 1.12), ('LY', 1.24), ('▁NIGHT', 1.6), ('F', 1.72), ('A', 1.8), ('LL', 1.92), ('▁THE', 2.04), ('▁YE', 2.2), ('LL', 2.2800000000000002), ('OW', 2.36), ('▁LA', 2.52), ('M', 2.64), ('P', 2.68), ('S', 2.7600000000000002), ('▁WOULD', 2.92), ('▁LIGHT', 3.16), ('▁UP', 3.36), ('▁HE', 3.6), ('RE', 3.72), ('▁AND', 3.88), ('▁THERE', 4.12), ('▁THE', 4.48), ('▁S', 4.68), ('QUA', 4.72), ('LI', 4.84), ('D', 5.0), ('▁', 5.16), ('QUA', 5.2), ('R', 5.32), ('TER', 5.4), ('▁OF', 5.6000000000000005), ('▁THE', 5.72), ('▁B', 5.88), ('RA', 5.92), ('FF', 6.0), ('EL', 6.24), ('S', 6.48)]
2022-07-11 16:08:02,471 INFO [streaming_server.py:456] Disconnected: ('127.0.0.1', 34732). Number of connections: 0/500
2022-07-11 16:08:02,472 INFO [server.py:260] connection closed

Penalty=0.001

2022-07-11 16:10:08,641 INFO [streaming_server.py:472] Connected: ('127.0.0.1', 36388). Number of connections: 1/500
[('▁AFTER', 0.32), ('▁E', 0.96), ('AR', 1.0), ('LY', 1.12), ('▁NIGHT', 1.32), ('F', 1.68), ('A', 1.72), ('LL', 1.84), ('▁THE', 1.96), ('▁YE', 2.08), ('LL', 2.16), ('OW', 2.24), ('▁LA', 2.4), ('M', 2.52), ('P', 2.6), ('S', 2.68),
('▁WOULD', 2.84), ('▁LIGHT', 3.0), ('▁UP', 3.24), ('▁HE', 3.44), ('RE', 3.6), ('▁AND', 3.72), ('▁THERE', 3.92), ('▁THE', 4.28), ('▁S', 4.48), ('QUA', 4.5200000000000005), ('LI', 4.72), ('D', 4.84), ('▁', 5.0), ('QUA', 5.04), ('R', 5.2), ('TER', 5.24), ('▁OF', 5.44), ('▁THE', 5.5600000000000005), ('▁B', 5.76), ('RO', 5.8), ('TH', 5.96), ('EL', 6.16), ('S', 6.32)]
2022-07-11 16:10:15,437 INFO [streaming_server.py:456] Disconnected: ('127.0.0.1', 36388). Number of connections: 0/500
2022-07-11 16:10:15,438 INFO [server.py:260] connection closed

Penalty=0.0015

2022-07-11 15:54:46,658 INFO [streaming_server.py:472] Connected: ('127.0.0.1', 52946). Number of connections: 1/500
[('▁AFTER', 0.32), ('▁E', 0.8), ('AR', 0.84), ('LY', 1.08), ('▁NIGHT', 1.28), ('F', 1.6400000000000001), ('A', 1.68), ('LL', 1.76), ('▁THE', 1.92), ('▁YE', 2.04), ('LL', 2.08), ('OW', 2.24), ('▁LA', 2.36), ('M', 2.48), ('P', 2.52), ('S', 2.64), ('▁WOULD', 2.7600000000000002), ('▁LIGHT', 3.0), ('▁UP', 3.2), ('▁HE', 3.4), ('RE', 3.56), ('▁AND', 3.68), ('▁THERE', 3.88), ('▁THE', 4.28), ('▁S', 4.48), ('QUA', 4.5200000000000005), ('LI', 4.68), ('D', 4.72), ('▁', 4.92), ('QUA', 4.96), ('R', 5.16), ('TER', 5.2), ('▁OF', 5.4), ('▁THE', 5.5200000000000005), ('▁B', 5.68), ('RO', 5.72), ('TH', 5.92), ('EL', 6.04), ('S', 6.28)]
2022-07-11 15:54:53,462 INFO [streaming_server.py:456] Disconnected: ('127.0.0.1', 52946). Number of connections: 0/500
2022-07-11 15:54:53,463 INFO [server.py:260] connection closed

Penalty=0.0015 (left-context=64, chunk-size=16)

2022-07-11 16:17:32,326 INFO [streaming_server.py:472] Connected: ('127.0.0.1', 42458). Number of connections: 1/500
[('▁AFTER', 0.0), ('▁E', 0.8), ('AR', 0.84), ('LY', 1.04), ('▁NIGHT', 1.28), ('F', 1.6), ('A', 1.6400000000000001), ('LL', 1.76), ('▁THE', 1.92), ('▁YE', 2.04), ('LL', 2.08), ('OW', 2.24), ('▁LA', 2.4), ('M', 2.48), ('P', 2.52),
('S', 2.64), ('▁WOULD', 2.8000000000000003), ('▁LIGHT', 3.0), ('▁UP', 3.24), ('▁HE', 3.4), ('RE', 3.56), ('▁AND', 3.68), ('▁THERE', 3.92), ('▁THE', 4.28), ('▁S', 4.48), ('QUA', 4.5200000000000005), ('LI', 4.68), ('D', 4.72), ('▁', 4.96), ('QUA', 5.0), ('R', 5.16), ('TER', 5.2), ('▁OF', 5.4), ('▁THE', 5.5200000000000005), ('▁B', 5.68), ('RO', 5.72), ('TH', 5.96), ('EL', 6.12), ('S', 6.28)]
2022-07-11 16:17:39,075 INFO [streaming_server.py:456] Disconnected: ('127.0.0.1', 42458). Number of connections: 0/500
2022-07-11 16:17:39,076 INFO [server.py:260] connection closed
word CorentinJ/librispeech-alignments this PR (non-streaming greedy search) streaming no penalty streaming penalty=0.001 streaming penalty=0.0015 streaming penalty = 0.0015 (left=64, chunk=16)
AFTER 0.36-0.73 0-0.28 0.32-1.04 0.32-0.96 0.32-0.8 0.0 - 0.8
EARLY 0.73-1.04 0.28-0.88 1.04-1.6 0.96-1.32 0.8-1.28 0.8-1.28
NIGHTFALL 1.04-1.77 0.88-1.60 1.6-2.04 1.32-1.96 1.28-1.92 1.28-1.92
THE 1.77-1.90 1.60-1.76 2.04-2.20 1.96-2.08 1.92-2.04 1.92-2.04
YELLOW 1.90-2.16 1.76-2.16 2.20-2.52 2.08-2.4 2.04-2.36 2.04-2.4
LAMPS 2.16-2.59 2.16-2.60 2.52-2.92 2.4-2.84 2.36-2.76 2.4-2.8
WOULD 2.59-2.76 2.60-2.80 2.92-3.16 2.84-3.0 2.76-3.0 2.8-3.0
LIGHT 2.76-3.07 2.80-3.08 3.16-3.36 3.0-3.24 3.0-3.2 3.0-3.24
UP 3.07-3.27 3.08-3.28 3.36-3.6 3.24-3.44 3.2-3.4 3.24-3.4
HERE 3.27-3.52 3.28-3.60 3.6-3.88 3.44-3.72 3.4-3.68 3.4-3.68
AND 3.52-3.66 3.60-3.80 3.88-4.12 3.72-3.92 3.68-3.88 3.68-3.92
THERE 3.66-4.09 3.80-4.08 4.12-4.48 3.92-4.28 3.88-4.28 3.92-4.28
THE 4.09-4.21 4.08-4.28 4.48-4.68 4.28-4.48 4.28-4.48 4.28-4.48
SQUALID 4.21-4.78 4.28-4.84 4.68-5.16 4.48-5.0 4.48-4.92 4.48-4.96
QUARTER 4.78-5.31 4.84-5.40 5.16-5.6 5.0-5.44 4.92-5.4 4.96-5.4
OF 5.31-5.42 5.40-5.52 5.6-5.72 5.44-5.56 5.4-5.52 5.4-5.52
THE 5.42-5.50 5.52-5.68 5.72-5.88 5.56-5.76 5.52-5.68 5.52-5.68
BROTHELS 5.50-6.16 5.68-6.36 5.88-6.48 5.76-6.32 5.68-6.28 5.68-6.28
silence 6.16-6.625 N/A N/A N/A N/A N/A

@csukuangfj
Copy link
Collaborator Author

I just found that a model trained using modified_transducer (from optimized_transducer)
won't predict the first token at frame 0.
See k2-fsa/icefall#239

@danpovey
Copy link
Collaborator

I just found that a model trained using modified_transducer (from optimized_transducer) won't predict the first token at frame 0. See k2-fsa/icefall#239

I don't see a recent comment there. I don't understand what you are saying, is this a Sherpa-specific issue? Are you saying that empirically, the 1st token never seems to appear on frame 0?

@csukuangfj
Copy link
Collaborator Author

I just found that a model trained using modified_transducer (from optimized_transducer) won't predict the first token at frame 0. See k2-fsa/icefall#239

I don't see a recent comment there. I don't understand what you are saying, is this a Sherpa-specific issue? Are you saying that empirically, the 1st token never seems to appear on frame 0?

k2-fsa/icefall#239 uses a model trained with standard RNN-T loss plus using modified_transducer with prob 0.25 to do force alignment. You can see that the first token does not start with the first frame.
The model is trained using k2-fsa/icefall#213. Also, it is a non-streaming model.


This PR uses a reworked conformer model trained with pruned RNN-T. It is also non-streaming.

#52 (comment) shows that for the given audio, the model from this PR always predicts the first token on the first frame, no matter whether you prepend it with 5 seconds of noise or not.

This PR and k2-fsa/icefall#239 are testing the same audio.


One difference between this PR and k2-fsa/icefall#239
is that this PR predicts the timestamp of the tokens while k2-fsa/icefall#239 does a force alignment.

@csukuangfj
Copy link
Collaborator Author

Already supported.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants