Inconsistent number of segments error #64

olevanss · 2023-03-15T08:24:02Z

Hi!

Recently I launched transcription and received such error:

File "/usr/local/lib/python3.9/site-packages/whisper_timestamped/transcribe.py", line 259, in transcribe_timestamped
(transcription, words) = _transcribe_timestamped_efficient(model, audio,
File "/usr/local/lib/python3.9/site-packages/whisper_timestamped/transcribe.py", line 851, in _transcribe_timestamped_efficient
assert l1 == l2 or l1 == 0, f"Inconsistent number of segments: whisper_segments ({l1}) != timestamped_word_segments ({l2})"
AssertionError: Inconsistent number of segments: whisper_segments (57) != timestamped_word_segments (56)

Could you know the reason behind it?

If you need some more details please let me know

Jeronymous · 2023-03-15T12:47:05Z

This is a duplicate of #59

I fixed this issue recently, and the fix landed in master a few minutes ago.
Can you please update and retry?
pip install --upgrade --no-deps --force-reinstall git+https://github.com/linto-ai/whisper-timestamped

Jeronymous · 2023-03-15T12:49:15Z

I'm closing assuming it is fixed.
If it still fails for you, you can reopen and give the output of whisper_timestamped --versions
(this gives whisper_timestamped.__version__ as well as whisper.__version__)

darnn · 2023-03-15T13:01:36Z

Still happening for me with both Whisper and Whisper Timestamped updated:
1.12.3 -- Whisper 20230314 in C:\Users\User\AppData\Local\Programs\Python\Python39\lib\site-
packages\whisper

Jeronymous · 2023-03-15T13:48:52Z

Thanks!
Oh! Whisper released a new version yesterday... That may explain.

If it's a blocker for you, you can try pip install whisper==20230308 (or even version 20230124 which is not bad) and tell us if that resolves.

Jeronymous · 2023-03-15T13:58:18Z

However I'm not seeing anything particular in the last release that would explain the failure...
Is there any chance that you can share the audio and the details of all the options you use, for us to reproduce? (at least all the options)

darnn · 2023-03-15T14:49:26Z

Sure. Audio:
https://drive.google.com/file/d/1Gws313lBSie3HswzkhiOKMOf0HS6yMH8

Command and error:
C:\downloaded>whisper_timestamped efrat.wav --model tiny --output_dir c:\victor
Detected language: Hebrew
100%|██████████████████████████████████████████████| 109316/109316 [03:30<00:00, 519.28frames/s]
WARNING:whisper_timestamped:Inconsistent number of segments: whisper_segments (621) != timestamped_word_segments (620)
Traceback (most recent call last):
File "C:\Users\User\AppData\Local\Programs\Python\Python39\lib\runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\User\AppData\Local\Programs\Python\Python39\lib\runpy.py", line 87, in run_code
exec(code, run_globals)
File "C:\Users\User\AppData\Local\Programs\Python\Python39\Scripts\whisper_timestamped.exe_main.py", line 7, in
File "C:\Users\User\AppData\Local\Programs\Python\Python39\lib\site-packages\whisper_timestamped\transcribe.py", line 2127, in cli
result = transcribe_timestamped(
File "C:\Users\User\AppData\Local\Programs\Python\Python39\lib\site-packages\whisper_timestamped\transcribe.py", line 259, in transcribe_timestamped
(transcription, words) = _transcribe_timestamped_efficient(model, audio,
File "C:\Users\User\AppData\Local\Programs\Python\Python39\lib\site-packages\whisper_timestamped\transcribe.py", line 851, in _transcribe_timestamped_efficient
assert l1 == l2 or l1 == 0, f"Inconsistent number of segments: whisper_segments ({l1}) != timestamped_word_segments ({l2})"
AssertionError: Inconsistent number of segments: whisper_segments (621) != timestamped_word_segments (620)

Jeronymous · 2023-03-16T09:05:31Z

Thanks a lot @darnn Now I can reproduce :)

I will work on this soon

Jeronymous · 2023-03-16T16:27:59Z

This should be finally fixed in version 1.12.5

(Sorry about the inconvenience in previous versions, it took me some times before finding the good solution to some corner case, now I think I did it right)

Thanks again for well reporting this issue @darnn

Jeronymous · 2023-03-16T17:28:38Z

Still work in progress actually. I encountered another corner case that fails

darnn · 2023-03-18T11:21:48Z

ty!

jeremymatt · 2023-03-23T22:49:21Z

Still work in progress actually. I encountered another corner case that fails

Is this error still considered a work-in-progress? If it is, my thanks for your work and please disregard the info below (unless it's useful to you).

If not, I'm still encountering it using the medium model (I'm currently trying the other model sizes to see if they fail):

  File ~\Anaconda3\envs\stt\lib\site-packages\whisper_timestamped\transcribe.py:860 in _transcribe_timestamped_efficient
    assert l1 == l2 or l1 == 0, f"Inconsistent number of segments: whisper_segments ({l1}) != timestamped_word_segments ({l2})"

AssertionError: Inconsistent number of segments: whisper_segments (1118) != timestamped_word_segments (1117)

Another bit of information - the raw version of this audio stream does not crash the transcription script. However, the file is noisy and the transcription quality isn't great (lots of repeated text) so I ran the logmmse version of the Kalman filter on it. This substantially improved the audio quality, but transcribing now fails.
The logmmse settings for this particular run are (I'm also trying a few different noise thresholds to see which works best for my dataset - I'm not sure if other noise thresholds cause failure or not):

sr = 16_000
raw_audio,sr = librosa.load(audio_path,sr=sr)
filtered_audio = logmmse.logmmse(raw_audio,sr,initial_noise=6,window_size=0,noise_threshold = 0.01)
#Save filtered file for subsequent use (e.g., loading into Whisper for transcription - I use librosa for that as well)
soundfile.write(filtered_audio_output_path,filtered_audio ,sr)

Versions:

# Name                    Version                   Build  Channel
openai-whisper            20230314                 pypi_0    pypi
whisper                   1.1.10                   pypi_0    pypi
whisper-timestamped       1.12.8                   pypi_0    pypi

Jeronymous · 2023-03-24T07:58:10Z

Oh dear, I was not aware this could fail again.

This kind of error really depends on what is transcribed by the inner Whisper model. With a "butterfly effect" that makes the issue hardly reproducible.
Is there any chance you can share the "filtered file" along with all the options you give to whisper-timestamps.transcribe?

stungkuling · 2023-03-25T20:06:34Z

Hello, this did the trick for me.

Just adding the options
"beam_size=5, best_of=5" in the transcribe method of the module.

results = whisper_timestamped.transcribe(model, audio, verbose=True, beam_size=5, best_of=5)

I hope this helps.

jeremymatt · 2023-03-30T23:02:17Z

Oh dear, I was not aware this could fail again.

This kind of error really depends on what is transcribed by the inner Whisper model. With a "butterfly effect" that makes the issue hardly reproducible. Is there any chance you can share the "filtered file" along with all the options you give to whisper-timestamps.transcribe?

Sorry for the delay, I've been busy with other stuff. Unfortunately I can't share the file (it's a HIPA-protected recording of a healthcare conversation).

I've updated to version 1.12.8 and am still encountering this error (although with a different file now - the other one started working when I switched condition_on_previous_text from True to False (this helped with hallucination problems).

The call to whisper is as follows:

import whisper_timestamped as whisper
options = {"task":"transcribe",
                               "language":"English",
                               "fp16":fp16,
                               'no_speech_threshold':0.1,
                               "condition_on_previous_text": False,
                               "logprob_threshold": -1.00}
result = whisper.transcribe(model, audio=audio, verbose=False, **options)

Jeronymous · 2023-04-03T14:10:19Z

Thank you @jeremymatt for your feeback.
Unfortunately, I don't have enough element to reproduce.
But I modified something in the latest version (1.12.10) that might resolve this bug.
Can you please retry where it was failing?

If it still fails, could you please use the --debug option and send me the stderr (it can be by email: there is my email in the commit logs of that repo).
The --debug option is with the CLI, but if you are in python you can activate the debug logs using:

import logging
logging.basicConfig()
logger = logging.getLogger("whisper_timestamped")
logger.setLevel(logging.DEBUG)

Finally, if it's really a blocker for you, a workaround is to disable efficient decoding, as spotted by @stungkuling . This can be done in python by using one of these options with whisper-timestamped's transcribe() function:

naive_approach = True
beam_size = 5
temperature = (0.0, 0.2, 0.4, 0.6, 0.8, 1.0), best_of = 5

Just the decoding time will be higher. But transcription results can also be better (especially with beam_size = 5, temperature = (0.0, 0.2, 0.4, 0.6, 0.8, 1.0), best_of = 5 which is the default in OpenAI's whisper lib).
Independently of that workaround, I'm interested in solving this bug :) meaning interested in reproducing it (so any help to do so is welcome).

Jeronymous · 2023-04-03T16:18:02Z

I finally identified something that could cause this error.

I cross fingers very hard so that this bug is finally solved in new version 1.12.11

jeremymatt · 2023-04-03T23:24:27Z

Thanks for your hard work on this! It's a super useful tool. Helping me out an ton, and I'll be using it for at least one paper.

I'll re-try the problematic file in a bit and will let you know how it goes.

Another solution (sort of) is to transcribe in parts and then just join the transcripts. This is similar to how I'm dealing with the hallucinations. Hallucinations are easy to detect as they consist of repeated phrases - at least for my transcripts, if there wasn't phrase repetition, the quality of transcription is acceptable. There's some funkiness such as if a word shows up twice in a phrase. For example, "I think that I should I think that I should I think that I should" is a period 5 repetition, but "I" has a 3/2/3 pattern. Anyway, I just find repetition, clip that out of the transcript, and then re-transcribe only that section of audio.

eloukas · 2023-10-02T17:08:26Z

Any updates on this?

Jeronymous · 2023-10-02T17:44:03Z

We had no feedback whether it was fixed for @jeremymatt
And as nobody reported this error anymore we assumed it was fixed after April 3rd (version 1.12.11 and higher).

Do you have such an exception @eloukas ? If yes, can you give more details and maybe way to reproduce?
If there is something we can re-open this issue or open another

iampickle · 2024-01-31T12:34:06Z

versions:
-python

whisper-timestamped==1.14.4
torch==1.13.0

-system

nvidia-cuda-toolkit==11.8

got the same error:
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3901509/3901509 [01:46<00:00, 36483.15frames/s]
Inconsistent number of segments: whisper_segments (1388) != timestamped_word_segments (1109)
Traceback (most recent call last):
File "/home/tbot/twitchbot/test.py", line 7, in

File "/home/tbot/miniconda3/envs/tbot/lib/python3.11/site-packages/whisper_timestamped/transcribe.py", line 285, in transcribe_timestamped
(transcription, words) = _transcribe_timestamped_efficient(model, audio,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tbot/miniconda3/envs/tbot/lib/python3.11/site-packages/whisper_timestamped/transcribe.py", line 903, in _transcribe_timestamped_efficient
assert l1 == l2 or l1 == 0, f"Inconsistent number of segments: whisper_segments ({l1}) != timestamped_word_segments ({l2})"

this is my code:

import whisper_timestamped as whisper

audio = whisper.load_audio("/media/raid/twitch/papaplatte/papaplatte-stream-2024-01-30/temp_1.5_15.22.mp4")

model = whisper.load_model("tiny", device="cuda")

result = whisper.transcribe(model, audio, language='de')

import json
print(json.dumps(result, indent = 2, ensure_ascii = False))

.. When assertionerror was commented out code was able to print results in json. But im not shure if they´re somewhat reliable

blob of the data: https://pastes.io/embed/bsmewxtuyd

Jeronymous · 2024-01-31T15:03:36Z

Thanks @iampickle

I reopen this issue, that is also being discussed here: #79

Having your openai-whisper version would also help to understand.
And is it possible to have the audio, to be able to reproduce?
(otherwise, see last comment in the discussion linked above: there are ways to have more debug outputs)

And I think this bug is problematic for the result (that is probably wrong).
A possible workaround is to use naive_approach=True as commented here: #64 (comment)
(just things should be slower with this workaround)

iampickle · 2024-02-01T10:19:18Z

Shure,
Download for mp4 22GB !?
At fist I used the newest version and then tried the version(whisper==20230308) mentioned above. Both gave the same result.
Debug out: clio.txt

lumpidu · 2024-02-24T16:25:27Z

So I tested this module to see if I get anywhere with my finetuned whisper-v2 model. Unfortunately, the timestamps are often bad, especially if I am using beam_size=5, best_of=5, temperature=(0.0, 0.2, 0.4, 0.6, 0.8, 1.0)with or without VAD.

The segments don't seem to overlap enough, i.e. the words at those boundaries are not correct. I don't know if you can work around this with VAD or not, but IMHO one should overlap the segments by at least 5 seconds and synchronize the generated text OR one needs to use VAD to find a speech pause in time before the 30 seconds boundary, so that whisper doesn't need to transcribe an utterance that has been cut off in the middle.
One should add a consistency check for the generated segments. Often problematic segments don't add up with the previously generated segments, i.e. end time of the previous segment and start time of the next segment
Often I can see, that segments with problems have very low confidence level for most words as well, although the words themselves are transcribed correctly

As you often ask about concrete audio files: these are audios generated via Microsoft TTS from an Icelandic voice. The text itself is not specific. My guess is that you can use the same approach (use a TTS system) to generate enough of test data yourself.

The problem lies not in the TTS audio files. These are very clear, have consistent timing and pauses. No background noise at all, etc.

Jeronymous · 2024-02-25T08:08:53Z

@lumpidu
If I understand correctly, your problem is not related to the current issue (which is a failure that can happen in some corner case, that I could not reproduce yet), but the quality of the timestamps with a finetuned model?
(maybe due to alignment heads that have to be re-estimated for this model).

concerning 1 : Do you mean you need overlapping segments/words?
concerning 2 : There are already many consistency checks in the code. Are you suggesting here that segments should be contiguous (start where the previous ended) when VAD do not detect silence?

Anyway, this description is not clear enough to me to understand the suggestion.

If you don't see an assertion failure ("Inconsistent number of segments") please open a separate issue.
A concrete example would be good to clarify (ex: here is the audio, here is the output of whisper-timestamped, I'm not satisfied with happens with that segment... and this and this...).
If you're using examples from a TTS system, I guess there is no problem to share.
We have a lot of test data already (coming from real use cases), we are not going to run Microsoft TTS (particularly because it's not free, not open-source)

lumpidu · 2024-02-25T15:11:21Z

Yes maybe it's a different bug, but maybe it's also related. You need to decide. I see e.g. the following problems when looking at the segments:

"segments": [
    {
      "id": 0,
      "seek": 0,
      "start": 0.0,
      "end": 30.0,
      "text": "afbrot og refsjábyrgð eitt efnisyfirlit ...",
      "tokens": [...],
      "temperature": 0.0,
      "avg_logprob": -0.024239512714179786,
      "compression_ratio": 1.8644067796610169,
      "no_speech_prob": 8.609986252849922e-05,
      "confidence": 0.988,
      "words": [ ... ]
     ....
    },
    {
      "id": 1,
      "seek": 3000,
      "start": 30.0,
      "end": 31.58,
      "text": "ilög á grundvelli þjóðréttarsamninga tuttugu og tvö þrjú íslensk refsilög og áhrif mannréttindareglna...",
      "tokens": [ ... ],
      "confidence": 0.031,
     ...
    },
    {
      "id": 2,
      "seek": 6000,
      "start": 59.74,
      "end": 60.8,
      "text": "fsiréttar í fræðikerfi lögfræðinnar tuttugu og sjö fjögur grundvallarhugtökin afbrot og refsing tuttugu og sjö...",
      "tokens": [ ... ],
      "confidence": 0.011,
     ...
    },
...
]

Take a look at the start, end segment data:

these don't line up for segments 1 +2, i.e. start of id 2doesn't begin at end of segment 1
segment id 1 start - end is 1.58 seconds, but in fact the segment length is 29.74 seconds, as can be seen in segment 2 start time of 59.74
segment id 2 start - end is 1.06 seconds, but in fact the segment length is 29.76 seconds, as segment 3 starts at 89.5
the transcribed first words of segment 1: ilög and segment 2: fsiréttar isn't correct, because these segments start in the middle of a spoken word.
the confidence of all segments that have wrong start, end timings is very low. In the above case its << 0.1. For all non-problematic segments, it's often close to 1.0, e.g. 0.986

There is no warning on stderr/stdout about non-aligning segments or low confidence values of the transcripts. There is also no way any ASR system can generate correct first or last words, if segments start or end in the middle of a spoken word. Therefore my suggestion proposes to use a less naive approach either via VAD or overlapping segments. It's not clear for me, which of these approaches already has been implemented by whisper_timestamped.

Jeronymous · 2024-02-25T15:19:44Z

OK @lumpidu so it's another issue.
About the quality of the timestamps.
It is normal that consecutive segments can be not contiguous : there can be silence in between.
And the low quality of the alignement maybe due to the fact that you are using a finetuned model, without having adapted alignment heads.
If you want this to be investigated, please open a new issue, providing the audio and the exact thing that you run for reproduction.

Jeronymous · 2024-02-25T15:50:32Z

@iampickle The failure should not happen anymore (in new version 1.15.0 of whisper-timestamped).

Thank you for having given everything to reproduce and investigate that properly.
And sorry it took me sometimes to investigate, handling the 10H audio was tricky.

Note that the transcription results are rather poor on your audio with music (it transcripts only "Musik").
This is partly due to the fact that you are using a "tiny" model (moreover you are transcribing with default greedy decoding).
And of course, transcribing music is challenging for the model.
But at least it permitted to spot a possible corner case of failure.

Jeronymous closed this as completed Mar 15, 2023

Jeronymous reopened this Mar 15, 2023

Jeronymous closed this as completed in 75a6fa1 Mar 16, 2023

Jeronymous reopened this Mar 16, 2023

Jeronymous closed this as completed in d767f4f Mar 16, 2023

jeremymatt mentioned this issue Mar 23, 2023

Weird repetition on transcript #63

Closed

Jeronymous reopened this Mar 24, 2023

Jeronymous closed this as completed in 428531c Apr 3, 2023

Jeronymous reopened this Jan 31, 2024

lumpidu mentioned this issue Feb 25, 2024

Bad timestamp prediction with some finetuned Whisper models #173

Open

Jeronymous closed this as completed in cf576e5 Feb 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent number of segments error #64

Inconsistent number of segments error #64

olevanss commented Mar 15, 2023

Jeronymous commented Mar 15, 2023 •

edited

Loading

Jeronymous commented Mar 15, 2023

darnn commented Mar 15, 2023

Jeronymous commented Mar 15, 2023

Jeronymous commented Mar 15, 2023 •

edited

Loading

darnn commented Mar 15, 2023

Jeronymous commented Mar 16, 2023

Jeronymous commented Mar 16, 2023

Jeronymous commented Mar 16, 2023

darnn commented Mar 18, 2023

jeremymatt commented Mar 23, 2023

Jeronymous commented Mar 24, 2023

stungkuling commented Mar 25, 2023 •

edited

Loading

jeremymatt commented Mar 30, 2023 •

edited

Loading

Jeronymous commented Apr 3, 2023

Jeronymous commented Apr 3, 2023

jeremymatt commented Apr 3, 2023

eloukas commented Oct 2, 2023

Jeronymous commented Oct 2, 2023

iampickle commented Jan 31, 2024 •

edited

Loading

Jeronymous commented Jan 31, 2024

iampickle commented Feb 1, 2024 •

edited

Loading

lumpidu commented Feb 24, 2024

Jeronymous commented Feb 25, 2024

lumpidu commented Feb 25, 2024 •

edited

Loading

Jeronymous commented Feb 25, 2024

Jeronymous commented Feb 25, 2024

Inconsistent number of segments error #64

Inconsistent number of segments error #64

Comments

olevanss commented Mar 15, 2023

Jeronymous commented Mar 15, 2023 • edited Loading

Jeronymous commented Mar 15, 2023

darnn commented Mar 15, 2023

Jeronymous commented Mar 15, 2023

Jeronymous commented Mar 15, 2023 • edited Loading

darnn commented Mar 15, 2023

Jeronymous commented Mar 16, 2023

Jeronymous commented Mar 16, 2023

Jeronymous commented Mar 16, 2023

darnn commented Mar 18, 2023

jeremymatt commented Mar 23, 2023

Jeronymous commented Mar 24, 2023

stungkuling commented Mar 25, 2023 • edited Loading

jeremymatt commented Mar 30, 2023 • edited Loading

Jeronymous commented Apr 3, 2023

Jeronymous commented Apr 3, 2023

jeremymatt commented Apr 3, 2023

eloukas commented Oct 2, 2023

Jeronymous commented Oct 2, 2023

iampickle commented Jan 31, 2024 • edited Loading

Jeronymous commented Jan 31, 2024

iampickle commented Feb 1, 2024 • edited Loading

lumpidu commented Feb 24, 2024

Jeronymous commented Feb 25, 2024

lumpidu commented Feb 25, 2024 • edited Loading

Jeronymous commented Feb 25, 2024

Jeronymous commented Feb 25, 2024

Jeronymous commented Mar 15, 2023 •

edited

Loading

Jeronymous commented Mar 15, 2023 •

edited

Loading

stungkuling commented Mar 25, 2023 •

edited

Loading

jeremymatt commented Mar 30, 2023 •

edited

Loading

iampickle commented Jan 31, 2024 •

edited

Loading

iampickle commented Feb 1, 2024 •

edited

Loading

lumpidu commented Feb 25, 2024 •

edited

Loading