Any way Whisper can paragraph text #552

sam1946 · 2022-11-19T07:53:15Z

sam1946
Nov 19, 2022

Great ML model, but for audio 20 minutes long it produces one long block of text, no paragraphs.

Any way to make it more human readable with paragraphs, breaks etc,...?

glangford · 2022-11-19T20:24:57Z

glangford
Nov 19, 2022

Whisper cannot do this today. You could post-process the text Whisper generates and create paragraphs based on sentence similarity. See for example:

https://stackoverflow.com/questions/65199011/is-there-a-way-to-check-similarity-between-two-full-sentences-in-python
https://medium.com/@npolovinkin/how-to-chunk-text-into-paragraphs-using-python-8ae66be38ea6

2 replies

turnkit Dec 30, 2022

I couldn’t get this installed. Anyone gave any success?

ref: https://github.com/poloniki/quint/blob/master/notebooks/Chunking%20text%20into%20paragraphs.ipynb

turnkit Apr 4, 2023

Got this working from a friend who set it up in a vm: https://github.com/V3ntus/quint

Has it's limitations but really helpful.

vojto · 2022-11-19T21:40:59Z

vojto
Nov 19, 2022

In my app Whisper Memos, I use GPT-3 with the edit model:

await openai.createEdit({
      model: "text-davinci-edit-001",
      input: content,
      instruction: "split text into short paragraphs",
      temperature: 0.7,
})

3 replies

sam1946 Nov 21, 2022
Author

In my app Whisper Memos, I use GPT-3 with the edit model:

await openai.createEdit({
      model: "text-davinci-edit-001",
      input: content,
      instruction: "split text into short paragraphs",
      temperature: 0.7,
})

Forgive the basic question, but how would I get the output from Whisper (in a .txt file) to pipe into your code here?

vojto Nov 21, 2022

Well, my Node app is calling API where Whisper is running. Then send the result to GPT-3.

turnkit Dec 30, 2022

Would you create something that can be installed simply then can take the whisper text files and output paragraphed text? That would be amazing.

pperlee · 2022-11-21T05:47:39Z

pperlee
Nov 21, 2022

pined to this question, i wonder if there is a way to migrate https://github.com/UKPLab/sentence-transformers or so to whisper, and then output.

1 reply

turnkit Dec 30, 2022

This looks promising. Especially the “paragraph mining” bit:
https://www.sbert.net/examples/applications/paraphrase-mining/README.html

but I sure wish I had a simple package in could run that would take an input file that whisper generated and then output a document with the paragraphs in place.

Any simple way to do that with this code?

turnkit · 2022-12-30T13:42:17Z

turnkit
Dec 30, 2022

I also am desperately needing a paragraph chunker.

I found another lead on some code — https://github.com/flippedAben/texttiling — but couldn’t get this to work either.

If anyone can get this to compile please tell me how!

0 replies

livc · 2023-01-02T11:22:14Z

livc
Jan 2, 2023

I simply process it that if a line end with '.', then print a new line.

0 replies

andupotorac · 2023-01-13T00:32:48Z

andupotorac
Jan 13, 2023

For what it's worth, ChatGTP does it well if you give it this command "Split this block of text in several paragraphs, and don't change the text at all:" and then the text. :-)

1 reply

stefandesu Mar 26, 2023

I wish this worked. I had to split up the text into many chunks, and even though it often just stopped in the middle of the text. I guess it did work in the end and it was still easier than doing it manually, but it was much more frustrating than I had hoped.

andupotorac · 2023-01-13T00:53:33Z

andupotorac
Jan 13, 2023

Btw, this seems to work ok, check the paper. They also offer the code, didn't try it:
https://www.catalyzex.com/paper/arxiv:1902.04793

0 replies

jonathanjfshaw · 2023-02-26T19:20:52Z

jonathanjfshaw
Feb 26, 2023

State of the art seems to be "Auxiliary Loss for BERT-Based Paragraph Segmentation" Zhuo et al. 2023

3 replies

fivestones Mar 19, 2023

Is there some code associated with this paper anywhere? I couldn't see any.

turnkit Oct 8, 2023

I wonder if we could give that pdf to ChatGPT 4.0 and ask it to write something. If anyone does this please post your results!

vrodriguezf Oct 9, 2023

I do it with this prompt and it works ok:

   prompt = "I have a text without any paragraphs, that is, in a single line. I want you to split it into paragraphs without changing a single word. An important requirement is that it follows a very strict format, which is the following: [paragraph 1]: " \
               "content paragraph 1\n[paragraph 2]: content paragraph 2\n...[paragraph n]: content paragraph n. One last request is that you don't write any introduction like: Here's your text or something similar. Here I show you the text that" \
               " I want you to divide into paragraphs just as I've described: "

then I split the paragraphs programatically

vrodriguezf · 2023-02-27T22:30:02Z

vrodriguezf
Feb 27, 2023

For now, I'm processing the txt files with awk like this:

awk '{if (substr($0, length($0), 1) != ".") printf "%s ",$0; else {print $0; print ""}} END {print ""}' input_file > output_file

This just adds a line break every time there's a "." at the end of the line. Otherwise, it removes the line break so that the two lines are merged, because I found that whisper was finishing many lines with a "," or just with nothing at all.

0 replies

jonathanjfshaw · 2023-03-26T19:34:45Z

jonathanjfshaw
Mar 26, 2023

I've been considering approaches like this:

We want to split sentences into paragraphs at the places where they less tightly semantically coupled, without making paragraphs too short or too long.

use a next sentence prediction model like BERT to judge the probability that sentence D, E, F follows sentences A,B,C.
Use similar logic to consider the probability that E,F follow A,B,C and D,E,F follow A B.
pool these probabilities to give a single measure of linkage for each sentence boundary (C-D in this example)
start at the beginning of the document and use a beam search to go through sentence by sentence evaluating where best to cut the sentences into paragraphs (at the weakest linked boundary)
when doing the beam search, use a Poisson distribution to weight the linkage measures by the length of the resulting paragraph.

3 replies

fivestones Mar 31, 2023

Do you have any example code for this? Or more details about how you go about implementing it?

jonathanjfshaw Mar 31, 2023

ChatGPT gave me this which is a start:

import spacy
from transformers import AlbertTokenizer, AlbertForNextSentencePrediction
import torch.nn.functional as F
import numpy as np
from scipy.stats import poisson

# Load the Spacy model for English
nlp = spacy.load('en_core_web_sm')

# Load the Albert tokenizer and pre-trained model for next sentence prediction
tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')
model = AlbertForNextSentencePrediction.from_pretrained('albert-base-v2')

# Define the input text and split it into individual sentences using Spacy
text = "This is the first sentence. This is the second sentence. This is the third sentence. This is the fourth sentence. This is the fifth sentence. This is the sixth sentence. This is the seventh sentence. This is the eighth sentence. This is the ninth sentence. This is the tenth sentence. This is the eleventh sentence."
doc = nlp(text)
sentences = [sent.text for sent in doc.sents]

# Define the minimum and maximum expected length of paragraphs, as well as the maximum expected total length of paragraphs
min_paragraph_len = 5
max_paragraph_len = 10
max_total_len = 30

# Define the beam width for the beam search
beam_width = 3

# Compute the probability of each possible sentence segmentation using dynamic programming
n = len(sentences)
dp = np.zeros((n + 1, n + 1))
cuts = [[0 for j in range(n)] for i in range(n)]

for i in range(n):
    dp[i][i] = 1
    for j in range(i+1, n):
        # Encode the pair of sentences as input to the pre-trained model
        input_ids = tokenizer.encode(sentences[j-1], sentences[j], return_tensors='pt')
        # Compute the probabilities of the next sentence prediction
        logits = model(input_ids)[0]
        probabilities = F.softmax(logits, dim=1).tolist()[0]
        # Update the dynamic programming table and the cut points
        if probabilities[1] > probabilities[0]:
            dp[i][j] = dp[i][j-1] * probabilities[1]
            cuts[i][j-1] = j
        else:
            dp[i][j] = dp[i][j-1] * probabilities[0]

# Perform a beam search to find the best segmentation of the document based on the expected paragraph length
beam = [(0, [], np.zeros(n))]
for i in range(n):
    new_beam = []
    for score, seg, lengths in beam:
        for j in range(i + min_paragraph_len - 1, min(n, i + max_paragraph_len)):
            if dp[i][j] == 0:
                continue
            # Compute the score for the new segment using a Poisson distribution to model the distribution of paragraph lengths
            new_score = score - np.log(poisson.pmf(j-i+1, lengths.sum() + j-i+1))
            # Update the segment and the lengths
            new_seg = seg + [(i, j)]
            new_lengths = np.zeros(n)
            for (p, q) in new_seg:
                new_lengths[p:q+1] += (q-p+1)
            # Compute the score for the new lengths using a Poisson distribution to model the distribution of paragraph lengths
            new_score -= np.log(poisson.pmf(j+1, new_lengths.sum()))
            # Add the new segment to the beam
            new_beam.append((new_score, new_seg, new_lengths))
        #

turnkit Oct 8, 2023

Please paste the remaining code. Thanks!

jwnacnud · 2023-03-31T14:50:58Z

jwnacnud
Mar 31, 2023

Did that get cut off?

…

On Fri, Mar 31, 2023 at 3:17 AM jonathanjfshaw ***@***.***> wrote: ChatGPT gave me this which is a start: import spacy from transformers import AlbertTokenizer, AlbertForNextSentencePrediction import torch.nn.functional as F import numpy as np from scipy.stats import poisson # Load the Spacy model for English nlp = spacy.load('en_core_web_sm') # Load the Albert tokenizer and pre-trained model for next sentence prediction tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2') model = AlbertForNextSentencePrediction.from_pretrained('albert-base-v2') # Define the input text and split it into individual sentences using Spacy text = "This is the first sentence. This is the second sentence. This is the third sentence. This is the fourth sentence. This is the fifth sentence. This is the sixth sentence. This is the seventh sentence. This is the eighth sentence. This is the ninth sentence. This is the tenth sentence. This is the eleventh sentence." doc = nlp(text) sentences = [sent.text for sent in doc.sents] # Define the minimum and maximum expected length of paragraphs, as well as the maximum expected total length of paragraphs min_paragraph_len = 5 max_paragraph_len = 10 max_total_len = 30 # Define the beam width for the beam search beam_width = 3 # Compute the probability of each possible sentence segmentation using dynamic programming n = len(sentences) dp = np.zeros((n + 1, n + 1)) cuts = [[0 for j in range(n)] for i in range(n)] for i in range(n): dp[i][i] = 1 for j in range(i+1, n): # Encode the pair of sentences as input to the pre-trained model input_ids = tokenizer.encode(sentences[j-1], sentences[j], return_tensors='pt') # Compute the probabilities of the next sentence prediction logits = model(input_ids)[0] probabilities = F.softmax(logits, dim=1).tolist()[0] # Update the dynamic programming table and the cut points if probabilities[1] > probabilities[0]: dp[i][j] = dp[i][j-1] * probabilities[1] cuts[i][j-1] = j else: dp[i][j] = dp[i][j-1] * probabilities[0] # Perform a beam search to find the best segmentation of the document based on the expected paragraph length beam = [(0, [], np.zeros(n))] for i in range(n): new_beam = [] for score, seg, lengths in beam: for j in range(i + min_paragraph_len - 1, min(n, i + max_paragraph_len)): if dp[i][j] == 0: continue # Compute the score for the new segment using a Poisson distribution to model the distribution of paragraph lengths new_score = score - np.log(poisson.pmf(j-i+1, lengths.sum() + j-i+1)) # Update the segment and the lengths new_seg = seg + [(i, j)] new_lengths = np.zeros(n) for (p, q) in new_seg: new_lengths[p:q+1] += (q-p+1) # Compute the score for the new lengths using a Poisson distribution to model the distribution of paragraph lengths new_score -= np.log(poisson.pmf(j+1, new_lengths.sum())) # Add the new segment to the beam new_beam.append((new_score, new_seg, new_lengths)) # — Reply to this email directly, view it on GitHub <#552 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAGW5A6RXUIDMVOKRWFGQADW62OJNANCNFSM6AAAAAASFF4NFY> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

0 replies

karan68 · 2024-03-05T10:48:24Z

karan68
Mar 5, 2024

even if we post process it , but we cannot break the timestamp for it , do we have any workaround for it .
and even when i feed it a audio from a reel , then it breaks it

0 replies

shun-liang · 2024-10-11T08:32:16Z

shun-liang
Oct 11, 2024

I have a good level of success using the Segment any Text / wtpsplit library to paragraph Whisper output in my project.

1 reply

fivestones Oct 14, 2024

Thanks for this!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Any way Whisper can paragraph text #552

{{title}}

Replies: 13 comments 14 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Any way Whisper can paragraph text #552

Replies: 13 comments · 14 replies

sam1946 Nov 21, 2022 Author

Replies: 13 comments 14 replies

sam1946 Nov 21, 2022
Author