Replies: 13 comments 14 replies
-
Whisper cannot do this today. You could post-process the text Whisper generates and create paragraphs based on sentence similarity. See for example: https://stackoverflow.com/questions/65199011/is-there-a-way-to-check-similarity-between-two-full-sentences-in-python |
Beta Was this translation helpful? Give feedback.
-
In my app Whisper Memos, I use GPT-3 with the edit model: await openai.createEdit({
model: "text-davinci-edit-001",
input: content,
instruction: "split text into short paragraphs",
temperature: 0.7,
}) |
Beta Was this translation helpful? Give feedback.
-
pined to this question, i wonder if there is a way to migrate https://github.com/UKPLab/sentence-transformers or so to whisper, and then output. |
Beta Was this translation helpful? Give feedback.
-
I also am desperately needing a paragraph chunker. I found another lead on some code — https://github.com/flippedAben/texttiling — but couldn’t get this to work either. If anyone can get this to compile please tell me how! |
Beta Was this translation helpful? Give feedback.
-
I simply process it that if a line end with '.', then print a new line. |
Beta Was this translation helpful? Give feedback.
-
For what it's worth, ChatGTP does it well if you give it this command "Split this block of text in several paragraphs, and don't change the text at all:" and then the text. :-) |
Beta Was this translation helpful? Give feedback.
-
Btw, this seems to work ok, check the paper. They also offer the code, didn't try it: |
Beta Was this translation helpful? Give feedback.
-
State of the art seems to be "Auxiliary Loss for BERT-Based Paragraph Segmentation" Zhuo et al. 2023 |
Beta Was this translation helpful? Give feedback.
-
For now, I'm processing the awk '{if (substr($0, length($0), 1) != ".") printf "%s ",$0; else {print $0; print ""}} END {print ""}' input_file > output_file This just adds a line break every time there's a "." at the end of the line. Otherwise, it removes the line break so that the two lines are merged, because I found that whisper was finishing many lines with a "," or just with nothing at all. |
Beta Was this translation helpful? Give feedback.
-
I've been considering approaches like this: We want to split sentences into paragraphs at the places where they less tightly semantically coupled, without making paragraphs too short or too long.
|
Beta Was this translation helpful? Give feedback.
-
Did that get cut off?
…On Fri, Mar 31, 2023 at 3:17 AM jonathanjfshaw ***@***.***> wrote:
ChatGPT gave me this which is a start:
import spacy
from transformers import AlbertTokenizer, AlbertForNextSentencePrediction
import torch.nn.functional as F
import numpy as np
from scipy.stats import poisson
# Load the Spacy model for English
nlp = spacy.load('en_core_web_sm')
# Load the Albert tokenizer and pre-trained model for next sentence prediction
tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')
model = AlbertForNextSentencePrediction.from_pretrained('albert-base-v2')
# Define the input text and split it into individual sentences using Spacy
text = "This is the first sentence. This is the second sentence. This is the third sentence. This is the fourth sentence. This is the fifth sentence. This is the sixth sentence. This is the seventh sentence. This is the eighth sentence. This is the ninth sentence. This is the tenth sentence. This is the eleventh sentence."
doc = nlp(text)
sentences = [sent.text for sent in doc.sents]
# Define the minimum and maximum expected length of paragraphs, as well as the maximum expected total length of paragraphs
min_paragraph_len = 5
max_paragraph_len = 10
max_total_len = 30
# Define the beam width for the beam search
beam_width = 3
# Compute the probability of each possible sentence segmentation using dynamic programming
n = len(sentences)
dp = np.zeros((n + 1, n + 1))
cuts = [[0 for j in range(n)] for i in range(n)]
for i in range(n):
dp[i][i] = 1
for j in range(i+1, n):
# Encode the pair of sentences as input to the pre-trained model
input_ids = tokenizer.encode(sentences[j-1], sentences[j], return_tensors='pt')
# Compute the probabilities of the next sentence prediction
logits = model(input_ids)[0]
probabilities = F.softmax(logits, dim=1).tolist()[0]
# Update the dynamic programming table and the cut points
if probabilities[1] > probabilities[0]:
dp[i][j] = dp[i][j-1] * probabilities[1]
cuts[i][j-1] = j
else:
dp[i][j] = dp[i][j-1] * probabilities[0]
# Perform a beam search to find the best segmentation of the document based on the expected paragraph length
beam = [(0, [], np.zeros(n))]
for i in range(n):
new_beam = []
for score, seg, lengths in beam:
for j in range(i + min_paragraph_len - 1, min(n, i + max_paragraph_len)):
if dp[i][j] == 0:
continue
# Compute the score for the new segment using a Poisson distribution to model the distribution of paragraph lengths
new_score = score - np.log(poisson.pmf(j-i+1, lengths.sum() + j-i+1))
# Update the segment and the lengths
new_seg = seg + [(i, j)]
new_lengths = np.zeros(n)
for (p, q) in new_seg:
new_lengths[p:q+1] += (q-p+1)
# Compute the score for the new lengths using a Poisson distribution to model the distribution of paragraph lengths
new_score -= np.log(poisson.pmf(j+1, new_lengths.sum()))
# Add the new segment to the beam
new_beam.append((new_score, new_seg, new_lengths))
#
—
Reply to this email directly, view it on GitHub
<#552 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAGW5A6RXUIDMVOKRWFGQADW62OJNANCNFSM6AAAAAASFF4NFY>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
even if we post process it , but we cannot break the timestamp for it , do we have any workaround for it . |
Beta Was this translation helpful? Give feedback.
-
I have a good level of success using the Segment any Text / wtpsplit library to paragraph Whisper output in my project. |
Beta Was this translation helpful? Give feedback.
-
Great ML model, but for audio 20 minutes long it produces one long block of text, no paragraphs.
Any way to make it more human readable with paragraphs, breaks etc,...?
Beta Was this translation helpful? Give feedback.
All reactions