-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TTS streaming does not work #864
Comments
Thanks for the bug report, we're working on a fix. |
Anything I might be able to do to help? I see that there is no mention of streaming in the TTS REST API endpoint, so I'm assuming it doesn't actually support this feature? |
All good! The issue here is that the http client reads the entire response body before returning. The "stream" terminology here is maybe a little confusing as it's different from the |
Once it is fixed, would this allow us to send streamed response from |
Seems to be fixed in #866 I hacked it in earlier in #724 and results were good, when I just passed the stream=True parameter in .. even for a long (like 30s or more) text, i got to start hearing it in a browser client in about 1s. @RobertCraigie here seems to say, though, that it only starts when the whole audio is completed on the server? Apparently the generation is quick, then? |
@antont oh shoot! I didn't see your issue already about this because I searched TTS and instead of just Speach 🤦🏻. Also, great news about it being possibly fixed! Thanks for verifying :). |
@amarbayar, I'm super interested in this! I don't think it will work because the speech endpoint would need to accept a |
@antont out of curiosity, how did you play the streaming audio in the browser? Do you have example code you could share for the benefit of others? |
Ah, I'm sorry to say the fix in #866 is being reverted; upon further discussion, we've found a better way that should be available in the coming days. Thank you for your patience. |
Ill be honest Im quite new to github as I dont really collab with anyone. Would definitely like to know when this works though. Would love to shave the time off my generated audio. In the meantime while we are waiting for this bugfix is there any other way to stream the response from the tts endpoint? |
Well it's very simple: just return audio/mpeg from the http server, and stream the response. The browsers handle that by showing a audio player so didn't need to do anything on the browser side. I used FastAPI, so the http handler func is: from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from openai import AsyncStream
from openai.types.chat import ChatCompletionChunk
app = FastAPI()
@app.get("/stream")
async def stream():
text = "test. "
text *= 10 #NOTE: I tested with 100 here too to make sure that it streams, i.e. still starts quickly, in about 1s
#was without stream=True param
#speech_stream: HttpxBinaryResponseContent = await text_to_speech_stream_openai(text)
speech_stream: AsyncStream[ChatCompletionChunk] = await text_to_speech_stream_openai(text) #is not the actual type
return StreamingResponse(speech_stream.response.aiter_bytes(), media_type="audio/mpeg")
async def text_to_speech_stream_openai(text: str):
print('Generating audio from text review using Open AI API')
#without stream=True is: response: HttpxBinaryResponseContent
stream: AsyncStream[ChatCompletionChunk] = await client.audio.speech.create(
model="tts-1",
voice="echo",
input=text,
stream=True
) #type: ignore
#print(type(response), dir(response), response)
print(type(stream), dir(stream), stream)
return stream This requires the stream param in speech.create, so either my quick hack PR #724 or the later more proper one from OpenAI, #866 . I guess that one works too even though they reverted it now. I have this live at https://id-longtask-3q4kbi7oda-uc.a.run.app/stream . First query to the server takes time as it starts up the instance, but it starts playing audio in about 1s in later requests. Don't bomb it too much so I don't need to take it down due to much API usage.. |
Yes proper support for such pipelining within the OpenAI backend would be great. It might be possible to hack a somewhat working system now
If you'd do the splits when a sentence ends it might be bearable, even though might have awkward pauses too. I'm not sure how the timings would go actually, might be even quite fine, maybe even no pauses? Perhaps with a slow speaker :D |
I did the split on sentences approach with elevenlabs before openai's tts endpoint came out. It did work but it was quite choppy. I'll just wait for the time being and hope its implemented soon. Regarding your prior post so did they add the stream parameter to create? Because when I tried to use it in my current version which was updated maybe a week or two ago it said there was no stream parameter. |
It's not in any released version. I added it myself and put up a PR, mostly for info and to get some too, and that's what I have been using, only for that test. So if you use that branch it's there. Now they added it too, so you'd have it in like a 2d old version from openai's repo, but they then reverted the addition as are reworking it to be somehow better, so it's not in current main. That's what I tried to say in:
Cool to hear that you actually implemented that splitting thing! |
@antont For sentence based splitting at least using the chat completions endpoint all I did was stream the response and as each token came in I added it to a string. That string is checked against a regex if it matches a sentence structure (i got gpt to do the regex). If the string contained a full sentence it would push that sentence through to my tts endpoint and in a thread generate the relevant audio file for that sentence. Of course there was more steps to organizing it all and it was a bit of a pain and there was always a noticeable pause between each sentence (more than there should have been). But it did work. For example I also needed to check api call timestamps to ensure if a thread was out of sequence from say the second sentence finishing faster than the first that it wouldnt play them in the wrong order. I am much happier to stream the audio out and run the audio on the full turn instead of each sentence. The problem is I seem to not be able to get anything asynchronously from assistant. I figured if i called .aiter_bytes() off of the tts... create() function it would immediately return the async iterator which I could continuously check for new chunks but it only seems to return the iterator once the entire response is complete... I logged the time it took to get first chunk from aiter_bytes() and the time it took for stream_to_file() to complete and the first chunk from aiter_bytes() was always slower than the full completion from stream_to_file() I can only conclude its because the aiter_bytes() doesnt return the asynciterator until the full response is received. I am a rookie at async as I have always used processes or threads but even gpt expected my solution would work and thinks the endpoint is currently just not capable of streaming. Anyone more experienced in the matter know how to make it work? |
Yes that's correct and what I do in what I pasted above. So did you pass the |
Cool, thanks. It doesn't seem to use streaming to get the audio from tts, but maybe it's fine if getting the chunks as whole is fast enough. const arrayBuffer = await response.arrayBuffer();
const blob = new Blob([arrayBuffer], { type: 'audio/mpeg' });
const url = URL.createObjectURL(blob); |
@antont No, I am super new to using github so I am not quite sure how it works yet. Am i able to just look up like a specific branch by id and pull that down as my new openai library? Even if its not the best implementation as long as other things dont significantly break i'd 100% be willing to work with it temporarily if it gives me a tts streamable response. Just finished learning how websockets work and was finally after like 6 hours able to setup a websocket connection to twilio. Their docs is really dated which made it much harder... |
Yes, you can get that version and install it. https://github.com/openai/openai-python/tree/b2b4239bc95a2c81d9db49416ec4095f8a72d5e2 . Maybe there are nice instructions somewhere. @EigenSpan here are brief instructions for you:
oh or actually just this works too:
|
Great that this issue is being inspected! I've spent 5 hours (it was a week ago) trying to fix that. I've done the same as @antont and it worked! But when I tried to run the audio, it was so messy. Here is an example with just using the requests library: https://gist.github.com/44-5-53-6-k/2df4f85d210d0ba80ff6335a78d872e5 Basically, it does this:
And at some point it just gives me this:
I belive that I am fundamentally wrong about how streaming chunks work, but I can't find any info on that. Could you share how do you run these chunks? |
Seems like everyone just wants to be able to pipe the incoming text streaming chunks into the speech create function and for it to start generating speech as soon as a certain threshold is reached (probably a whole sentence, so it sounds natural). Hope this feature is added soon. |
@Narco121 Although everyone absolutely wanrs to be able to pipe the gpt response stream into the tts endpoint and recieve a tts stream back. If you use the chat completions endpoint this can be done by using a generator and async or threads with a regex. You add the incoming tokens to a string and check against a regex for when that string is a complete sentence (check for things like .?!…) Once it is async/thread send that string to your tts endpoint and organize the responses in order and play them. It’s a pain to organize but it did somewhat work when I did it. Sounded a but choppy with 11labs not sure how openai’s tts would sound. The main problem people are having stems around the fact that in the main branch you currently can’t stream the tts endpoint. At least every way I’ve tried using threading or async etc… The actual endpint even when calling stream to file, doesnt stream anything. It only returns the file once all the audio is generated. There is no way to access the audio data before it is 100% completed. that being said as antont pointed out they did fix it in another branch but later reverted the change in favour of another method of doing it. For now once my code is ready to handle the stream I’ll just be swapping to the branc that allows tts even if it isn’t the “best” way of doing it. |
I got the output of OAI TTS to stream. Here's an example: url = "https://api.openai.com/v1/audio/speech"
headers = {
"Authorization": 'Bearer YOUR_API_KEY',
}
data = {
"model": model,
"input": input_text,
"voice": voice,
"response_format": "opus",
}
with requests.post(url, headers=headers, json=data, stream=True) as response:
if response.status_code == 200:
buffer = io.BytesIO()
for chunk in response.iter_content(chunk_size=4096):
buffer.write(chunk) Hope that helps! |
Nice @cyzanfar ! In case anyone comes across this thread looking for a fully working solution, written in Python, with a sample Flask app and an audio player: from flask import Flask, Response, render_template_string
import requests
app = Flask(__name__)
@app.route('/')
def index():
# HTML template to render an audio player
html = '''
<!DOCTYPE html>
<html>
<body>
<audio controls autoplay>
<source src="/stream" type="audio/mpeg">
Your browser does not support the audio element.
</audio>
</body>
</html>
'''
return render_template_string(html)
@app.route('/stream')
def stream():
def generate():
url = "https://api.openai.com/v1/audio/speech"
headers = {
"Authorization": 'Bearer YOUR_SK_TOKEN,
}
data = {
"model": "tts-1",
"input": "YOUR TEXT THAT NEEDS TO BE TTSD HERE",
"voice": "alloy",
"response_format": "mp3",
}
with requests.post(url, headers=headers, json=data, stream=True) as response:
if response.status_code == 200:
for chunk in response.iter_content(chunk_size=4096):
yield chunk
return Response(generate(), mimetype="audio/mpeg")
if __name__ == "__main__":
app.run(debug=True, threaded=True) And this is indeed working beautifully! |
@amarbayar Is it also possible to stream a TTS response in TypeScript? In my frontend I want to play the returned audio with minimum latency. |
Just tried the new raw response streaming API in the SDK (#1072) with fastAPI. This always results in openai_client = OpenAI()
with openai_client.audio.speech.with_streaming_response.create(
model="tts-1",
voice="alloy",
input=text,
response_format="mp3"
) as response:
if response.status_code == 200:
def generate():
for chunk in response.iter_bytes(chunk_size=2048):
print(f"Chunk size: {len(chunk)}") # Print the size of each chunk
yield chunk
return StreamingResponse(
content=generate(),
media_type="audio/mp3"
)
else:
return HTTPException(status_code=500, detail="Failed to generate audio") Making the API call directly seems to still be the best way to go. |
@matthiaskern that's happening because the context manager is exiting when you Like this example in the FastAPI docs, you'll need to make the API call within your Is there a reason that making the API call inside |
of course, thank you @RobertCraigie! I'm not sure about how to handle an exception from OpenAI in this case, but this is a great start. for reference: def generate():
with openai_client.audio.speech.with_streaming_response.create(
model="tts-1",
voice="alloy",
input=input,
response_format="mp3"
) as response:
if response.status_code == 200:
for chunk in response.iter_bytes(chunk_size=2048):
yield chunk
return StreamingResponse(
content=generate(),
media_type="audio/mp3"
) |
Great @matthiaskern! I'm not 100% sure how to handle exceptions - I'm not a FastAPI expert, but I think you could define middleware to handle exceptions? That way you don't have to handle it for every endpoint. It's also worth noting that the original code snippet you shared won't actually hit the openai-python/src/openai/_base_client.py Line 940 in 0c1e58d
openai-python/src/openai/_exceptions.py Line 73 in 0c1e58d
|
As of today (
I found the message a bit cryptic and couldn't find any real documentation of this from openai import OpenAI
client = OpenAI()
with client.audio.speech.with_streaming_response.create(
model="tts-1",
voice="alloy",
input="""I see skies of blue and clouds of white
The bright blessed days, the dark sacred nights
And I think to myself
What a wonderful world""",
) as response:
# This doesn't seem to be *actually* streaming, it just creates the file
# and then doesn't update it until the whole generation is finished
response.stream_to_file("speech.mp3") But I wasn't able to achieve actual streaming with the Python library, only through the REST API (see my post on OpenAI Forum). It would be great to have improved documentation and support for streaming TTS! |
I think you should not do that but something like speech_stream.response.aiter_bytes My working thing from a few months back is in a comment above here and used that. |
Hey @nimobeeren, thanks for the repro. On my computer this does seem to be streaming to the file as expected :(. Would you be able to provide what system you're using (it shouldn't make a difference, but just in case) and how you were able to determine that the file wasn't streaming? |
Thanks for the handy link – we'll be releasing a new example file soon based on that thread which shows how to stream TTS to pyaudio with WAV. |
An example on how to stream TTS response to your speakers with PyAudio is now available here: openai-python/examples/audio.py Lines 40 to 60 in e41abf7
|
@yjp20 It seems to be working for me too now! I'm just running the code and simultaneously doing |
@rattrayalex thank you Alex, that example works for me! I spent some time trying to figure out why It might be nice to use the header to set the pyaudio options instead of hardcoding them, but I couldn't figure out how to do that (didn't manage to turn the response into something I can feed into |
This is a pretty long extension of the above, but figured those on this thread may find it useful. It links together a whole chain such that you can stream the audio response to a prompt. It works using threading by using one thread to stream the text reply into phrases which are enqueued for TTS. Then a second thread which TTS's each phrase as it completes. And finally a third thread which starts playing out loud each phrase as it's been TTS'd. The final effect is much like working with the ChatGPT app where you get "streaming audio response" to your question and don't have to wait to have the full text come back before you can start listening to audio. What's here I'm sure could be improved and it's primarily designed to show, in a terminal, it all put together. https://gist.github.com/Ga68/3862688ab55b9d9b41256572b1fedc67 |
I think this can be closed. |
Confirm this is an issue with the Python library and not an underlying OpenAI API
Describe the bug
When following the documentation on how to use
client.audio.speech.create()
, the returned response has a method calledstream_to_file(file_path)
which explains that when used, it should stream the content of the audio file as it's being created. This does not seem to work. I've used a rather large text input that generates a 3.5 minute sound file and the file is only created once the whole request is completed.To Reproduce
Utilize the following code and replace the text input with a decently large amount of text.
Notice that when the script is run that the
speech.mp3
file is only ever created after the request is fully completed.Code snippets
No response
OS
macOS
Python version
Python 3.11.6
Library version
openai v1.2.4
The text was updated successfully, but these errors were encountered: