-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue sending JSON and then audio Watson Speech to Text service #5
Comments
Hi! Many thanks for the detailed report, I managed to reproduce without much hassle. Actually, it is a bit simpler than that. The problem occurs here. It seems that rust's read_line assumes that input is UTF-8, and throws an error if this is not the case. I guess this is because the String struct is UTF-8 encoded by default. I'll research if there is some other way to handle this. Also: Way cool idea you have here. Love it. |
Aah, I see. Thanks for looking into it and thanks for the props :) |
So, I've looked into the matter. It indeed possible, I just need to figure out a neat way to implement it. Do you by any chance know how the watson service determines how much data to put into each frame? It obviously does not make sense to use line breaks to separate frames in binary data. Have they simply decided to put 5.8KB in each chunk and send that in a frame? It looks like it, looking at the image below. In which case, I'm thinking an API like That API has a problem, however, and that problem is the fact that you want to send a JSON message first and last. It would not make sense to have the same 5800-byte limitation there. Maybe a better option is to try to parse everything as UTF-8 and then fall back to binary if that fails. That has it's own issues, implementation-wise, however. Suggestions or ideas are welcome. I'll get back to you when I know more. |
I believe the Watson service can handle a pretty wide range of frame sizes, but I think the demo works on 8192-sample buffers of 16-bit mono audio, so But, yea, switching between text and binary is the tricky part; I'm not sure what to suggest there. Although looking closer at your screenshot, I do see the 5.8KB chunks.. not sure what decided that. |
I messed around a bit more with the binary feature, and now I have something I can show. I did not succeed in getting the watson service to respond, however. Are you pre-processing the audio in any way before it is sent over the wire? If you want to try it, you can have a look at the env WSTA_BINARY_FRAME_SIZE=16384 cargo run -- --binary 'wss://stream.watsonplatform.net/speech-to-text/api/v1/recognize?watson-token=...' "$(cat start.json)" < audio-file.wav |
oh, instead of And, then, I think WebSockets differentiate between UTF-8 and binary messages, and the opening JSON one may have to be UTF-8. I'm kind of slammed right now, so it might be a few days before I can test anything :/ |
Oh wow, it works! This is pretty much the coolest thing I have seen all month! Thanks for the updated I'll release a new version soon, so that you can enjoy this through your normal distribution channel when you feel like it. $ arecord -D hw:3,0 --format=S16_LE --rate=44100 --channels=1 | env WSTA_BINARY_FRAME_SIZE=16384 target/release/wsta -b 'wss://stream.watsonplatform.net/speech-to-text/api/v1/recognize?watson-token=...' "$(cat start.json)" | jq .results[0].alternatives[0].transcript
Recording WAVE 'stdin' : Signed 16 bit Little Endian, Rate 44100 Hz, Mono
Connected to wss://stream.watsonplatform.net/speech-to-text/api/v1/recognize?watson-token=...
null
"hello "
"hello this is me "
"hello this is me talking to "
"hello this is me talking to people "
"hello this is me talking to people "
"%HESITATION "
"all my "
"all my "
"well my "
"well my go through "
"well my go this "
"hold my ground this action "
"hold my ground this actually worked "
"hold my ground this actually works "
"or more "
"or more ago "
"well I got on the "
"well I got on the details "
"or more you know to tell someone about this "
"or more you know to tell someone about this " As you can see, my reaction was pretty much summed up in |
Hey, this is a followup to the comments I left on hacker news. I'm trying to send an opening JSON message and then audio data to the Watson STT service. It was suggested that something like this would work:
(Getting a token requires some fiddling around with bluemix for credentials and converting it to a token via either curl or your favorite SDK... or just go to the demo, open the dev console, and grab one - they're reusable for a short period of time. OR, if wsta supports custom headers, you can stick the credentials into a basic auth header and skip the token.)
And, start.json looks like this:
However, when I do that, I get a ton of
error: stream did not contain valid UTF-8
messages, and then the normal{"state": "listening"}
message that acknowledges my initial JSON, and then a"No JSON object could be decoded"
error.My best guess is that wsta is correctly marking the opening JSON as a UTF-8 message, and then incorrectly marking all of the audio data as UTF-8 messages also. Does this sound likely? Is that reasonably easy to fix?
FWIW, the service also expects a closing JSON message at the end.. but that's not nearly as important because it will automatically kill the connection after 30 seconds of silence.
The text was updated successfully, but these errors were encountered: