-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use and parse SSML to change voices, pitch, rate #3
Comments
Thanks for the links! As I understand SSML itself isn't supported by browsers now and utterance value is a text extracted from SSML markup. Thus manual SSML parsing breaks sentence and speech synthesis couldn't build a correct phrase consisted of several utterances with different voice settings. Am I correct? Also as I understand SSML solves one part of the problem I've referred to: the speech related voice control. I think there should be more control on each level: sound output as an audio stream, filters for making voice sound softly or make it sound more metallic, emotion control (in the future), etc. I think there is some progress on it, but I'm just new in the question and do not know much. And I don't know where to look this up. Also I think what is very important is to separate artistic or research usage (for book reading, games, personal communication) from service or every day usage (for work, learning, business communication). In the second case publisher should use semantic and pronunciation markup without specifying the voice characteristics. Because users should have full control and be able to choose voices themselves. And this voices should be set up in OS/Browser settings. |
SSML parsing is not supported by browsers right now (https://bugs.chromium.org/p/chromium/issues/detail?id=795371; https://bugzilla.mozilla.org/show_bug.cgi?id=1425523). That is one reason composed an SSML parser. Yes, it is possible to construct sentences with breaks, pitch, rate, and voice changes. You can load https://github.com/guest271314/SpeechSynthesisSSMLParser/blob/master/SpeechSynthesisSSMLParserTest.html and observe the output for yourself. So far have implemented
That is possible by adjusting the voice, pitch, and rate, and using
Depending on the requirement it is possible to bypass Web Speech API (including the issues with implementations https://stackoverflow.com/questions/48504228/how-can-i-make-my-web-browser-speak-programmatically/48504311#48504311) altogether and use |
I had read the source of SSML Parser and according to it you create a queue of utterances and then play them. But there is an issue with it: my browser pronounces such sentences with pauses between words. For example phrase "Hello, World!" would be pronounced differently when presented as one or two strings. Here it is JSFiddle example: https://jsfiddle.net/rumkin/o3x5Lf96/. Have you checked this difference on your solution? I think native SSML support will fix this issue. There is basic SSML support in Chromium, but I haven't checked yet what it can.
I need to get deeper into SSML to be more confident with terminology. I saw in the spec such params as age and gender, and it seems pretty interesting to control this on the fly. But It's based on current generation of TTS technologies, which is still strict, I think if it will be enhanced with Neural Network, we can achieve more flexible solution and control more specific characteristics like emotion, tooth count, and other physical params.
I'm aiming to enhance specification and work with W3C committee to make it a standard, not some kind of hack.
Currently It's hard to say without going deeper into this question. But the global goal is to make browser a complete solution required to create, test and use new TTS solutions, and give equal access to this technology for all engineers. So browser should be able to:
This project's goals are to promote idea of two voice model and to demonstrate how easy it is to work and to experiment with Speech API in the browser. The next step is to create web site speech api accessibility debugger. And I think you solution fits well for this. But it should be reworked for better maintainability. I will think how I can help with that. |
What is the expected result?
Are you certain? The last time checked Chromium had not implemented SSML parsing for Web Speech API.
Web Speech API is currently under WICG umbrella. Am banned from WICG for 1,000 years for fraudulent reasons concocted by that body. W3C did not contest that had signed up correctly, deleted account had created, and have content published under their umbrella right now which they cited as the reason as an issue with own account, thus, their conduct is fraudulent in nature and substance as well. Will not be able to contribute to the current specification as it is under WICG control right now. If am able to help in any way give a ping. |
Have you listen an example I've attached? It displays pretty clear how different it sounds when you speak the whole phrase and when you speak it word by word. In the second case there would be additional pauses between words. Phase doesn't sound like regular speech and fall apart. It could change the sense of the messages on the sentences borders.
Well, I did not check meticulously, but it accepted
Sad to hear that. I think the situation could be solved with working code and community support.
Sure. Thanks! |
That depends on what the input and expected result are. If the requirement is to input a sentence without distinguisable gaps between the words, that is possible using the linked parser code. If the requirement is a break The SSML within the linked code are tests. Can you file an issue at the linked repository if the output is not as expected?
Chromium appears to have at least stripped XML from input text. Does not parse SSML
Am still not sure exactly what output you are expecting that you are not able to achieve now? |
What local TTS engine is |
So that we are prospectively using the same code you can install |
@rumkin Initial implementation of the proof-of-concept at previous post https://github.com/guest271314/native-messaging-espeak-ng. Provides a means to input SSML as a string or XML
|
Hi, thanks for this. I wish it be more widely adopted solution, which everyone can run in their browser without a need to install something in their system. But I understand it's not possible in the moment, so I'll be searching a way to make communicating with WhatWG. I'll reopen this to help others to read this issue. Please don't close it. |
That is the purpose of the solution in the format provided by Native Messaging. Either way code has to be installed in their system. The code can be shipped in Chromium source code, as it is with Chromium OS (https://chromium.googlesource.com/chromiumos/third_party/espeak-ng/+/refs/heads/chrome), however, AFAICT SSML parsing is not enabled by default (pettarin/espeakng.js-cdn#1). Or, the code can be installed by the user and executed utilizing Native Messaging. The former requires asking Chromium authors to ship existing code to achieve the requirement by default, which already have, more than once. The latter provides front-end control over the entire process. Native Messaging might appear to be an "installation" initially, though that is not necessarily the case. The code can be maintained and implemented by front-end for the front-end, as described in a linked answer above. https://github.com/simov/native-messaging includes a Firefox version as well. Have not yet thoroughly tested the SSML parsing output of If you find any errors with the code (during testing) do not hesitate to file an issue. Having completed the initial version am now exploring creating a virtual device for the audio output. Am able to get a direct In any event, do not hesitate to file issue, PR, feature request, to improve the code. |
Opus installation (for the purposes of the initial code included primarily to reduce file size) is not necessary. |
Assumes It should be possible substitute "native-messaging-host-bash.sh" (https://github.com/guest271314/native-messaging-espeak-ng/blob/bash-audioworklet/host/native-messaging-host-bash.sh) for C, C++, Python, Rust, etc. language. Note: We do not actually send the input text to the Native Messaing host or send back audio output as a file using the Native Messaging protocol due to the limitations on message size (https://developer.chrome.com/extensions/nativeMessaging) and processing input with
Instead we send a single character |
The goal of this repository described at #2 (comment)
is achievable by parsing an HTML or SSML document, see WICG/speech-api#10, https://github.com/mhakkinen/SSML-issues, https://github.com/alia11y/SSMLinHTMLproposal, https://github.com/guest271314/SpeechSynthesisSSMLParser.
Changing voices is possible at any time after the voices are loaded with
getVoices()
and usingonvoiceschanged
event and/or parsing an SSML element where the voice is set, e.g.,<voice name="english_rp" languages="en-US" required="name">${Math.E}</voice>
https://github.com/guest271314/SpeechSynthesisSSMLParser/blob/master/SpeechSynthesisSSMLParserTest.html#L525.
The text was updated successfully, but these errors were encountered: