-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extending Media Capture and Streams with MediaStreamTrack kind TTS #654
Comments
Note, am asking this question more for other users and use cases than for self, for individuals who are more comfortable using clearly defined specifications and official implementations by browsers than rolling their own. Some users appear to actually expect these specifications to meet their needs, and/or, do not want to "install" anything, rather the code is expected to be already implemented in the browser, for whatever their reasons are. Nonetheless, some users still apparently believe that the browser should be able to output what they expect it to, given the state of the art. That is a reasonable expectation that have abandoned for self. Until that option is foreclosed, which the closure of the related issues effectively does. Still, will ask one more time, this occasion, for the canonical procedure to extend |
Basic model of
Single-use
Persistent
|
Production of a MediaStreamTrack from TTS should be an extension spec for the Text-To-Speech API, not a feature of the MediaStreamTrack API. |
@alvestrand Can you answer this supplemental question
where the scope is beyond only a TTS use case? Since Chromium, Chrome already has source code to create a "Fake" device, read a |
Extending Media Capture and Streams with MediaStreamTrack kind TTS or: What is the canonical procedure to programmatically create a virtual media device where the source is a local file or piped output from a local application?
The current iteration of the specification includes the language
https://w3c.github.io/mediacapture-main/#dfn-source
and
https://w3c.github.io/mediacapture-main/#extensibility
In pertinent part
and
https://w3c.github.io/mediacapture-main/#defining-a-new-media-type-beyond-the-existing-audio-and-video-types
The list items under the above section are incorporated by reference herein.
Problem
Web Speech API (W3C) is dead.
The model is based on communication with
speech-dispatcher
to output audio from the sound card, where the user (consumer) has no control over the output.This proposal is simple:
Extend
MediaStreamTrack
to include akind
TTS
where thesource
is output from a local TTS (Text To Speech; speech synthesis) engine.The model is also simple: Assuming there is a local
.txt
or.xml
document the input text is read by the TTS application from the local file. The outpiut is aMediaStream
containing a singleMediaStreamTrack
oftype
andlabel
TTS
.The
source
file is read and output as aMediaStreamTrack
within aMediaStream
aftergetUserMedia()
prompt.When the read of the file reaches
EOF
theMediaStreamTrack
ofkind
TTS
automatically stops, similar to MediaRecorder: Implements spontaneous stopping.Such functionality exists for testing, in brief
For example
One problem with using that testing code for production to meet the requirement of outputting resulting of TTS is that there is no way to determine
EOF
without getting theduration
of the file before playback atMediaStream
, since SSML can include<break time="5000ms"/>
analyzing audio output stream for silence can lead to prematurely executingMediaStreamTrack.stop()
in order to end the track.When called multiple times in succession, even after having statted the file twice to get the
duration
, after two to three calls no sound is output.MacOS also has an issue with that flag
--use-file-for-fake-audio-capture
doesn't work on Chrome.We can write the input
.txt
or.xml
file to local filesystem using File API or Native File System, therefore input is not an issue.Why Media Capture and Streams and not Web Speech API?
W3C Web Speech API is dead.
W3C Web Speech API was not initially written to provide such functionality, even though the underlying speech synthesis application installed on the local machine might have such functionality.
Even if Web Speech API does become un-dead to move to, or provide
MediaStream
andMediaStreamTrack
as output options some form of collarboration with and reliance on this governing specification will be required, thus it is reasonable to simply begin from the Media Capture and Stream specification re "Extensibility" and work backwards, or rather, work from both ends towards the middle. Attempting to perform either modification in isolation might prove to be inadequate.If there is any objection as to W3C Web Speech API being dead, re the suggestion to deal with speech synthesis in Web Speech API specification, then that objection must include the reason why Web Speech API hass not implemented SSML parsing flag when the patch has been available for some time https://bugs.chromium.org/p/chromium/issues/detail?id=795371#c18, and why instead of actually using the Web Speech API, ChromiumOS authors decided to use
wasm
andespeak-ng
to implement TTS https://chromium.googlesource.com/chromiumos/third_party/espeak-ng/+/refs/heads/chrome, essentially abandoning Web Speech API usage?--
An alternative approach to solve the use case is for the specification to compose the formal steps necesary to create a virtual media device that
getUserMedia()
provides access to as "microphone" (becuase the device is virtual we can assign said device as a microphone which should be listed atgetUserMedia()
prompt and listed atenumerateDevices()
., e.g., https://stackoverflow.com/a/40783725
in order to not have to ask this body to specify the same in the official standard, just patch in the virtual device to the existing infrastructure.
--
Use cases
For some reason, users appear to feel more comfortable using standardized API's rather than rolling their own. For those users a canonical means to patch into the existing formal API without that functionality being offcially written might provide the assurance they seem to want that the means used are appropriate and should "work". Indeed, some users appear to not be aware that currently Web Speech API itself does not provide any algorithm to synthesize text to speech, it is hard to say.
and
The latter case should be easily solved by implementing SSML parsing. However, that has not been done
even though the patch to o so exists https://bugs.chromium.org/p/chromium/issues/detail?id=795371#c18
and the maintainers of
speech-dispatcher
(speechd
) are very helpful.Tired of waiting for Web Speech API to be un-dead, wrote an SSML parser from scratch using JavaScript
So, no,
is not applicable anymore. Why would users have any confidence that the Web Speech API is un-dead and will eventually address the issue?
Besides, in order to get output as a
MediaStream
this specification would needd to be involved in some non-trivial way as a reference.--
The purpose of this issue is to get clarity on precisely what is needed to
getUserMedia()
to list a created virtual device for purposes of speech synthesis output;getUserMedia()
is currently specified to list and have access to, so that we can feed that device the input from file or pipe directly to theMediaStreamTrack
, so that users can implement the necessary code properly themselves.The use cases exist. The technology exists. Am attempting to bridge the gap between an active and well-defined specification and an ostensibly non-active and ill-defined specification, incapable of being "fixed" properly without rewriting the entire specification (which cannot participate in due to the fraudulent 1,000 year ban placed on this user from contributing to WICG/spech-api).
What are the canonical procedures to 1) extend (as defined in this specification)
MediaStreamTrack
to include a"TTS"
,kind
andlabel
with speech synthesis engine output as source (as defined in this specification); and 2) programmatically create a virtual input device thatgetUserMedia({audio:true})
will recognize, list, and have access to?The text was updated successfully, but these errors were encountered: