A node.js wrapper library around the Google Speech API for automatic speech recognition.
- Unlimited requests to Google automatic speech recognition servers
- Access to timed transcripts
For clips under 60 seconds, usage is simple:
var gspeech = require('gspeech-api');
gspeech.recognize('path/to/my/file', function(err, data) {
if (err)
console.error(err);
console.log("Final transcript is:\n" + data.transcript);
});
Google servers ignore clips over 60 seconds, so for clips longer than that, you have to specify how you want your audio files split into pieces. To use default package settings, the same code from above for clips under 60 seconds works.
The speed varies, but in general, one hour of audio will take a couple minutes to process.
Gpseech-api relies on fluent-ffmpeg to deal with different audio formats, which has a dependency on ffmpeg.
Unfortunately, ffmpeg is a little bit tricky to install on Ubuntu 12.04 and 14.04. I followed the instructions to install from source from the ffmpeg Installation page.
All other dependencies should be automatically handled by npm.
npm install gspeech-api
This package exposes one main method, gspeech.recognize(options, callback)
for taking a file and returning an array of timed captions along with a final transcript.
If options
is passed as a String
, it is taken as the path for file
. Otherwise, it should be an Object, which can have the following attributes:
file
- path to the audio file to be transcribed.segments
(optional) -[start]
Specifies how to divide the audio file into segments before transcription.start
is afloat
specifying a track time in seconds. If this argument is not specified, the audio is split into 60 second segments (the maximum length allowed by Google's servers). If a segment is longer than 60 seconds, it is split into 60 second segments.maxDuration
(optional) - any segments longer thanmaxDuration
will be split into segments ofmaxDuration
seconds. Defaults to 15 seconds.maxRetries
(optional) - sometimes Google servers do not respond correctly. If a segment is not processed correctly, it will be sent againmaxRetries
more times. Defaults to 1.limitConcurrent
(optional) - Google's servers communicate through separate GET and POST requests - sending too many requests at once may cause the two requests to not line up, resulting in errors. This defaults to 20.
Callback is a function which will be called after all requests to Google speech servers have completed. It is passed two parameters callback(err, data)
:
err
- contains an error message, if error occursdata
- contains an Object with the following fields, if all requests are successful:transcript
-String
: text of complete transcripttimedTranscript
-[{'start': float, 'text': String}]
: unprocessed text of each segment transcribed separately by Google speech servers. Ifoptions.segments
is specified, each object corresponds to a segment; otherwise, each object designates how the audio file was split before being transcribed.
gspeech.recognize('path/to/my/file', function(err, data) {
if (err)
console.error(err);
for (var i = 0; i < data.timedTranscript.length; i++) {
// Print the transcript
console.log(data.timedTranscript[i].start + ': '
+ data.timedTranscript[i].text + '\n');
}
});
If you would like to generate a timed transcript, and know where fragments start, specify these times to the library.
var segTimes = [0, 15, 20, 30];
gspeech.recognize({
'file': 'path/to/my/file',
'segments': segTimes,
},
function(err, data) {
if (err)
console.error(err);
for (var i = 0; i < data.timedTranscript.length; i++) {
console.log(data.timedTranscript[i].start + ': '
+ data.timedTranscript[i].text + '\n');
}
}
);
This is not an officially supported Google API, and should only be used for personal purposes. The API is subject to change, and should not be relied upon by any crucial services. I'm also not actively maintaining this repo - for modifications, the best option is probably to make your own fork.