[TOC]
- Adding a subtitle for a video automatically
- Extract sound from video
- Input: video
- Format: any video format allowed by FFmpge
- Size: no longer than 1min, if it is longer than 1min, we may upload output audio into Google Cloud Storage, and use
gcs_uri
in convert speech step with Google Cloud Speech-To-Text API
- Output: audio
- Format:
- Ecoding: FLAC
- sample_rate_hertz=16000 or more
- Language=Mandarin
- Format:
- Input: video
- Convert speech into text
- Input: audio
- Format: above output audio format
- Output: text
- Format: Plain text and
str
file
- Format: Plain text and
- Input: audio
- Generate subtitle using above source
.str
file is the target subtitle file
- Extract sound from video
1. Extract sound from video using FFmpeg
-
Install:
-
$ brew install ffmpeg
-
Usage:
ffmpeg -i video.mp4 -f mp3 -ab 192000 -vn audio.mp3
-i
input file-f
convert to format, flac is recommanded by Google Cloud Speech-To-Tex API-ar
conver to sampleRateHertz, 16000 is recommanded by Google Cloud Speech-To-Tex API-vn
output file is not video-ac
1 only 1 channel audio would be allowed by Cloud Speech
-
Allow format
ffmpeg -formats
D 3dostr 3DO STR E 3g2 3GP2 (3GPP2 file format) E 3gp 3GP (3GPP file format) D 4xm 4X Technologies E a64 a64 - video for Commodore 64 D aa Audible AA format files D aac raw ADTS AAC (Advanced Audio Coding) DE ac3 raw AC-3 D acm Interplay ACM ...
2. Convert audio into text using Google Cloud Speech-To-Text API
-
Install: follow official website Set up a GCP Console projec, Set the environment variable GOOGLE_APPLICATION_CREDENTIALS and install and initialize Google Cloud SDK
Install the client library (for python):
pip3 install --upgrade google-cloud-speech
-
In this project, our test video is longer than 1 min, we need the asynchronously transcribes and use
gcs_uri
for speech file -
we refer the following code from official website
-
# [START speech_transcribe_async_gcs] def transcribe_gcs(gcs_uri): """Asynchronously transcribes the audio file specified by the gcs_uri.""" from google.cloud import speech from google.cloud.speech import enums from google.cloud.speech import types client = speech.SpeechClient() audio = types.RecognitionAudio(uri=gcs_uri) config = types.RecognitionConfig( encoding=enums.RecognitionConfig.AudioEncoding.FLAC, sample_rate_hertz=16000, language_code='en-US') operation = client.long_running_recognize(config, audio) print('Waiting for operation to complete...') response = operation.result(timeout=90) # Each result is for a consecutive portion of the audio. Iterate through # them to get the transcripts for the entire audio file. for result in response.results: # The first alternative is the most likely one for this portion. print(u'Transcript: {}'.format(result.alternatives[0].transcript)) print('Confidence: {}'.format(result.alternatives[0].confidence)) # [END speech_transcribe_async_gcs]
-
In config,
-
We use as following:
- sample_rate_hertz=16000
- language_code='zh'
- encoding='FLAC'
- this config would set several phrases for specific vedio "Savvy _June Cut_final.mp4"
- enable_word_time_offsets=True, which include timestamp used for generate subtitle
- enable_automatic_punctuation=True, which include punctuation in transcript field
-
config = types.RecognitionConfig( encoding=enums.RecognitionConfig.AudioEncoding.FLAC, sample_rate_hertz=16000, language_code='zh', speech_contexts=[ speech.types.SpeechContext(phrases=[ '思睿', '在思睿', '海外教育', '双师', '辅导', '授课', '云台录播', '讲义', '赢取' ]) ], enable_word_time_offsets=True, enable_automatic_punctuation=True)
-
python3 extract-audio.py Savvy\ _June\ Cut_final.mp4
-
This script would recogonize whether the file exists, whether the format allowed by
ffmpeg
-
Furthermore, this script can handle the input name with whitespace and output coutain original file name
-
The output audio is obtain each config which needed by Google Cloud Speech-To-Text API(see: Component2)
-
The output file name would be
audio-inputFileName.flac
, which would also be upload intogs://test-convert-audio/audio-inputFileName.flac
(used in convert step)
For shorter audio (no longer 1 min) using synchronous speech recognition.
python3 audio-to-text.py localFile.flac
For longer audio (longer than 1 min) using asynchronous speech recognition.
python3 audio-to-text.py "gs://test-convert-audio/audio-Savvy _June Cut_final.flac"
Note: if filename with whitespace plase use ""
This script would send a recognize request to Cloud Speech-to-Text and obtain the response, and we can also write into a plain text in output folder ( transcript-text.txt
)
operation = client.long_running_recognize(config, audio)
response = operation.result(timeout=90)
This script add sequence number for each line and timpesamp for each line (e.g 0:00:1.012 —> 0: 00: 3.211
) and the words, which format is need by .str
file
helper modual timestr.py
would help us convert the start_time
or end_time
as allowed string, start_time
or end_time
both are from response
information
Because the response.result.alternatives[0].word
only contain word information, so reading output file transcript-text.txt
, which including punctuation, and add punctuation into subtitle file or leave white space at punctuation position.
-
There provide two version audio-to-text.py available
-
Subtitle with punctuation
audio-to-text-with-punctuation.py
:-
python3 audio-to-text-with-punctuation.py "gs://test-convert-audio/audio-Savvy _June Cut_final.flac"
-
-
Subtitle no punctuation
audio-to-text-no-punctuation.py
-
python3 audio-to-text-no-punctuation.py "gs://test-convert-audio/audio-Savvy _June Cut_final.flac"
-
-