- Adding a subtitle for a video automatically
- Extract sound from video
- Input: video
- Format: any video format allowed by FFmpge
- Size: no longer than 1min, if it is longer than 1min, we may upload output audio into Google Cloud Storage, and use
in convert speech step with Google Cloud Speech-To-Text API
- Output: audio
- Format:
- Ecoding: FLAC
- sample_rate_hertz=16000 or more
- Language=Mandarin
- Format:
- Input: video
- Convert speech into text
- Input: audio
- Format: above output audio format
- Output: text
- Format: Plain text and
- Format: Plain text and
- Input: audio
- Generate subtitle using above source
file is the target subtitle file
- Extract sound from video
1. Extract sound from video using FFmpeg
$ brew install ffmpeg
ffmpeg -i video.mp4 -f mp3 -ab 192000 -vn audio.mp3
input file-f
convert to format, flac is recommanded by Google Cloud Speech-To-Tex API-ar
conver to sampleRateHertz, 16000 is recommanded by Google Cloud Speech-To-Tex API-vn
output file is not video-ac
1 only 1 channel audio would be allowed by Cloud Speech
Allow format
ffmpeg -formats
D 3dostr 3DO STR E 3g2 3GP2 (3GPP2 file format) E 3gp 3GP (3GPP file format) D 4xm 4X Technologies E a64 a64 - video for Commodore 64 D aa Audible AA format files D aac raw ADTS AAC (Advanced Audio Coding) DE ac3 raw AC-3 D acm Interplay ACM ...
2. Convert audio into text using Google Cloud Speech-To-Text API
Install: follow official website Set up a GCP Console projec, Set the environment variable GOOGLE_APPLICATION_CREDENTIALS and install and initialize Google Cloud SDK
Install the client library (for python):
pip3 install --upgrade google-cloud-speech
In this project, our test video is longer than 1 min, we need the asynchronously transcribes and use
for speech file -
we refer the following code from official website
# [START speech_transcribe_async_gcs] def transcribe_gcs(gcs_uri): """Asynchronously transcribes the audio file specified by the gcs_uri.""" from google.cloud import speech from google.cloud.speech import enums from google.cloud.speech import types client = speech.SpeechClient() audio = types.RecognitionAudio(uri=gcs_uri) config = types.RecognitionConfig( encoding=enums.RecognitionConfig.AudioEncoding.FLAC, sample_rate_hertz=16000, language_code='en-US') operation = client.long_running_recognize(config, audio) print('Waiting for operation to complete...') response = operation.result(timeout=90) # Each result is for a consecutive portion of the audio. Iterate through # them to get the transcripts for the entire audio file. for result in response.results: # The first alternative is the most likely one for this portion. print(u'Transcript: {}'.format(result.alternatives[0].transcript)) print('Confidence: {}'.format(result.alternatives[0].confidence)) # [END speech_transcribe_async_gcs]
In config,
We use as following:
- sample_rate_hertz=16000
- language_code='zh'
- encoding='FLAC'
- this config would set several phrases for specific vedio "Savvy _June Cut_final.mp4"
- enable_word_time_offsets=True, which include timestamp used for generate subtitle
- enable_automatic_punctuation=True, which include punctuation in transcript field
config = types.RecognitionConfig( encoding=enums.RecognitionConfig.AudioEncoding.FLAC, sample_rate_hertz=16000, language_code='zh', speech_contexts=[ speech.types.SpeechContext(phrases=[ '思睿', '在思睿', '海外教育', '双师', '辅导', '授课', '云台录播', '讲义', '赢取' ]) ], enable_word_time_offsets=True, enable_automatic_punctuation=True)
python3 extract-audio.py Savvy\ _June\ Cut_final.mp4
This script would recogonize whether the file exists, whether the format allowed by
Furthermore, this script can handle the input name with whitespace and output coutain original file name
The output audio is obtain each config which needed by Google Cloud Speech-To-Text API(see: Component2)
The output file name would be
, which would also be upload intogs://test-convert-audio/audio-inputFileName.flac
(used in convert step)
For shorter audio (no longer 1 min) using synchronous speech recognition.
python3 audio-to-text.py localFile.flac
For longer audio (longer than 1 min) using asynchronous speech recognition.
python3 audio-to-text.py "gs://test-convert-audio/audio-Savvy _June Cut_final.flac"
Note: if filename with whitespace plase use ""
This script would send a recognize request to Cloud Speech-to-Text and obtain the response, and we can also write into a plain text in output folder ( transcript-text.txt
operation = client.long_running_recognize(config, audio)
response = operation.result(timeout=90)
This script add sequence number for each line and timpesamp for each line (e.g 0:00:1.012 —> 0: 00: 3.211
) and the words, which format is need by .str
helper modual timestr.py
would help us convert the start_time
or end_time
as allowed string, start_time
or end_time
both are from response
Because the response.result.alternatives[0].word
only contain word information, so reading output file transcript-text.txt
, which including punctuation, and add punctuation into subtitle file or leave white space at punctuation position.
There provide two version audio-to-text.py available
Subtitle with punctuation
python3 audio-to-text-with-punctuation.py "gs://test-convert-audio/audio-Savvy _June Cut_final.flac"
Subtitle no punctuation
python3 audio-to-text-no-punctuation.py "gs://test-convert-audio/audio-Savvy _June Cut_final.flac"