This tool transcribes audio to text using OpenAI's Speech to text API, and post process it using Text generation API.
Before you begin, ensure you have met the following requirements:
- You have
ffmpeg
installed. Please refer to Getting ffmpeg set up - You have a valid OpenAI API key and configurated.
- You have Python 3.9 or higher installed on your machine.
-
Clone the repository:
git clone [email protected]:actuallyyun/JoyScribe.git
-
Navigate to the project directory:
cd JoyScribe
-
Install the required dependencies. You can do this by running:
pip install -r requirements.txt
-
Open a terminal and navigate to the project directory.
-
Run the script with the following command:
python transcribe.py --file <path_to_audio_file> --output <output_directory>
Replace
<path_to_audio_file>
with the path to your audio file and<output_directory>
with the directory where you want to save the output files. -
Wait for the the program to finish. You can follow the terminal to see the progress.
To customize the prompt in the post_processing.py
file, you can modify the system_prompt
variable.
- Test it with short audio
By default, the Whisper API only supports files that are less than 25 MB.
This is a limitation I have to address since my audio files are bigger than 25 MB.
But first, I want to make sure my setup with OpenAI works.
So I tested it with a shorter audio, run the script, waited for a few seconds, and woala, it worked.
- Research solutions for longer audio inputs
OpenAI's documentation recommened the to use the PyDub open source Python package to split the audio
But this packages has some dependencies, and one of which is ffmpeg
. The offical ffmpeg website is a bit confusing and only provides download option.
I use Mac OS and perfer install packages with brew
. And indeed, brew
has this package.
- The logic is straightforward: cut the audio into smaller chunks, save it to a directory and pass the segement one by one to
whisper
.
Up until this point,it does the job transcribing it. However, the response is not easy to read. It does not include puncutations and does not have formatting.
Using gpt
to post process the transcript is a common practice.
To simplify things, I decided to use one assistant for the job. It does the following:
- Format the text
- Convert traditional Chinese characters to simplified Chinese
- Format English with Chinese translations included in parenthese
- Extract subheading every 4 paragraphs
-
Keep fine tuning the assistant to be more useful.
-
Clean the audio before segementing. For example, trim leading silence in audio, this increases Wisper's transcribing performance.
For any inquiries or feedback, please reach out to:
- Name: Yun Ji
- Email: [email protected]
- LinkedIn: your-linkedin-profile
Feel free to connect with me!