{"id":10750,"date":"2023-09-29T18:00:00","date_gmt":"2023-09-29T18:00:00","guid":{"rendered":"https:\/\/businessyield.com\/tech\/?p=10750"},"modified":"2023-09-27T20:33:55","modified_gmt":"2023-09-27T20:33:55","slug":"openai-whisper-how-does-openai-whisper-work","status":"publish","type":"post","link":"https:\/\/businessyield.com\/tech\/technology\/openai-whisper-how-does-openai-whisper-work\/","title":{"rendered":"OpenAI Whisper: How Does OpenAI Whisper Work?","gt_translate_keys":[{"key":"rendered","format":"text"}]},"content":{"rendered":"\n

OpenAI recently launched the Whisper API, a hosted version of the open-source Whisper speech-to-text model to coincide with the release of ChatGPT API. <\/p>\n\n\n\n

Priced at $0.006 per minute, Whisper is an automatic speech recognition system that OpenAI claims enables \u201crobust\u201d transcription in multiple languages as well as translation from those languages into English. It takes files in a variety of formats, including M4A, MP3, MP4, MPEG, MPGA, WAV, and WEBM.<\/p>\n\n\n\n

Countless organizations have developed highly capable speech recognition systems, which sit at the core of software and services from tech giants like Google, Amazon, and Meta. However, what sets Whisper apart is that it was trained on 680,000 hours of multilingual and \u201cmultitask\u201d data collected from the web. <\/p>\n\n\n\n

This leads to improved recognition of unique accents, background noise, and technical jargon.<\/p>\n\n\n\n

Overview of OpenAI Whisper<\/strong><\/span><\/h2>\n\n\n\n

Whisper is an automatic speech recognition model trained on 680,000 hours of multilingual data collected from the web. As per OpenAI, this model is robust to accents, background noise, and technical language. In addition, it supports 99 different languages\u2019 transcription and translation from those languages into English.<\/p>\n\n\n\n

Whisper has five models (refer to the below table). Below is the table available on OpenAI\u2019s GitHub page. According to OpenAI, there are four models for English-only applications, which are denoted as .en<\/code>. The model performs better for tiny.en<\/code> and base.en<\/code>, but differences become less significant for the small.en<\/code> and medium.en<\/code> models.<\/p>\n\n\n\n

\"\"
Ref: OpenAI\u2019s GitHHub Page<\/a><\/figcaption><\/figure>\n\n\n\n

The Whisper models are trained for speech recognition and translation tasks, capable of transcribing speech audio into the text in the language it is spoken (ASR) as well as translated into English (speech translation). Whisper is an Encoder-Decoder model, trained on 680,000 hours of multilingual and multitask supervised data collected from the web. <\/p>\n\n\n\n

Transcription is a process of converting spoken language into text. In the past, it was done manually; now, there are AI-powered tools like Whisper that can accurately understand spoken language. With a basic knowledge of Python language, you can integrate OpenAI Whisper API into your application. <\/p>\n\n\n\n

The Whisper API is a part of openai\/openai-python, which allows you to access various OpenAI services and models.<\/p>\n\n\n\n

What are good use cases for transcription?<\/strong><\/span><\/h2>\n\n\n\n
    \n
  1. Transcribing interviews, meetings, lectures, and podcasts for analysis, easy access, and keeping records. <\/li>\n\n\n\n
  2. Real-time speech transcription for subtitles (YouTube), captioning (Zoom meetings), and translation of spoken language.<\/li>\n\n\n\n
  3. Speech transcription for personal and professional use. Transcribing voice notes, messages, reminders, memos, and feedback.<\/li>\n\n\n\n
  4. Transcription for people with hearing impairments.<\/li>\n\n\n\n
  5. Transcription for voice-based applications that require text input. For example, chatbot, voice assistant, and language translation.<\/li>\n<\/ol>\n\n\n\n

    How does Whisper work?<\/strong><\/span><\/h2>\n\n\n\n

    Input audio is split into 30-second chunks, converted into a log-Mel spectrogram, and then passed into an encoder. A decoder is trained to predict the corresponding text caption, intermixed with special tokens that direct the single model to perform tasks such as language identification, phrase-level timestamps, multilingual speech transcription, and to-English speech translation.<\/p>\n\n\n\n

    In simpler words, OpenAI Whisper is built on the transformer architecture, stacking encoder blocks and decoder blocks with the attention mechanism propagating information between both.<\/p>\n\n\n\n

    It will take the audio recording, split it into 30-second chunks and process them one by one. For each 30-second recording, it will encode the audio using the encoder section and save the position of each word said. Then, it will leverage this encoded information to find what was said using the decoder.<\/p>\n\n\n\n

    The decoder will predict what we call tokens from all this information, which is basically each word that is said. Then, it will repeat this process for the next word using all the same information as well as the predicted previous word, helping it guess the next one that would make more sense.<\/p>\n\n\n\n

    OpenAI trained Whisper’s audio model in a similar way as GPT-3 – with data available on the internet. This makes it a large and general audio model. It also makes the model way more robust than others. In fact, according to OpenAI, Whisper approaches human-level robustness due to being trained on such a diverse set of data ranging from clips, TED talks, podcasts, interviews, and more. <\/p>\n\n\n\n

    All of these represent real-world-like data, with some of them transcribed using machine learning-based models and not humans.<\/p>\n\n\n\n

    How to use OpenAI Whisper<\/strong><\/span><\/h2>\n\n\n\n

    The speech-to-text API provides two endpoints – transcriptions and translations – based on OpenAI’s state-of-the-art open-source large-v2 Whisper model. They can be used to:<\/p>\n\n\n\n