Readme

Transcribe any audio file with speaker diarization

Uses Whisper Large V3 Turbo + Pyannote Speaker Diarization Community-1

Create transcripts with speaker labels, timestamps, and word-level timing. Uses faster-whisper 1.2.1 and pyannote.audio 4.0.4 with pyannote/speaker-diarization-community-1 under the hood.

Last update: 10 June 2026

Now uses Pyannote Speaker Diarization Community-1, updated Faster Whisper, newer dependencies, local model loading, and improved audio handling.

file_string: str: Base64 encoded audio file.
file_url: str: Direct audio file URL.
file: Path: Audio file upload.
num_speakers: int: Number of speakers. Leave empty to autodetect. Must be between 1 and 50.
translate: bool: Translate speech into English.
language: str: Language of spoken words as a language code like en. Leave empty to auto-detect.
prompt: str: Vocabulary: provide names, acronyms, and loanwords. Use punctuation for best accuracy.

Provide exactly one of file_string, file_url, or file.

segments: List[Dict]: Transcript segments with speaker, text, start time, end time, duration, and word-level details.
Includes avg_logprob for each segment.
Includes probability, timestamps, and speaker labels for each word-level segment.
num_speakers: int: Number of speakers detected, unless specified in input.
language: str: Spoken language detected, unless specified in input.

With an L40S GPU, it takes <1 minute to transcribe and diarize a 25 minute MP3 with 2 people speaking English.

Contact me if you’d like a demo or want to know more:

Model created over 1 year ago

Model updated 3 days, 2 hours ago