Readme
Transcribe any audio file with speaker diarization
Uses Whisper Large V3 Turbo + Pyannote Speaker Diarization Community-1
Create transcripts with speaker labels, timestamps, and word-level timing. Uses faster-whisper 1.2.1 and pyannote.audio 4.0.4 with pyannote/speaker-diarization-community-1 under the hood.
Last update: 10 June 2026
Now uses Pyannote Speaker Diarization Community-1, updated Faster Whisper, newer dependencies, local model loading, and improved audio handling.
Usage
Input
file_string: str: Base64 encoded audio file.file_url: str: Direct audio file URL.file: Path: Audio file upload.num_speakers: int: Number of speakers. Leave empty to autodetect. Must be between 1 and 50.translate: bool: Translate speech into English.language: str: Language of spoken words as a language code likeen. Leave empty to auto-detect.prompt: str: Vocabulary: provide names, acronyms, and loanwords. Use punctuation for best accuracy.
Provide exactly one of file_string, file_url, or file.
Output
segments: List[Dict]: Transcript segments with speaker, text, start time, end time, duration, and word-level details.- Includes
avg_logprobfor each segment. - Includes
probability, timestamps, and speaker labels for each word-level segment. num_speakers: int: Number of speakers detected, unless specified in input.language: str: Spoken language detected, unless specified in input.
Made possible by
Speed
With an L40S GPU, it takes <1 minute to transcribe and diarize a 25 minute MP3 with 2 people speaking English.
About
Contact me if you’d like a demo or want to know more:
- thomasmol.com
- X/Twitter: x.com/thomas_mol
- Linkedin: linkedin.com/in/thomas-mol