thomasmol/whisper-diarization

⚡️ Blazing fast audio transcription with speaker diarization | Whisper Large V3 Turbo & pyannote 4.0 community-1 | word & sentence level timestamps | prompt

Public
8.4M runs

Run time and cost

This model costs approximately $0.046 to run on Replicate, or 21 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia L40S GPU hardware. Predictions typically complete within 48 seconds. The predict time for this model varies significantly based on the inputs.

Readme

Transcribe any audio file with speaker diarization

Uses Whisper Large V3 Turbo + Pyannote Speaker Diarization Community-1

Create transcripts with speaker labels, timestamps, and word-level timing. Uses faster-whisper 1.2.1 and pyannote.audio 4.0.4 with pyannote/speaker-diarization-community-1 under the hood.

Last update: 10 June 2026

Now uses Pyannote Speaker Diarization Community-1, updated Faster Whisper, newer dependencies, local model loading, and improved audio handling.

Usage

Input

  • file_string: str: Base64 encoded audio file.
  • file_url: str: Direct audio file URL.
  • file: Path: Audio file upload.
  • num_speakers: int: Number of speakers. Leave empty to autodetect. Must be between 1 and 50.
  • translate: bool: Translate speech into English.
  • language: str: Language of spoken words as a language code like en. Leave empty to auto-detect.
  • prompt: str: Vocabulary: provide names, acronyms, and loanwords. Use punctuation for best accuracy.

Provide exactly one of file_string, file_url, or file.

Output

  • segments: List[Dict]: Transcript segments with speaker, text, start time, end time, duration, and word-level details.
  • Includes avg_logprob for each segment.
  • Includes probability, timestamps, and speaker labels for each word-level segment.
  • num_speakers: int: Number of speakers detected, unless specified in input.
  • language: str: Spoken language detected, unless specified in input.

Made possible by

Speed

With an L40S GPU, it takes <1 minute to transcribe and diarize a 25 minute MP3 with 2 people speaking English.

About

Contact me if you’d like a demo or want to know more:

Model created
Model updated