Cohere Unveils a New State-of-the-Art in Open Source Speech Recognition - Transcribe

Karan Bhatia
Mar 27
3 min read

Cohere, building foundational models and AI solutions that help teams turn everyday effort into extraordinary impact, led by Aidan Gomez, Nick Frosst, Ivan Zhang, and the team, has announced Transcribe, a state-of-the-art automatic speech recognition (ASR) model that is open source and available for download.

Speech is rapidly emerging as a core modality for AI-driven workflows, from transcription and analytics to real-time support agents.

Cohere developed Cohere Transcribe with a clear goal: maximize real-world ASR accuracy while remaining production-ready. Trained from scratch with a focus on minimizing word error rate (WER), the model balances research-grade performance with practical deployment.

Cohere Transcribe is open-source with full infrastructure control, optimized for efficient inference on GPUs and local systems, and available via Model Vault for secure, managed deployment. It currently ranks #1 on HuggingFace’s Open ASR Leaderboard, setting a new benchmark for real-world transcription performance.

This marks a key step in bringing high-performance speech recognition into enterprise AI workflows.

Model Overview — Cohere Transcribe

Name: cohere-transcribe-03-2026
Architecture: Conformer-based encoder–decoder
Input: Audio waveform → log-Mel spectrogram
Output: Transcribed text
Model Size: 2B parameters
Model Design: Large Conformer encoder for acoustic feature extraction, paired with a lightweight Transformer decoder for token generation
Training Objective: Supervised cross-entropy on output tokens (trained from scratch)
- Languages (14):European: English, French, German, Italian, Spanish, Portuguese, Greek, Dutch, Polish
- APAC: Chinese, Japanese, Korean, Vietnamese
- MENA: Arabic
License: Apache 2.0

Model Performance — Accuracy

Cohere’s Transcribe sets a new benchmark in English speech recognition, ranking #1 on the HuggingFace Open ASR Leaderboard with an average 5.42% word error rate (WER).

It outperforms leading models such as Whisper Large v3, ElevenLabs Scribe v2, and Qwen3-ASR-1.7B, demonstrating strong real-world performance across:

Multi-speaker environments
Boardroom-style acoustics (AMI)
Diverse accents (VoxPopuli)

Top Models by Average WER:

Cohere Transcribe: 5.42
Zoom Scribe v1: 5.47
IBM Granite 4.0 1B Speech: 5.52
NVIDIA Canary Qwen 2.5B: 5.63
Qwen3-ASR-1.7B: 5.76
ElevenLabs Scribe v2: 5.83
Whisper Large v3: 7.44

This performance highlights its versatility and reliability across varied, real-world speech scenarios, setting a new standard for production-grade ASR systems.

Critically, these gains extend beyond benchmarks. Cohere’s Transcribe also delivers state-of-the-art results in human evaluations, where reviewers assess real-world audio for accuracy, coherence, and usability.

This consistency across both benchmark and human testing confirms that its performance translates reliably into practical enterprise deployments.

Throughput

In production environments, ASR systems must meet strict latency and throughput demands; high accuracy alone isn’t sufficient if transcription is slow or resource-intensive.

Cohere’s Transcribe pushes the Pareto frontier, combining state-of-the-art accuracy (low WER) with best-in-class throughput (high RTFx) within the 1B+ parameter model category, ensuring efficient, real-time performance at scale.

“We’re genuinely impressed with what Cohere has built with Transcribe,” said Paige Dickie, Vice-President at Radical Ventures. “The speed is exceptional, turning minutes of audio into usable transcripts in seconds, unlocking new possibilities for real-time products and workflows. In our testing, the model handled everyday speech well and delivered reliable transcription quality. We’re excited to partner with Cohere and explore what we can build with this technology.”

Zero to One — and Beyond

Cohere is expanding Cohere Transcribe beyond transcription, with deeper integration planned into North, its AI agent orchestration platform. The goal is to evolve it into a broader foundation for enterprise-grade speech intelligence.

Getting Started

Cohere Transcribe is available on Hugging Face for local or edge deployment. It can also be accessed via API for quick experimentation, with rate-limited usage.

For production, Cohere offers Model Vault, enabling low-latency, private cloud inference without infrastructure overhead, with pricing based on hourly instances and discounted long-term plans.

MENLO TIMES

Cohere Unveils a New State-of-the-Art in Open Source Speech Recognition - Transcribe

Recent Posts