Benchmark 14 min read

Transcription Accuracy Tested Across 8 Providers

We ran identical meeting audio through 8 transcription APIs and measured Word Error Rate, diarization accuracy, latency, and cost. Here is every number.

meetingstack research · 14 min read

Why we ran this test

Every transcription vendor publishes accuracy numbers. Deepgram claims "industry-leading accuracy." AssemblyAI says "best-in-class." OpenAI published Whisper benchmarks that compared favorably against commercial APIs. The problem: each vendor tests against different audio, under different conditions, using different metrics. Their numbers can't be compared side by side.

We wanted one dataset, one methodology, applied to every provider. No vendor-provided audio. No cherry-picked samples. Just the same meeting recordings fed through each API, with results measured against human-verified ground truth.

This article presents the full results. The methodology is open source. You can run the same tests yourself.

Methodology

We tested 8 transcription providers using 5 audio samples designed to simulate real meeting conditions. Each sample was 10 minutes long:

Sample 1: Clean single speaker. One person presenting, good microphone, quiet room. The baseline.
Sample 2: Two-person dialogue. Engineering standup, natural turn-taking, some technical vocabulary ("Kubernetes," "OAuth," "CI pipeline").
Sample 3: Four-person group call. Product planning meeting with frequent interruptions and crosstalk. Laptop microphones, varying distances.
Sample 4: Non-native English speaker. Moderate accent, clear speech, standard business vocabulary.
Sample 5: Background noise. Coffee shop environment, one speaker on a phone call. Air conditioning hum, ambient chatter.

All audio was recorded at 16kHz mono WAV. Each file was submitted to all 8 providers using their default settings: no custom vocabulary, no language hints, standard/best available model. Ground truth transcripts were created by two independent human transcribers, with disagreements resolved by a third.

Providers tested: Deepgram (Nova-2), AssemblyAI (Universal-2), OpenAI Whisper (large-v3 via API), Rev.ai, Speechmatics, Google Cloud Speech-to-Text (Chirp 2), Azure Speech Services, and Otter.ai.

Overall WER results

Word Error Rate (WER) measures the percentage of words incorrectly transcribed compared to the ground truth. Lower is better. This chart shows the average WER across all 5 audio samples:

Average Word Error Rate by provider

Lower is better. Average across 5 meeting audio samples.

The spread between best (4.7%) and worst (8.3%) is 3.6 percentage points. That sounds small, but in practice it means Otter produces nearly twice as many errors per paragraph as Deepgram. For a 30-minute meeting generating ~5,000 words, the difference is roughly 180 additional errors.

Performance by audio condition

Average WER masks important differences. Some providers handle clean audio well but fall apart with crosstalk. Others are surprisingly resilient to background noise but struggle with accents. Here's how each provider performed across the 5 conditions:

Provider	Clean	Dialogue	Crosstalk	Accent	Noise	Avg
Deepgram	2.8%	3.9%	6.2%	6.1%	4.5%	4.7%
AssemblyAI	3.1%	4.2%	5.8%	6.8%	5.6%	5.1%
Speechmatics	3.4%	4.5%	7.1%	6.4%	5.6%	5.4%
Google Chirp	3.2%	5.1%	8.4%	6.9%	5.4%	5.8%
Whisper	3.0%	4.8%	7.8%	9.8%	5.6%	6.2%
Azure Speech	3.6%	5.3%	8.6%	8.2%	6.8%	6.5%
Rev.ai	4.1%	5.8%	9.2%	8.7%	7.7%	7.1%
Otter.ai	4.5%	6.2%	10.8%	14.2%	5.8%	8.3%

Three patterns stand out:

Crosstalk is the great equalizer. Every provider's error rate at least doubles when handling four overlapping speakers versus clean audio. The best performer on crosstalk (AssemblyAI at 5.8%) is still worse than most providers' clean-audio scores. If your team has group calls, this column matters more than the average.

Accent handling varies wildly. Whisper's 9.8% on the accent sample is surprising given OpenAI's focus on multilingual models. Otter's 14.2% is the single worst result in the entire dataset. Deepgram and Speechmatics both stayed below 6.5%, suggesting their training data includes more diverse speech patterns.

Background noise is more manageable than expected. Most providers handled the coffee shop sample within 1-2 points of their clean-audio score. Modern noise suppression has made significant progress. This is the condition that matters least when choosing a provider.

Speaker diarization

Diarization means identifying who said what. For meeting transcripts, this matters as much as raw accuracy. A transcript where every word is correct but nobody knows who spoke is still hard to use.

We measured diarization accuracy as the percentage of total audio time where the provider correctly identified the speaker, after optimal label mapping.

Speaker diarization accuracy

Four-speaker group call sample. Higher is better.

AssemblyAI leads on diarization (91.2%), which explains its strong crosstalk WER score. The two metrics are related: better speaker separation means the model can apply speaker-specific context to improve word recognition.

OpenAI's Whisper API does not return speaker labels. If you need diarization with Whisper, you have to run a separate model (pyannote, NeMo) on the audio, which adds complexity, latency, and cost. Otter.ai was excluded from this chart because its diarization is part of its proprietary pipeline and cannot be isolated for measurement.

Latency comparison

We measured wall-clock time from API call to complete transcript for each 10-minute audio file. For providers offering both streaming and batch modes, we tested batch (the common path for post-meeting transcription).

Processing time for 10-minute audio

Batch mode, upload to complete transcript. Median of 5 runs.

Deepgram is significantly faster than the rest. At 4.2 seconds for a 10-minute file, it processes audio at roughly 140x real-time. This matters for applications that need near-instant transcripts after a call ends (CRM updates, action item extraction, meeting summaries).

AssemblyAI's 15.2 seconds includes polling time. Their upload-then-poll architecture adds overhead compared to Deepgram's synchronous endpoint. Rev.ai has a similar model. For batch processing of recorded calls, these delays are acceptable. For real-time use cases, they're not.

Whisper via OpenAI's API is the slowest at 22.6 seconds. Self-hosted Whisper on an A100 GPU can be faster, but then you're paying for GPU time instead of API fees.

Cost analysis

Pricing as of April 2026. All prices are per minute of audio processed, on standard/pay-as-you-go tiers.

Provider	Per minute	Per hour	1,000 hrs/mo	Diarization
Deepgram	$0.0043	$0.26	$260	Included
AssemblyAI	$0.0050	$0.30	$300	Included
Rev.ai	$0.0047	$0.28	$280	Included
Google Chirp	$0.0060	$0.36	$360	Included
Whisper API	$0.0060	$0.36	$360	No
Azure Speech	$0.0067	$0.40	$400	Included
Speechmatics	$0.0070	$0.42	$420	Included
Otter.ai	Subscription ($16.99/mo per user)			Included

Otter's subscription model makes direct comparison difficult. At $16.99/user/month, it's cheaper for light usage (a few hours per month) but quickly becomes more expensive than API-based providers for heavy processing. It also lacks an API, making it unsuitable for automated pipelines.

Accuracy vs. cost tradeoff

Plotting WER against cost per hour reveals which providers offer the best value:

Accuracy vs. cost

Lower-left is better (lower cost, lower error rate). Otter excluded (subscription pricing).

Deepgram sits alone in the lower-left: cheapest and most accurate. AssemblyAI is close behind, with slightly higher cost and error rate but the best diarization. These two are the clear leaders on the accuracy-to-cost ratio.

Rev.ai offers competitive pricing but trails on accuracy. Speechmatics delivers good accuracy but at the highest API cost. Google and Azure are middle-of-road on both axes. Whisper is penalized by its lack of diarization; if you add a third-party diarization service, the effective cost rises above $0.50/hr.

Best for each use case

Best overall accuracy: Deepgram Nova-2. Lowest WER, fastest processing, lowest cost. Hard to argue against.

Best for multi-speaker meetings: AssemblyAI. Best diarization accuracy (91.2%) and best crosstalk WER (5.8%). If your primary use case is group calls with 3+ speakers, AssemblyAI's speaker separation gives it an edge.

Best for multilingual teams: Speechmatics. Supports 50+ languages with consistent quality. Premium pricing, but the breadth of language support is unmatched among the providers tested.

Best free/self-hosted option: OpenAI Whisper large-v3. 6.2% WER with zero API cost if you self-host. The tradeoff: no built-in diarization, slower processing, and you're responsible for GPU infrastructure.

Best for real-time streaming: Deepgram. Sub-second latency in streaming mode is 3-5x faster than the next competitor. Essential for live captioning, real-time assistants, and voice agents.

Best for budget at scale: Deepgram at $0.0043/min. At 1,000 hours/month, you're paying $260 compared to $420 for Speechmatics or $400 for Azure. That adds up.

Methodology notes

A few caveats worth noting:

Default settings only. We did not test custom vocabulary, language hints, or fine-tuned models. Providers that offer these features may perform better in practice for teams that invest time configuring them.
English only. All samples were in English. Multilingual performance would produce different rankings.
API pricing changes. We verified prices as of April 2026. Volume discounts and enterprise agreements will produce lower rates than those shown here.
Otter.ai is consumer-grade. It's designed for individual users joining live meetings, not for API-based batch processing. Comparing it directly against developer APIs is not entirely fair, but it's included because teams commonly consider it alongside these alternatives.

The benchmark methodology is open source. You can run the same tests, add your own audio samples, or contribute adapters for providers we haven't tested. See the benchmark page for setup instructions.

If you want us to test a provider we missed, submit a suggestion.