Voice AI Agents in Meetings: Hype Check
Vapi, Retell, Bland, and Synthflow are raising money to put AI agents in your meetings. Can they actually run a meeting? We checked.
Every voice AI pitch deck makes the same promise: "Our agent can handle any conversation, from sales demos to board meetings, with human-level fluency." I tested five voice AI platforms over two weeks. Most of them made me want to hang up. The reality is more complicated, and more interesting, than the marketing suggests.
Voice AI agents are real technology solving real problems. They book appointments, confirm meetings, and handle scripted outbound calls with surprising competence. But the gap between "handle a structured phone call" and "participate in a business meeting" is enormous. The funding is flowing anyway.
The funding map
The voice AI agent category has attracted over $100M in venture funding in the past 18 months. Most of that money is concentrated in five companies.
ElevenLabs dominates in voice synthesis and cloning but operates more as infrastructure than an agent platform. Vapi has positioned itself as the developer-first API layer. Retell is shipping features fast and targeting mid-market sales teams. Bland AI focuses almost entirely on outbound calling. Synthflow targets non-technical users with a drag-and-drop builder.
The money is real. The question is whether "voice AI agent" is a product category or a feature that gets absorbed into existing platforms like Salesforce, HubSpot, and the meeting tools themselves. As the AI notetaker price war shows, features that start as standalone products often collapse into platform plays.
What works today
I tested all five platforms on real business workflows. Four use cases consistently produced good results:
1. Outbound cold calls with scripted flows. This is the sweet spot. The agent calls a prospect, delivers a 30-second pitch, handles basic objections ("not interested," "send me an email," "who is this?"), and either books a meeting or moves on. Bland AI and Vapi both handled this well. Completion rates were comparable to junior SDRs on simple scripts.
2. Appointment booking and confirmation. "Hi, this is Sarah from Dr. Miller's office confirming your appointment on Thursday at 2 PM. Can you make it?" These calls are short, predictable, and high-volume. Every platform I tested handled them reliably.
3. Customer support triage. Routing calls to the right department, collecting account numbers, describing the issue before transfer. The agents handled this as well as a basic IVR system, with the added benefit of natural conversation instead of "press 1 for billing."
4. Post-meeting follow-up calls. "Hi, just following up on the demo yesterday. Do you have 15 minutes this week to discuss next steps?" Simple, structured, one question to answer. The agents did fine here.
Notice the pattern. Every successful use case is a two-party conversation with a narrow scope and predictable responses. The agent knows what it wants. The human's responses fall into a small number of categories. When the conversation goes off-script, the agent either redirects or gracefully hands off to a human.
What doesn't work
I also tested these platforms on scenarios closer to actual business meetings. The results were bad. Not "needs improvement" bad. "I apologized to the person on the other end" bad.
Multi-party conversations break these agents. Current voice AI can track one speaker at a time. When three people talk over each other in a product review meeting, the agent cannot determine who said what, who to respond to, or when to speak. I tested Vapi's multi-party mode and it consistently talked over participants or went silent at the wrong moments.
Latency kills natural conversation. The best voice AI agents have roughly 400-800ms of response latency. In a scripted call where the agent asks a question and waits, that delay is acceptable. In a fast-moving meeting where people build on each other's ideas, half a second of silence after every statement feels robotic. Participants stop treating the agent as a participant and start treating it as a broken speakerphone. I found myself repeating "hello?" more than once, which tells you everything.
Context from shared screens and documents is invisible. Half of what happens in a business meeting references something on screen: a slide, a Figma mockup, a spreadsheet, a code diff. Voice AI agents hear words but cannot see screens. When someone says "let's go back to slide 4" or "the numbers in row 12 look off," the agent has nothing to work with.
Nuance and negotiation require reading the room. Knowing when to push, when to concede, when to stay quiet. These are high-context skills that depend on tone, pauses, and social dynamics. Voice AI agents process words. They do not process the silence after a lowball offer.
The real opportunity
Voice AI will not replace meeting participants. Not this year, probably not in the next three years for anything beyond simple structured calls. But that is not where the real value is anyway.
The valuable work around meetings is not the meeting itself. It is everything before and after: scheduling, rescheduling, confirming attendance, sending agendas, taking notes, distributing action items, updating CRM records, booking follow-ups. Most of that work is structured, predictable, and mind-numbing. Voice AI can handle all of it.
The agent that calls your prospect to confirm a demo time? That works today. The agent that calls every attendee 24 hours before a meeting to confirm or reschedule? That works. The agent that calls a customer after a support ticket closes to check satisfaction? Works fine.
The agent that runs the demo? Not yet.
The companies that will win this market are the ones building around meetings, not trying to replace the humans in them. That means integrating with calendars, CRMs, and meeting platforms. It means treating voice AI as the connective tissue between meetings, not the meeting itself.
Who to watch
Vapi has the best developer platform in the category. Their API is clean, well-documented, and flexible enough to build custom workflows. If a meeting tool wants to add voice AI features, Vapi is the most likely infrastructure layer they will build on. The risk is that Vapi is a platform play, not a product play. They need other companies to build the meeting-specific use cases, which means their success depends on an ecosystem that does not fully exist yet.
Retell moves faster than anyone else I tested. They went from launch to production-ready in under a year, and their latency numbers are the best in the group (consistently under 500ms). They are targeting sales teams directly with pre-built templates for cold calling and appointment setting, which gives them revenue now while competitors build infrastructure. If Vapi's ecosystem catches up with comparable templates, Retell's narrow focus becomes a vulnerability.
ElevenLabs produces the most natural-sounding voices, and it is not close. Their voice cloning can replicate a specific person's speech patterns with minutes of training audio. For meeting use cases, voice quality matters because participants need to feel comfortable talking to the agent. The limitation is that ElevenLabs is a voice engine, not an agent platform. They need partners to build the conversation logic, which puts them one step removed from the end user.
Bland AI owns outbound calling and is not pretending otherwise. Their pricing model (per-minute, no platform fee) makes them the cheapest option for high-volume campaigns. They have built integrations with most major CRMs. I did not see any indication they plan to move toward multi-party or meeting-adjacent use cases, which means they stay in their lane and do well there, or the lane gets absorbed by a bigger player.
Synthflow is the Zapier of voice AI: no-code builder, pre-built templates, one-click deployment. They target small businesses that want a phone agent without hiring a developer. No-code platforms hit a ceiling when conversations get complex, and their voice quality trails ElevenLabs and Retell. But for the "I just need someone to answer the phone" use case, Synthflow is the fastest path from zero to working agent.
The most interesting scenario: one of the major meeting platforms (Zoom, Teams, Google Meet) acquires a voice AI company and builds native agent capabilities. That has not happened yet. When it does, the category map changes overnight.