Voice-agent QA for real audio

AudioWorld

We find the failures your scripted tests miss: barge-in, false activation, noise robustness, addressing and recovery when a realtime voice agent is placed in cars, transit, meetings and other messy acoustic worlds.

Request a Pulse pilot See the sample report

No model weights required Endpoint or response logs Report, clips and repro logs

What we sell

A diagnosis, not a dataset.

A customer gives us a realtime voice endpoint or response logs. AudioWorld replays controlled real-world scenarios, scores behavior, and returns a scorecard, hearable failures, reproducible logs and concrete fixes.

Does it stop when a user cuts in mid-response?
Does it ignore nearby speech not addressed to it?
Does it stay concise and safe in car or transit settings?
Does it hold the goal across noise, backchannels and corrections?

What the world measures

Behavior under real acoustic pressure.

Listen through noise and motion
Know when speech is addressed to the agent
Yield, recover and keep the task state
Return failures as scorecards and clips

How it works

Connect, run, receive the failures your team can fix.

Send an endpoint or logs

No model weights required. Start with a realtime endpoint, a webhook adapter, or recorded response logs from your stack.

Run real-world scenarios

We replay seeded audio worlds: interruptions, bystander speech, car cabins, transit noise, meetings and long-session memory pressure.

Get a repairable report

You receive pass/fail scorecards, failure clips, JSON/CSV logs, repro seeds and specific recommendations for model or product fixes.

What you receive

A pilot report designed for product and model teams.

Executive scorecard

Overall verdict, failure families, confidence, scenario coverage and the top fixes to prioritize.

Hearable failures

Short audio clips that show exactly where the agent kept talking, activated falsely or missed the task.

Reproducible logs

Scenario IDs, seeds, timing traces, transcripts, expected behavior and machine-readable score outputs.

Fix path

Recommendations for endpoint behavior, prompt policy, VAD/cancel settings, data collection or fine-tuning.

Products

Start with Pulse. Pay for depth.

Pulse is the intake report. Duplex measures live interruption behavior. Arena, Boardroom and Forge extend the same evaluation lineage into reliability, enterprise scenes and training data.

Free starter

Pulse

Ten short scenarios in one environment. A fast scorecard for noise robustness, false activation, silence compliance and basic task behavior.

Input endpoint or logs
Output scorecard + clips
Turnaround 48h target

Primary pilot

Duplex

Live realtime barge-in measurement: stop latency, cancel verdict and grace-contract pass rate, benchmarked against reference systems.

Input realtime agent
Output stop-latency traces
Best for launch readiness

Closed loop

Arena

Agent-conditioned user simulation over multiple seeds for pass^k reliability, interruption recovery and goal completion.

Input endpoint + goals
Output pass^k reliability
Best for regression QA

Enterprise

Boardroom

Multi-party and long-form scenarios for addressing, silence compliance, false activations and custom acoustic environments.

Input target workflow
Output custom eval pack
Best for enterprise pilots

Training add-on

Forge

RL-ready supervised, preference and duplex-timing data generated from evaluation failures and reproducible simulation runs.

Input failure families
Output SFT/preference/timing data
Best for model improvement

Evidence

Public sample, private raw captures.

The public site shows synthetic speech and speech-free ambience demos. The private pipeline already produces measurable failure cases across static items, duplex trials, long-session cards and Boardroom scenes.

Sample diagnosis: gpt-realtime-mini Interactive audio-world demos Request a pilot scope

169static items across failure families

257speech-free ambience stems

100long-session endurance cards

100Boardroom audio scenes

20live Duplex barge-in runs

Security and data handling

Built for evaluation work with sensitive audio.

Customer boundaries

Customers can provide endpoints or logs without sending model weights. Pilot scope defines retention, deletion and allowed artifacts before runs begin.

Public release gate

Published demos use TTS voices and speech-free ambience. Identifiable raw voices and customer outputs are not published without explicit approval.

Lineage and deletion

Scenario assets track source, processing version, consent scope and export policy. Deletion requests are handled at the pilot artifact level.

Why this becomes infrastructure

Realtime voice needs evals that are native to sound and timing.

Text benchmarks do not catch the failures that make voice agents feel broken: late cancellation, accidental activation, missed addressee, noise collapse and long-session drift. AudioWorld turns those failures into repeatable QA, procurement evidence and training data.

Market wedge

Teams shipping realtime agents into support, mobility, meetings, wearables and embedded environments.

Moat

Scenario generator, field-recording-derived ambience, timing instrumentation and failure-to-data pipeline.

Current ask

Seeking design partners for Pulse and Duplex pilots; deeper enterprise packs scoped from real customer failures.