AudioWorld pilot Request a voice-agent eval pack

Real-world audio arena

AudioWorld

A controlled benchmark and training environment for full-duplex voice agents operating in subways, cars, offices, cafes, streets, and homes.

Live Arena 42s scene loop
agent
Subway announcement ignored
Barge-in detected at 30.0s
Task state retained
Eval Task success, barge-in, latency, safety
Train State updates, tool decisions, turn boundaries
World Real backgrounds with synthetic interaction scenes
Full-duplex eval Real background stems Synthetic speech expansion Behavior scoring

Product

What AudioWorld sells

Not a noisy ASR dataset. A behavior arena.

AudioWorld packages real-world audio scenes where voice agents must listen, track state, reject distractors, use tools, stop for barge-in, and answer safely under realistic acoustic stress.

Eval Packs

Replayable benchmark scenes with gold behavior labels.

Training Packs

Audio-to-action items for full-duplex SLM improvement.

Failure Reports

Agent behavior breakdowns tied to timestamps and tasks.

Synthetic Expansion

Real backgrounds with consented or safe synthetic speech.

Eval answers whether the agent can operate.

Each scene includes audio, turn timing, expected agent state, ignored events, tool-call targets, latency budgets, and safety constraints.

  • 01 Task success under noise
  • 02 Multi-turn state retention
  • 03 Barge-in and correction handling
  • 04 Non-user speech rejection
  • 05 Tool timing and argument accuracy
  • 06 First-audio, stop, and tool latency

Training data comes from replayable failures.

Agent logs become labeled items for ASR under noise, turn boundaries, intent slots, state updates, tool decisions, barge-in policy, distractor rejection, and response policy.

Real scene
Agent log
Failure label
Train item
1

Capture

Subway, car vibration, office, cafe, street, and home audio.

2

Separate

Speech target and residual background stems with lineage.

3

Synthesize

Timed multilingual turns over real acoustic backgrounds.

4

Score

Replay into voice agents and score behavior, not just words.

Build the benchmark before the market standardizes.

AudioWorld is preparing pilot packs for voice-agent teams that need hard, realistic, repeatable audio interaction tests.

contact@audioworld.ai Routes to the founder inbox for dataset licensing, custom capture, and agent evaluation pilots

Arena model

Replay the same world into every voice agent.

AudioWorld turns real recordings into timed interaction scenes with expected state, ignored audio events, tool-use targets, and latency budgets. The same scene can evaluate OpenAI, Gemini, local SLMs, or a customer's internal stack.

scenesubway_lookup_014
turnbarge-in at 30.0s
expectedstop, update state, ask one question
scoretask 0.84 ยท latency 410ms

World coverage

Scenes where agents fail in production

Subway

Announcements, crowd speech, tunnel rumble, station names.

Car

Road noise, vibration, hands-free microphones, safety constraints.

Office

Keyboards, room echo, side conversations, low-volume turns.

Cafe

Music, dishes, overlapping speakers, privacy-sensitive audio.

Street

Wind, sirens, traffic spikes, unstable turn boundaries.

Home

TV, family voices, appliances, command-vs-distractor decisions.