Eleven v3 Audio Tags: Giving situational awareness to AI audio

Jun 9, 2025 • 3 minutes reading time

A man with glasses and a beard looking to the side in a room with bookshelves.

Enhance AI speech with Eleven v3 Audio Tags. Control tone, emotion, and pacing for natural conversation. Add situational awareness to your text to speech.

Contact Sales

Audio Tags are a fundamental part of the new Eleven v3 (alpha) Text to Speech model. They let you control how lines are delivered — shifting tone, emotion, and pacing to reflect real-world context.

At their simplest, Audio Tags are words in square brackets. The model interprets these as performance cues. That means you can adjust the delivery mid-sentence to reflect emotional beats or situational shifts — giving the AI a degree of situational awareness.

What is situational awareness in AI speech?

Situational awareness means the AI adapts its delivery to fit the moment. With Audio Tags, you control not just what the model says — but how it responds.

Whether you're adding urgency with a [SHOUTING] tag, softening a warning with a [WHISPER], or signaling hesitation with [SIGH], tags transform narration into performance. They’re especially valuable in high-context or dynamic scenes.

Performance, not just reading

Imagine you’re scripting a Veo 3 highlight video of a football match between 11 United and 12 United. You want the intensity to rise with the action: “He cuts past one defender — [EXCITED] here comes the cross — [SHOUTING] GOAAAL!”

Or you’re voicing a suspenseful moment in an audiobook: “[WHISPERING] I think someone’s in the house. [PAUSE] Stay quiet.”

These aren't stylistic add-ons. They define the moment and drive how it feels. The model doesn't read — it performs.

Common tags for situational use

Audio Tags let you simulate a range of emotional and physical cues:

Emotional tone: [EXCITED], [NERVOUS], [FRUSTRATED], [TIRED]
Reactions: [GASP], [SIGH], [LAUGHS], [GULPS]
Volume & energy: [WHISPERING], [SHOUTING], [QUIETLY], [LOUDLY]
Pacing & rhythm: [PAUSES], [STAMMERS], [RUSHED]

Tags can be layered to add nuance: “[NERVOUSLY] I... I’m not sure this is going to work. [GULPS] But let’s try anyway.”

Performance you can steer

Eleven v3 supports these tags with a deeper contextual model. It can shift tone mid-line, handle interruptions, and maintain flow — giving you delivery that feels more natural without rewriting the script.

For voice designers, game developers, and storytellers, this unlocks a new creative layer. You’re not just writing lines. You’re directing them.

Selecting the right voice

Professional Voice Clones (PVCs) are currently not fully optimized for Eleven v3, resulting in potentially lower clone quality compared to earlier models. During this research preview stage it would be best to find an Instant Voice Clone (IVC) or designed voice for your project if you need to use v3 features. PVC optimization for v3 is coming in the near future.

Explore articles by the ElevenLabs team

Research

Research

Introducing Eleven v3 (alpha)

The most expressive Text to Speech model

Resources

Resources

What are Eleven v3 Audio Tags — and why they matter

ElevenLabs' audio tags control AI voice emotion, pacing, and sound effects.

Create with the highest quality AI Audio

Get started free

Already have an account? Log in