VibeVoice Text-to-Speech

A novel framework for generating expressive, long-form, multi-speaker conversational audio from text. Features ultra-low frame rate tokenizers and next-token diffusion for high-quality speech synthesis up to 90 minutes with 4 distinct speakers.

Key Features

  • Long-form conversational audio (up to 90 minutes)
  • Multi-speaker support (up to 4 distinct speakers)
  • Ultra-low frame rate tokenizers (7.5 Hz)
  • Next-token diffusion framework

🎁 Experience the future of conversational text-to-speech technology

Zonos Text-to-Speech Architecture

Try VibeVoice Online

Experience the power of VibeVoice text-to-speech directly in your browser. No installation required.

placeholder hero

What is VibeVoice

VibeVoice is a novel framework designed for generating expressive, long-form, multi-speaker conversational audio from text. It addresses significant challenges in traditional TTS systems, particularly in scalability, speaker consistency, and natural turn-taking.

  • Continuous Speech Tokenizers
    Uses Acoustic and Semantic tokenizers operating at ultra-low frame rate of 7.5 Hz, efficiently preserving audio fidelity while boosting computational efficiency.
  • Next-Token Diffusion Framework
    Leverages Large Language Model to understand textual context and dialogue flow, with diffusion head to generate high-fidelity acoustic details.
  • Long-Form Multi-Speaker Support
    Synthesizes speech up to 90 minutes long with up to 4 distinct speakers, surpassing typical 1-2 speaker limits of prior models.
Benefits

Why Choose VibeVoice

Experience breakthrough technology in conversational text-to-speech with unprecedented scalability and natural dialogue generation.

Generate conversational audio up to 90 minutes long, perfect for podcasts, interviews, and extended dialogues.

Ultra-Long Form Generation
Multi-Speaker Conversations
Computational Efficiency

What makes VibeVoice special

VibeVoice is a breakthrough framework that revolutionizes conversational text-to-speech with its innovative architecture and unprecedented capabilities.

Continuous Speech Tokenizers

Acoustic and Semantic tokenizers operating at ultra-low 7.5 Hz frame rate for efficient long-sequence processing

Next-Token Diffusion Framework

Combines LLM understanding with diffusion head for high-fidelity acoustic detail generation

Long-Form Multi-Speaker Support

Generate up to 90 minutes of audio with up to 4 distinct speakers in natural conversations

Expressive Conversational Audio

Designed specifically for podcasts, interviews, and multi-speaker dialogues with natural turn-taking

Scalable Architecture

Addresses traditional TTS challenges in scalability, speaker consistency, and natural dialogue flow

Research Framework

Open-source research framework intended to advance collaboration in the speech synthesis community

Testimonial

What People Are Saying

See what the community thinks about VibeVoice.

FAQ

Frequently Asked Questions About VibeVoice

Have another question? Contact us by email.

1

What is VibeVoice designed for?

VibeVoice is designed for generating expressive, long-form, multi-speaker conversational audio such as podcasts, interviews, and extended dialogues from text input.

2

How long can VibeVoice generate audio?

VibeVoice can synthesize speech up to 90 minutes long, significantly longer than traditional TTS systems which typically handle much shorter sequences.

3

How many speakers can VibeVoice handle?

VibeVoice supports up to 4 distinct speakers in a single conversation, with natural turn-taking and speaker consistency throughout the entire audio.

4

What languages does VibeVoice support?

VibeVoice currently supports English and Chinese. Transcripts in other languages may result in unexpected audio outputs.

5

Is VibeVoice suitable for commercial use?

VibeVoice is intended for research and development purposes only. We do not recommend using it in commercial or real-world applications without further testing and development.

Ready to try VibeVoice?

Experience the power of conversational text-to-speech technology.