
Alibaba has just released a powerful open sourced text to speech system, Qwen3-TTS, designed for speed, fidelity and global accessibility. It represents a significant upgrade in speech synthesis technology by focusing on preserving the “soul” of speech like emotion and environment without sacrificing performance.
Qwen3-TTS has efficient encoding – Using the Qwen3-TTS-Tokenizer-12Hz, the system compresses speech signals into a robust representation that retains tone, rhythm emotion and acoustic environment features.
Qwen3-TTS has dual track modelling, an architecture that allows for bidirectional streaming. Te system can produce the first packet of audio just by analysing a single character, making it almost instantaneous.
Qwen3- TTS uses high fidelity reconstruction. Unlike many modern systems that rely on heavy Diffusion Transformers (DiT), Qwen3 uses a light weight non-DiT architecture, to achieve high fidelity audio with minimal latency.
Qwen3-TTS has multi-lingual support, supporting up to 10 different languages (English, Chinese, Japanese, European languages) and various dialects. The model understands context,adapting its rhythm and emotional expression based on the text’s meaning and specific user instructions. The Qwen3-TTS is highly robust, it is highly resistant to noise in the input text, ensuring a smooth audio file even if the text is messy.
The series is available in 2 different open source versions, 1.7B Model and 0.6B Model. The 1.7B model is the flagship version, offering maximum control and peak audio performance. The 0.6B model is a streamlined version optimized for efficiency while maintaining high quality.
