MOSS‑TTS Family: Open-Source Speech and Sound Synthesis for Real-World AI

OpenMOSS Logo

Welcome to the world of MOSS‑TTS, a family of open‑source models engineered by MOSI.AI and the OpenMOSS team. This ecosystem is designed to deliver high fidelity, expressive, and context‑aware speech and sound generation that can handle long-form speech, multi‑speaker dialogues, voice/character design, environmental sound effects, and real‑time streaming TTS. The family is built to be used as modular building blocks that can operate standalone or be composed into end‑to‑end pipelines for demanding production scenarios.

OpenMOSS-TTS on TrendShift [OpenClaw Skills] HuggingFace models ModelScope models arXiv paper Blog updates API docs Social channels

The MOSS‑TTS Family at a Glance

The MOSS‑TTS Family is a coordinated set of five production‑grade models that cover a wide range of real‑world needs. Each model is designed to be used independently or together to form a complete voice pipeline:

MOSS‑TTS: The flagship production model. It delivers high‑fidelity, expressive synthesis with robust zero‑shot voice cloning capabilities. It supports long speech, fine‑grained control over pronunciation (Pinyin, phonemes), duration, and multilingual or code‑switched synthesis.
MOSS‑TTSD: A spoken dialogue engine for expressive, multi‑speaker conversations and ultra‑long dialogues. Its v1.0 claims top objective metrics and strong subjective performance against leading closed‑source models.
MOSS‑VoiceGenerator: A design‑of‑voices model that can generate diverse voices and styles directly from text prompts, without requiring reference speech. It serves as a voice design layer for downstream TTS.
MOSS‑TTS‑Realtime: A multi‑turn, context‑aware TTS engine for real‑time voice agents. It uses incremental synthesis to keep replies natural and coherent across turns, with extremely low latency.
MOSS‑SoundEffect: A specialized model that generates sound effects across broad categories—from natural environments to human actions and musical fragments—ideal for film, games, and interactive experiences.

In addition, the family relies on specialized components like MOSS‑Audio‑Tokenizer, and optional backends such as llama.cpp for torch‑free inference and SGLang for accelerated inference. The combination provides a complete toolkit for production‑level TTS and audio generation.

Introductory family image gives a visual sense of the collection and its interrelated components.

News and Milestones

The ecosystem has a vibrant release and update cadence, with notable milestones in 2026:

May 26, 2026: Release of MOSS‑SoundEffect‑v2.0, a 48 kHz bilingual text‑to‑audio model with a DiT backbone and Flow Matching, capable of generating sound effects up to 30 seconds.
May 26, 2026: Release of MOSS‑TTS‑v1.5, featuring stronger multilingual synthesis when language tags are provided, more stable voice cloning, improved long‑reference/short‑text cloning, punctuation‑driven prosody, and explicit pause control via [pause X.Ys].
May 6, 2026: MOSS‑TTS and MOSS‑Audio‑Tokenizer gain mlx‑audio support for enhanced audio processing.
April 29, 2026: MOSS‑TTS 2.0 on the horizon; community feedback is being gathered to shape the next generation.
April 13, 2026: MOSS‑TTS‑Nano, a compact ~100M parameter model, becomes available, featuring multilingual voice cloning, 48 kHz stereo input/output, and streaming output on only four CPU cores.
March 31–18, 2026: A series of arXiv papers and tutorials land, including the MOSS‑TTSD technical report, the MOSS‑VoiceGenerator paper, and e2e guides for fine‑tuning and deployment.
March 18, 2026: A first‑class llama.cpp end‑to‑end implementation emerges, with runnable pipelines for GGUF backbone inference and ONNX audio decoding.
February 10, 2026: The project announces the launch of the MOSS‑TTS family on the HuggingFace ecosystem and ModelScope, alongside a vibrant blog and demo pages.

These releases illustrate a clear emphasis on production readiness, multilingual capabilities, voice design, and real‑time streaming performance.

Model Architecture: Core Concepts

The MOSS‑TTS family leverages two complementary baselines to guide deployment and research:

MossTTSDelay: Focuses on long‑context stability, inference speed, and production readiness.
MossTTSLocal: Emphasizes lightweight flexibility and strong objective performance for streaming‑oriented systems.

Additionally, MossTTSRealtime is a capability‑driven design for voice agents, modeling multi‑turn context from both prior text and user acoustics to deliver low‑latency, coherent speech across turns.

Key architectural ideas:

A multi‑head, delay‑pattern approach for robust long‑form generation (Delay).
Time‑synchronous RVQ blocks within a transformer backbone for streaming consistency (Local).
A hierarchical, context‑aware design for real‑time interactions (Realtime).

A brief view of the architectural family shows how these components complement each other to cover a wide range of use cases—from long, coherent monologues to on‑the‑fly dialogue with voice consistency.

Arch Details for MossTTSDelay and MossTTSLocal and related documentation point to deeper technical specifics.

Released Models: What’s Available

The MOSS‑TTS family ships several model variants, each tuned for specific deployment goals:

MOSS‑TTS‑v1.5: Uses MossTTSDelay, 8B parameter footprint, supports multilingual synthesis with language tags, improved cloning stability, long‑reference handling, and explicit pause control.
MOSS‑TTS‑v1.0: The original 8B‑parameter MossTTSDelay variant, with HuggingFace and ModelScope coverage.
MOSS‑TTS‑Local‑Transformer: A 1.7B parameter variant focusing on streaming performance and solid objective metrics, with HuggingFace and ModelScope availability.
MOSS‑TTSD‑V1.0: An 8B model variant for spoken dialogue generation with strong objective results and competitive subjective quality.
MOSS‑VoiceGenerator: A 1.7B model designed for voice design and synthesis from text alone, outperforming other voice design models on arena ratings.
MOSS‑SoundEffect: 8B model for sound‑effect generation with a 2.0 v2.0 pipeline that uses a DiT backbone with a dedicated audio tokenizer.
MOSS‑SoundEffect‑v2.0: A 1.3B DiT pipeline with high‑quality audio reconstruction for sound effects.
MOSS‑TTS‑Realtime: 1.7B model optimized for real‑time, multi‑turn voice agents with low latency.
MOSS‑TTS‑Nano: A tiny ~100M model designed for CPU‑first, streaming deployment with multilingual voice cloning, 48 kHz stereo I/O, and 4 CPU cores.

For quick reference, model cards and pages are hosted on HuggingFace and ModelScope, with example model cards and versions linked in the release notes.

[Model cards and model pages are linked here for quick navigation.]

Supported Languages

MOSS‑TTS v1.5 currently supports 31 languages, expanding beyond the original 20 languages of MOSS‑TTS v1.0. The expansion includes Cantonese, Dutch, Finnish, Hindi, Macedonian, Malay, Romanian, Swahili, Tagalog, Thai, Vietnamese, and more. The TTSD and TTS‑Realtime components publish their own language coverage in their respective model cards.

A language table in the project materials offers a curated set of languages with codes and flags, reflecting broad multilingual capabilities that support code‑switching and cross‑lingual synthesis. The field is actively extended through multilingual continued training and language tagging to improve pronunciation accuracy, prosody, and intelligibility in each target language.

MOSS‑TTS v1.5: What’s New

Compared with the original MOSS‑TTS, v1.5 emphasizes several practical improvements:

Stronger multilingual synthesis with explicit language tags to guide pronunciation and prosody.
More stable voice cloning, with reduced variance across repeated generations.
Better long‑reference / short‑text cloning for more reliable identity transfer when reference audio length varies.
More stable punctuation‑driven prosody, delivering more natural pauses and intonation in longer sentences.
Explicit pause control using inline markers like [pause 3.2s], enabling precise rhythm in generated speech.

These updates aim to deliver more usable, production‑grade results while preserving the flexibility and openness that users expect from the MOSS‑TTS ecosystem.

Quickstart: Getting Hands‑On

OpenClaw API Skills

The MOSS‑TTS skills are available in ClawHub under OpenClaw, with API access via MOSI AI Studio. Practical entries include:
feishu‑voice‑tts: Send voice messages in Feishu.
moss‑tts‑voice: Call the MOSS‑TTS API to generate speech.

Environment Setup

The recommended workflow uses a clean Python environment with Transformers 5.0.0 to avoid conflicts.
Conda and uv are supported options; both paths guide you through installing dependencies, cloning the repository, and installing the torch runtime.

Optional acceleration and speed enhancements

FlashAttention 2 can be installed if your hardware supports it, to reduce peak VRAM usage and accelerate inference.
For machines with limited memory or many CPU cores, build parallelism can be capped (MAX_JOBS=4) to keep builds manageable.

Torch‑free and hardware‑accelerated paths

Torch‑free inference with llama.cpp + ONNX Runtime is supported for edge deployments, with quantized GGUF weights available for backbone models and ONNX audio tokenizers.
SGLang backend (Accelerated Inference) provides an end‑to‑end fused path for MOSS‑TTS Delay, enabling efficient inference with the SGLang ecosystem.

Quick Start Snippet (high level)

Install dependencies, download weights, and run a simple CLI or Python invocation to generate speech from text.
The ecosystem provides a range of sample scripts for four main models (MOSS‑TTS, MOSS‑TTSD, MOSS‑VoiceGenerator, MOSS‑SoundEffect) to demonstrate direct generation, voice cloning, duration control, and explicit pauses.

MOSS‑TTS Basic Usage

The system offers a generation interface that handles Chinese, English, multilingual text with language tags, Pinyin, and IPA.
It supports voice cloning with or without references, duration control via tokens, and explicit pause markers for timing control.

llama.cpp Backend (Torch‑Free Inference)

A torch‑free path using llama.cpp enables lightweight deployment on edge devices.
Configs control the backend choices (headsbackend, audiobackend) and low memory behavior to fit variable VRAM environments.

SGLang Backend (Accelerated Inference)

SGLang enables highly efficient inference by fusing the MOSS‑TTS model with the audio tokenizer for faster generation, especially under production‑scale loads.

Model Weights and Configurations

The ecosystem provides weights for GGUF backbones and ONNX audio tokenizers, with configuration presets (default.yaml, trt.yaml, trt‑8gb.yaml, cpu‑only.yaml) to tailor resource usage.
A number of parameter quantization options (e.g., q80, q40) reduce VRAM use, enabling more nodes to run in parallel on the same hardware.

Reference Examples

The Quick Start and Quick Start 1 sections include representative use cases, such as direct generation, voice cloning, and explicit pauses. The examples illustrate how language tags, reference audio, and duration tokens influence the output.

MOSS‑TTS Realtime and Dialogue

MOSS‑TTS‑Realtime is designed for multi‑turn contexts, enabling coherent replies with low latency. The integration with LLMs (e.g., via vLLM) helps measure the time to first sentence and related latency metrics, delivering a smooth user experience for live agents.

MOSS‑TTS‑Nano: Lightweight, CPU‑First Real‑Time TTS

Introduction

MOSS‑TTS‑Nano is a compact model designed for CPU‑first real‑time deployment. It focuses on essential aspects of speech generation—low footprint, streaming delivery, and competent voice cloning—without requiring GPUs.
It is built on a pure autoregressive Audio Tokenizer + LLM pipeline, maintaining a simple deployment stack while staying capable for local demos and lightweight production use.

Key Features

Approximately 0.1B parameters, enabling low memory usage and cost‑effective hosting.
Real‑time streaming on as few as four CPU cores, making it practical for CPU‑only environments.
Multilingual voice cloning support, enabling cross‑language synthesis from a single reference speaker.
48 kHz stereo input/output to preserve high fidelity in reference and final audio.

Architecture Visual

A dedicated image portrays the architecture of MOSS‑TTS‑Nano, illustrating how the lightweight design achieves streaming capability while maintaining voice quality.

Image: MOSS TTS Nano architecture

Model Weights and Availability

MOSS‑TTS‑Nano is available on HuggingFace and ModelScope with model cards and previews to help you begin experimentation quickly.

MOSS‑Audio‑Tokenizer: A Unified Audio Interface

Introduction

MOSS‑Audio‑Tokenizer is the unified discrete audio interface that powers the entire MOSS‑TTS family. It uses the Cat architecture (Causal Audio Tokenizer with Transformer) – a 1.6B parameter, CNN‑free, homogeneous tokenizer built from causal Transformer blocks.
It provides a single, shared audio representation across MOSS‑TTS, MOSS‑TTSD, MOSS‑VoiceGenerator, MOSS‑SoundEffect, and MOSS‑TTS‑Realtime.

Key Capabilities

Unified discrete bridge across all family models for consistent audio representation.
Extreme compression with high fidelity: 24 kHz audio compressed to as low as 12.5 Hz frame rate using 32‑layer RVQ, with flexible bitrates from 0.125 kbps to 4 kbps.
Massively scaled audio training: trained from 3 million hours of diverse data (speech, sound effects, music).
Native streaming design for low‑latency inference and production workflows.

Architecture Image

The MOSS Audio Tokenizer architecture image provides a visual representation of the tokenizer’s structure and role in the system.

Architecture image: MOSS Audio Tokenizer

Weights and Access

The tokenizer weights and corresponding model weights are hosted on HuggingFace and ModelScope, enabling straightforward integration with the TTS system.

Objective Reconstruction and Evaluation

The MOSS Audio Tokenizer is evaluated against open‑source tokenizers on LibriSpeech test‑clean, using metrics such as SIM, STOI, and PESQ, with bitrate controlled via the number of RVQ codebooks.

Evaluation image: LibriSpeech audio tokenizer metrics

Evaluation and Community: What People Experience

Evaluation in the MOSS‑TTS ecosystem spans objective metrics, subjective judgments, and practical, real‑world benchmarks:

Objective evaluation highlights for MOSS‑TTSD show strong speaker attribution accuracy, speaker similarity, and low word error rate compared with both open‑source and proprietary baselines.
Subjective evaluation uses ranking and Elo‑based methods to measure overall preference, voice similarity, prosody, and quality, with open‑source models and proprietary systems contrasted in user studies.
Visual comparison assets show how MOSS‑TTSD and related models stack up against various competitors, highlighting the strengths and areas for future improvement.

Open‑source vs. proprietary comparison visuals Proprietary comparison visuals Voice generator winrate

MOSS‑VoiceGenerator emphasizes the ability to create voices with naturalness and instruction following, as reflected in user preferences and model ratings.

MOSS‑TTS‑Realtime demonstrates strong low‑latency performance, including measured TTFB and Real‑Time Factor metrics on high‑end GPUs, with a detailed breakdown of the LLM‑first sentence timing in combined systems.

MOSS‑TTS‑Nano shows compelling performance for CPU‑based deployment, confirming that real‑time, multilingual, and cloneable voices can be produced without GPUs in many practical scenarios.

More Information and Community

Community projects and ecosystem growth are a vital part of MOSS‑TTS:

ComfyUI‑MOSS‑TTS: A ComfyUI extension enabling visual workflows for MOSS‑TTS.
MOSS‑TTS‑OpenAI: An OpenAI‑style API wrapper for MOSS‑TTS.
AnyPod: A podcast generation tool that uses MOSS‑TTS and MOSS‑TTSD as the backend.
Norwegian LoRA for MOSS‑TTS: A community‑trained LoRA adapter fine‑tuned on Norwegian speech datasets, enabling domain‑specific voice customization.

LoRA weights and training scripts are available through community channels to encourage experimentation and customization.

Licensing, Citations, and Community Acknowledgments

Licensing: Models in the MOSS‑TTS Family are released under the Apache License 2.0.
Citations: Technical reports and arXiv papers detail the architecture, training, and evaluation results:
MOSS‑TTS Technical Report (arXiv:2603.18090)
MOSS‑TTSD: Text‑to‑Spoken Dialogue
MOSS‑VoiceGenerator: Create Realistic Voices with Natural Descriptions
Star History: The project’s star history provides a visual timeline of community engagement and project popularity: Star History Chart

[Stars and community metrics image: Star History]

Visual Tour: Selected Images You’ll See in the Stack

The MOSS‑TTS family overview is complemented by a central product image:
A dedicated visual for the MOSS‑TTS‑Nano architecture highlights its CPU‑friendly streaming approach:
The MOSS Audio Tokenizer image illustrates the 1.6B, CNN‑free, causal Transformer design:
Evaluation visuals compare open‑source and proprietary models, underscoring relative strengths:
Voice design performance:
Audio tokenization and objective reconstruction visuals:

Quick Recap: Why the MOSS‑TTS Family Matters

Real‑world readiness: Five core models cover long‑form speech, dialogue, voice design, real‑time agents, and environmental sounds, all designed for stability and production deployment.
Multilingual ambition: 31 languages supported in v1.5, with ongoing continued training to broaden coverage and improve pronunciation and prosody in many languages.
Flexible deployment: Torch‑free inference with llama.cpp, optional FlashAttention, and SGLang accelerated backends provide multiple paths to deployment across devices and scales.
Lightweight options: MOSS‑TTS‑Nano demonstrates what is possible with a tiny parameter budget and CPU‑friendly streaming, opening up local demos and edge deployments.
Unified audio representation: MOSS‑Audio‑Tokenizer provides a single, high‑fidelity audio representation across all models, enabling cohesive pipelines and consistent quality.

If you’re exploring open‑source TTS, the MOSS‑TTS Family offers a rich toolkit—from novice experiments to production‑grade deployments. The combination of high fidelity, expressive control, multilingual support, and real‑time capabilities makes it a compelling option for researchers, developers, and creative teams looking to bring natural, engaging speech and sound to their applications.

For further details, you can browse the project pages on HuggingFace and ModelScope, read the arXiv technical reports, and explore the community projects that extend the ecosystem with practical tooling and demos. The community is actively expanding, and the landscape continues to evolve with new models, tools, and usage patterns that push the boundaries of open‑source speech and audio generation.

MOSS-TTS Family

MOSS‑TTS Family: Open-Source Speech and Sound Synthesis for Real-World AI

The MOSS‑TTS Family at a Glance

News and Milestones

Model Architecture: Core Concepts

Released Models: What’s Available

Supported Languages

MOSS‑TTS v1.5: What’s New

Quickstart: Getting Hands‑On

Evaluation and Community: What People Experience

More Information and Community

Licensing, Citations, and Community Acknowledgments

Visual Tour: Selected Images You’ll See in the Stack

Quick Recap: Why the MOSS‑TTS Family Matters

Enjoying this project?

GitHub - OpenMOSS/MOSS-TTS: MOSS-TTS Family

Stay Updated

Product

Learn

Company

Legal

Stay Updated

Browse by Category