We want to talk about ChatTTS, a generative speech model for daily dialogue. ChatTTS is a text-to-speech model designed specifically for dialogue scenario such as LLM assistant. It supports both English and Chinese languages. Our model is trained with 100,000+ hours composed of Chinese and English. The open-source version on HuggingFace is a 40,000-hour pre-trained model without SFT.
ChatTTS is an advanced, generative text-to-speech model specifically designed for dialogue scenarios, making it ideal for Language Learning Models (LLM) assistants. The model is a product of extensive training, based on over 100,000 hours of speech data in both English and Chinese, ensuring a comprehensive understanding and natural language processing capability.
The open-source version available on HuggingFace represents a 40,000-hour pre-trained model. Despite its reduced training hours, it holds strong potential due to its structured training and the specific purpose it serves. This level of availability and access encourages researchers and developers to experiment, innovate, and improve upon the foundational work presented by the ChatTTS development team.

Image: Advanced text-to-speech model for daily dialogue scenarios
Key Features
Bilingual Capability: ChatTTS supports both English and Chinese languages, making it a versatile tool for a variety of applications and wide-ranging user bases.
Extensive Training Data: More than 100,000 hours of speech data have been utilized for training, ensuring high accuracy and naturalness in speech output.
Open-Source Availability: The model is accessible on HuggingFace as a 40,000-hour pre-trained version, fostering community engagement and collaborative improvement.
Ethical Considerations and Limitations
The team behind ChatTTS places a strong emphasis on **responsible and ethical use** of the model. To mitigate potential misuse, a small amount of high-frequency noise has been introduced during training, and audio quality is compressed using the MP3 format. This measure aims to prevent malicious actors from exploiting the technology.
In addition to these safeguards, the team has developed a detection model internally, which they plan to open-source in the future. This will further aid in identifying and mitigating any misuse of the ChatTTS model, reflecting their commitment to ethical AI development.
Usage Roadmap and Technical Requirements
For practical deployment, generating a 30-second audio clip requires at least 4GB of GPU memory. Utilizing a 4090D GPU allows the model to generate audio at a rate of roughly 7 semantic tokens per second, with a Real-Time Factor (RTF) of around 0.65. Given these specifications, users need to ensure adequate computational resources for optimal operation of the model.
Additional Features and Future Updates
Currently, the released model includes limited token-level control units such as [laugh], [uv_break], and [lbreak]. Future versions are expected to introduce additional emotional control capabilities, expanding the range of expressive output and enhancing interaction quality. With continuous development, the ChatTTS model promises to become even more adept at generating realistic and responsive dialogue.
Acknowledgements and Community Interaction
The authors of ChatTTS encourage academic and research use, stressing that the repo is meant solely for these purposes. They welcome contributions and issue submissions through GitHub, fostering a collaborative environment where improvements and innovations can thrive.
Edworking is the best and smartest decision for SMEs and startups to be more productive. Edworking is a FREE superapp of productivity that includes all you need for work powered by AI in the same superapp, connecting Task Management, Docs, Chat, Videocall, and File Management. Save money today by not paying for Slack, Trello, Dropbox, Zoom, and Notion.
Remember these 3 key ideas for your startup:
Empower Global Communication: Incorporate ChatTTS into your customer service or communication tools to effectively bridge language barriers, particularly with the model’s ability to handle both English and Chinese.
Innovate Responsibly: Leverage the ethical and responsible AI practices embedded within ChatTTS, setting a standard for how technology can be advanced while mitigating risks of misuse.
Maximize Open-Source Potential: Engage with the open-source community around ChatTTS available on platforms like HuggingFace to customize, improve, and stay at the forefront of generative speech technologies, tailoring solutions specific to your business needs.
By integrating ChatTTS into your operations, your startup can significantly enhance its communication capabilities, streamline workflow, and adhere to ethical AI practices, setting a benchmark in the industry.
For more details, see the original source.