CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

[Paper] [Code] [HuggingFace] [ModelScope]


FunAudioLLM Team

SpeechLab@Tongyi, Alibaba Group

Abstract: In our previous work, we proposed CosyVoice, a multilingual speech synthesis model based on supervised discrete speech token. By performing progressive semantic decoding with two popular generative models: language models (LMs) and Flow Matching, CosyVoice achieved high prosody naturalness, content consistency, and speaker similarity in speech in-context learning. Recently, there have been significant advancements in multi-modal large language models (LLMs), in which the response latency and real-time factor of speech synthesis play a crucial role in the interactive experience. Therefore, in this work, we introduce an improved streaming speech synthesis model, CosyVoice 2, with comprehensive and systematic optimizations. Firstly, we introduce finite-scalar quantization to improve the codebook utilization of speech tokens. Secondly, we simplify the model architecture of the text-speech LM, so as the pre-trained LLMs can be directly used as the backbone. Additionally, we also design a chunk-aware causal flow matching model to accommodate different synthesis scenarios. As a result, we can perform the streaming and non-streaming synthesis within a single model. By training on a large-scale multilingual dataset, CosyVoice 2 achieves human-comparable synthesis quality with very low response latency and real-time factor.

Ultra-Low Latency: CosyVoice 2.0 introduces a large-scale voice generation model technology that integrates offline and streaming modeling, supporting bidirectional streaming speech synthesis. The first packet synthesis latency can reach 150ms with minimal loss in quality.

High Accuracy: Compared to CosyVoice 1.0, CosyVoice 2.0 reduces pronunciation errors in synthesized audio by 30% to 50%. It achieves the current lowest character error rate on the hard test set of the Seed-TTS evaluation set.

Strong Stability: CosyVoice 2.0 ensures excellent consistency in timbre for zero-shot voice generation and cross-language speech synthesis. It shows significant improvement in cross-language synthesis compared to version 1.0.

Natural Experience: The prosody, sound quality, and emotional alignment of the synthesized audio in CosyVoice 2.0 have significantly improved compared to version 1.0. The MOS evaluation score increased from 5.4 to 5.53 (with a comparable score of 5.52 for a commercialized large-scale speech synthesis model). Additionally, CosyVoice 2.0 has upgraded its controllable audio generation capabilities, supporting more granular emotional controls and dialect accent adjustments.

Contents

Zero-shot In-context Generation

Language Prompt Generated 1 Generated 2
ZH
对,这就是我,万人敬仰的太乙真人,虽然有点婴儿肥,但也掩不住我逼人的帅气。

突然,身边一阵笑声。我看着他们,意气风发地挺直了胸膛,甩了甩那稍显肉感的双臂,轻笑道:"我身上的肉,是为了掩饰我爆棚的魅力,否则,岂不吓坏了你们呢?"

不少人从四面八方赶来,只为目睹我的风采。看,他们眼中流露出的崇敬,足以让我感到自豪。我微微一笑,挥手致意,心中默念着:责任重大,不容懈怠。

啊,事情怎么这么多啊,明天又要开会了啊,做不完啊,说了你又不听,听又不懂,懂又不做,做又做错,错又不认,认又不改,改又不服,不服又不说,你说叫我怎么办

这次机会让我能够在新的领域中不断学习和成长,同时也激励我去克服自身的不足。

我觉得这种运动其实不是说靠机会的,我觉得对每个人来讲,像我们歌手来讲,我觉得其实都是你要自己去努力,然后才可以达到自己的梦想。

周日被我射熄火了,所以今天是周一。

有一种撕心裂肺的感觉,是辣椒,我加了辣椒!

新的一周开始了,我计划好好利用这段时间来完成手头的工作和项目。

今夜的月光如此清亮,不做些什么真是浪费。随我一同去月下漫步吧,不许拒绝。

梯度是一个多变量微积分中的概念,用于描述一个标量场在某一点处的最大变化率,以及变化最快的方向。在物理学中,梯度通常用来表示某个物理量的空间变化情况。

在这宁静的夜晚,我们可以沿着小路慢慢走,感受微风拂面的轻柔,与自然融为一体。
EN
I think people online have actually assembled videos showing every launch and it just gets like crazy fast as you get to twenty twenty three. So yeah, so we've done a nineteen three flight. We're now qualifying Falcon nine to be able to do forty flights.

In the quest for sustainable energy, Tesla leads the charge; every electric vehicle on the road is an emissary saluting clearer skies, collectively weaving the tapestry of our planet's verdant future.

From space exploration to subterranean tunnels, from AI to the neurotechnology revolution, my pursuit transcends mere technological frontiers; it's about carving out unprecedented realms of existence and progress for mankind.

I'm so happy I got to do this. I really wanted to work with Tom Hooper. I know that he records live and he films and records your vocals live. It's such an interesting thing to me and I wanted to see him work. I had actually done screen tests for Les Mis.

Every stage is a fresh adventure, and as the lights ignite, it's an unspoken pact between me and the audience, weaving unforgettable nights where dreams meet reality.

Creating is my way of extracting magic from life's moments. Whether it's joy or tears, I embrace it all, transmuting those feelings into notes, with the hope of touching the depths of every soul.
JP
の匂いを嗅ぎつけて現場に赴き、モテる感覚の全てを使って犯人を割り出し、食らいついたら相手が観念するまで証拠という鋭い歯を食い込ませるそれが探偵さん。

投資で安定収入を得たい人達で情報交換をしませんか?

自分でもユナに提案してからやっぱり暑すぎるか。

どうして、どうしてお姉ちゃんを助けてくれなかったの?

クレジットカード現金化の店舗のスタッフブログです。

某ハンドメイドブログの別館ともなっております。
KO
괴로우나 즐거우나.

장로와 다른 교인들이 들어와 병원으로 가기를 번차례로 권하였다.

비상한 흥분과 그 흥분에 반비례되는 시원치 않은 결과, 이러한 불만의 십 년이 지났습니다.

그는 무슨 난처한 일이나 있는 듯이 유쾌하지 못한 표정으로 술만 따른다.

그날 저녁에 실례한 것은 이 사람이었소이다.

기천은 상을 물리고 담배를 붙여 물었다..

Crosslingual In-context Generation

ZH EN JP KO

老兵不死,只会逐渐凋零

Old soldiers never die; they just fade away.

老兵は死なず、ただ消え去るのみ。

노병은 죽지 않고, 다만 사라져갈 뿐이다.

在那之后,完全收购那家公司。因此,保持管理层的一致性,利益与即将加入家族的资产保持一致,这就是我们有时不买下全部的原因。

And then later on, fully acquiring that company. So keeping management in line, interest in line with the asset that's coming into the family is a reason why sometimes we don't buy the whole thing.

その後、その会社を完全に買収する。だから、経営陣を一列に並べ、家族に入る資産との利益を一致させることが、私たちが全てを買わない理由の一つです。

그리고 나중에, 그 회사를 완전히 인수하게 됩니다. 그래서 경영진을 일치시키고, 가족에 들어오는 자산과의 이익을 일치시키는 것이 우리가 가끔 전부를 사지 않는 이유입니다.

只是雨滴有什么麻烦的?这还没有打雷呢!

Rainfall alone does not constitute a storm. Thunder is required.

雨なんて大したことありません。まだ雷も鳴っていないのですから。

고작 빗방울로 호들갑은, 아직 번개도 치지 않았는데.

如果能对小事感到感激和满足,那他就是幸福的人。

If one knows how to be grateful and content with small things, then he is a happy person.

小さなことに感謝し満足することができれば、その人は幸せな人です。

작은 것을 가지고도 고마워하고 만족할 줄 안다면 그는 행복한 사람이다.

Mixlingual In-context Generation

Prompt Generated

晴空万里不如你心情愉悦,今天有什么开心的事吗?

你昨天的 presentation がよかったので, 오늘도 좋은 피드백을 받을 거예요。

그런 걸 별짓을 다해 가면서 억지루 시작을 했었지요.

최근에 한국어를 배우기 시작했어, I find it fascinating, 特に文化の違いを理解するのが 재미있어.

…덥다… 이렇게 독한 햇빛이라면 서리꽃도 녹아버리겠지…? 아, 생각해보니까 그럴 일은 없겠네. 하하… 아, 으윽…

生活は簡単ではないが, happiness 是自己追求的 목표, 맞죠?

The world looks glorious in the snow. Pure white, like the light of the moon. A perfect backdrop for bloodshed.

내일의 meeting は几点开始?一緒に行きますか?

Emotionally Expressive Voice Generation

Emotion Prompt Generated
Happy
能和大家在一起,我好开心啊。

不久以后,王后果然生下了一个可爱的小公主。
Sad
Dogs are sitting by the door.

Their heads lowered and eyes drooping, letting out soft whimpers as they wait for their owner to return.
Surprise
天呐,这么好一部小说,竟然出自一个怨碎的小作者之手。

进入小木屋后,里面竟然整齐排列着七张小小的床!
Neutral
Dogs are sitting by the door.

Their expressions steady and unmoved, simply observing the world outside with quiet interest.
Angry
刚才还好好的,一眨眼又消失了,真的是要气死我了。

可恶的恶魔!你胡作非为,竟敢抓走公主。

Instructed Voice Generation

Role-playing Control

Text CosyVoice 2 CosyVoice-Instruct
机器人<|endofprompt|>欢迎使用智能助理系统,我是您的虚拟助手,可以协助您完成多种任务,无论是安排日程、管理电子邮件,还是查询路线、预定餐厅,我都能为您提供高效的服务。
小猪佩奇<|endofprompt|>大家好,我是小猪佩奇,今天我和苏西羊一起去公园,我们在秋千上荡来荡去,开心极了,还一起玩了捉迷藏,真是个快乐的下午。
神秘<|endofprompt|>她总是穿着一件神秘的黑色斗篷,没人知道她的真实身份。
凶猛<|endofprompt|>那头狮子的吼声凶猛,仿佛能震撼整个大地。
好奇<|endofprompt|>他对宇宙充满好奇,总是夜晚仰望星空,希望能揭开其中的奥秘。
优雅<|endofprompt|>他的举止优雅,言谈间透出浓厚的书卷气息。
孤独<|endofprompt|>在异国他乡时,他常常感到孤独,思念家乡的亲人和朋友。

Dialect Control

Text CosyVoice 2 CosyVoice-Instruct
粤语<|endofprompt|>哎呀,上个星期我喺旺角嗰头搵到间好巴闭嘅冰室,啲奶茶真系“啱啱掂”,滑到像丝绸咁,配埋个菠萝包,真系顶呱呱!
四川话<|endofprompt|>哎呀,晓不晓得,前几天我在成都春熙路上头吃了一碗担担面,那味道简直巴适得板,辣得人直呼过瘾。
上海话<|endofprompt|>侬晓得伐,上礼拜我去淮海路的小马路上头捡漏,居然淘到一只老克勒的手表,侬讲得好伐?
郑州话<|endofprompt|>最近我在家头学做烩面,那手擀面条筋道得真是咥得不赖,搭配那牛骨汤,简直是绝了。
长沙话<|endofprompt|>哎呀,前几天去坡子街吃夜宵,那口味虾辣得我直冒汗,嘴巴烧得像火,但就是停不下来,实在霸蛮。
天津话<|endofprompt|>我最近在家学做煎饼果子,虽然手法还不太纯熟,但那摊出来的鸡蛋酥香得让人直流口水。

Fine-grained Control

Text CosyVoice 2 CosyVoice-Instruct
在他讲述那个荒诞故事的过程中,他突然[laughter]停下来,因为他自己也被逗笑了[laughter]。
他搞的一个恶作剧,让大家<laughter>忍俊不禁</laughter>。
追求卓越不是终点,它需要你每天都<strong>付出</strong>和<strong>精进</strong>,最终才能达到巅峰。
当你用心去倾听一首音乐时[breath],你会开始注意到那些细微的音符变化[breath],并通过它们感受到音乐背后的情感。

Speaking Style Control

Text CosyVoice 2 CosyVoice-Instruct
开心<|endofprompt|>参加朋友的婚礼,看着新人幸福的笑脸,我感到无比开心。这样的爱与承诺,总是令人心生向往。
伤心<|endofprompt|>收到拒信的那一刻,我感到无比伤心。虽然知道失败是成长的一部分,但仍然难以掩饰心中的失落。
惊讶<|endofprompt|>走进家门,看见墙上挂满了我的照片,我惊讶得愣住了。原来家人悄悄为我准备了一个惊喜的纪念墙。
生气<|endofprompt|>在交通高峰期,遭遇到一位鲁莽的司机插队,我感到非常生气。这种不文明的行为总让人无奈。
恐惧<|endofprompt|>看恐怖电影时,那突如其来的惊悚画面让我感到无比恐惧,心脏几乎要跳出胸口。
恶心<|endofprompt|>听到关于人体实验的细节描述,我感到非常恶心。这样的内容让人心生不适。
冷静<|endofprompt|>在争论中,我试图让自己冷静下来,理智地表达我的观点。只有冷静,才能有效沟通。
严肃<|endofprompt|>这个安全隐患问题必须严肃处理,我们不能掉以轻心,必须采取有效措施加以解决。
快速<|endofprompt|>这款新应用程序加载速度极快,让用户体验得到了极大的提升,使用起来更加流畅便捷。
非常快速<|endofprompt|>这款新应用程序加载速度极快,让用户体验得到了极大的提升,使用起来更加流畅便捷。
慢速<|endofprompt|>听着轻柔的音乐,我在画布上慢慢地涂抹色彩,让每一笔都充满灵感和思考。
非常慢速<|endofprompt|>听着轻柔的音乐,我在画布上慢慢地涂抹色彩,让每一笔都充满灵感和思考。

Disclaimer

The content provided above is for academic purposes only and is intended to demonstrate technical capabilities. Some examples are sourced from the internet. If any content infringes on your rights, please contact us to request its removal.