Telegram Group & Telegram Channel
Runtime quality control is interesting

https://www.linkedin.com/posts/yongyi-zang_github-resemble-aichatterbox-sota-open-source-activity-7333625257456480256-XfT8

Got very curious, so started to look into the source code of Resemble AI newly released open TTS model Chatterbox (https://lnkd.in/gzmCFFaQ) that claims to outperform ElevenLabs. Here's a (too quick, hopefully not wrong) tech deep-ish dive of its architecture:

High-level overview: text -> semantic tokens -> flow matching for Mel -> Mel to waveform. For voice conversion, speech -> semantic tokens, and the rest. Speaker embedding conditioning for text -> semantic tokens and semantic tokens -> Mel.

Given a reference audio, it is encoded to a speaker embedding network A (will explain in a second) and a speech tokenizer. The speech tokenizer looks very much like CosyVoice with some twists, is S3Tokenizer (basically just very semantic because trained on ASR objective).

These two embeddings are then sent into its core sequence-modeling model (a llama). CFG is applied by running within the same batch two times, one without conditioning speaker embedding. These two runs are then averaged to form final logits based on a weight. The text tokens and speech tokens have separate absolute positional embeddings. (Why not RoPE BTW?)

A runtime quality control model inspects alignment through looking into activation map of layer 9 of sequence modeling during generation (this is genius actually, they actually went in to look at attention maps!) and make sure the tokens are attending to the right words (final token is not being attended to for too long, previous tokens are not attended to again). It can't really fix anything, but it can make it immediately end speech by setting the <EOS> token's probability to ... 2^15...

So the speech tokens are generated by the sequence modeling, and a speaker embedding model B is used to extract the speaker embedding again and use that as condition for the vocoder. Remember speaker embedding network A? That's a GE2E (so basically 3 LSTMs). The network B is a CAM++ x-vector, if they didn't change it then likely a pre-trained model on AAM-Softmax objective. Why two embedding networks? I can't really figure this one out.

So the vocoder is a two stage system, they first generate Mel spectrogram then generate waveform from Mel. The Mel generation part is identical to CosyVoice, where a CFM model is used; They used HiFTNet instead of HiFiGAN as vocoder though.

HiFTNet is a neural source filter network that has several stages. It starts by predicting F0 from mel, then use these F0s to generate source signal. The source signal is then conditioned on the process of a inverse-STFT network (*neural filter*) where frame-wise magnitude and phase is predicted.

When the sequence modeling part is skipped, the speech tokens can also be directly extracted from source speech, which is then going through the decoder network.

The watermarking model is a separate model that runs on waveform-level. It adds a watermark into magnitude spectrogram, then use 3 branches of convolutions to model watermark presence at different time scales.



tg-me.com/speechtech/2123
Create:
Last Update:

Runtime quality control is interesting

https://www.linkedin.com/posts/yongyi-zang_github-resemble-aichatterbox-sota-open-source-activity-7333625257456480256-XfT8

Got very curious, so started to look into the source code of Resemble AI newly released open TTS model Chatterbox (https://lnkd.in/gzmCFFaQ) that claims to outperform ElevenLabs. Here's a (too quick, hopefully not wrong) tech deep-ish dive of its architecture:

High-level overview: text -> semantic tokens -> flow matching for Mel -> Mel to waveform. For voice conversion, speech -> semantic tokens, and the rest. Speaker embedding conditioning for text -> semantic tokens and semantic tokens -> Mel.

Given a reference audio, it is encoded to a speaker embedding network A (will explain in a second) and a speech tokenizer. The speech tokenizer looks very much like CosyVoice with some twists, is S3Tokenizer (basically just very semantic because trained on ASR objective).

These two embeddings are then sent into its core sequence-modeling model (a llama). CFG is applied by running within the same batch two times, one without conditioning speaker embedding. These two runs are then averaged to form final logits based on a weight. The text tokens and speech tokens have separate absolute positional embeddings. (Why not RoPE BTW?)

A runtime quality control model inspects alignment through looking into activation map of layer 9 of sequence modeling during generation (this is genius actually, they actually went in to look at attention maps!) and make sure the tokens are attending to the right words (final token is not being attended to for too long, previous tokens are not attended to again). It can't really fix anything, but it can make it immediately end speech by setting the <EOS> token's probability to ... 2^15...

So the speech tokens are generated by the sequence modeling, and a speaker embedding model B is used to extract the speaker embedding again and use that as condition for the vocoder. Remember speaker embedding network A? That's a GE2E (so basically 3 LSTMs). The network B is a CAM++ x-vector, if they didn't change it then likely a pre-trained model on AAM-Softmax objective. Why two embedding networks? I can't really figure this one out.

So the vocoder is a two stage system, they first generate Mel spectrogram then generate waveform from Mel. The Mel generation part is identical to CosyVoice, where a CFM model is used; They used HiFTNet instead of HiFiGAN as vocoder though.

HiFTNet is a neural source filter network that has several stages. It starts by predicting F0 from mel, then use these F0s to generate source signal. The source signal is then conditioned on the process of a inverse-STFT network (*neural filter*) where frame-wise magnitude and phase is predicted.

When the sequence modeling part is skipped, the speech tokens can also be directly extracted from source speech, which is then going through the decoder network.

The watermarking model is a separate model that runs on waveform-level. It adds a watermark into magnitude spectrogram, then use 3 branches of convolutions to model watermark presence at different time scales.

BY Speech Technology




Share with your friend now:
tg-me.com/speechtech/2123

View MORE
Open in Telegram


Speech Technology Telegram | DID YOU KNOW?

Date: |

Telegram announces Anonymous Admins

The cloud-based messaging platform is also adding Anonymous Group Admins feature. As per Telegram, this feature is being introduced for safer protests. As per the Telegram blog post, users can “Toggle Remain Anonymous in Admin rights to enable Batman mode. The anonymized admin will be hidden in the list of group members, and their messages in the chat will be signed with the group name, similar to channel posts.”

Tata Power whose core business is to generate, transmit and distribute electricity has made no money to investors in the last one decade. That is a big blunder considering it is one of the largest power generation companies in the country. One of the reasons is the company's huge debt levels which stood at ₹43,559 crore at the end of March 2021 compared to the company’s market capitalisation of ₹44,447 crore.

Speech Technology from us


Telegram Speech Technology
FROM USA