tg-me.com/speechtech/2123
Last Update:
Runtime quality control is interesting
https://www.linkedin.com/posts/yongyi-zang_github-resemble-aichatterbox-sota-open-source-activity-7333625257456480256-XfT8
Got very curious, so started to look into the source code of Resemble AI newly released open TTS model Chatterbox (https://lnkd.in/gzmCFFaQ) that claims to outperform ElevenLabs. Here's a (too quick, hopefully not wrong) tech deep-ish dive of its architecture:
High-level overview: text -> semantic tokens -> flow matching for Mel -> Mel to waveform. For voice conversion, speech -> semantic tokens, and the rest. Speaker embedding conditioning for text -> semantic tokens and semantic tokens -> Mel.
Given a reference audio, it is encoded to a speaker embedding network A (will explain in a second) and a speech tokenizer. The speech tokenizer looks very much like CosyVoice with some twists, is S3Tokenizer (basically just very semantic because trained on ASR objective).
These two embeddings are then sent into its core sequence-modeling model (a llama). CFG is applied by running within the same batch two times, one without conditioning speaker embedding. These two runs are then averaged to form final logits based on a weight. The text tokens and speech tokens have separate absolute positional embeddings. (Why not RoPE BTW?)
A runtime quality control model inspects alignment through looking into activation map of layer 9 of sequence modeling during generation (this is genius actually, they actually went in to look at attention maps!) and make sure the tokens are attending to the right words (final token is not being attended to for too long, previous tokens are not attended to again). It can't really fix anything, but it can make it immediately end speech by setting the <EOS> token's probability to ... 2^15...
So the speech tokens are generated by the sequence modeling, and a speaker embedding model B is used to extract the speaker embedding again and use that as condition for the vocoder. Remember speaker embedding network A? That's a GE2E (so basically 3 LSTMs). The network B is a CAM++ x-vector, if they didn't change it then likely a pre-trained model on AAM-Softmax objective. Why two embedding networks? I can't really figure this one out.
So the vocoder is a two stage system, they first generate Mel spectrogram then generate waveform from Mel. The Mel generation part is identical to CosyVoice, where a CFM model is used; They used HiFTNet instead of HiFiGAN as vocoder though.
HiFTNet is a neural source filter network that has several stages. It starts by predicting F0 from mel, then use these F0s to generate source signal. The source signal is then conditioned on the process of a inverse-STFT network (*neural filter*) where frame-wise magnitude and phase is predicted.
When the sequence modeling part is skipped, the speech tokens can also be directly extracted from source speech, which is then going through the decoder network.
The watermarking model is a separate model that runs on waveform-level. It adds a watermark into magnitude spectrogram, then use 3 branches of convolutions to model watermark presence at different time scales.
BY Speech Technology

Share with your friend now:
tg-me.com/speechtech/2123