Comparison between base and CTC models #10

hertz-pj · 2025-01-21T08:53:08Z

Hi, I tried both the CTC model and the base model, and it seems that the bug I encountered previously hasn't been fixed yet. The post hoc results are still somewhat better than the original FSQ.

Additionally, the CTC loss model seems to produce clearer pronunciation and overall better performance. Have any specific evaluations or comparisons been done in this part?

hertz-pj · 2025-01-21T13:22:50Z

test.zip
This is the result of the comparison.

cantabile-kwok · 2025-01-21T14:22:58Z

I am also bothered by this. Here is what I find:

When not using posthoc bottleneck configurations, the model actually calls the FSQ implemented in stable-audio-tools. This repository also has an FSQ which the author has fixed a bug (#8). So there are two FSQ versions that we can use. When using the posthoc bottleneck configurations, the model uses the FSQ implemented in this repo.

After checking the source code, I found that the default setting for FSQ (without calling posthoc bottleneck) is [17,17,17,17,17,17]. This means every FSQ dimension has 17 quantization levels. This sums up to 24137569 vocabulary size and 625bps bitrate. To compare the difference of the two FSQ versions, we can add this to stable_codec/model.py:

self.preset_bottleneck_configs = {
...
    "1x24137569_625bps":[
              ([17,17,17,17,17,17], 1.0)
          ],
}
...

And use model.set_posthoc_bottleneck("1x24137569_625bps") before reconstructing speech. This is supposed to have an identical behavior with the default (if I'm correct).

However, by testing on some utterances, I found that calling model.set_posthoc_bottleneck("1x24137569_625bps") gets much better intelligibility than the default. That's about an WER between 5% (calling it explicitly) and 10% (default). The default seems to be worse than using a posthoc bottleneck at 400bps, while calling it explicitly is better.

julian-parker · 2025-01-22T18:35:27Z

Yes, this is a known upstream bug in stable-audio-tools. The FSQ quantizer has numerical issues with very large codebook size, which are fixed by changing calculations related to the indices to int64. This is fixed locally at the moment (as you noticed), and applied when using the post-hoc bottleneck. Getting the fix upstreamed into stable-audio-tools will be a little slower, as we need to plan around other features and the release cycle.

Is there a practical reason to use anything other than the current available posthoc bottlenecks? 700bps and 1000bps residual versions both have the same accuracy as the original 17 level version, whilst having a manageable codebook size. If someone has a practical usecase for 1x24137569_625bps, I'd suggest we can add it to the bottleneck presets whilst waiting for the upstream fix.

julian-parker · 2025-01-22T18:37:16Z

Additionally, the CTC loss model seems to produce clearer pronunciation and overall better performance. Have any specific evaluations or comparisons been done in this part?

Objective metrics are available at the bottom of the README. The CTC version performs slightly worse according to those metrics, but I 100% agree that perceptually it's better.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comparison between base and CTC models #10

Comparison between base and CTC models #10

hertz-pj commented Jan 21, 2025

hertz-pj commented Jan 21, 2025

cantabile-kwok commented Jan 21, 2025 •

edited

Loading

julian-parker commented Jan 22, 2025

julian-parker commented Jan 22, 2025

Comparison between base and CTC models #10

Comparison between base and CTC models #10

Comments

hertz-pj commented Jan 21, 2025

hertz-pj commented Jan 21, 2025

cantabile-kwok commented Jan 21, 2025 • edited Loading

julian-parker commented Jan 22, 2025

julian-parker commented Jan 22, 2025

cantabile-kwok commented Jan 21, 2025 •

edited

Loading