Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comparison between base and CTC models #10

Open
hertz-pj opened this issue Jan 21, 2025 · 4 comments
Open

Comparison between base and CTC models #10

hertz-pj opened this issue Jan 21, 2025 · 4 comments

Comments

@hertz-pj
Copy link

Hi, I tried both the CTC model and the base model, and it seems that the bug I encountered previously hasn't been fixed yet. The post hoc results are still somewhat better than the original FSQ.

Additionally, the CTC loss model seems to produce clearer pronunciation and overall better performance. Have any specific evaluations or comparisons been done in this part?

@hertz-pj
Copy link
Author

test.zip
This is the result of the comparison.

@cantabile-kwok
Copy link

cantabile-kwok commented Jan 21, 2025

I am also bothered by this. Here is what I find:

When not using posthoc bottleneck configurations, the model actually calls the FSQ implemented in stable-audio-tools. This repository also has an FSQ which the author has fixed a bug (#8). So there are two FSQ versions that we can use. When using the posthoc bottleneck configurations, the model uses the FSQ implemented in this repo.

After checking the source code, I found that the default setting for FSQ (without calling posthoc bottleneck) is [17,17,17,17,17,17]. This means every FSQ dimension has 17 quantization levels. This sums up to 24137569 vocabulary size and 625bps bitrate. To compare the difference of the two FSQ versions, we can add this to stable_codec/model.py:

self.preset_bottleneck_configs = {
...
    "1x24137569_625bps":[
              ([17,17,17,17,17,17], 1.0)
          ],
}
...

And use model.set_posthoc_bottleneck("1x24137569_625bps") before reconstructing speech. This is supposed to have an identical behavior with the default (if I'm correct).

However, by testing on some utterances, I found that calling model.set_posthoc_bottleneck("1x24137569_625bps") gets much better intelligibility than the default. That's about an WER between 5% (calling it explicitly) and 10% (default). The default seems to be worse than using a posthoc bottleneck at 400bps, while calling it explicitly is better.

@julian-parker
Copy link
Collaborator

Yes, this is a known upstream bug in stable-audio-tools. The FSQ quantizer has numerical issues with very large codebook size, which are fixed by changing calculations related to the indices to int64. This is fixed locally at the moment (as you noticed), and applied when using the post-hoc bottleneck. Getting the fix upstreamed into stable-audio-tools will be a little slower, as we need to plan around other features and the release cycle.

Is there a practical reason to use anything other than the current available posthoc bottlenecks? 700bps and 1000bps residual versions both have the same accuracy as the original 17 level version, whilst having a manageable codebook size. If someone has a practical usecase for 1x24137569_625bps, I'd suggest we can add it to the bottleneck presets whilst waiting for the upstream fix.

@julian-parker
Copy link
Collaborator

Additionally, the CTC loss model seems to produce clearer pronunciation and overall better performance. Have any specific evaluations or comparisons been done in this part?

Objective metrics are available at the bottom of the README. The CTC version performs slightly worse according to those metrics, but I 100% agree that perceptually it's better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants