Skip to content

Commit

Permalink
add last part in readme
Browse files Browse the repository at this point in the history
  • Loading branch information
cantabile-kwok committed Oct 8, 2023
1 parent 0540fcd commit 7a5dba4
Show file tree
Hide file tree
Showing 2 changed files with 20 additions and 5 deletions.
17 changes: 16 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,22 @@ During the development, the following repositories were referred to:
* [GradTTS](https://github.com/huawei-noah/Speech-Backbones/tree/main/Grad-TTS), where most of the model architecture and training pipelines are adopted.
* [VITS](https://github.com/jaywalnut310/vits), whose distributed bucket sampler is used.
* [CFM](https://github.com/atong01/conditional-flow-matching), for the ODE samplers.
## Citation

## Easter Eggs & Citation
This repository also contains some experimental functionalities. ⚠️Warning: not guaranteed to be correct!
* **Voice conversion**. As GlowTTS can perform voice conversion via the disentangling property of normalizing flows, it is reasonable that flow matching can also perform it. Method `model.tts.GradTTS.voice_conversion` gives a preliminary try.

* **Likelihood estimation**. Differential equation-based generative models have the ability to estimate data likelihoods by the instantaneous change-of-variable formula
$$
\log p_0(\boldsymbol x(0)) = \log p_1(\boldsymbol x(1)) + \int _0^1 \nabla_{\boldsymbol x} \cdot {\boldsymbol v}(\boldsymbol x(t), t)\mathrm d t
$$
In practice, integral is replaced by summation, and divergence is replaced by the Skilling-Hutchinson trace estimator. See the Appendix D.2 in [Song, et. al](https://arxiv.org/abs/2011.13456) for theoretical details. I implemented this in `model.tts.GradTTS.compute_likelihood`.
* **Optimal transport**. The conditional flow matching used in this paper is not a **marginally** optimal transport path but only a **conditionally** optimal path. For the marginal optimal transport, [Tong et. al](https://arxiv.org/abs/2302.00482) introduces to sample $x_0,x_1$ together from the joint optimal transport distribution $\pi(x_0,x_1)$. I tried this in `model.cfm.OTCFM`, though it doe not work very well for now.
* **Different estimator architectures**. You can specify an estimator besides the `GradLogPEstimator2d` by the `model.fm_net_type` configuration. Currently the [DiffSinger](https://ojs.aaai.org/index.php/AAAI/article/view/21350)'s estimator architecture is also supported. You can add more, e.g. that introduced in [Matcha-TTS](https://github.com/shivammehta25/Matcha-TTS).
* 💡**Better alignment learning**. This repo supports supervised duration modeling together with monotonic alignment search as that in GradTTS. However, there might be a better way for MAS in flow-matching TTS. `model.tts.GradTTS.forward` now supports beta binomial prior for alignment maps; and if you want, you can change the variable `MAS_target` to something else, e.g. flow-transformed noise!

Feel free to cite this work if it helps 😄

```
@misc{guo2023voiceflow,
title={VoiceFlow: Efficient Text-to-Speech with Rectified Flow Matching},
Expand Down
8 changes: 4 additions & 4 deletions model/tts.py
Original file line number Diff line number Diff line change
Expand Up @@ -173,11 +173,11 @@ def forward(self, x, x_lengths, y, y_lengths, noise, spk=None, out_size=None, us
# traj = self.decoder.backward(y, y_mask, torch.zeros_like(y), n_timesteps=2, spk=spk, solver="euler")
with torch.no_grad():
# target = traj[-1]
target = y
MAS_target = y
const = -0.5 * math.log(2 * math.pi) * self.n_feats
factor = -0.5 * torch.ones(mu_x.shape, dtype=mu_x.dtype, device=mu_x.device)
z_square = torch.matmul(factor.transpose(1, 2), target ** 2)
z_mu_double = torch.matmul(2.0 * (factor * mu_x).transpose(1, 2), target)
z_square = torch.matmul(factor.transpose(1, 2), MAS_target ** 2)
z_mu_double = torch.matmul(2.0 * (factor * mu_x).transpose(1, 2), MAS_target)
mu_square = torch.sum(factor * (mu_x ** 2), 1).unsqueeze(-1)
log_prior = z_square - z_mu_double + mu_square + const
# it's actually the log likelihood of target given the Gaussian with (mu_x, I)
Expand All @@ -193,7 +193,7 @@ def forward(self, x, x_lengths, y, y_lengths, noise, spk=None, out_size=None, us

# compute MLE loss
mu_y_uncut = torch.matmul(attn.squeeze(1).transpose(1, 2), mu_x.transpose(1, 2)).transpose(1, 2) # here mu_x is not cut.
l_mle = mle_loss(target, mu_y_uncut, torch.zeros_like(mu_y_uncut), y_mask)
l_mle = mle_loss(MAS_target, mu_y_uncut, torch.zeros_like(mu_y_uncut), y_mask)

else:
attn = generate_path(durs, attn_mask.squeeze(1)).detach()
Expand Down

0 comments on commit 7a5dba4

Please sign in to comment.