add last part in readme

zshy1205 · Oct 8, 2023 · 7a5dba4 · 7a5dba4
1 parent 0540fcd
commit 7a5dba4
Show file tree

Hide file tree

Showing 2 changed files with 20 additions and 5 deletions.
diff --git a/README.md b/README.md
@@ -29,7 +29,22 @@ During the development, the following repositories were referred to:
 * [GradTTS](https://github.com/huawei-noah/Speech-Backbones/tree/main/Grad-TTS), where most of the model architecture and training pipelines are adopted.
 * [VITS](https://github.com/jaywalnut310/vits), whose distributed bucket sampler is used.
 * [CFM](https://github.com/atong01/conditional-flow-matching), for the ODE samplers.
-## Citation
+
+## Easter Eggs & Citation
+This repository also contains some experimental functionalities. ⚠️Warning: not guaranteed to be correct!
+* **Voice conversion**. As GlowTTS can perform voice conversion via the disentangling property of normalizing flows, it is reasonable that flow matching can also perform it. Method `model.tts.GradTTS.voice_conversion` gives a preliminary try.
+
+* **Likelihood estimation**. Differential equation-based generative models have the ability to estimate data likelihoods by the instantaneous change-of-variable formula
+  $$
+  \log p_0(\boldsymbol x(0)) = \log p_1(\boldsymbol  x(1)) + \int _0^1 \nabla_{\boldsymbol x} \cdot {\boldsymbol v}(\boldsymbol x(t), t)\mathrm d t
+  $$
+  In practice, integral is replaced by summation, and divergence is replaced by the Skilling-Hutchinson trace estimator. See the Appendix D.2 in [Song, et. al](https://arxiv.org/abs/2011.13456) for theoretical details. I implemented this in `model.tts.GradTTS.compute_likelihood`. 
+* **Optimal transport**. The conditional flow matching used in this paper is not a **marginally** optimal transport path but only a **conditionally** optimal path. For the marginal optimal transport, [Tong et. al](https://arxiv.org/abs/2302.00482) introduces to sample $x_0,x_1$ together from the joint optimal transport distribution $\pi(x_0,x_1)$. I tried this in `model.cfm.OTCFM`, though it doe not work very well for now.
+* **Different estimator architectures**. You can specify an estimator besides the `GradLogPEstimator2d` by the `model.fm_net_type` configuration. Currently the [DiffSinger](https://ojs.aaai.org/index.php/AAAI/article/view/21350)'s estimator architecture is also supported. You can add more, e.g. that introduced in [Matcha-TTS](https://github.com/shivammehta25/Matcha-TTS).
+* 💡**Better alignment learning**. This repo supports supervised duration modeling together with monotonic alignment search as that in GradTTS. However, there might be a better way for MAS in flow-matching TTS. `model.tts.GradTTS.forward` now supports beta binomial prior for alignment maps; and if you want, you can change the variable `MAS_target` to something else, e.g. flow-transformed noise!
+
+Feel free to cite this work if it helps 😄
+
 ```
 @misc{guo2023voiceflow,
       title={VoiceFlow: Efficient Text-to-Speech with Rectified Flow Matching}, 

diff --git a/model/tts.py b/model/tts.py
@@ -173,11 +173,11 @@ def forward(self, x, x_lengths, y, y_lengths, noise, spk=None, out_size=None, us
             # traj = self.decoder.backward(y, y_mask, torch.zeros_like(y), n_timesteps=2, spk=spk, solver="euler")
             with torch.no_grad():
                 # target = traj[-1]
-                target = y
+                MAS_target = y
                 const = -0.5 * math.log(2 * math.pi) * self.n_feats
                 factor = -0.5 * torch.ones(mu_x.shape, dtype=mu_x.dtype, device=mu_x.device)
-                z_square = torch.matmul(factor.transpose(1, 2), target ** 2)
-                z_mu_double = torch.matmul(2.0 * (factor * mu_x).transpose(1, 2), target)
+                z_square = torch.matmul(factor.transpose(1, 2), MAS_target ** 2)
+                z_mu_double = torch.matmul(2.0 * (factor * mu_x).transpose(1, 2), MAS_target)
                 mu_square = torch.sum(factor * (mu_x ** 2), 1).unsqueeze(-1)
                 log_prior = z_square - z_mu_double + mu_square + const
                 # it's actually the log likelihood of target given the Gaussian with (mu_x, I)
@@ -193,7 +193,7 @@ def forward(self, x, x_lengths, y, y_lengths, noise, spk=None, out_size=None, us
 
                 # compute MLE loss
                 mu_y_uncut = torch.matmul(attn.squeeze(1).transpose(1, 2), mu_x.transpose(1, 2)).transpose(1, 2)  # here mu_x is not cut.
-                l_mle = mle_loss(target, mu_y_uncut, torch.zeros_like(mu_y_uncut), y_mask)
+                l_mle = mle_loss(MAS_target, mu_y_uncut, torch.zeros_like(mu_y_uncut), y_mask)
 
         else:
             attn = generate_path(durs, attn_mask.squeeze(1)).detach()