You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi authors, thanks for open-sourcing the project code as well as your implementation of STFT discriminator. I was able to train a good codec model with your STFT implementation.
However, I see that the output logits from the discriminator have the shape (b c t w), b for batch size, c for #channels, t for timesteps/frames, and w for frequency bins/dimension. In order to compute the hinge loss, D(x) should return a shape of (b c) or (b c t) right, and I am wondering how you aggregate the information from the frequency bins/dimension (the last dimension)?
My current implementation directly sums over the last dimension and I suppose I could also use a nn.Linear layer to map dimension from w->1. I'm wondering if you could provide more details on how the discriminator logits are used during training, thanks!
The text was updated successfully, but these errors were encountered:
❓ Questions
Hi authors, thanks for open-sourcing the project code as well as your implementation of STFT discriminator. I was able to train a good codec model with your STFT implementation.
However, I see that the output logits from the discriminator have the shape (b c t w), b for batch size, c for #channels, t for timesteps/frames, and w for frequency bins/dimension. In order to compute the hinge loss, D(x) should return a shape of (b c) or (b c t) right, and I am wondering how you aggregate the information from the frequency bins/dimension (the last dimension)?
My current implementation directly sums over the last dimension and I suppose I could also use a nn.Linear layer to map dimension from w->1. I'm wondering if you could provide more details on how the discriminator logits are used during training, thanks!
The text was updated successfully, but these errors were encountered: