You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Sep 1, 2024. It is now read-only.
Hello, I observed the effect of net_vocal_attributes in the whole model framework.
At present, the embedding extracted from the predicted sound, the distance of the negative sample pair (audio_embedding_A1_pred and audio_embedding_B1_pred) can reach 2, and the distance of the positive sample pair (audio_embedding_A1_pred and audio_embedding_A2_pred) can reach about 0.
But after I changed the input of net_vocal to pure real sound, the distance between negative sample pairs (audio_embedding_A1_gt and audio_embedding_B_gt) can only reach 1. That is to say, the sound feature extraction is not good when I train the net_vocal alone.
It stands to reason that pure ground voices are easier to extract features than predicted voices. I modified the parameters of the training (batch, learning rate, etc.) but none solved the problem. May I know what is the reason?
Looking forward to your reply!
The text was updated successfully, but these errors were encountered:
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Hello, I observed the effect of net_vocal_attributes in the whole model framework.
At present, the embedding extracted from the predicted sound, the distance of the negative sample pair (
audio_embedding_A1_pred
andaudio_embedding_B1_pred
) can reach 2, and the distance of the positive sample pair (audio_embedding_A1_pred
andaudio_embedding_A2_pred
) can reach about 0.But after I changed the input of net_vocal to pure real sound, the distance between negative sample pairs (
audio_embedding_A1_gt
andaudio_embedding_B_gt
) can only reach 1. That is to say, the sound feature extraction is not good when I train the net_vocal alone.It stands to reason that pure ground voices are easier to extract features than predicted voices. I modified the parameters of the training (batch, learning rate, etc.) but none solved the problem. May I know what is the reason?
Looking forward to your reply!
The text was updated successfully, but these errors were encountered: