-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question on Label Extraction #3
Comments
Hi @Nkcemeka! Thank you for your interest :) The label extraction code can be found here. You can have a look at the get_noisy_label_for_codec or get_midi_label_for_codec functions. These functions return a binary tensor which indicates which notes are active at each frame of a signal, i.e. these tensors have a dimensionality of [n_frames, n_notes]. Hope that helps and please let me know if you have more questions! Best wishes, |
Thank you @marypilataki. It was really helpful. I am currently working with it. I used the get_midi_label_for_codec function and passed the offset as HOP_DURATION * i (where i is the audio frame under consideration), duration is the EXCERPT_DURATION and the codec rate is SAMPLE_RATE. I think your training script eventually chops the time axis of the piano roll to the number of frames from the encoder. This is a little confusing. How do we make the time steps match from the get go? Do you recommend downsampling the piano roll? Or is it okay to change the codec rate to a way lower value so that automatically resolves the issue? You might also want to check the requirements file. The branch installation of basic_pitch and audiotools with pip does not work on my end. I think a git+ is missing and one of the paths is wrong. For example, in the setup.py for audiotools, you have a different name: descript-audiotools which causes a conflict. This works on my end: descript-audiotools @ git+https://github.com/marypilataki/audiotools_mir@mpe_labels. Correct me if I am wrong but I presume that should be the correct thing. Also, just a little question, if you don't mind: in the paper, you linked to a page for the Mazurkas dataset, but it has been pretty hard accessing it on the website. Do you have a more explicit link? And how do we get the Guitar dataset as well? I saw it was not available and I am unsure about how to request for it. Any advice on that will be really appreciated. |
Just to add, there are some subtle errors in audiotools in the get_noisy_label_for_codec function which you referred me to. I think rate should be sample_rate since rate was never defined or maybe the codec rate?. Also, num_samples = int(duration * codec_rate). This is because when creating the label, num_samples and self.n_notes are passed to torch.zeros which requires ints. Lastly, I have tried training the Padac model with the few datasets I have. I get an error from the same function because the codec_rate is None. This causes an error when getting num_samples. Any ideas as to what could be causing this? If you have a checkpoint that might be helpful as well. Apologies for the many questions. I will wait for your response. |
Hey @Nkcemeka! Thanks for noticing all those typos :)
Indeed, I have corrected this to
|
Thanks again for your input! 🥇 |
@marypilataki Thank you so much for your responses. They are indeed helpful. I eventually defined the codec rate as the frame size (or time steps) from the encoder divided by the excerpt duration (in the case it exceeds 1s). I think I understand now. Thank you. I have not trained the architecture fully but I was able to get it to work some degree on my end. However, I also noticed some possible sources of error you might find useful just in case someone else tries the code.
Hope this helps. I would let you know if I encounter any other issues. Just one question, if you don't mind. I was trying to use the Slakh dataset for training the shallow transcriber. For the paper, did you use the mixed file for each track? Or you took non-drum stems? I was assuming you did this because from the training script, it seems your labels are 3D (batch_size, time_steps, n_notes) rather than 4D (batch_size, time_steps, n_notes, n_instruments). As a result, I was thinking of how you handled the instrument issue for the Slakh dataset. EDIT: I think you considered the instruments as well from the function you directed me to earlier. I can see a get_midi_path function that reads all_src.mid for slakh. I will process it that way and see what I get. You could also help me clarify if you used a 4D tensor or aggregated everything into a 3D tensor (pitch-only labels) when evaluating the shallow transcriber. |
Hello @marypilataki, I tried training the shallow transcriber on MAESTRO and had a super low f1-score (about 3%). The precision was much higher (36%-ish). The issue clearly was due to the recall being low. It means there are a lot of false negatives. For MAESTRO, I generated a training set of 27000 embeddings (approximately 7.5 hours since an embedding is for an audio chunk of 1s) and 2700 embeddings for the validation set. I was not sure what the issue was so I reduced the threshold to 0.2 in order to improve the recall and I had 27.7% precision and 10.5% fscore which was way better (although still low). I also converted the predictions of the transcriber on one or two files to a MIDI file and the output was reasonable. The melody was audible and was correlated with the chunks. So, it means the probe is actually making an effort and that was great to hear and visualize. Do you have any insights based on your experience on what I could do to get better results? I used the params defined in the paper (learning rate, weight decay etc). I would appreciate it if you could provide useful tips in any way. Also, is there a cogent reason for using a threshold of 0.3 rather than 0.5, which is typically standard? Maybe that could help me in debugging what I am doing wrong since I generally got a slightly better performance by reducing the threshold to 0.2. Pardon my many questions once again. I am learning a lot and just want to understand better. |
You are right, thanks, I fixed those typos!
Please do!
I used the full mix in all cases. I extracted the ground truth for all instruments (except for drums) into a tensor of [number of frames x number of notes]. This is considered as instrument-agnostic transcription, i.e. we only transcribe the pitch while we do not 'care' for the instrument source. Although I did not take into account drums in the ground truth, they do appear in the mix (I did not remove the drum stems). I followed the same method for both pretraining and when training/evaluating the shallow transcriber. Hope that's clear and let me know if anything else occurs. |
Oh no :( The first thing that comes to my mind is a bug in the ground truth. Did you check that the labels you used are correct? Could you check that the active notes are indeed present in the labels? Or are the labels all zeros? It seems that the model can hardly predict any note :(
Great that you tried to listen to the results! Weird that you managed to do so with such low scores.
First and foremost check that the ground truth is correct. Your results are way far from mine and that is really weird. Also which package did you use for evaluation? I use mir_eval.
I don't think there is a standard threshold value. Depends on the project's goals and the desirable balance between Precision and Recall. Many papers optimise with respect to the threshold and use the value that gives the best F-score. |
Thank you @marypilataki . I will spend time this weekend looking into it to confirm there is no bug. However, just to clarify, I used the below function for the labels: get_midi_label_for_codec(self, sample_rate, offset, duration, path, codec_rate) sample_rate was not used so I ignored it. My duration is one second. My path was the path to the midi file. What did you use for the offset? I used the hop_length * frame_number of the audio frame under consideration. Would that be the correct thing to do? Please, let me know if you approached this differently. It's possible I'm calculating this the wrong way. |
Thanks for noticing, fixed that too.
The offset should be the time in seconds within the full track that your 1-second excerpt starts from. This is required so that the correct part of the MIDI is read and encoded into the label. Is your hop_length in seconds? |
My hop length is 0.8 seconds. And yes, your explanation is what I thought it was. This is the function I used to get both the features and labels at the same time:
I didn't change the training script. All I did was get the labels and features. I will look into carefully. Maybe there is some subtle bug somewhere. |
Now that you mention it, I can see I am doing
when passing it to the get_midi_label_codec function, instead of multiplying i with the HOP_DURATION which is 0.8s. Oops; I think that is a bug. I should not use the int. Snap. That could be the problem. At i = 1, the position in seconds should be 0.8s, but it will end up being 0. At i=2, it will be 1 rather than 1.6s. That might be the entire problem. That might just be it. I will fix this asap and get back to you. Hopefully, it works now. Please, kindly also let me know if I am doing any other thing wrong. EDIT: I am currently retraining but I don't think there are any improvements from the precision, recall and f1score curves . Maybe something else is wrong and I wonder what it is. By the way, the scores for the precision, recall and f1-score I am reporting is from scikit-learn which is part of the training script (the evaluate function). I reported the values at the 20th epoch. I didn't use any from mir_eval. Also, I actually trained the code on the Slakh dataset in an instrument-agnostic manner (before noticing the above bug), The f1-score was way higher although still low (about 36%): see image below. What bugged me was why the MAESTRO dataset was starkly different. Maybe I will redo this from scratch and verify every step of the process. But yes, like you said, the results are weird. Hopefully, I find the cause of the error. |
Hey, spotted something else that is wrong. I have updated the In the
with:
Let me know if this fixes the problem! Best, |
Hello @marypilataki , Thank you for your help so far in debugging this with me. I don't think that is the issue actually. Using your previous code and getting the latent space from model.encoder without passing it through the residual vector quantization module, I got things to work by training on the labels from the get_noisy_label_for_codec function rather than the get_midi_label_for_codec function. Here is a sample of my results when I trained with MAESTRO: You can see that the f1 score is above 65% and the Precision is above 70% which is good to see. To test further, I trained the probe using mel-spectrogram features based on the predicted labels from Basic Pitch and I got this: The results for the mel spectrogram were even better. The f1 score is above 70% and the precision is about 75% and above. This shows that the issue isn't the residual vector quantization module. I read the get_midi_label_for_codec function and I have a few questionsm if you don't mind. It will clear things up. Here is the code for it:
From the above code, you only consider events where the note's start time falls in the considered window. This approach does not consider events with note.start outside the left boundary of the window and note.end events in the window. It also overlooks events with note.start outside the left boundary of the window and note.end beyond the right boundary of the window. I don't think this is the cause of the problem. I just wonder if this was a design decision to make the training simpler since we are dealing with chunks. Your explanation will be really appreciated. I will make another comment on what I suspect is the issue. |
As to the cause of the problem, I have a clue that it is probably from madmom's implementation. When I started going through the repo, I was inititally implementing the logic for getting the chunks based on how the STFT is performed but ditched that approach since you had a library that did that efficiently. It then occurred to me that for STFT algorithms, the window function might be centred on the reference sample. To make this work seamlessly, the audio will be padded by frame_size/2 samples to the left. Although this isn't STFT, I believe the logic is the same. Madmom has an origin parameter that is 0 by default. This means the first chunk of the audio has 0.5s of pure silence. This did not affect my training when I used the get_noisy_label function since Basic Pitch was predicting the labels. To account for this, I ended up using: When I did this over 20 epochs, I didn't get the results I wanted but they were way better. I got a F1-score of 30%-ish and a precision of about 50% (therabouts). You can see the graph in the image below: I still feel these results can be higher. What do you think? The results look more reasonable. Is this what you did? And is there something else I should do? Below is an image that proves that madmom pads the audio chunks for the first sample. The image is messy, but the 22050 you see was gotten by calling np.nonzero on the first chunk of the audio signal. This means 0.5 seconds of the first chunk is pure silence. Although the results are reasonable now, I still expected the score from the ground truth to be higher than that of the noisy labels. I don't know if anything else is wrong, but I hope that was the cause of the issue. Just to add, I trained the model on the ground truth based on your changes incorporating the RVQ module. However, the model with a 70% f1 score did not include that. It was also based on the noisy labels. The mel spectrogram trained beautifully with the noisy labels and has the best performance so far. Thank you for your help so far. And yes, looking forward to your response. |
Hey @Nkcemeka The missing notes in the get_midi_label_for_codec function is definetely not a design choice, it is a bug! I was working on a different repo for the paper and when trying to clean things up many things got lost in the process! I have updated this function. I can't be sure regarding the issues with the results you shared. Thanks for noticing this implementation detail within the madmom library, I will have a look. Maybe it would worth replacing the dataloader used for training the downstream model with the one used used for pretraining and compare the results? Did not do that myself. Also, it is great to hear that you got decent performance when using mel spectrograms. I did not try this myself! Let me know how you are getting on, |
Thank you so much. I appreciate your willingness to help. I think I understand the whole pipeline way better than before. One final question. Did you run your final checkpoint on the entire Slakh Test set. I assume if you take 1 second excerpts at that hop size, it would be a lot of embeddings which should exceed 36 minutes? Or did you settle for a size similar to the validation dataset? Lastly, for the training data, I took the audio corresponding to the IDs you provided. Did some random shuffling on these files and then proceeded to get the excerpts until it got to 7200 embeddings which is 2 hours. I did this for all three datasets and wanted to verify if this approach is okay. The same thing was done with the validation set to obtain 36 minutes worth of embeddings. |
Hello @marypilataki , Here are some updates: Training Process
Results
I will run the above experiments again to confirm the results. On comparing the above to the paper, for Slakh, you had a higher F1-score of 69.7% against mine which is 62.65% (for the entire test set) and 63.23% for a subset of it (training set size). I think that is okay and was good to see. For the size of the validation set, I had a score of around 55.8%. For MusicNet, I had a much lower F1-score of 45.81% compared to yours which was about 64%. The size of the MusicNet dataset was slightly above that of the validation size. Maybe the differences are down to implementation. This is why I wanted to verify if my approach to the training process above is similar to yours. That aside, the results look much better, especially, that of the Slakh dataset and I think it is safe to assume all is working fine now. That aside, I would appreciate any feedback on details of the training process so I can ensure I am doing everything right. Also, if you don't mind, I also checked the checkpoint. It has 86 frames instead of 87 which is not much of a problem. I just wanted to confirm if these weights are for DAC, PADAC, PADAC(g) or PADAC(n). That was something I forgot to ask. Thanks, |
@marypilataki Great work. Currently going through your code and paper to understand the concept of using a probe. I just wanted to ask if perhaps you had a script similar to extract_features.py for labels. I assume a piano roll would have to be generated for each 1s excerpt (please, correct me if I misunderstand this). If you do have something that can do this, that will be really appreciated as it will make it easy to explore your work instead of having to code it out. So far, I have found extract_features.py really helpful. If I missed it, please kindly let me know.
Looking forward to hearing from you.
The text was updated successfully, but these errors were encountered: