Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question on Label Extraction #3

Open
Nkcemeka opened this issue Jan 14, 2025 · 19 comments
Open

Question on Label Extraction #3

Nkcemeka opened this issue Jan 14, 2025 · 19 comments

Comments

@Nkcemeka
Copy link

Nkcemeka commented Jan 14, 2025

@marypilataki Great work. Currently going through your code and paper to understand the concept of using a probe. I just wanted to ask if perhaps you had a script similar to extract_features.py for labels. I assume a piano roll would have to be generated for each 1s excerpt (please, correct me if I misunderstand this). If you do have something that can do this, that will be really appreciated as it will make it easy to explore your work instead of having to code it out. So far, I have found extract_features.py really helpful. If I missed it, please kindly let me know.

Looking forward to hearing from you.

@marypilataki
Copy link
Owner

Hi @Nkcemeka! Thank you for your interest :)

The label extraction code can be found here. You can have a look at the get_noisy_label_for_codec or get_midi_label_for_codec functions.

These functions return a binary tensor which indicates which notes are active at each frame of a signal, i.e. these tensors have a dimensionality of [n_frames, n_notes].

Hope that helps and please let me know if you have more questions!

Best wishes,
Mary

@Nkcemeka
Copy link
Author

Nkcemeka commented Jan 17, 2025

Thank you @marypilataki. It was really helpful. I am currently working with it. I used the get_midi_label_for_codec function and passed the offset as HOP_DURATION * i (where i is the audio frame under consideration), duration is the EXCERPT_DURATION and the codec rate is SAMPLE_RATE. I think your training script eventually chops the time axis of the piano roll to the number of frames from the encoder. This is a little confusing. How do we make the time steps match from the get go? Do you recommend downsampling the piano roll? Or is it okay to change the codec rate to a way lower value so that automatically resolves the issue?

You might also want to check the requirements file. The branch installation of basic_pitch and audiotools with pip does not work on my end. I think a git+ is missing and one of the paths is wrong. For example, in the setup.py for audiotools, you have a different name: descript-audiotools which causes a conflict. This works on my end: descript-audiotools @ git+https://github.com/marypilataki/audiotools_mir@mpe_labels. Correct me if I am wrong but I presume that should be the correct thing.

Also, just a little question, if you don't mind: in the paper, you linked to a page for the Mazurkas dataset, but it has been pretty hard accessing it on the website. Do you have a more explicit link? And how do we get the Guitar dataset as well? I saw it was not available and I am unsure about how to request for it. Any advice on that will be really appreciated.

@Nkcemeka
Copy link
Author

Nkcemeka commented Jan 18, 2025

Just to add, there are some subtle errors in audiotools in the get_noisy_label_for_codec function which you referred me to. I think rate should be sample_rate since rate was never defined or maybe the codec rate?. Also, num_samples = int(duration * codec_rate). This is because when creating the label, num_samples and self.n_notes are passed to torch.zeros which requires ints.

Lastly, I have tried training the Padac model with the few datasets I have. I get an error from the same function because the codec_rate is None. This causes an error when getting num_samples. Any ideas as to what could be causing this?

If you have a checkpoint that might be helpful as well.

Apologies for the many questions. I will wait for your response.

@marypilataki
Copy link
Owner

Hey @Nkcemeka!

Thanks for noticing all those typos :)

Just to add, there are some subtle errors in audiotools in the get_noisy_label_for_codec function which you referred me to. I think rate should be sample_rate since rate was never defined or maybe the codec rate?. Also, num_samples = int(duration * codec_rate). This is because when creating the label, num_samples and self.n_notes are passed to torch.zeros which requires ints.

Indeed, I have corrected this to num_samples = int(duration * codec_rate)

Lastly, I have tried training the Padac model with the few datasets I have. I get an error from the same function because the codec_rate is None. This causes an error when getting num_samples. Any ideas as to what could be causing this?

codec_rate depends on the architecture we are extracting features from, in our case it should be the rate of DAC/PA-DAC. This corresponds to 87. I have forgotten to update the config files, thanks for noticing this! I have now updated the AudioLoader arguments in the config.

If you have a checkpoint that might be helpful as well.
I will upload a checkpoint of PA-DAC.

@marypilataki
Copy link
Owner

Thank you @marypilataki. It was really helpful. I am currently working with it. I used the get_midi_label_for_codec function and passed the offset as HOP_DURATION * i (where i is the audio frame under consideration), duration is the EXCERPT_DURATION and the codec rate is SAMPLE_RATE. I think your training script eventually chops the time axis of the piano roll to the number of frames from the encoder. This is a little confusing. How do we make the time steps match from the get go? Do you recommend downsampling the piano roll? Or is it okay to change the codec rate to a way lower value so that automatically resolves the issue?
codec_rate here corresponds to DAC/PA-DAC rate. I have updated the config file to include this. When creating a dataloader the codec_rate member should be set to 87. You do not need to make any amendments to this. The encoder downsamples audio @ 87 samples per second and hence one ground truth tensor is of shape [87, number_of_notes]. Let me know if that makes sense.

You might also want to check the requirements file. The branch installation of basic_pitch and audiotools with pip does not work on my end. I think a git+ is missing and one of the paths is wrong. For example, in the setup.py for audiotools, you have a different name: descript-audiotools which causes a conflict. This works on my end: descript-audiotools @ git+https://github.com/marypilataki/audiotools_mir@mpe_labels. Correct me if I am wrong but I presume that should be the correct thing.
Yeap, you are right, thanks for noticing this! I have updated the requirements, let me know if you still have issues.

Also, just a little question, if you don't mind: in the paper, you linked to a page for the Mazurkas dataset, but it has been pretty hard accessing it on the website. Do you have a more explicit link? And how do we get the Guitar dataset as well? I saw it was not available and I am unsure about how to request for it. Any advice on that will be really appreciated.
Unfortunately, the Mazurkas dataset is not publicly available. I have access to it for research purposes through my university.
The guitar dataset is also not publicly available (at least as of now). You could try contacting the first author of this paper who created it if you would like to ask more questions.

Thanks again for your input! 🥇

@Nkcemeka
Copy link
Author

Nkcemeka commented Jan 21, 2025

@marypilataki Thank you so much for your responses. They are indeed helpful. I eventually defined the codec rate as the frame size (or time steps) from the encoder divided by the excerpt duration (in the case it exceeds 1s). I think I understand now. Thank you.

I have not trained the architecture fully but I was able to get it to work some degree on my end. However, I also noticed some possible sources of error you might find useful just in case someone else tries the code.

  1. In dac/model/dac.py, line 284, you had self.classifier instead of self.conditioner. I don't think it is much of an issue since the conditioner class and classifier class are essentially the same. It only throws an error because self.classifier was not defined.

  2. In scripts/train_padac.py (line 264 and 291), you had batch["pitch_labels"] which throws an error. An inspection of the batch passed to the train_loop function indicates it should be label and not pitch_labels. My code never got to the val_loop function but I think that would throw an error too since it had pitch_labels as well.

Hope this helps. I would let you know if I encounter any other issues.

Just one question, if you don't mind. I was trying to use the Slakh dataset for training the shallow transcriber. For the paper, did you use the mixed file for each track? Or you took non-drum stems? I was assuming you did this because from the training script, it seems your labels are 3D (batch_size, time_steps, n_notes) rather than 4D (batch_size, time_steps, n_notes, n_instruments). As a result, I was thinking of how you handled the instrument issue for the Slakh dataset.

EDIT: I think you considered the instruments as well from the function you directed me to earlier. I can see a get_midi_path function that reads all_src.mid for slakh. I will process it that way and see what I get. You could also help me clarify if you used a 4D tensor or aggregated everything into a 3D tensor (pitch-only labels) when evaluating the shallow transcriber.

@Nkcemeka
Copy link
Author

Nkcemeka commented Jan 22, 2025

Hello @marypilataki,

I tried training the shallow transcriber on MAESTRO and had a super low f1-score (about 3%). The precision was much higher (36%-ish). The issue clearly was due to the recall being low. It means there are a lot of false negatives. For MAESTRO, I generated a training set of 27000 embeddings (approximately 7.5 hours since an embedding is for an audio chunk of 1s) and 2700 embeddings for the validation set.

I was not sure what the issue was so I reduced the threshold to 0.2 in order to improve the recall and I had 27.7% precision and 10.5% fscore which was way better (although still low). I also converted the predictions of the transcriber on one or two files to a MIDI file and the output was reasonable. The melody was audible and was correlated with the chunks. So, it means the probe is actually making an effort and that was great to hear and visualize.

Do you have any insights based on your experience on what I could do to get better results? I used the params defined in the paper (learning rate, weight decay etc). I would appreciate it if you could provide useful tips in any way. Also, is there a cogent reason for using a threshold of 0.3 rather than 0.5, which is typically standard? Maybe that could help me in debugging what I am doing wrong since I generally got a slightly better performance by reducing the threshold to 0.2.

Pardon my many questions once again. I am learning a lot and just want to understand better.

@marypilataki
Copy link
Owner

marypilataki commented Jan 25, 2025

I have not trained the architecture fully but I was able to get it to work some degree on my end. However, I also noticed some possible sources of error you might find useful just in case someone else tries the code.

1. In dac/model/dac.py, line 284, you had _self.classifier_ instead of _self.conditioner_. I don't think it is much of an issue since the conditioner class and classifier class are essentially the same. It only throws an error because _self.classifier_ was not defined.

2. In scripts/train_padac.py (line 264 and 291), you had _batch["pitch_labels"]_ which throws an error. An inspection of the batch passed to the _train_loop_ function indicates it should be _label_ and not _pitch_labels_. My code never got to the _val_loop_ function but I think that would throw an error too since it had _pitch_labels_ as well.

You are right, thanks, I fixed those typos!

Hope this helps. I would let you know if I encounter any other issues.

Please do!

Just one question, if you don't mind. I was trying to use the Slakh dataset for training the shallow transcriber. For the paper, did you use the mixed file for each track? Or you took non-drum stems? I was assuming you did this because from the training script, it seems your labels are 3D (batch_size, time_steps, n_notes) rather than 4D (batch_size, time_steps, n_notes, n_instruments). As a result, I was thinking of how you handled the instrument issue for the Slakh dataset.

EDIT: I think you considered the instruments as well from the function you directed me to earlier. I can see a get_midi_path function that reads all_src.mid for slakh. I will process it that way and see what I get. You could also help me clarify if you used a 4D tensor or aggregated everything into a 3D tensor (pitch-only labels) when evaluating the shallow transcriber.

I used the full mix in all cases. I extracted the ground truth for all instruments (except for drums) into a tensor of [number of frames x number of notes]. This is considered as instrument-agnostic transcription, i.e. we only transcribe the pitch while we do not 'care' for the instrument source. Although I did not take into account drums in the ground truth, they do appear in the mix (I did not remove the drum stems). I followed the same method for both pretraining and when training/evaluating the shallow transcriber. Hope that's clear and let me know if anything else occurs.

@marypilataki
Copy link
Owner

marypilataki commented Jan 25, 2025

I tried training the shallow transcriber on MAESTRO and had a super low f1-score (about 3%). The precision was much higher (36%-ish). The issue clearly was due to the recall being low. It means there are a lot of false negatives.

Oh no :( The first thing that comes to my mind is a bug in the ground truth. Did you check that the labels you used are correct? Could you check that the active notes are indeed present in the labels? Or are the labels all zeros? It seems that the model can hardly predict any note :(

I was not sure what the issue was so I reduced the threshold to 0.2 in order to improve the recall and I had 27.7% precision and 10.5% fscore which was way better (although still low). I also converted the predictions of the transcriber on one or two files to a MIDI file and the output was reasonable. The melody was audible and was correlated with the chunks. So, it means the probe is actually making an effort and that was great to hear and visualize.

Great that you tried to listen to the results! Weird that you managed to do so with such low scores.

Do you have any insights based on your experience on what I could do to get better results?

First and foremost check that the ground truth is correct. Your results are way far from mine and that is really weird. Also which package did you use for evaluation? I use mir_eval.

Also, is there a cogent reason for using a threshold of 0.3 rather than 0.5, which is typically standard?

I don't think there is a standard threshold value. Depends on the project's goals and the desirable balance between Precision and Recall. Many papers optimise with respect to the threshold and use the value that gives the best F-score.

@Nkcemeka
Copy link
Author

Thank you @marypilataki . I will spend time this weekend looking into it to confirm there is no bug.

However, just to clarify, I used the below function for the labels:

get_midi_label_for_codec(self, sample_rate, offset, duration, path, codec_rate)

sample_rate was not used so I ignored it. My duration is one second. My path was the path to the midi file. What did you use for the offset? I used the hop_length * frame_number of the audio frame under consideration. Would that be the correct thing to do? Please, let me know if you approached this differently.

It's possible I'm calculating this the wrong way.

@marypilataki
Copy link
Owner

However, just to clarify, I used the below function for the labels:
get_midi_label_for_codec(self, sample_rate, offset, duration, path, codec_rate)
sample_rate was not used so I ignored it

Thanks for noticing, fixed that too.

My duration is one second. My path was the path to the midi file. What did you use for the offset? I used the hop_length * frame_number of the audio frame under consideration. Would that be the correct thing to do? Please, let me know if you approached this differently.

The offset should be the time in seconds within the full track that your 1-second excerpt starts from. This is required so that the correct part of the MIDI is read and encoded into the label. Is your hop_length in seconds?

@Nkcemeka
Copy link
Author

Nkcemeka commented Jan 25, 2025

My hop length is 0.8 seconds. And yes, your explanation is what I thought it was. This is the function I used to get both the features and labels at the same time:

def process_features(audiofilenames, midifilenames, model, path, dataset_type="train"):
    # extract features
    count = 0
    for f, g in tqdm(zip(audiofilenames, midifilenames), total=len(audiofilenames)):
        track_id = f.stem
        midi_id = g.stem

        assert track_id == midi_id, "The audio stem and midi stem do not match."
        audio, sample_rate = sf.read(f)

        if sample_rate != SAMPLE_RATE:
            audio = resampy.resample(audio, sample_rate, SAMPLE_RATE)

        audio = torch.tensor(audio, dtype=torch.float)
        if len(audio.shape) == 2:
            # Convert stereo to mono
            audio = audio.mean(dim = 1, keepdim=True)

        audio = FramedSignal(audio.detach().cpu().numpy(), frame_size=SAMPLE_RATE*EXCERPT_DURATION, hop_size=SAMPLE_RATE*HOP_DURATION)


        for i, chunk in enumerate(audio):
            if count == 27000 and dataset_type=="train":
                break

            if count == 2700 and dataset_type=="valid":
                break

            embed_path = Path(path + f"features/{track_id}_{i}.pt")
            roll_path = Path(path + f"labels/{track_id}_{i}.npz")

            if not embed_path.exists() and not roll_path.exists():
                chunk = torch.FloatTensor(chunk).unsqueeze(0).unsqueeze(0) # [1 x 1 x 44100]
                if torch.cuda.is_available():
                    chunk = chunk.cuda()

                with torch.no_grad():
                    chunk = chunk.squeeze(-1)
                    latent_space = model.encoder(chunk)

                # batch x feat x time => batch x time x feat
                latent_space = latent_space.permute(0, 2, 1)
                batch, frames, feat = latent_space.shape
                codec_rate = frames/EXCERPT_DURATION
                roll = get_midi_label_for_codec(int(i*HOP_DURATION), EXCERPT_DURATION, g, codec_rate)

                # Save the embedding and piano roll
                latent_space_dict = {"latent_space": latent_space}
                roll_dict = {"pitch_label": roll}
                torch.save(latent_space_dict, embed_path)
                np.savez(roll_path, **roll_dict)
                count += 1

I didn't change the training script. All I did was get the labels and features. I will look into carefully. Maybe there is some subtle bug somewhere.

@Nkcemeka
Copy link
Author

Nkcemeka commented Jan 25, 2025

Now that you mention it, I can see I am doing

int(i*HOP_DURATION) 

when passing it to the get_midi_label_codec function, instead of multiplying i with the HOP_DURATION which is 0.8s. Oops; I think that is a bug. I should not use the int. Snap. That could be the problem. At i = 1, the position in seconds should be 0.8s, but it will end up being 0. At i=2, it will be 1 rather than 1.6s. That might be the entire problem. That might just be it.

I will fix this asap and get back to you. Hopefully, it works now. Please, kindly also let me know if I am doing any other thing wrong.

EDIT: I am currently retraining but I don't think there are any improvements from the precision, recall and f1score curves . Maybe something else is wrong and I wonder what it is. By the way, the scores for the precision, recall and f1-score I am reporting is from scikit-learn which is part of the training script (the evaluate function). I reported the values at the 20th epoch. I didn't use any from mir_eval.

Also, I actually trained the code on the Slakh dataset in an instrument-agnostic manner (before noticing the above bug), The f1-score was way higher although still low (about 36%): see image below. What bugged me was why the MAESTRO dataset was starkly different.

Image

Maybe I will redo this from scratch and verify every step of the process. But yes, like you said, the results are weird. Hopefully, I find the cause of the error.

@marypilataki
Copy link
Owner

marypilataki commented Jan 27, 2025

Hey, spotted something else that is wrong. I have updated the extract_features.py script, please have a look at line 66.
There was a step missing for audio preprocessing hence model input was different than what was excepted. I believe that this might have caused missalignments between codec and labels.

In the extract_features.py script, could you please replace:

                with torch.no_grad():
                    chunk = chunk.squeeze(-1)
                    latent_space = model.encoder(chunk)

with:

                with torch.no_grad():
                    _, _, _, _, _, _, latent_space = model.encode(chunk, return_latent_space=True)

Let me know if this fixes the problem!

Best,
Mary

@Nkcemeka
Copy link
Author

Nkcemeka commented Jan 29, 2025

Hello @marypilataki ,

Thank you for your help so far in debugging this with me.

I don't think that is the issue actually. Using your previous code and getting the latent space from model.encoder without passing it through the residual vector quantization module, I got things to work by training on the labels from the get_noisy_label_for_codec function rather than the get_midi_label_for_codec function.

Here is a sample of my results when I trained with MAESTRO:

Image

You can see that the f1 score is above 65% and the Precision is above 70% which is good to see. To test further, I trained the probe using mel-spectrogram features based on the predicted labels from Basic Pitch and I got this:

Image

The results for the mel spectrogram were even better. The f1 score is above 70% and the precision is about 75% and above. This shows that the issue isn't the residual vector quantization module. I read the get_midi_label_for_codec function and I have a few questionsm if you don't mind. It will clear things up.

Here is the code for it:

 def get_midi_label_for_codec(self, offset, duration, path, codec_rate):
        "Function to return ground truth label for codec."
        # todo: add support for multi-instrument roll
        num_samples = duration * codec_rate
        start_time = offset
        end_time = start_time + duration
        if self.n_instruments > 1:
            from ..data.vocabulary import program_to_index, program_to_name
            label = torch.zeros(num_samples, self.n_notes, self.n_instruments, dtype=torch.int32)
        else:
            label = torch.zeros(num_samples, self.n_notes, dtype=torch.int32)

        label_path = self.get_midi_path(path)
        midi_data = PrettyMIDI(str(label_path))

        for instrument in midi_data.instruments:
            if not instrument.is_drum:
                for note in instrument.notes:
                    if note.start >= start_time:
                        note_start = librosa.time_to_samples(note.start - start_time, sr=codec_rate)
                        pitch_index = note.pitch - self.midi_offset
                        assert pitch_index >= 0, f'Pitch index is negative: {pitch_index}'

                        if note.end <= end_time:
                            note_end = librosa.time_to_samples(note.end - start_time, sr=codec_rate)
                            label[note_start:note_end, pitch_index] = 1
                        else:
                            label[note_start:, pitch_index] = 1
        return label

From the above code, you only consider events where the note's start time falls in the considered window. This approach does not consider events with note.start outside the left boundary of the window and note.end events in the window. It also overlooks events with note.start outside the left boundary of the window and note.end beyond the right boundary of the window. I don't think this is the cause of the problem. I just wonder if this was a design decision to make the training simpler since we are dealing with chunks. Your explanation will be really appreciated.

I will make another comment on what I suspect is the issue.

@Nkcemeka
Copy link
Author

Nkcemeka commented Jan 29, 2025

As to the cause of the problem, I have a clue that it is probably from madmom's implementation. When I started going through the repo, I was inititally implementing the logic for getting the chunks based on how the STFT is performed but ditched that approach since you had a library that did that efficiently.

It then occurred to me that for STFT algorithms, the window function might be centred on the reference sample. To make this work seamlessly, the audio will be padded by frame_size/2 samples to the left. Although this isn't STFT, I believe the logic is the same. Madmom has an origin parameter that is 0 by default. This means the first chunk of the audio has 0.5s of pure silence. This did not affect my training when I used the get_noisy_label function since Basic Pitch was predicting the labels.

To account for this, I ended up using:
offset = (i*HOP_DURATION) - (EXCERPT_DURATION/2).

When I did this over 20 epochs, I didn't get the results I wanted but they were way better. I got a F1-score of 30%-ish and a precision of about 50% (therabouts). You can see the graph in the image below:

Image

I still feel these results can be higher. What do you think? The results look more reasonable. Is this what you did? And is there something else I should do?

Below is an image that proves that madmom pads the audio chunks for the first sample. The image is messy, but the 22050 you see was gotten by calling np.nonzero on the first chunk of the audio signal. This means 0.5 seconds of the first chunk is pure silence.

Image

Although the results are reasonable now, I still expected the score from the ground truth to be higher than that of the noisy labels. I don't know if anything else is wrong, but I hope that was the cause of the issue. Just to add, I trained the model on the ground truth based on your changes incorporating the RVQ module. However, the model with a 70% f1 score did not include that. It was also based on the noisy labels. The mel spectrogram trained beautifully with the noisy labels and has the best performance so far.

Thank you for your help so far. And yes, looking forward to your response.

@marypilataki
Copy link
Owner

Hey @Nkcemeka

The missing notes in the get_midi_label_for_codec function is definetely not a design choice, it is a bug! I was working on a different repo for the paper and when trying to clean things up many things got lost in the process! I have updated this function.

I can't be sure regarding the issues with the results you shared. Thanks for noticing this implementation detail within the madmom library, I will have a look. Maybe it would worth replacing the dataloader used for training the downstream model with the one used used for pretraining and compare the results? Did not do that myself.

Also, it is great to hear that you got decent performance when using mel spectrograms. I did not try this myself!

Let me know how you are getting on,
Best wishes,
Mary

@Nkcemeka
Copy link
Author

Nkcemeka commented Feb 7, 2025

Hey @Nkcemeka

The missing notes in the get_midi_label_for_codec function is definetely not a design choice, it is a bug! I was working on a different repo for the paper and when trying to clean things up many things got lost in the process! I have updated this function.

I can't be sure regarding the issues with the results you shared. Thanks for noticing this implementation detail within the madmom library, I will have a look. Maybe it would worth replacing the dataloader used for training the downstream model with the one used used for pretraining and compare the results? Did not do that myself.

Also, it is great to hear that you got decent performance when using mel spectrograms. I did not try this myself!

Let me know how you are getting on,
Best wishes,
Mary

Thank you so much. I appreciate your willingness to help. I think I understand the whole pipeline way better than before.

One final question. Did you run your final checkpoint on the entire Slakh Test set. I assume if you take 1 second excerpts at that hop size, it would be a lot of embeddings which should exceed 36 minutes? Or did you settle for a size similar to the validation dataset?

Lastly, for the training data, I took the audio corresponding to the IDs you provided. Did some random shuffling on these files and then proceeded to get the excerpts until it got to 7200 embeddings which is 2 hours. I did this for all three datasets and wanted to verify if this approach is okay. The same thing was done with the validation set to obtain 36 minutes worth of embeddings.

@Nkcemeka
Copy link
Author

Nkcemeka commented Feb 10, 2025

Hello @marypilataki ,

Here are some updates:

Training Process

  1. I got a list of all the IDs for each dataset from the repo
  2. I performed a random shuffle on the training list and validation list for each dataset.
  3. I took each audio file in the above lists and generated embeddings.
  4. For the training list for each dataset, when the number of embeddings got to 7200, I stopped the process since 7200 embeddings is 2 hours worth of data. I did this for Slakh, MusicNet and GuitarSet.
  5. I repeated the above for the validation list for each dataset and got 720 embeddings for each dataset (36 minutes worth of data).

Results

  • On the entire Slakh Test set: Precision is 63.47%, Recall is 61.87% and F1-score is 62.65%.
  • On a subset of the Slakh Test set (7200 embeddings from the Slakh Test set which is the same amount of data as the training set): Precision is 62.83%, Recall is 63.62% and F1-score is 63.23%. When I use the size of validation set, I get values around: Precision of 61.2%, recall of 51.27% and f1-score of 55.8%.
  • On the entire Music Net Test set directory: Precision is 47.08%, Recall is 44.62% and F1-score is 45.81%

I will run the above experiments again to confirm the results. On comparing the above to the paper, for Slakh, you had a higher F1-score of 69.7% against mine which is 62.65% (for the entire test set) and 63.23% for a subset of it (training set size). I think that is okay and was good to see. For the size of the validation set, I had a score of around 55.8%.

For MusicNet, I had a much lower F1-score of 45.81% compared to yours which was about 64%. The size of the MusicNet dataset was slightly above that of the validation size. Maybe the differences are down to implementation. This is why I wanted to verify if my approach to the training process above is similar to yours. That aside, the results look much better, especially, that of the Slakh dataset and I think it is safe to assume all is working fine now.

That aside, I would appreciate any feedback on details of the training process so I can ensure I am doing everything right.

Also, if you don't mind, I also checked the checkpoint. It has 86 frames instead of 87 which is not much of a problem. I just wanted to confirm if these weights are for DAC, PADAC, PADAC(g) or PADAC(n). That was something I forgot to ask.

Thanks,
Chukwuemeka

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants