Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot Reproduce Results. Release Evaluation Code #15

Open
wjc2830 opened this issue Aug 12, 2024 · 5 comments
Open

Cannot Reproduce Results. Release Evaluation Code #15

wjc2830 opened this issue Aug 12, 2024 · 5 comments

Comments

@wjc2830
Copy link

wjc2830 commented Aug 12, 2024

Thank you for your work on this paper.
However, I am unable to reproduce the main results reported in your paper, including the FID, onset detection accuracy, and AP.
For evaluating FID, I used the SpecVQGAN code, and for onset performance, I used the CondFoleyGen code.
Despite using these resources, the results obtained from this repository's inference code do not match those reported in your paper.
Could you please release your evaluation scripts to facilitate further investigation and ensure reproducibility?

@wjc2830 wjc2830 changed the title Release Evaluation Code Cannot Reproduce Results. Release Evaluation Code Aug 17, 2024
@ymzhang0319
Copy link
Collaborator

Hi @wjc2830, thanks for your interest.

We use the same evaluation tools.
Could you please provide more details in your evaluation settings (semantic weight 1.0, temporal weight 0.2,... in our experiments)?
Then I can try to help you figure it out.

@wjc2830
Copy link
Author

wjc2830 commented Aug 19, 2024

Yes, ip_adapter_weight is set to 1.0 and controlnet_conditioning_scale is set to 0.2. I opted not to use a class prompt like "machine gun shooting" because I observed that the results generated without it were superior to those with it. With these settings, I got onset acc: 0.1213, detection acc: 0.1347, detection ap: 0.6893 FID: 47.498411865917966, MKL: 5.17888476451238, KID: [0.046363522010469276-1.8900277153235862e-07] on AVSync15 (1500 samples).
Regarding FID computation, I wanna clarify if the reported FID score represents an average across all 15 classes or no concept of class here.

@ymzhang0319
Copy link
Collaborator

Thanks for your information! The evaluation experiments is conducted on the AVSync15 test set (150 samples.) You can refer to the official link of AVSync15.
If you have any other questions, please feel free to contact us.

@wjc2830
Copy link
Author

wjc2830 commented Aug 19, 2024

Thank you for the prompt reply. I have re-implemented the evaluation and obtained the following results: FID: 33.87400189673342 MKL: 5.159568889935811 KID: [0.053455384736237746-1.3165596698642787e-07] #onset acc: 0.1007, detection acc: 0.1209, detection ap: 0.6936.
There are still discrepancies in metrics such as MKL and detection accuracy. To ensure consistency, I recommend releasing your evaluation script.

@Gloria2tt
Copy link

@wjc2830 Can you provide your test code. I also try to do the same thing you do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants