Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clip #116

Open
Muinez opened this issue Dec 25, 2024 · 4 comments
Open

clip #116

Muinez opened this issue Dec 25, 2024 · 4 comments
Labels
documentation Improvements or additions to documentation

Comments

@Muinez
Copy link

Muinez commented Dec 25, 2024

Could you please try training Sana together with CLIP, similar to how it's done in SDXL? I experimented with fine-tuning Sana on CLIP embeddings (I modified the caption channels), and the model trained significantly better compared to using pure gemma

@lawrence-cj
Copy link
Collaborator

Nice. Any comparison to learn about the improvement?

@lawrence-cj lawrence-cj added the documentation Improvements or additions to documentation label Jan 2, 2025
@Muinez
Copy link
Author

Muinez commented Jan 3, 2025

Nice. Any comparison to learn about the improvement?

https://wandb.ai/muinez/mysana

1 is Gemma.
2 and 3 are CLIP.

Don't focus on the end of the 2nd run because I broke something there. Look at the 3rd run and the beginning to mid-point of the 2nd run.

@lawrence-cj
Copy link
Collaborator

No idea what's the improvement. Can you explain more?

@Muinez
Copy link
Author

Muinez commented Jan 3, 2025

No idea what's the improvement. Can you explain more?

The model seems to generate more aesthetically pleasing art overall, with improvements in features like eyes and textures. Prompt following has gotten worse, though, because the prompt doesn't fit within the 64 token limit (which was the limit of the CLIP version I used for training). It looks like the art has become more varied and possibly more lively—although that could be my impression, and I’m not the only one who noticed. I shared this with others, and they also think the CLIP version performs better. It’s not a huge resource hog either—if I managed to do this on a modest A6000, the model adapts quickly. I think it’s worth experimenting with if you haven’t tried it yet. If you decide to train, maybe try using the CLIP from SDXL Animagine finetune. You could then further fine-tune it on longer prompts to improve its understanding of them

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

2 participants