Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fine tuning ColPali to make it multilingual #163

Open
AndRossi opened this issue Jan 2, 2025 · 0 comments
Open

Fine tuning ColPali to make it multilingual #163

AndRossi opened this issue Jan 2, 2025 · 0 comments

Comments

@AndRossi
Copy link

AndRossi commented Jan 2, 2025

Hey there! First of all I wanted to thank you for your work. I'm a big fan of your ColPali and ColQwen models, and the fact that not only you open sourced them, but you also released the code and your whole training set, is an immense gift to the community.

I am reaching out because I wanted to ask your opinion on something. I would like to use ColPali/ColQwen2 in a multilingual RAG scenario. Of course your training set is English only, so the model will not perform particularly well at multilingual tasks. Hence, I was thinking of doing some fine-tuning on multilingual data, maybe starting with just a few languages like EN, IT, FR, ES.

I wanted to ask you if you think it is a reasonable idea, and if you have any insights on the order of magnitude of samples that I would need to gather. I know your original training set had around 130k samples, and I was hoping that, for a fine-tuning, maybe something between 1k and 10k query-page pairs would be enough.

Do you have any insights about this? Or any general suggestions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant