You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hey there! First of all I wanted to thank you for your work. I'm a big fan of your ColPali and ColQwen models, and the fact that not only you open sourced them, but you also released the code and your whole training set, is an immense gift to the community.
I am reaching out because I wanted to ask your opinion on something. I would like to use ColPali/ColQwen2 in a multilingual RAG scenario. Of course your training set is English only, so the model will not perform particularly well at multilingual tasks. Hence, I was thinking of doing some fine-tuning on multilingual data, maybe starting with just a few languages like EN, IT, FR, ES.
I wanted to ask you if you think it is a reasonable idea, and if you have any insights on the order of magnitude of samples that I would need to gather. I know your original training set had around 130k samples, and I was hoping that, for a fine-tuning, maybe something between 1k and 10k query-page pairs would be enough.
Do you have any insights about this? Or any general suggestions?
The text was updated successfully, but these errors were encountered:
Hey there! First of all I wanted to thank you for your work. I'm a big fan of your ColPali and ColQwen models, and the fact that not only you open sourced them, but you also released the code and your whole training set, is an immense gift to the community.
I am reaching out because I wanted to ask your opinion on something. I would like to use ColPali/ColQwen2 in a multilingual RAG scenario. Of course your training set is English only, so the model will not perform particularly well at multilingual tasks. Hence, I was thinking of doing some fine-tuning on multilingual data, maybe starting with just a few languages like EN, IT, FR, ES.
I wanted to ask you if you think it is a reasonable idea, and if you have any insights on the order of magnitude of samples that I would need to gather. I know your original training set had around 130k samples, and I was hoping that, for a fine-tuning, maybe something between 1k and 10k query-page pairs would be enough.
Do you have any insights about this? Or any general suggestions?
The text was updated successfully, but these errors were encountered: