Fine-tuning Sentence-Transformers for Multi-class Language Identification Task with PyTorch Lightning
This repository is for language identification using SentenceTransformer, a pre-trained transformer-based model for natural language processing.
A list of SentenceTransformer pre-trained models can be found here
I specifically used the task-agnotic (English) pre-trained SentenceTransformer model to extract features from 100 documents per language and trained a single linear classifier on the extracted features.
Figure: Architecture for the approach. A pre-trained SentenceTransformer transforms the documents and the embeddings are used to train a single linear classifier.
Python version: Python 3.10.8
- Clone the repo:
git clone kayodeolaleye/multilang-identification
cd ./multilang-identification
- Install requirements:
pip install -r requirements.txt
- Train the model:
python training.py --model_name all-MiniLM-L6-v2 --epochs 1000 --batch_size 32