GitHub - kayodeolaleye/multilang-identification: A PyTorch Lightning Implementation of Multi-Language Identification using a SentenceTransformer model pre-trained on English. Work done while interning at ByteFuse.

Fine-tuning Sentence-Transformers for Multi-class Language Identification Task with PyTorch Lightning

This repository is for language identification using SentenceTransformer, a pre-trained transformer-based model for natural language processing.

A list of SentenceTransformer pre-trained models can be found here

I specifically used the task-agnotic (English) pre-trained SentenceTransformer model to extract features from 100 documents per language and trained a single linear classifier on the extracted features.

Figure: Architecture for the approach. A pre-trained SentenceTransformer transforms the documents and the embeddings are used to train a single linear classifier.

Python version: Python 3.10.8

Train from Scratch

Clone the repo:

git clone kayodeolaleye/multilang-identification
cd ./multilang-identification

Install requirements:

pip install -r requirements.txt

Train the model:

python training.py --model_name all-MiniLM-L6-v2 --epochs 1000 --batch_size 32

Embeddings from pre-trained models: all-mini-LM-L6-v2 and all-mini-LM-L12-v2 respectively

Learning curves for the single Linear Classifier

Performance on test set

ToDo: Example Usage

Add code snippets for loading the model weights and assessing performance on test samples in Google Colab

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
results		results
Multilang_identification.ipynb		Multilang_identification.ipynb
README.md		README.md
architecture.png		architecture.png
requirements.txt		requirements.txt
training.py		training.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fine-tuning Sentence-Transformers for Multi-class Language Identification Task with PyTorch Lightning

Train from Scratch

Embeddings from pre-trained models: all-mini-LM-L6-v2 and all-mini-LM-L12-v2 respectively

Learning curves for the single Linear Classifier

Performance on test set

ToDo: Example Usage

References

About

Releases

Packages

Languages

kayodeolaleye/multilang-identification

Folders and files

Latest commit

History

Repository files navigation

Fine-tuning Sentence-Transformers for Multi-class Language Identification Task with PyTorch Lightning

Train from Scratch

Embeddings from pre-trained models: all-mini-LM-L6-v2 and all-mini-LM-L12-v2 respectively

Learning curves for the single Linear Classifier

Performance on test set

ToDo: Example Usage

References

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages