Language Modeling Demo - ATM 2021

This repository has been created as auxiliary material for the presentation entitled “Do You Speak Allegro?” Large Scale Language Modeling Using Allegro Offers Data held by Riccardo Belluzzo during Allegro Tech Meeting 2021. You can find the slides of the presentation here.

Why Is This Repository Useful?

If you have followed the presentation, you may find this repository useful to better understand the code presented. Otherwise, you may use this repository as the starting point for a project involving Language Model training.

Language Models are the foundation of modern NLP systems. It has been proved that many state-of-the-art results in NLP tasks like Text Classification, Summarization, Machine Translation have been achieved by employing and fine-tuning Language Models pre-trained on big corpora of data.

In this repository, you will find the most compact code possible showing how you can train a BERT-based Language Model on your domain-specific data. We have written this code using 🤗transformers and pytorch-lightning two libraries known for being very intuitive and easy to use.

Building Blocks of Language Model Training

In order to train a Language Model, you need:

A training corpus - namely, a big set of strings representing the "sentences" of the language you want to fit. In the case your domain data were e-commerce users reviews, you could use each review as a sentence.
A Tokenizer - an object converting strings to numbers (tokens)
A model - a neural network that will be trained following the Masked Language Model objective, i.e trying to predict the masked token in a given sentence

In this repository we provide code for each of the aforementioned steps.

Setup

Install the environment by running:

conda create env -f environment.yaml

this will install a conda environment called lm-training-demo into your working dir. Activate it by running:

conda activate lm-training-demo

Generate Training Corpus

Under the activated environment, run:

python ./corpus/run_corpus_generation.py

This will download and extract the amazon_reviews_us dataset to the specified cache_dir, process it and save it into the specified output_dir. In particular, the corpus will be splitted into two sets (training and validation) of several .txt files.

Each .txt file should look like this: one sentence in each line.

- Britax Roundabout G4 Convertible Car Seat Onyx Prior Model
- Bebamour New Style Designer Sling and Baby Carrier 2 in 1
- Spry Kids Xylitol Tooth Gel 3 Pack Bubble Gum Original Strawberry Banana
- Graco Fastaction Fold Duo Click Connect Stroller
- JJ Cole Mode Diaper Tote Bag Mixed Leaf Discontinued by Manufacturer
- The creative energy saving mushroom energysaving lamps LED touch small night light the head of a bed bedroom lamp multi color
- Little Giraffe Luxe Solid Blanky

Fit a Tokenizer

Under the activated environment, run:

python ./tokenization/run_tokenizer_training.py --corpus-dir path/to/training_corpus_dir

where training_corpus_dir is the training corpus generated at the previous point. If the training will be successful you should see something like this:

[00:00:04] Pre-processing files (86 Mo)             ████████████████████████████████████████████████████████████████ 100%
[00:00:00] Tokenize words                           ████████████████████████████████████████████████████████████████ 40676    /    40676
[00:00:00] Count pairs                              ████████████████████████████████████████████████████████████████ 40676    /    40676
[00:00:00] Compute merges                           ████████████████████████████████████████████████████████████████ 29874    /    29874

and a file called tokenizer.json should be visible at the specified output_dir.

Train a (Masked) Language Model

Under the activated environment, run:

python ./masked_language_model/main.py /
--job-dir path/to/job_dir
--path-to-train-set path/to/training_corpus
--path-to-val-set path/to/validation_corpus
--path-to-tokenizer path/to/tokenizer.json

this will train a RoBERTa solving the masked language model objective. You can monitor the training through tensorboard. You can do this by launching:

tensorboard --logdir path/to/job_dir/tensorboard

where path/to/job_dir/tensorboard is the path pointing to the tensorboard logs (automatically saved along training)

Testing the model

You can try out the Language Model you have trained by running:

python masked_language_model/test_lm.py

Specify an input string to be filled, like t-shirt stan <MASK> and the script will provide the following output:

[{'sequence': 't-shirt stan nowy', 'score': 0.9219169, 'token': 267, 'token_str': ' nowy'},
{'sequence': 't-shirt stan używany', 'score': 0.063408, 'token': 717, 'token_str': ' używany'},
{'sequence': 't-shirt stan idealny', 'score': 0.00275171, 'token': 9666, 'token_str': ' idealny'}]

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
corpus		corpus
images		images
masked_language_model		masked_language_model
tokenization		tokenization
README.md		README.md
environment.yaml		environment.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Language Modeling Demo - ATM 2021

Why Is This Repository Useful?

Building Blocks of Language Model Training

Setup

Generate Training Corpus

Fit a Tokenizer

Train a (Masked) Language Model

Testing the model

About

Releases

Packages

Languages

riccardo-alle/lm-training-demo-atm2021

Folders and files

Latest commit

History

Repository files navigation

Language Modeling Demo - ATM 2021

Why Is This Repository Useful?

Building Blocks of Language Model Training

Setup

Generate Training Corpus

Fit a Tokenizer

Train a (Masked) Language Model

Testing the model

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages