FastText in Tensorflow

This based on the ideas in Facebook's FastText but implemented in Tensorflow. However, it is not an exact replica of fastText.

Classification is done by embedding each word, taking the mean embedding over the full text and classifying that using a linear classifier. The embedding is trained with the classifier. You can also specify to use 2+ character ngrams. These ngrams get hashed then embedded in a similar manner to the orginal words. Note, ngrams make training much slower but only make marginal improvements in performance, at least in English.

I may implement skipgram and cbow training later. Or preloading embedding tables.

<< Still WIP >>

Implemented:

classification of text using word embeddings
char ngrams, hashed to n bins
train and test program
preprocess facebook format, or text input into tensorflow records

Not Implemented:

separate word vector training
heirarchical softmax.
quantize models

Usage

The following are examples of how to use the applications. Get full help with --help option on any of the programs.

To transform input data into tensorflow Example format, an example:

process_input.py --facebook_input=queries.txt --model_dir=. --ngrams=2,3,4

Or, using a text file with one example per line with an extra file for labels:

process_input.py --text_input=queries.txt --labels=labels.txt --model_dir=.

To train a text classifier:

classifier.py \
  --train_records=queries.tfrecords \
  --eval_records=queries.tfrecords \
  --label_file=labels.txt \
  --vocab_file=vocab.txt \
  --model_dir=model \
  --export_dir=model

To predict classifications for text, use a saved_model from classifier. classifier.py --export_dir stores a saved model in a numbered directory below export_dir. Pass this directory to the following to use that model for predictions:

predictor.py
  --saved_model=model/12345678
  --text="some text to classify"
  --signature_def=proba

To export the embedding layer you can export from predictor. Note, this will only be the text embedding, not the ngram embeddings.

predictor.py
  --saved_model=model/12345678
  --text="some text to classify"
  --signature_def=embedding

Use the provided script to train easily:

train_classifier.sh path-to-data-directory

Facebook Examples

<< NOT IMPLEMENTED YET >>

You can compare with Facebook's fastText by running similar examples to what's provided in their repository.

./classification_example.sh
./classification_results.sh

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
classifier.py		classifier.py
inputs.py		inputs.py
inputs_test.py		inputs_test.py
predictor.py		predictor.py
process_input.py		process_input.py
requirements.txt		requirements.txt
todo.txt		todo.txt
train_classifier.sh		train_classifier.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FastText in Tensorflow

Implemented:

Not Implemented:

Usage

Facebook Examples

About

Releases

Packages

Languages

License

cbockman/tensorflow_fasttext

Folders and files

Latest commit

History

Repository files navigation

FastText in Tensorflow

Implemented:

Not Implemented:

Usage

Facebook Examples

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages