We have split the original BEST-2010 into training and validation sets. We call this the raw
data. Essentially, one can create such a dataset by concatenating the original dataset into a text file. We can the splits that we made. Please get in touch.
Note to self: the main data directory is ./data/best-syllable-big
.
Please install all necessary packages via pip install -r requirements.txt
.
python ./scripts/train.py --model-name seq_sy_ch_conv_3lv \
--model-params "embc:8|embt:8|embs:8|conv:8|l1:6|do:0.1|oc:BI" \
--data-dir ./data/best-syllable-big \
--output-dir ./artifacts/model-test \
--epoch 1 \
--batch-size 128 \
--lr 0.001
Available models and their configuration can be found in ./attacut/models
.
python ./scripts/attacut-cli ../docker-thai-tokenizers/data/wisesight-1000/input.txt \
--model=./artifacts/model-xx
python ./scripts/benchmark.py \
--label ../docker-thai-tokenizers/data/tnhc/tnhc.label \
--input=../docker-thai-tokenizers/data/tnhc/input_tokenised-deepcut-deepcut.txt
# this script will run segmentation and benchmarking in one shot.
python ./scripts/eval.py \
--model <path-to-model> \
--dataset <dataset>
We use a cluster provided by GWDG for running random search; the system's queue manager uses Slurm
.
The script below is for submitting slurm
jobs for each parameter configuation (see ./scripts/hyper-configs
).
python ./scripts/hyperopt.py --config=./scripts/hyper-configs/seq_ch_conv_3lv.yaml \
--N=20 \
--max-epoch=20
./scripts/writing
: we have scripts for generating latex tables used in the paper. These scripts are used viaMake
commands../scripts/data-related
: we have a couple of scripts for- computing number of words and characters for a dataset;
- preprocessing the
THNC
dataset.
Please see https://github.com/heytitle/tokenization-speed-benchmark.
File | Description |
---|---|
viz-plot-hyperopt-results.ipynb | making plot for expected valiation performance, i.e. Figure 3 |
x_attacut_captum.ipynb | explaining model decision, i.e. Figure 4 and 5. for expected valiation performance, i.e. Figure 3 |
extract-syllable-dict.ipynb | as the name suggested |
convert-raw-to-syllable-and-label.ipynb | convert BEST-2010 raw dataset to the dataset with syllable labels |
All data files and models are backup
s3://[backup-bucket]/projects/2020-Syllable-based-Neural-Thai-Word-Segmentation
Install Torch with GPU on GWDG's cluster
pip install torch==1.4.0+cu100 torchvision==0.5.0+cu100 -f https://download.pytorch.org/whl/torch_stable.html