The source code is for the research: Automated ICD Coding using Extreme Multi-label Long Text Transformer-based Models.
Restore MIMIC-III v1.4 data into a Postgres database.
python3 lm.prepare_mimic_data_for_pretrain_lm.py
python3 lm/run_mlm.py
--model_name_or_path=google/bigbird-roberta-base
--train_file=[data_dir]/mimic3_uncased_preprocessed_total.txt
--output_dir=[model_dir]
--max_seq_length=4096
--line_by_line=True
--do_train=True
--do_eval=True
--per_device_train_batch_size=2
--per_device_eval_batch_size=4
--gradient_accumulation_steps=32
--overwrite_output_dir=True
--max_steps=500000
--save_steps=1000
--warmup_steps=10000
--learning_rate=2e-05
--save_total_limit=20
The pretrained ClinicalBIGBIRD can be downloaded here.
use mimic3_data_preparer.py in HiLAT to prepare the raw training data with the following command line flags:
Name | Value |
---|---|
pre_process_level | raw |
segment_text | False |
- Download the pretrained Transformer RoBERTa-PM-M3
- Training python3 xr-lat/run_coding.py config-baseline.json
Data preparation is the same as the baseline model.
python3 data_processing/generate_hierarchical_label_tree.py
python3 xr-lat/run_coding.py config-xrlat.json
- Prepare training data python3 data_processing.prepare_xrtransformer_data.py
- Install libpecos. Use "install and develop locally" option.
- Replace /pecos/xmc/xtransformer/network.py with xr-transformer.network.py
- Create TF-IDF features
python3 -m pecos.utils.featurization.text.preprocess build --text-pos 0 --input-text-path ../data/mimic3/full-xr/train.txt --output-model-folder ./tfidf-model
python3 -m pecos.utils.featurization.text.preprocess run --text-pos 0 --input-preprocessor-folder ./tfidf-model --input-text-path ../data/mimic3/full-xr/train.txt --output-inst-path ./data/train.tfidf.npz
python3 -m pecos.utils.featurization.text.preprocess run --text-pos 0 --input-preprocessor-folder ./tfidf-model --input-text-path ../data/mimic3/full-xr/test.txt --output-inst-path ./data/test.tfidf.npz
- Training
xr-transformer/train_and_predict.sh