Automated ICD Coding using Extreme Multi-label Long Text Transformer-based Models

The source code is for the research: Automated ICD Coding using Extreme Multi-label Long Text Transformer-based Models.

Prerequisites

Restore MIMIC-III v1.4 data into a Postgres database.

1. Further pretraining BIGBIRD

Prepare training data

python3 lm.prepare_mimic_data_for_pretrain_lm.py

Pretraining

python3 lm/run_mlm.py
--model_name_or_path=google/bigbird-roberta-base
--train_file=[data_dir]/mimic3_uncased_preprocessed_total.txt
--output_dir=[model_dir]
--max_seq_length=4096
--line_by_line=True
--do_train=True
--do_eval=True
--per_device_train_batch_size=2
--per_device_eval_batch_size=4
--gradient_accumulation_steps=32
--overwrite_output_dir=True
--max_steps=500000
--save_steps=1000
--warmup_steps=10000
--learning_rate=2e-05
--save_total_limit=20

ClinicalBIGBIRD

The pretrained ClinicalBIGBIRD can be downloaded here.

2. ICD coding - Baseline model

Data preparation

use mimic3_data_preparer.py in HiLAT to prepare the raw training data with the following command line flags:

Name	Value
pre_process_level	raw
segment_text	False

Train model

Download the pretrained Transformer RoBERTa-PM-M3
Training python3 xr-lat/run_coding.py config-baseline.json

3. ICD coding - XR-LAT

Data preparation is the same as the baseline model.

Generate hierarchical code tree (HCT) on MIMIC-III-full dataset

python3 data_processing/generate_hierarchical_label_tree.py

Train model

python3 xr-lat/run_coding.py config-xrlat.json

4. ICD coding - XR-Transformer

Prepare training data python3 data_processing.prepare_xrtransformer_data.py
Install libpecos. Use "install and develop locally" option.
Replace /pecos/xmc/xtransformer/network.py with xr-transformer.network.py
Create TF-IDF features

python3 -m pecos.utils.featurization.text.preprocess build --text-pos 0 --input-text-path ../data/mimic3/full-xr/train.txt --output-model-folder ./tfidf-model

python3 -m pecos.utils.featurization.text.preprocess run --text-pos 0 --input-preprocessor-folder ./tfidf-model --input-text-path ../data/mimic3/full-xr/train.txt --output-inst-path ./data/train.tfidf.npz

python3 -m pecos.utils.featurization.text.preprocess run --text-pos 0 --input-preprocessor-folder ./tfidf-model --input-text-path ../data/mimic3/full-xr/test.txt --output-inst-path ./data/test.tfidf.npz

Training

xr-transformer/train_and_predict.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Automated ICD Coding using Extreme Multi-label Long Text Transformer-based Models

Prerequisites

1. Further pretraining BIGBIRD

Prepare training data

Pretraining

ClinicalBIGBIRD

2. ICD coding - Baseline model

Data preparation

Train model

3. ICD coding - XR-LAT

Generate hierarchical code tree (HCT) on MIMIC-III-full dataset

Train model

4. ICD coding - XR-Transformer

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data/mimic3/full		data/mimic3/full
data_processing		data_processing
lm		lm
xr-lat		xr-lat
xr-transformer		xr-transformer
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

leiboliu/xr-lat

Folders and files

Latest commit

History

Repository files navigation

Automated ICD Coding using Extreme Multi-label Long Text Transformer-based Models

Prerequisites

1. Further pretraining BIGBIRD

Prepare training data

Pretraining

ClinicalBIGBIRD

2. ICD coding - Baseline model

Data preparation

Train model

3. ICD coding - XR-LAT

Generate hierarchical code tree (HCT) on MIMIC-III-full dataset

Train model

4. ICD coding - XR-Transformer

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages