Project structure

Code
└───data_processing
│       │   al_data_selection.ipynb
│       │   combine_test_data.ipynb
│       │   inspect_data.ipynb
│       │   preprocessing.ipynb
└───evaluation
│       │   evaluation_error_export.ipynb
└───fine-tuning
        │   predict.py
        │   train_model.py
Thesis report
└───Murat_Ertas_MA_Thesis.pdf
│   .gitignore
│   LICENSE
│   README.md
│   requirements.tx

Code

1. Data Processing

preprocessing.ipynb : The raw unstructured data consist of Electronic Health Records (EHR) represented in csv format which contains various kinds of information about the record. The classifiers used in this project has been trained and fine-tuned with anonymized sentences. Therefore the unstructured data had to be filtered from unnnecessary information, text had to be segmented into sentences and person and location entities had to be anonymized. The code in this notebook includes the steps and the code for this process.
inspect_data.ipynb : A generic data analysis code used in this research, calculates category distribution, pattern distribution, picking and comparing individual instances
al_data_selection.ipynb : This notebook includes the code that has been used in data querying within the Active Learning (AL) scheme.
- Uncertainty based selection: The code that samples instances based on defined confidence range.
- Cosine Similarity based selection: The code for representing the sentences in word embedding averages and applying DBSCAN clustering aiming to get most informative representative instances.
combine_test_data.ipynb : This project combined 3 independent test sets to create a more balanced test data. This code demonstrates the process of combining these datasets.

2. Fine-tuning

These scripts are implemented directly from the previous A-Proof research repository. train_model.py is used to fine-tune a pre-trained model, predict.py is used to create predictions and output the updated evaluation data. They both use simpletransformers python library.

train_model.py : This script used to fine-tune a pre-trained model. It needs access to a model folder and training hyperparameter can be defined within the script. It outputs the new fine-tuned model into the defined path.
predict.py : This script is used to create predictions from a model for a given evaluation/test data. The data that model predicts over needs to be in a dataframe stored in pickle and text instances should be under a column named as 'text' within the dataframe. It takes the pickled dataframe as input, adds new columns for classification and confidence values and writes out the dataframe as pickle over the original input.

3. Evaluation

evaluation_error_export.ipynb : This notebook includes the code for evaluation of predictions. The notebook has 3 parts : Classification Report, Confusion Matrices, Error Export.
- Classification Report: It calculates the evaluation metrics of precision, recall and creates a classification report.
- Confusion Matrices : It creates confusion matrices for the given evaluation.
- Error Export : It exports the false negatives and false positives of a given evaluation into separate csv files per category.

Thesis report

The full thesis can be found in this folder.

Data

The data used in this thesis consist of Electronic Health Records (EHR) from Amsterdam University Medical Centers (AUMC) and VU Medical Center (VUMC). Due to privacy and confidentiality agreements, these datasets are not publicly accessible.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
Code		Code
Thesis Report		Thesis Report
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project structure

Code

1. Data Processing

2. Fine-tuning

3. Evaluation

Thesis report

Data

About

Releases

Packages

Languages

License

muratertas1/Murat_Ertas_Improving_Medical_text_classifiers_through_Balancing_Data

Folders and files

Latest commit

History

Repository files navigation

Project structure

Code

1. Data Processing

2. Fine-tuning

3. Evaluation

Thesis report

Data

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages