- Stephanie Wu
- Albert Halim
- Rongze Liu
- Ziyuan Zhao
The Heart Disease Predictor project aims to build a reliable machine learning model that predicts the presence of heart disease based on patient health measurements. It involves data wrangling, exploratory data analysis (EDA), and classification techniques to find meaningful patterns and build an accurate model.
The dataset is sourced from the UCI Machine Learning Repository and contains 303 patient records with 13 attributes related to heart health. Our model’s goal is to predict heart disease presence to assist clinicians in assessing risk.
- Data Wrangling: Preprocess the raw dataset for analysis.
- Exploratory Data Analysis (EDA): Examine key relationships among features.
- Model Development: Train and evaluate a classification model to predict heart disease.
- Evaluation: Assess model performance using accuracy, confusion matrices, and other metrics.
Our current classifier achieves ~87% accuracy, though improvements could enhance clinical usefulness—especially reducing false negatives.
The dataset includes various features (e.g., age, sex, chest pain type, resting blood pressure, cholesterol, max heart rate) that have been used to study risk factors related to heart disease.
A summary of the findings and model development can be found in the final report.
A complete list of dependencies is in the environment.yml
file.
You can set up and run this project in two main ways:
- Local Environment (using Conda)
- Using Docker (with or without Docker Compose)
Choose the approach that best fits your environment.
- Install Conda to manage dependencies.
-
Clone the Repository
git clone https://github.com/UBC-MDS/heart_disease_predictor_py.git cd heart_disease_predictor_py
-
Create and Activate the Environment
conda env create -f environment.yml conda activate heart_disease_predictor
-
Run the Analysis (Make Targets)
- To start fresh (remove previously generated files):
make clean
- To run the entire pipeline and produce the final outputs:
make all
This will download data, preprocess it, run EDA, train and evaluate models, and render the report.
- To start fresh (remove previously generated files):
If you prefer a containerized environment, use our pre-built Docker image that includes all dependencies.
- Docker installed on your system.
- Run the Docker Container
Go to the root of this project in the terminal and then run:
docker compose up
- Run the Analysis (Make Targets)
- To start fresh (remove previously generated files):
make clean
- To run the entire pipeline and produce the final outputs:
make all
- To start fresh (remove previously generated files):
If you prefer not to use make
, you can manually run each step after setting up your environment (via Conda or Docker):
-
Download the Data
python scripts/download_data.py \ --url="https://archive.ics.uci.edu/static/public/45/heart+disease.zip" \ --path="data/raw"
-
Split and Preprocess the Data
python scripts/split_n_preprocess.py \ --input-path=data/raw/processed.cleveland.data \ --data-dir=data/processed \ --preprocessor-dir=results/models \ --seed=522
-
Perform EDA
python scripts/script_eda.py \ --input_data_path=data/processed/heart_disease_train.csv \ --output_prefix=results/
-
Fit the Predictive Models
python scripts/fit_heart_disease_predictor.py \ --train-set=data/processed/heart_disease_train.csv \ --preprocessor=results/models/heart_disease_preprocessor.pickle \ --pipeline-to=results/models \ --table-to=results/tables \ --seed=522
-
Evaluate the Models and Generate Figures
python scripts/evaluate_heart_disease_predictor.py \ --test-set=data/processed/heart_disease_test.csv \ --pipeline-svc-from=results/models/heart_disease_svc_pipeline.pickle \ --pipeline-lr-from=results/models/heart_disease_lr_pipeline.pickle \ --table-to=results/tables \ --plot-to=results/figures \ --seed=522
-
Render the Report
quarto render report/heart_disease_predictor_report.qmd --to html quarto render report/heart_disease_predictor_report.qmd --to pdf
After ensuring that you are in the project root directory, you can run the tests in the terminal with the following command:
pytest
This will execute all the test scripts located in the tests/
directory within the Docker container.
-
Add/Update Dependencies
Editenvironment.yml
and then regenerate the conda lock file:conda-lock install --name heart_disease_predictor --file environment.yml
-
Rebuild the Docker Image (if using Docker)
docker build -t achalim/heart_disease_predictor_py:latest . docker push achalim/heart_disease_predictor_py:latest
-
To clean generated files (figures, models, tables):
make clean
-
To deactivate the conda environment:
conda deactivate
-
To stop and remove Docker containers, use Ctrl + C in the terminal and then run this:
docker compose down
All code in the Heart Disease Predictor project is licensed under the MIT License. The project report is licensed under the CC0 1.0 Universal License. If you use or re-mix any part of this project, please provide appropriate attribution.
- Dua, D., Dheeru, D., & Graff, C. (2017). UCI Machine Learning Repository. University of California, Irvine. https://archive.ics.uci.edu/ml
- Cleveland Clinic Foundation. (1988). Heart disease data set. In Proceedings of Machine Learning and Medical Applications.
- Attia, P. (2023, February 15). Peter on the four horsemen of chronic disease. PeterAttiaMD.com. https://peterattiamd.com/peter-on-the-four-horsemen-of-chronic-disease/
- Bui, T. (2024, October 15). Cardiovascular disease is rising again after years of improvement. Stat News. https://www.statnews.com/2024/10/15/cardiovascular-disease-rising-experts-on-causes/
- Centers for Disease Control and Prevention (CDC). (2022). Leading causes of death. National Center for Health Statistics. https://www.cdc.gov/nchs/fastats/leading-causes-of-death.htm
- Detrano, R., Jánosi, A., Steinbrunn, W., Pfisterer, M., Schmid, J., Sandhu, S., Guppy, K., Lee, S., & Froelicher, V. (1988). Heart Disease UCI dataset. UC Irvine Machine Learning Repository. https://archive.ics.uci.edu/dataset/45/heart+disease
- Carlén, A., Gustafsson, M., Åström Aneq, M., & Nylander, E. (2019). Exercise-induced ST depression in an asymptomatic population without coronary artery disease. Scandinavian Cardiovascular Journal, 53(4), 206–212. https://doi.org/10.1080/14017431.2019.1626021
- Fuchs, F. D., & Whelton, P. K. (2020). High Blood Pressure and Cardiovascular Disease. Hypertension, 75(2), 285–292. https://doi.org/10.1161/HYPERTENSIONAHA.119.14240
- Regitz-Zagrosek, V., & Gebhard, C. (2023). Gender medicine: Effects of sex and gender on cardiovascular disease manifestation and outcomes. Nature Reviews Cardiology, 20(4), 236–247. https://doi.org/10.1038/s41569-022-00797-4