Project Title : Dyslipidemia Prediction using Machine Learning and Deep Learning
Overview:
Dyslipidemia is a significant health condition that can lead to severe cardiovascular diseases if not detected and managed properly. The goal of this project is to build and evaluate models that can accurately predict dyslipidemia using both machine learning (ML) and deep learning (DL) techniques. Given the potential risks of false negatives in medical diagnosis, the key focus is on maximizing recall, ensuring that the model identifies as many patients with dyslipidemia as possible, while minimizing the number of false negatives.
Project Structure:
The project is divided into the following sections:
Data Exploration and Preprocessing
Machine Learning Modeling
Deep Learning Modeling
Model Evaluation and Results
- Data Exploration and Preprocessing: We began by analyzing a dataset containing medical and lifestyle features related to dyslipidemia, with 131 columns and 531 records. The dataset included information such as age, gender, BMI, waist circumference, blood pressure history, dietary habits, and more. Several features were numeric, while others were categorical, requiring careful preprocessing.
Key preprocessing steps:
Binning: Continuous variables such as BMI, percentage of fat intake, and RFAC (lipid-related) values were divided into categorical bins to simplify the modeling process. Categories like “Very Low,” “Low,” “Normal,” “High,” and “Very High” were used to standardize values.
Handling Imbalanced Data: The target variable, dyslipidemia, was initially imbalanced, with fewer cases of the condition compared to non-cases. We addressed this imbalance using SMOTE (Synthetic Minority Over-sampling Technique), which oversamples the minority class (dyslipidemia cases).
Feature Selection: Medical and lifestyle features deemed most relevant for prediction were selected for use in both machine learning and deep learning models.
- Machine Learning Models:
Several machine learning models were applied to predict dyslipidemia, with a special focus on maximizing recall to reduce false negatives. XGBoost and CatBoost were among the key models tested.
Key steps:
Grid Search: We performed hyperparameter tuning using GridSearchCV to optimize each model's performance.
Evaluation Metrics:
Recall: As the critical metric, both training recall and validation recall were used to measure how well the model identified dyslipidemia cases. Accuracy, F1-score, Precision: These metrics were also tracked, but the model selection was based primarily on recall.
Model Results:
XGBoost: Provided the best recall, minimizing false negatives and achieving optimal results for identifying dyslipidemia cases.
CatBoost: Offered competitive performance but with slightly lower recall in validation.
- Deep Learning Models: We also implemented a deep learning approach using a Convolutional Neural Network (CNN) to handle the complexity of the data and extract patterns.
Key steps:
Model Architecture: The CNN model consisted of multiple convolutional layers, followed by fully connected layers and a final sigmoid activation for binary classification (predicting dyslipidemia presence).
Early Stopping: To avoid overfitting, we used early stopping with patience to monitor validation loss.
Dropout Layers: Dropout was incorporated to further prevent overfitting, allowing the model to generalize better to unseen data.
Evaluation:
Recall and Precision: As with the machine learning models, recall was the most important metric for evaluation. The model was designed to minimize false negatives, ensuring that most dyslipidemia cases were identified.
Validation and Confusion Matrix: The confusion matrix and classification report were used to analyze the performance of the model on validation data, confirming that it was able to maintain a good balance between recall and precision.
- Model Evaluation and Results: Across all models, the recall metric was the primary focus, particularly on the validation data, to ensure the model would perform well in real-world settings and avoid missing dyslipidemia cases. Both machine learning and deep learning models performed well, with XGBoost offering the best balance between accuracy and recall.
Overfitting Prevention:
Cross-Validation: All models were evaluated using cross-validation to prevent overfitting.
Regularization: Models were regularized using appropriate techniques to control for complexity and ensure robustness.
Dropout Layers in DL: This helped the deep learning model generalize better.
Conclusion: This project effectively demonstrates how machine learning and deep learning techniques can be applied to predict dyslipidemia, with a specific focus on maximizing recall and minimizing false negatives. The combination of careful data preprocessing, feature engineering, and model selection ensures that the models not only perform well on training data but also generalize to new, unseen data.
This project shows promise in providing a scalable solution for early detection of dyslipidemia, potentially improving patient outcomes and helping prevent severe cardiovascular diseases.