Classification of Cancerous Cells and Cancerous Nature of Tumors of Wisconsin Patients with Breast Cancer

Running naive Bayes and support vector machine classifiers using scikit-learn on benign/malignant cancer cell data sets to predict cancerous nature of the patient. Data accessed from the Breast Cancer Wisconsin (Diagnostic) Data Set: https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(diagnostic).

Machine Learning final project.

Pride Y.

Training data

We will train the data from this data set running three different classifier algorithms (Naive Bayes, Logistic Regression and Support Vector Machines).

Data correlation

We find that some features in this data set are highly correlated, such as worst perimeter and worst radius, and some little correlation, mean symmetry and mean smoothness.

Visualization

We can also visualize the separability of our trained data.

Classifier

This program rudimentally classifies cancer cell data on benignity and malignancy from scikit-learn's existing breast cancer data set running Naive Bayes, Logistic Regression and Support Vector Machine classifiers.

We will train the data, then run it through a confusion matrix to identify the correctness of our data for all three classifiers.

Naive bayes

We find that 89 patients were correctly identified to have a tumor that is malignant, 180 patients were correctly identified to have benign tumors, and a 94.39% effectiveness in classification.

7 patients were incorrectly identified to have malignant tumors and 9 patients were incorrectly identified to have benign tumors.

Logistic regression

We find that 90 patients were correctly identified to have a tumor that is malignant, 180 patients were correctly identified to have benign tumors, and a 94.74% effectiveness in classification.

7 patients were incorrectly identified to have malignant tumors and 8 patients were incorrectly identified to have benign tumors.

Support vector machines

We find that 93 patients were correctly identified to have a tumor that is malignant, 182 patients were correctly identified to have benign tumors, and a 96.49% effectiveness in classification.

5 patients were incorrectly identified to have malignant tumors and 5 patients were incorrectly identified to have benign tumors.

Conclusion

We can conclude that support vector machines are most accurate in predicting the cancerous nature of a patient's tumor in this data set. Logistic regression falls behind support vector machines, however, both are more accurate than naive Bayes predictions according to the confusion matrices.

Scrambled data

To test the algorithms in a less controlled manner, I scrambled data to see the results. Surprisingly, support vector machines and logistic regression algorithms provided more similar results to our previous results as compared to naive Bayes.

Perhaps this is due to chance, however, in out of 10 trials, SVMs and logistic regression showed accuracy greater than naive Bayes.

Naive Bayes

Logistic regression

Support vector machines

Limitations

No CTC cell count - metastasis

Unfortunately, this data set does not contain feature set data for CTC (circulating tumor cell) count, which makes it impossible in this scenario to predict the nature of metastasis within a patient.

Additionally, arbitrary sample number provided by the program would not be effective or viable at all for this program.

SVM decision boundary generation

This program unfortunately does not visualize the hyperplane nor the decision margins of the support vector machine, as computation and plotting of data was not accurate in visualizing this.

Scrambling of untrained data

Due to the fact that much data within this data set are in fact, not 1s and 0s, scrambling of untrained data could lead to many inaccuracies when fitting them on any of the classifiers.

Dependencies

scikit-learn

pip install scikit-learn

matplotlib

pip install matplotlib

pandas

pip install pandas

numpy

pip install numpy

seaborn

pip install seaborn

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
readme		readme
README.md		README.md
classifier.py		classifier.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Classification of Cancerous Cells and Cancerous Nature of Tumors of Wisconsin Patients with Breast Cancer

Training data

Data correlation

Visualization

Classifier

Naive bayes

Logistic regression

Support vector machines

Conclusion

Scrambled data

Naive Bayes

Logistic regression

Support vector machines

Limitations

No CTC cell count - metastasis

SVM decision boundary generation

Scrambling of untrained data

Dependencies

scikit-learn

matplotlib

pandas

numpy

seaborn

About

Releases

Packages

Languages

PrideInt/cancerous-cells-classifier

Folders and files

Latest commit

History

Repository files navigation

Classification of Cancerous Cells and Cancerous Nature of Tumors of Wisconsin Patients with Breast Cancer

Training data

Data correlation

Visualization

Classifier

Naive bayes

Logistic regression

Support vector machines

Conclusion

Scrambled data

Naive Bayes

Logistic regression

Support vector machines

Limitations

No CTC cell count - metastasis

SVM decision boundary generation

Scrambling of untrained data

Dependencies

scikit-learn

matplotlib

pandas

numpy

seaborn

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages