The Breast Cancer Diagnostic data is sourced from the UCI Machine Learning Repository. It is also accessible through the UW CS FTP server.
This dataset comprises features computed from digitized images of fine needle aspirates (FNA) of breast masses. These features describe the characteristics of cell nuclei present in the images. The dataset is discussed in the paper: K. P. Bennett and O. L. Mangasarian: "Robust Linear Programming Discrimination of Two Linearly Inseparable Sets", Optimization Methods and Software 1, 1992, 23-34.
- ID number: Unique identifier for each instance.
- Diagnosis: Class label (M = malignant, B = benign).
- Features: Derived from the cell nuclei in the images, including:
- Radius (mean of distances from center to points on the perimeter)
- Texture (standard deviation of gray-scale values)
- Perimeter
- Area
- Smoothness (local variation in radius lengths)
- Compactness (perimeter^2 / area - 1.0)
- Concavity (severity of concave portions of the contour)
- Concave Points (number of concave portions of the contour)
- Symmetry
- Fractal Dimension ("coastline approximation" - 1)
The dataset includes 30 features in total, with each feature having mean, standard error, and "worst" values.
- Data Quality: No missing values.
- Class Distribution: 357 benign cases, 212 malignant cases.
To run the notebook, you need the following dependencies:
- Python 3.x
- Jupyter Notebook
- pandas
- seaborn
- matplotlib