Breast cancer classification using scikit-learn and Keras

 In Data Science, Technical

The post on the blog will be devoted to the breast cancer classification, implemented using machine learning techniques and neural networks.

Introduction to Breast Cancer

The goal of the project is a medical data analysis using artificial intelligence methods such as machine learning and deep learning for classifying cancers (malignant or benign). Breast cancer is the most common cancer occurring among women, and this is also the main reason for dying from cancer in the world. The most effective way to reduce numbers of death is early detection.
Every 19 seconds, cancer in women is diagnosed somewhere in the world, and every 74 seconds someone dies from breast cancer.

Machine learning allows to precision and fast classification of breast cancer based on numerical data (in our case) and images without leaving home e.g. for a surgical biopsy.

Data used for the project

For the project, I used a breast cancer dataset from Wisconsin University. The dataset contains 569 samples and 30 features computed from digital images. Each sample identifies parameters of each patient.

Futures information:

  1. ID
  2. diagnosis
  3. radius
  4. texture
  5. perimeter
  6. area
  7. smoothness
  8. compactness
  9. concavity
  10. concave points
  11. symmetry
  12. fractal dimension

Python packages

I work daily with Python 3.6+ using a few packages to simplify everyday tasks in data science.

Below are the most important ones.

  • scikit-learn is a library for machine learning algorithms
  • Keras is a library for deep learning algorithms
  • Pandas is used for data processing
  • Seaborn is used for data visualization

All requirements are in Ermlab repository as a requirements.txt file.

Data processing

First of all, we need to import our data using Pandas module.

Before making anything like feature selection, feature extraction and classification, firstly we start with basic data analysis. Let’s look at the features of data.

Now, We need to drop unused columns such as id (not used for classification), Unnamed: 32 (with NaN values) and diagnosis (this is our label). The next step is to convert strings (M, B) to integers (0, 1) using map(),  define our features and labels.

First plot: number of malignant and begin cancer.

Picture 1. Count of Benign and Malignant cancer

We have 357 benign and 212 malignant samples of cancer.

Split our data into train and test set and normalize them.

Dimensionality Reduction

Principal Component Analysis (PCA) is by far the most popular dimensionality reduction algorithm.

Another very useful piece of information is the Explained Variance Ratio of each principal component. It indicates the proportion of the dataset’s variance.


Picture 2. Variance ratio of PCA without Std

As you can see in Picture 2., only one variable is necessary without data normalization. But to learn more, let’s make data standardization presented in Picture 3.

Picture 3. Variance ratio of PCA with Std

As you can see in Picture 3., only six variables are necessary without data standardization to reach 95% of the variance.


In this section, we compare the classification results of several popular classifiers and neural networks with different architecture.

Support Vector Machines (SVM)

SVM accuracy = 98,83%

K-Nearest Neighbours (K-NN)

K-NN accuracy: 96,74%

Decision Tree

simple visualization of Decision Tree:

Picture 4. Visualization of Decision Tree



Decision Tree accuracy: 96,24%

Random Forest

Random Forest accuracy = 95,9%

Naive Bayes Classifier

Naive Bayes Classifier accuracy = 95,38%

Neural Networks

Accuracy for 1, 3 and 5-layer Neural Network: 97.07, 96.73 and 97.66%

As we see, in this comparison of classifiers, the best classification we get with the SVM algorithm.

The worst with Naive Bayes Classifier.

Classification metrics

Our classification metrics are prepared from the best score of accuracy (SVM algorithm).

Confusion Matrix

Confusion Matrix is a performance measurement for machine learning classification problem, where output can be two or more classes.

It’s useful for measuring Precision, Recall, F1 score, accuracy and AUC.

TP (True Positive) – you predicted positive and it is true,

FP (False Positive) – you predicted positive and it is false,

FN (False Negative) – you predicted negative and it is false,

TN (True Negative) – you predicted negative and it is true.


Picture 5. Visualization of Confusion Matrix

Precision, Recall & F1 Score

Out of all the classes, how much we predicted correctly.

Out of all the positive classes, how much we predicted correctly.


F1-score is the harmonic mean of the precision and recall.

ROC Curve

ROC Curve (Receiver Operating Characteristics)  is a performance measurement for classification problem at various thresholds settings. It tells how much model is capable of distinguishing between classes.


Picture 6. ROC Curve

Correlation Map


Picture 7. Visualization of Correlation Map for all features



Recent Posts

Leave a Comment