Breast cancer classification using scikit-learn and Keras

Posted 07/11/2018

The post on the blog will be devoted to the breast cancer classification, implemented using machine learning techniques and neural networks.

Introduction to Breast Cancer

The goal of the project is a medical data analysis using artificial intelligence methods such as machine learning and deep learning for classifying cancers (malignant or benign). Breast cancer is the most common cancer occurring among women, and this is also the main reason for dying from cancer in the world. The most effective way to reduce numbers of death is early detection.
Every 19 seconds, cancer in women is diagnosed somewhere in the world, and every 74 seconds someone dies from breast cancer.

Machine learning allows to precision and fast classification of breast cancer based on numerical data (in our case) and images without leaving home e.g. for a surgical biopsy.

Data used for the project

For the project, I used a breast cancer dataset from Wisconsin University. The dataset contains 569 samples and 30 features computed from digital images. Each sample identifies parameters of each patient.

Futures information:

ID
diagnosis
radius
texture
perimeter
area
smoothness
compactness
concavity
concave points
symmetry
fractal dimension

Python packages

I work daily with Python 3.6+ using a few packages to simplify everyday tasks in data science.

Below are the most important ones.

scikit-learn is a library for machine learning algorithms
Keras is a library for deep learning algorithms
Pandas is used for data processing
Seaborn is used for data visualization

All requirements are in Ermlab repository as a requirements.txt file.

Data processing

First of all, we need to import our data using Pandas module.

# Load data
data = pd.read_csv('Data/data.csv', delimiter=',', header=0)

Before making anything like feature selection, feature extraction and classification, firstly we start with basic data analysis. Let’s look at the features of data.

# Head method show first 5 rows of data
print(data.head())

         id diagnosis     ...       fractal_dimension_worst  Unnamed: 32
0    842302         M     ...                       0.11890          NaN
1    842517         M     ...                       0.08902          NaN
2  84300903         M     ...                       0.08758          NaN
3  84348301         M     ...                       0.17300          NaN
4  84358402         M     ...                       0.07678          NaN

Now, We need to drop unused columns such as id (not used for classification), Unnamed: 32 (with NaN values) and diagnosis (this is our label). The next step is to convert strings (M, B) to integers (0, 1) using map(), define our features and labels.

# Drop unused columns
columns = ['Unnamed: 32', 'id', 'diagnosis']

# Convert strings -> integers
d = {'M': 0, 'B': 1}

# Define features and labels
y = data['diagnosis'].map(d)
X = data.drop(columns, axis=1)

First plot: number of malignant and begin cancer.

# Plot number of M - malignant and B - benign cancer

ax = sns.countplot(y, label="Count", palette="muted")
B, M = y.value_counts()
plt.savefig('count.png')
print('Number of benign cancer: ', B)
print('Number of malignant cancer: ', M)

Picture 1. Count of Benign and Malignant cancer

We have 357 benign and 212 malignant samples of cancer.

Split our data into train and test set and normalize them.

# Split dataset into training (80%) and test (20%) set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Normalize data
X_train_N = (X_train-X_train.mean())/(X_train.max()-X_train.min())
X_test_N = (X_test-X_train.mean())/(X_test.max()-X_test.min())

Dimensionality Reduction

Principal Component Analysis (PCA) is by far the most popular dimensionality reduction algorithm.

Another very useful piece of information is the Explained Variance Ratio of each principal component. It indicates the proportion of the dataset’s variance.

Picture 2. Variance ratio of PCA without Std

As you can see in Picture 2., only one variable is necessary without data normalization. But to learn more, let’s make data standardization presented in Picture 3.

Picture 3. Variance ratio of PCA with Std

As you can see in Picture 3., only six variables are necessary without data standardization to reach 95% of the variance.

Classification

In this section, we compare the classification results of several popular classifiers and neural networks with different architecture.

Support Vector Machines (SVM)

svc = svm.SVC(kernel='linear', C=1)

# Pipeline
model = Pipeline([
    ('reduce_dim', pca),
    ('svc', svc)
])

# Fit
model.fit(X_train_N, y_train)
svm_score = cross_val_score(model, X, y, cv=10, scoring='accuracy')

SVM accuracy = 98,83%

K-Nearest Neighbours (K-NN)

def KnearestNeighbors():
    """
    Function for compute accuracy using K-NN algorithm
    :return: k-NN score
    """
    for i in range(1, 5):
        knn = KNeighborsClassifier(n_neighbors=i)
        knnp = Pipeline([
            ('reduce_dim', pca),
            ('knn', knn)
        ])
        k_score = cross_val_score(knnp, X, y, cv=10, scoring="accuracy")

K-NN accuracy: 96,74%

Decision Tree

trees = tree.DecisionTreeClassifier()
treeclf = trees.fit(X_train_N, y_train)
treep = Pipeline([
    ('reduce_dim', pca),
    ('trees', trees)
    ])
score_trees = cross_val_score(treep, X, y, cv=10)

simple visualization of Decision Tree:

feature_names = X.columns.values

def plot_decision_tree1(a,b):
    """
    Function for plot decision tree
    :param a: decision tree classifier
    :param b: feature names
    :return: graph
    """
    dot_data = tree.export_graphviz(a, out_file='Plots/tree.dot',
                             feature_names=b,
                             class_names=['Malignant','Benign'],
                             filled=False, rounded=True,
                             special_characters=False)
    graph = graphviz.Source(dot_data)
    return graph

Picture 4. Visualization of Decision Tree

Decision Tree accuracy: 96,24%

Random Forest

rf = RandomForestClassifier()
rfp = Pipeline([
    ('reduce_dim', pca),
    ('rf', rf)
])
score_rf = cross_val_score(rfp, X, y, cv=10)

Random Forest accuracy = 95,9%

Naive Bayes Classifier

gnb = GaussianNB()
gnbclf = gnb.fit(X_train_N, y_train)
gnbp = Pipeline([
    ('reduce_dim', pca),
    ('gnb', gnb)
])
gnb_score = cross_val_score(gnb, X, y, cv=10, scoring='accuracy')

Naive Bayes Classifier accuracy = 95,38%

Neural Networks

###### Neural Networks ######

scaler = StandardScaler()

num_epoch = 10

# 1-layer NN
def l1neuralNetwork():
    model = Sequential()
    model.add(Dense(input_dim=30, units=2))
    model.add(Activation('softmax'))
    model.compile(loss='sparse_categorical_crossentropy', optimizer='sgd', metrics=['accuracy'])
    #model.summary()

    model.fit(scaler.fit_transform(X_train_N), y_train, epochs=num_epoch,
              shuffle=True)
    y_pred = model.predict_classes(scaler.transform(X_test_N.values))

# 3-layer NN
def l3neuralNetwork():
    model = Sequential()
    model.add(Dense(input_dim=30, units=30))
    model.add(Dense(input_dim=30, units=30))
    model.add(Dense(input_dim=30, units=2))
    model.add(Activation('softmax'))
    model.compile(loss='sparse_categorical_crossentropy', optimizer='sgd', metrics=['accuracy'])
    #model.summary()
    model.fit(scaler.fit_transform(X_train_N), y_train, epochs=num_epoch,
              shuffle=True)
    y_pred = model.predict_classes(scaler.transform(X_test_N.values))

# 5-layer NN
def l5neuralNetwork():
    model = Sequential()
    model.add(Dense(input_dim=30, units=30))
    model.add(Dense(input_dim=30, units=30))
    model.add(Dense(input_dim=30, units=30))
    model.add(Dense(input_dim=30, units=30))
    model.add(Dense(input_dim=30, units=2))
    model.add(Activation('softmax'))
    model.compile(loss='sparse_categorical_crossentropy', optimizer='sgd', metrics=['accuracy'])
    #model.summary()
    model.fit(scaler.fit_transform(X_train_N), y_train, epochs=num_epoch,
              shuffle=True)
    y_pred = model.predict_classes(scaler.transform(X_test_N.values))

Accuracy for 1, 3 and 5-layer Neural Network: 97.07, 96.73 and 97.66%

As we see, in this comparison of classifiers, the best classification we get with the SVM algorithm.

The worst with Naive Bayes Classifier.

Classification metrics

Our classification metrics are prepared from the best score of accuracy (SVM algorithm).

Confusion Matrix

Confusion Matrix is a performance measurement for machine learning classification problem, where output can be two or more classes.

It’s useful for measuring Precision, Recall, F1 score, accuracy and AUC.

TP (True Positive) – you predicted positive and it is true,

FP (False Positive) – you predicted positive and it is false,

FN (False Negative) – you predicted negative and it is false,

TN (True Negative) – you predicted negative and it is true.

y_pred = model.predict(X_test_N)
cm = confusion_matrix(y_test, y_pred)
df_cm = pd.DataFrame(cm, range(2),
                  range(2))
plt.figure(figsize=(10,7))
sns.set(font_scale=1.4)#for label size
cm_plot = sns.heatmap(df_cm, annot=True, fmt='n', annot_kws={"size": 12})

Picture 5. Visualization of Confusion Matrix

Precision, Recall & F1 Score

Out of all the classes, how much we predicted correctly.

Out of all the positive classes, how much we predicted correctly.

F1-score is the harmonic mean of the precision and recall.

print("Precision score {}%".format(round(precision_score(y_test, y_pred),3)))
print("Recall score {}%".format(round(recall_score(y_test, y_pred),3)))
print("F1 Score {}%".format(round(f1_score(y_test, y_pred, average='weighted'),3)))

ROC Curve

ROC Curve (Receiver Operating Characteristics) is a performance measurement for classification problem at various thresholds settings. It tells how much model is capable of distinguishing between classes.

y_score = model.fit(X_train_N, y_train).decision_function(X_test_N)

fpr, tpr, thresholds = roc_curve(y_test, y_score)


fig, ax = plt.subplots(1, figsize=(12, 6))
plt.plot(fpr, tpr, color='blue', label='ROC curve for SVM')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate (1 - specificity)')
plt.ylabel('True Positive Rate (sensitivity)')
plt.title('ROC Curve for Breast Cancer Classifer')
plt.legend(loc="lower right")

Picture 6. ROC Curve

Correlation Map

plt.figure()
f, ax = plt.subplots(figsize=(14,14))
corr_plot = sns.heatmap(X.corr(), annot=False, linewidths=.5, fmt='.1f', ax=ax)

Picture 7. Visualization of Correlation Map for all features

Szymon Płotka