Breast cancer classification using scikit-learn and Keras
The post on the blog will be devoted to the breast cancer classification, implemented using machine learning techniques and neural networks.
Introduction to Breast Cancer
The goal of the project is a medical data analysis using artificial intelligence methods such as machine learning and deep learning for classifying cancers (malignant or benign). Breast cancer is the most common cancer occurring among women, and this is also the main reason for dying from cancer in the world. The most effective way to reduce numbers of death is early detection.
Every 19 seconds, cancer in women is diagnosed somewhere in the world, and every 74 seconds someone dies from breast cancer.
Machine learning allows to precision and fast classification of breast cancer based on numerical data (in our case) and images without leaving home e.g. for a surgical biopsy.
Data used for the project
For the project, I used a breast cancer dataset from Wisconsin University. The dataset contains 569 samples and 30 features computed from digital images. Each sample identifies parameters of each patient.
Futures information:
- ID
- diagnosis
- radius
- texture
- perimeter
- area
- smoothness
- compactness
- concavity
- concave points
- symmetry
- fractal dimension
Python packages
I work daily with Python 3.6+ using a few packages to simplify everyday tasks in data science.
Below are the most important ones.
- scikit-learn is a library for machine learning algorithms
- Keras is a library for deep learning algorithms
- Pandas is used for data processing
- Seaborn is used for data visualization
All requirements are in Ermlab repository as a requirements.txt file.
Data processing
First of all, we need to import our data using Pandas module.
# Load data data = pd.read_csv('Data/data.csv', delimiter=',', header=0)
Before making anything like feature selection, feature extraction and classification, firstly we start with basic data analysis. Let’s look at the features of data.
# Head method show first 5 rows of data print(data.head())
id diagnosis ... fractal_dimension_worst Unnamed: 32 0 842302 M ... 0.11890 NaN 1 842517 M ... 0.08902 NaN 2 84300903 M ... 0.08758 NaN 3 84348301 M ... 0.17300 NaN 4 84358402 M ... 0.07678 NaN
Now, We need to drop unused columns such as id (not used for classification), Unnamed: 32 (with NaN values) and diagnosis (this is our label). The next step is to convert strings (M, B) to integers (0, 1) using map(), define our features and labels.
# Drop unused columns columns = ['Unnamed: 32', 'id', 'diagnosis'] # Convert strings -> integers d = {'M': 0, 'B': 1} # Define features and labels y = data['diagnosis'].map(d) X = data.drop(columns, axis=1)
First plot: number of malignant and begin cancer.
# Plot number of M - malignant and B - benign cancer ax = sns.countplot(y, label="Count", palette="muted") B, M = y.value_counts() plt.savefig('count.png') print('Number of benign cancer: ', B) print('Number of malignant cancer: ', M)
Picture 1. Count of Benign and Malignant cancer
We have 357 benign and 212 malignant samples of cancer.
Split our data into train and test set and normalize them.
# Split dataset into training (80%) and test (20%) set X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0) # Normalize data X_train_N = (X_train-X_train.mean())/(X_train.max()-X_train.min()) X_test_N = (X_test-X_train.mean())/(X_test.max()-X_test.min())
Dimensionality Reduction
Principal Component Analysis (PCA) is by far the most popular dimensionality reduction algorithm.
Another very useful piece of information is the Explained Variance Ratio of each principal component. It indicates the proportion of the dataset’s variance.
Picture 2. Variance ratio of PCA without Std
As you can see in Picture 2., only one variable is necessary without data normalization. But to learn more, let’s make data standardization presented in Picture 3.
Picture 3. Variance ratio of PCA with Std
As you can see in Picture 3., only six variables are necessary without data standardization to reach 95% of the variance.
Classification
In this section, we compare the classification results of several popular classifiers and neural networks with different architecture.
svc = svm.SVC(kernel='linear', C=1) # Pipeline model = Pipeline([ ('reduce_dim', pca), ('svc', svc) ]) # Fit model.fit(X_train_N, y_train) svm_score = cross_val_score(model, X, y, cv=10, scoring='accuracy')
SVM accuracy = 98,83%
def KnearestNeighbors(): """ Function for compute accuracy using K-NN algorithm :return: k-NN score """ for i in range(1, 5): knn = KNeighborsClassifier(n_neighbors=i) knnp = Pipeline([ ('reduce_dim', pca), ('knn', knn) ]) k_score = cross_val_score(knnp, X, y, cv=10, scoring="accuracy")
K-NN accuracy: 96,74%
trees = tree.DecisionTreeClassifier() treeclf = trees.fit(X_train_N, y_train) treep = Pipeline([ ('reduce_dim', pca), ('trees', trees) ]) score_trees = cross_val_score(treep, X, y, cv=10)
simple visualization of Decision Tree:
feature_names = X.columns.values def plot_decision_tree1(a,b): """ Function for plot decision tree :param a: decision tree classifier :param b: feature names :return: graph """ dot_data = tree.export_graphviz(a, out_file='Plots/tree.dot', feature_names=b, class_names=['Malignant','Benign'], filled=False, rounded=True, special_characters=False) graph = graphviz.Source(dot_data) return graph
Picture 4. Visualization of Decision Tree
Decision Tree accuracy: 96,24%
rf = RandomForestClassifier() rfp = Pipeline([ ('reduce_dim', pca), ('rf', rf) ]) score_rf = cross_val_score(rfp, X, y, cv=10)
Random Forest accuracy = 95,9%
gnb = GaussianNB() gnbclf = gnb.fit(X_train_N, y_train) gnbp = Pipeline([ ('reduce_dim', pca), ('gnb', gnb) ]) gnb_score = cross_val_score(gnb, X, y, cv=10, scoring='accuracy')
Naive Bayes Classifier accuracy = 95,38%
###### Neural Networks ###### scaler = StandardScaler() num_epoch = 10 # 1-layer NN def l1neuralNetwork(): model = Sequential() model.add(Dense(input_dim=30, units=2)) model.add(Activation('softmax')) model.compile(loss='sparse_categorical_crossentropy', optimizer='sgd', metrics=['accuracy']) #model.summary() model.fit(scaler.fit_transform(X_train_N), y_train, epochs=num_epoch, shuffle=True) y_pred = model.predict_classes(scaler.transform(X_test_N.values)) # 3-layer NN def l3neuralNetwork(): model = Sequential() model.add(Dense(input_dim=30, units=30)) model.add(Dense(input_dim=30, units=30)) model.add(Dense(input_dim=30, units=2)) model.add(Activation('softmax')) model.compile(loss='sparse_categorical_crossentropy', optimizer='sgd', metrics=['accuracy']) #model.summary() model.fit(scaler.fit_transform(X_train_N), y_train, epochs=num_epoch, shuffle=True) y_pred = model.predict_classes(scaler.transform(X_test_N.values)) # 5-layer NN def l5neuralNetwork(): model = Sequential() model.add(Dense(input_dim=30, units=30)) model.add(Dense(input_dim=30, units=30)) model.add(Dense(input_dim=30, units=30)) model.add(Dense(input_dim=30, units=30)) model.add(Dense(input_dim=30, units=2)) model.add(Activation('softmax')) model.compile(loss='sparse_categorical_crossentropy', optimizer='sgd', metrics=['accuracy']) #model.summary() model.fit(scaler.fit_transform(X_train_N), y_train, epochs=num_epoch, shuffle=True) y_pred = model.predict_classes(scaler.transform(X_test_N.values))
Accuracy for 1, 3 and 5-layer Neural Network: 97.07, 96.73 and 97.66%
As we see, in this comparison of classifiers, the best classification we get with the SVM algorithm.
The worst with Naive Bayes Classifier.
Classification metrics
Our classification metrics are prepared from the best score of accuracy (SVM algorithm).
Confusion Matrix
Confusion Matrix is a performance measurement for machine learning classification problem, where output can be two or more classes.
It’s useful for measuring Precision, Recall, F1 score, accuracy and AUC.
TP (True Positive) – you predicted positive and it is true,
FP (False Positive) – you predicted positive and it is false,
FN (False Negative) – you predicted negative and it is false,
TN (True Negative) – you predicted negative and it is true.
y_pred = model.predict(X_test_N) cm = confusion_matrix(y_test, y_pred) df_cm = pd.DataFrame(cm, range(2), range(2)) plt.figure(figsize=(10,7)) sns.set(font_scale=1.4)#for label size cm_plot = sns.heatmap(df_cm, annot=True, fmt='n', annot_kws={"size": 12})
Picture 5. Visualization of Confusion Matrix
Precision, Recall & F1 Score
Out of all the classes, how much we predicted correctly.
Out of all the positive classes, how much we predicted correctly.
F1-score is the harmonic mean of the precision and recall.
print("Precision score {}%".format(round(precision_score(y_test, y_pred),3))) print("Recall score {}%".format(round(recall_score(y_test, y_pred),3))) print("F1 Score {}%".format(round(f1_score(y_test, y_pred, average='weighted'),3)))
ROC Curve
ROC Curve (Receiver Operating Characteristics) is a performance measurement for classification problem at various thresholds settings. It tells how much model is capable of distinguishing between classes.
y_score = model.fit(X_train_N, y_train).decision_function(X_test_N) fpr, tpr, thresholds = roc_curve(y_test, y_score) fig, ax = plt.subplots(1, figsize=(12, 6)) plt.plot(fpr, tpr, color='blue', label='ROC curve for SVM') plt.plot([0, 1], [0, 1], 'k--') plt.xlabel('False Positive Rate (1 - specificity)') plt.ylabel('True Positive Rate (sensitivity)') plt.title('ROC Curve for Breast Cancer Classifer') plt.legend(loc="lower right")
Picture 6. ROC Curve
Correlation Map
plt.figure() f, ax = plt.subplots(figsize=(14,14)) corr_plot = sns.heatmap(X.corr(), annot=False, linewidths=.5, fmt='.1f', ax=ax)
Picture 7. Visualization of Correlation Map for all features