Skip to main content

Command Palette

Search for a command to run...

Machine Learning Intro to Breast Cancer Classification

Published
7 min read
Machine Learning  Intro to Breast Cancer Classification
R

Graduate with a passion for developing software solutions to creative problems. Proven ability to write clean, scalable, robust, and performant production code. Diligent and results-oriented, with a passion for continuous improvement and test-driven design. Proven ability to think big and creatively, while also staying grounded in real-world pragmatism. Proficient in Python, Java, JavaScript/NodeJS, and TypeScript. Strong team player and self-starter, with excellent attention to detail. Background in object-oriented programming (OOP), data structures, and algorithms.

Breast cancer is one of the most common cancers among women and men globally. Breast cancer arises when cells in the breast start to develop abnormally. Due to this cancer, there is a huge number of deaths every year. It is the most common type of all cancer and also the main cause of women’s death worldwide.

Nowadays Classification and data mining methods are very effective ways to classify data. Especially when we talk about the medical field, wherewith the help of machine learning we use to diagnose the disease and analysis also to make particular decisions.

So with the help of Machine learning if we can classify the patient having which type of cancer, then it will be easy for doctors to provide timely treatment to patients and improve the chance of survival.

In this Machine learning project, we are going to analyze and classify Breast Cancer (which breast cancer belongs to which category), as basically there are two categories of breast cancer that is:

  • Malignant type breast cancer

  • Benign-type breast cancer

So our main aim in this project is that with the help of a dataset, we will create a model which will correctly classify whether Breast Cancer is of malignant or benign type.

Breast Cancer Dataset

We will be using a breast cancer dataset which you can download from this link

DataSet Source

We are going to analyze the dataset completely, which will clear all your questions regarding what dataset we will be using, how many rows and columns are there, etc.

So there are 10 columns that are:

  • Id

  • Diagnosis

  • Radius mean

  • Texture means

  • Perimeter means

  • Area means

  • Smoothness mean

  • Compactness mean

  • Concavity mean

  • Concave points

Project Prerequisites

I am using Google Collabs, and I am importing the following libraries, they have already been installed in Google Collabs.

  • Numpy

  • Pandas

  • Matplotlib

  • Seaborn

  • Sklearn

  • Tensorflow

1) Now let’s start Analysing and Implementing our Breast Cancer Classification

Firstly we have to import all the required libraries that we have installed above. We will also import some libraries at the time we will use them.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

2) Loading the dataset

In Google Collabs, you have to first upload the dataset to access that data. So to upload the dataset run the following command

# Load the dataset
from google.colab import files
uploaded = files.upload()

Now let's view our dataset using the `head()` method

df = pd.read_csv('data.csv')
df.head(10)

3) Analysing the data

Now we will analyze our dataset to see what is the shape of our data, and how many empty values are present in our dataset, and we will drop those missing values using various methods provided by pandas.

Count the number of rows and columns in the dataset

df.shape

Let’s create a pair plot that will show us the complete relationship between radius mean, texture means, perimeter means, area means and smoothness means on the basis of diagnosis type.

vars = ['radius_mean', 'texture_mean','perimeter_mean','area_mean','smoothness_mean']
sns.pairplot(
    df,hue = 'diagnosis', 
    palette= 'coolwarm', 
    vars = vars
)

# count the number of empty values in each columns:
df.isna().sum()
# drop the columns with all the missing values:
df = df.dropna(axis = 1)
df.shape
# Get the count of the number of Malignant(M) or Benign(B) cells
df['diagnosis'].value_counts()

Now we will visualize the diagnosis column in our dataset to see how many malignant and benign are present.

# visualize the count:
sns.countplot(df['diagnosis'], label = 'count')

In this whole analyzing process we are going to convert our data and perform some data processing so that we can build a model which can classify the type of Breast cancer using this preprocessed data.

I have written comments above each line of code about what we are doing and why we are doing it so that you can understand it better and easily.

# Rename the diagnosis data to labels:
df = df.rename(columns = {'diagnosis' : 'label'})
print(df.dtypes)

# define the dependent variable that need to predict(label)
y = df['label'].values
print(np.unique(y))

4) Encoding Categorical Data

Now we will convert our text (B and M) to integers (0 and 1) using LabelEncoder provided by sklearn library.

# Encoding categorical data from text(B and M) to integers (0 and 1)
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
Y = labelencoder.fit_transform(y) # M = 1 and B = 0
print(np.unique(Y))

5) Defined X

X will be our main features data which consists of all the columns except the label and id column. We also normalize our X data using MinMaxScaler provided by sklearn library.

from sklearn.preprocessing import MinMaxScaler

# define x and normalize / scale value:
# define the independent variables, Drop label and ID, and normalize other data:
X  = df.drop(labels=['label','id'], axis = 1)

scaler = MinMaxScaler()
scaler.fit(X)
X = scaler.transform(X)

print(X)Now we will split our data into training data and testing data using train_test_split.

6) Splitting Our Data

Now we will split our data into training data and testing data using train_test_split.

# Split data into training and testing data to verify accuracy after fitting the model
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(
    X,
    Y, 
    test_size = 0.25, 
    random_state=42
)
print('Shape of training data is: ',  x_train.shape)
print('Shape of testing data is: ',   x_test.shape)

7) Creating Model

In this step, we will be creating a Sequential model with the help of TensorFlow and Keras. In the model we have created three Dense layers in which one is the input layer with 128 hidden layers and the activation function is relu, and the second layer consists of 64 hidden layers with activation function relu and the third layer is final output layer and activation function is sigmoid.

import tensorflow
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, Dropout

model = Sequential()
model.add(Dense(128, input_dim=30, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(64,activation = 'relu'))
model.add(Dropout(0.5))
model.add(Dense(1))
model.add(Activation('sigmoid'))

8) Compile and Fit Machine Learning Model to Our Training Data

Compile the model and view a summary of the model

model.compile(
    loss = 'binary_crossentropy', 
    optimizer = 'adam' , 
    metrics = ['accuracy']
)
model.summary()

Fit the model to see the accuracy of training data

# fit with no early stopping or other callbacks:
history = model.fit(
    x_train,y_train,verbose = 1,
    epochs = 100, 
    batch_size = 64,
    validation_data = (x_test,y_test)
)

8) Visualizing Our Training Accuracy and Validation Accuracy

In this step, we will be analyzing our training accuracy and validation accuracy and also we will be plotting losses at each epoch with the help of matplotlib library.

# plot the training and validation accuracy and loss at each epochs:
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1,len(loss)+1)
plt.plot(epochs,loss,'y',label = 'Training loss')
plt.plot(epochs,val_loss,'r',label = 'Validation loss')
plt.title('TechVidvan Training and Validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()

acc = history.history['accuracy']
val_acc = history.history['val_accuracy']
plt.plot(epochs,acc,'y',label = 'Training acc')
plt.plot(epochs,val_acc,'r',label = 'Validation acc')
plt.title('TechVidvan Training and Validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.show()

10) Prediction and Visualizing our Model Accuracy on Test Data

This is the last step in which we will be seeing our model prediction of test data.

# Predicting the Test set results:
y_pred = model.predict(x_test)
y_pred = (y_pred > 0.5)

# Making the Confusion Matrix:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test,y_pred)

sns.heatmap(cm, annot = True)

In this, we can see that our model is working very efficiently and accurately in classifying whether the breast cancer is of Malignant type or Benign type.

Summary

We have created a Breast Cancer Classification project in a very easy way using a Neural network. Our model accuracy is 98.8 % on training data and 97.9% accuracy on validation data. As we have also seen that our model is classifying test data very efficiently and accurately.

So in this project, we have learned how to analyze and visualize the data using pandas and matplotlib libraries. We have also learned the use of LabelEncoder, MinMaxScaler.