Diabetes Prediction using Machine Learning

With this Machine Learning Project, we will be doing diabetes prediction analysis. For this project, we are using the Random Forest Classifier, Support Vector Classifier, and Gradient Boosting Algorithm.

Diabetes Prediction

Diabetes is one of the toughest illnesses. Diabetes is developed as a result of various conditions, including obesity and high blood sugar. It has an effect on the insulin hormone, which makes crabs’ metabolisms erratic and elevates blood sugar levels. Diabetes is developed when the body does not create enough insulin.

The World Health Organization estimates that 422 million people globally, mostly in low- and middle-income countries, have diabetes (WHO). And this number could rise to 490 million by the year 2030. However, diabetes rates are high in places like Canada, China, India, etc. There are 40 million diabetics in India, which has a population of over 1000 million presently. Diabetes is one of the main causes of death around the globe. Diabetes can be managed and controlled at early stages.

By utilizing a number of diabetes disease-related characteristics, we will make the prediction of diabetes. In this research, the Pima Indian Diabetes Dataset is utilized to anticipate diabetes using a variety of machine learning classification and ensemble techniques. Different machine learning algorithms provide efficient results for knowledge collecting by building numerous classifications and ensemble models from the given datasets. These findings may help in the diagnosis of diabetes. There are many different machine learning techniques that may be used to make predictions, but picking the most efficient one can be difficult. So, to make predictions, we need a dataset and run well-known classification and ensemble algorithms on it.

Choosing Model

The Random Forest algorithm, a machine learning technique, was suggested by K.Vijiya Kumar. It was designed to create a system that can predict diabetes earlier in the course of a patient’s life with more accuracy. The results indicated that the prediction system is able to forecast diabetes disease effectively, efficiently, and quickly. The suggested model yields the best results for diabetic prediction.

Diabetes Prediction was suggested by Nonso Nnamoko using an ensemble supervised learning approach in which five commonly used classifiers were utilized for the ensembles, and their outputs were combined using a meta-classifier. The results are presented and compared to those from previous studies that used the same dataset.

Studies have shown that diabetes can be cured at early stages. Tejas N. Joshi et al. produced a diabetes prediction. Using Machine Learning Approaches, he attempts to predict diabetes by employing three different supervised machine learning techniques, including SVM, Logistic Regression, and ANN. The results of his experiment point to a practical technique for early diabetic illness detection. Dheeraj Shetty suggested using data mining to anticipate diabetes disease in order to develop an Intelligent Diabetic Illness Prediction System which can analyze the condition using the data of diabetic patients. In his research, it is suggested that diabetes patient databases can be analyzed using algorithms like Bayesian and KNN (K-Nearest Neighbor), which are used to forecast the development of diabetes disease.

There are a lot of algorithms that can be used. But the Random Forest Classifier seems to perform the best for our project with an accuracy of around 90%.
Let’s have a look at the model.

Random Forest Classifier

In addition to being used for classification and regression tasks, it is also a type of ensemble learning technique. It provides a higher level of accuracy than other models. Large datasets can be handled by this strategy very easily. Leo Bremen created Random Forest. It is a well-liked collective learning approach. By lowering variance, Random Forest enhances the performance of the Decision Tree. During training, it builds a large number of decision trees, and then it outputs the class that represents the mean of all the classes.

Algorithms

Selecting the “R” features from the total features “m” where R>M is the first step.
The node employs the most effective split point out of all the “R” features.
Choose the optimal split to divide the node into sub-nodes.
Until you reach “l” number of nodes, repeat steps a through c.
By performing steps a through d repeatedly, you can build a forest by adding “a” number of trees to “n” trees.

Project Prerequisites

The requirement for this project is Python 3.6 installed on your computer. I have used a Jupyter notebook for this project. You can use whatever you want.
The required modules for this project are –

Numpy(1.22.4) – pip install numpy
Pandas(1.5.0) – pip install pandas
Seaborn(0.9.0) – pip install seaborn
SkLearn(1.1.1) – pip install sklearn

Diabetes Prediction Project & DataSet

We have provided the diabetes prediction project source code as well as a dataset for this project that will be required in this machine learning project. We will require a CSV file for this project. You can download the dataset and the jupyter notebook from the link below.

Implementing Machine Learning Model

1. Import the modules and the libraries. For this project, we are importing the libraries numpy, pandas, seaborn, sklearn, and matplotlib.

import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier

2. Here we are reading our dataset. And we are printing our dataset

dataframe = pd.read_csv('dataset.csv')
dataframe.head()

3. Here we are importing the seaborn library and we are plotting a box plot of the insulin column.

import seaborn as sns
sns.boxplot(x = dataframe["Insulin"]);

4. We are checking if there is any null value in the data. As we can see that there are no null values in the dataset.

5. We are printing the dataset again

6. We are printing the correlation of the dataset to see which column is the most irrelevant.

# checking if there is any missing value in our dataset
dataframe.isnull().sum()

# as there is no missing value in our dataset and we are printing our datset again
dataframe.head()

# printing the correlation of the dataframe to see the correlation of every column
dataframe.corr()

7. Here we are plotting the heatmap to see the correlation more clearly using seaborn library. We are using the heatmap function of the seaborn library.

f, ax = plt.subplots(figsize= [20,15])
sns.heatmap(
    dataframe.corr(), 
    annot = True, 
    fmt = ".2f", 
    ax = ax, 
    cmap = "magma"
)
ax.set_title("Correlation Matrix", fontsize=20)
plt.show()

8. Here we are plotting a pie chart and a count plot of the ‘Target’ column and the ‘Outcome’ column.

f,ax=plt.subplots(1,2,figsize=(18,8))
dataframe['Outcome'].value_counts().plot.pie(
    explode = [0,0.1],
    autopct = '%1.1f%%',
    ax = ax[0],
    shadow = True
)
ax[0].set_title('target')
ax[0].set_ylabel('')
sns.countplot('Outcome', data = dataframe, ax = ax[1])
ax[1].set_title('Outcome')
plt.show()

fig, ax = plt.subplots(4,2, figsize=(16,16))
sns.distplot(dataframe.Age, bins = 20, ax=ax[0,0])
sns.distplot(dataframe.Pregnancies, bins = 20, ax=ax[0,1])
sns.distplot(dataframe.Glucose, bins = 20, ax=ax[1,0])
sns.distplot(dataframe.BloodPressure, bins = 20, ax=ax[1,1])
sns.distplot(dataframe.SkinThickness, bins = 20, ax=ax[2,0])
sns.distplot(dataframe.Insulin, bins = 20, ax=ax[2,1])
sns.distplot(dataframe.DiabetesPedigreeFunction, bins = 20, ax=ax[3,0])
sns.distplot(dataframe.BMI, bins = 20, ax=ax[3,1])

9. Here we are creating the x and the y dataset from our original dataset.

y = dataframe["Outcome"]
X = dataframe.drop(["Outcome"], axis = 1)

10. Here we are importing the train test split function from sklearn and then we are dividing the dataset into training and testing.

from sklearn.model_selection import train_test_split
X_train,
X_test,
y_train,
y_test = train_test_split(X, y, test_size=0.33, random_state=42)

11. Here we are importing the Random forest Classifier and we are passing our training and testing dataset to see it and see what is the accuracy of this algorithm. The accuracy of this algorithm comes out to be 89%.

tree = RandomForestClassifier()
clf = tree.fit(X_train,y_train)
y_pred = clf.predict(X_test)
accuracy_score(y_pred,y_test)

12. Here we are importing the Gradient Boosting Classifier and we are passing our training and testing dataset to see it and see what is the accuracy of this algorithm. The accuracy of this algorithm comes out to be 88%.

tree = GradientBoostingClassifier()
clf = tree.fit(X_train,y_train)
y_pred = clf.predict(X_test)
accuracy_score(y_pred,y_test)

13. Here we are importing the Support Vector Classifier and we are passing our training and testing datasets to see it and see what is the accuracy of this algorithm. The accuracy of this algorithm comes out to be 66%.

tree = SVC(gamma='auto')
clf = tree.fit(X_train,y_train)
y_pred = clf.predict(X_test)
accuracy_score(y_pred,y_test)

Conclusion

In this Machine Learning project, we develop diabetes prediction. For this project, we are using the Random Forest Classifier, Support Vector Classifier, and Gradient Boosting Algorithm. We hope you have learned something new from this project.

Diabetes Prediction using Machine Learning

Diabetes Prediction

Choosing Model

Random Forest Classifier

Algorithms

Project Prerequisites

Diabetes Prediction Project & DataSet

Implementing Machine Learning Model

Conclusion

More from this blog

Introduction to Large Language Models

The Most Effective Methods for Evaluating LLMs.

Machine Learning Intro to Breast Cancer Classification

Types of Machine Learning

Command Palette

Diabetes Prediction

Choosing Model

Random Forest Classifier

Algorithms

Project Prerequisites

Diabetes Prediction Project & DataSet

Implementing Machine Learning Model

Conclusion

More from this blog