Abstract

Evaluating the Performance of Machine Learning Models for Diabetes Prediction with Feature Selection and Missing Values Handling


Abstract


Diabetes is a growing problem these days and many people are at risk. The effects of modern lifestyle are becoming visible in the spread of several lifestyle diseases. Obesity is one of these diseases which leads to several offshoots, diabetes being one of them. With the growth of Machine Learning (ML) techniques, researchers have been keen to apply them for predicting diabetes in the recent years. It is an important issue that the data set for applying ML techniques needs to be processed in the right way to be able to get informative results. This paper aims to study the effect of different techniques for pre-processing diabetes dataset, and then applying various ML classifiers on them. This research paper evaluates the performance of two algorithms, Logistic Regression and Support Vector Classifier based on the parameters Precision, Accuracy, Recall and F-measure. When the results were compared to other studies conducted by other researchers, it was evident that Logistic Regression performed better than all other classifiers, with the greatest accuracy of 81% utilizing only five features. The experiments have been performed using Python 3.8 and the IDE used was Jupyter Notebook.




Keywords


Driver smoking; feature selection; logistic regression; support vector classifier