dc.description.abstract |
Early detection of heart disease can help in preventing the disease progression. Different risk factors are associated with heart disease prediction. This project focuses on multiple datasets in order to find the most valuable attributes and risk factors associated with heart disease.One dataset containing 14 attributes including the target attribute and 303 instances is collected from UCI machine learning repository. The second one containing 10 attributes and 462 instances is collected from Kaggle repository. The third one contains 12 attributes of 70000 instances, and is available at Kaggle repository. Seven different machine learning algorithms are applied on these three individual datasets to study the most influential attributes for heart disease prediction. One hybrid dataset is also generated using only the common attributes of two individual datasets. Scikit-learn library of Python programing language is used for data analysis purpose. Univariate feature selection algorithm is applied in order to find the most valuable attributes associated with heart disease. The heart disease is predicted using several machine learning algorithms including support vector machine (SVM), decision tree, k-nearest neighbors (kNN), logistic regression, naïve Bayes, random forest, and majority voting.The training and testing portions of each dataset is separated using holdout and cross validation methods. Different parameters related to different algorithms are altered andapplied to find out which condition gives the highest accuracy. To evaluate the performance of different algorithms, classification report and confusion matrix are also calculated. It is shown here that majority voting as a combination of logistic regression, SVM, and naïve Bayes exhibits the best accuracy of 88.89% when applied to the first dataset.It is also shown that for the hybrid dataset, the classification accuracy is lower than that of the individual datasets.Finally, the best result obtained from this project work is compared with the results of existing similar research approaches. |
en_US |