Abstract:
The aim of this research work is to apply machine learning algorithms for predicting cervical cancer. Early screening of vulnerable patients is essential to prevent cervical cancer. However, in many developing countries, there is a scarcity of medical facilities for such screening. Hence, research is needed in the field of data-driven diagnosis of cervical cancer. In this thesis, a dataset of cervical cancer patients has been considered, which includes attributes suitable for Bangladeshi patients. Another objective is to classify the patients of the dataset by using a new efficient hybrid algorithm. Firstly, an existing dataset collected from the University of California, Irvine (UCI); a machine learning repository is considered, which consists of 36 attributes and 858 instances. To overcome the imbalance of the data samples, the borderline Synthetic Minority Over-sampling Technique (SMOTE) is used. Next, a new dataset of cervical cancer patients collected from various hospitals in Bangladesh has been introduced. This new dataset consists of 21 attributes and 228 instances. The Recursive Feature Elimination method is applied to both datasets to find the most important attributing to cervical cancer. A number of classifiers, including base, ensemble, and hybrid algorithms, are applied to the datasets. Next, a two-stage hybrid algorithm is proposed where ExtraTreeClassifier is used in the first stage, and a stacking algorithm is used in the second stage. Results show that stacking as a combination of Random Forest, ExtraTreeClassifier, XGBoost, and Bagging exhibits the best classification accuracy of 95.3% for the first dataset. For the second dataset, AdaBoost shows the best classification accuracy of 95.6%. The proposed hybrid method offers classification accuracy of 95.9% and 96.2% for first and second datasets. Hence, the Bangladeshi dataset and the proposed hybrid algorithm can play an essential role in predicting cervical cancer.