Abstract:
Feature selection (FS), a crucial preprocessing step in machine learning, greatly reduces the dimension of the data and improves model performance. By removing irrelevant and redundant features from the feature space, the fundamental goal of FS is to choose an optimal subset of features. Feature weightings reported in the literature illustrate how essential each feature is, but they cannot ensure a superior categorization feature set. It is found that the features' interaction is complex. In order to locate fewer redundant or more pure features, we may give up valuable ones, which could hinder data classification. Developing a good feature selection strategy is crucial. This research focuses on selecting features for medical data classification. In this work, a new form of ensemble FS method called PRG_Ensemble has been put forth. It combines three FS methods to produce a stable and diverse subset of features. Gaining an optimal subset of features and overcoming the shortcomings of a single FS method are the primary goal of the ensemble FS method. In this study, the three filter FS approaches that are employed as base selectors are the Pearson’s correlation coefficient (PCC), reliefF, and gain ratio (GR). When used on a certain dataset, these three FS approaches produce three distinct lists of features and order each feature by importance or weight. The final subset of features in this study is chosen using the average weight of each feature and the rank difference of a feature across three ranked lists. Using the average weight and rank difference of each feature, unstable and less significant features are eliminated from the feature space. Two well-known medical datasetschronic kidney disease (CKD) and Lung Cancer, have been used to evaluate the performance of the suggested technique. Data in CKD and Lung Cancer is classified using logistic regression (LR). The experimental results show that the proposed method has obtained highestaccuracy value of 99.25% for CKD and highest accuracy value of 93.5275% for Lung Cancer, compared to other three base FS methods for each dataset respectively.