Abstract:
Loan default risk, also known as credit risk, is one of the significant financial challenges in banking and financial institutions since it involves the uncertainty of the borrowers' ability to perform their contractual obligations. Banks and financial institutions rely on statistical and machine learning methods for loan default prediction to reduce the potential losses of issued loans. These machine learning applications may never achieve their full potential without the semantic context of the data.
A knowledge graph is a collection of linked entities and objects that include semantic information to contextualize them. Knowledge graphs allow machines to incorporate human expertise into their decision-making and provide context for machine learning applications. A Knowledge Graph can semantically incorporate various data and link knowledge from many areas without altering its original form, enabling organizations to leverage the power of collective intelligence. Furthermore, knowledge graph embedding is now a widely adopted technique for representing knowledge. This graph embedding preserves the original graph's semantic information and structure. It can be a beneficial source of features for a subsequent machine learning classification task. So, a knowledge graph-based approach will improve the prediction model's performance and interpretability.
In this thesis, we present a hybrid approach combining a knowledge graph and machine learning to enhance the performance and rationality of the loan default prediction model. For this purpose, we developed an ontology for the semantic data model. Then, we mapped our semantic data model with a publicly available credit dataset to construct the knowledge graph. Next, we used knowledge graph embedding methods to discover the knowledge graph's semantic and structural content. Finally, we inputted the vectors extracted from the graph embedding as features to the machine learning classifier to forecast loan default. The experimental results demonstrate that incorporating knowledge graph embedding as features can boost the performance of conventional machine learning classifiers in predicting loan default risk. To evaluate the performance of several machine learning classifiers that exhibited strong performance in the credit default prediction task, we employed accuracy, precision, recall, F1 score, MCC, and ROC AUC as evaluation metrics. The “XGBoost + KGE” model performed best in all evaluation measures, with a ROC AUC of 0.836 (an increase of around 10.14% over the conventional technique).