Abstract:
A data breach happens when there is unapproved access to sensitive, protected, or confidential data. In today’s digital era, data has become one of the most significant components of an organization. Any loss or misuse of data can cause serious damages to organizations, including significant reputational deterioration and financial losses. Data breaches portrays a constant threat to all types of organizations. Regardless of the type of data breaches or the companies that got data breached, the consequences are always the same impactful. Over the years, various antivirus and security companies have been publishing Data Breach Reports to mitigate these impacts. These reports provide a security practitioner a place to look for data-driven, real-world views on what commonly befalls companies concerning cyber-crime. But what they lack is a classification framework to analyze these data breach incidents. This thesis aims to reduce the burden of antivirus and software companies by providing a classification framework for Cyber Security Breaches. We train and evaluate our framework with data set ranging from (2008-2019) from Privacy Rights Clearinghouse. We use different machine learning algorithms for training the dataset. We also test the classification framework against different performance metrics (e.g., Accuracy, F1 Score, Confusion Matrix). We find the algorithm that gives the best result. We also use Latent Dirichlet Allocation (LDA) based topic modeling as a
feature selector and find that it improves the classification performance.