Abstract:
In a transaction database or support set, the task of finding out the patterns which occur more frequently than a specified threshold is known as frequent pattern mining. Since its inception in the early 1990s,frequent pattern mining has been extensively studied, and subsequently applied to the wide range of application domains-consumer behavior analysis, web log mining, gene expression profiling to name a few. While there has been substantial research in innovating a wide variety of frequent patterns, the evolution of existing and emergence of new application domains demand to innovate a new variety of patterns that can reveal distinguishing characteristics of the underlying support set. For example, clickstream analysis seek to segment users into meaningful clusters based on their click path, which requires identifying click sequences that contribute to user profiling. There are many such examples where analyst seeks for patterns that can reveal distinguishing characteristics of the underlying population. As the best of the literature review, there is no recent work to determine these characteristics. But many of these distinguishing characteristics can be identified using a newly proposed concept named endemism. If the constituent elements of pattern are more likely to be found in combine and less likely to be obtained otherwise, then this co-occurring tendency of these elements will be referred to as endemism and this type of pattern will be called endemic pattern. This thesis introduces this endemism concepts to make pattern level grouping of the records or users, which can provide valuable information about the underlying support set. This work proposes two scoring strategies, Reluctancy Scoring and Affinity Scoring, to evaluate the endemism of the frequent patterns. This thesis also proposes three heuristics, TopK selection, Optimized Search and Random selection, as the alternative to the costly Combinatorial Search method for the final grouping of the records. Experiments show that reluctancy Scoring outperforms Affinity Scoring, and optimized Search provides the best result among the heuristics with a little sacrifice of time.