Abstract:
Document clustering is a popular t,ool for organizing a large collection of
documents. Clustering algorithms are usually applied on documents, represented
as vectors, in a high dimensional term space. The main two problems
related to such clu~tering approach are accurately cluster the co-related documents
and determine the proper number of clust,ers. The first feature is
being analyzed in current literature in different ways including active CltlStering,
partitional k-means algorithm, project,ion based methods including
LSI, self-organizing maps, multi dimensional scaling, graph-theoretic techniques
and many more. As for the second feature most of the clustering
approaches assumes the number of clusters as a pre-requisite quantity such
in case of Markov State Cluster, partitional methods and most of the graphtheoretic
techniques. A few of the clustering algorithms have been analyzed
those can automatically determine the number of clusters. A popular approach
is based on the idea borrowed from Principal Component Analysis.
Another approach uses self-refinement process of discriminative feature identification
and cluster label voting to converge to optimal number of clusters.
In this work we have implemented iterative solution with inductive knowledge
base to achieve the optimal clustering. Both the inter-cluster distance
and number of clusters are iteratively varied to have this optimization. This
new technique to determine the number of clusters and document clustering
shows promising result with 81% percent clustering accuracy. For classification
we studied unsupervised clustering technique together with the group
vector that also minimizes the computational cost that is usually associated
with ordinary classification approaches. The outcome reveals comparable
result to current practices and gives 78% classification accuracy.