Abstract:
Of late, the advent of online social media has led to the inception of a new form of data stream
called multi-label data stream, where each stream record carries multiple class labels and requires a
classi er to associate multiple categories to each record. Data streams present several challenges that
has to be dealt with by any stream classi cation model. Concept drifting, in nite length with nite
memory and processing time are the challenges that have been addressed by the existing multi-label
data stream classi cation models in literature. In real world applications that generate data streams,
the amount of labeled data is usually very scarce compared to the entire stream. Moreover, with the
ever changing nature of Internet and social media, the emergence new class of data in the stream is a
common phenomenon. This phenomenon is known as concept evolution. When this emergence occurs
periodically for some classes of data, it is called class recurrence. None of the existing methodologies
address any of the issues of scarcity of labeled data, concept evolution and class recurrence.
This thesis proposes a layered ensemble based classi cation framework (LEAD) for multi-label
data streams. The primary component of our LEAD framework is a two layer ensemble architecture.
The top layer of the ensemble architecture re
ects the most recent concept of the data stream whereas
the bottom layer represents the older concepts of the stream. As a result, the bottom layer enables
LEAD to classify recurrent class instances. Moreover, the layered approach also helps to di erentiate
between recurrent and novel class instances which signi cantly reduces the false alarm rate of novel
class instance identi cation. LEAD deploys a fuzzy novel class detection technique to identify the
emergence of novel concept(s) in the stream. The problem of limited amount of labeled data is
handled by a deferred classi cation mechanism. This mechanism allows more labeled data to appear
in the stream that may help the development of a more informed classi er. Experimental results show
clearly that LEAD exhibits better performance than the baseline methods.