Abstract:
The way of collecting sensor data will face a revolution when the newly developing
technology of distributed sensor networks becomes fully functional and widely available.
Distributed sensor networks are indeed an attractive technology, but the program/stack
memory and the battery life of today nodes do not enable complex data mining in
runtime. Effective data mining can be implemented on the central base station, where the
computational power is not generally constrained. Today's real-world databases are
highly susceptible to noisy, missing and inconsistent data because of their typically huge
size and their likely origin from multiple, heterogeneous sources. Low-quality data will
lead to low-quality mining results.
There are many possible reasons for noisy data (having incorrect attribute values). The
data collection sensor nodes used may be faulty. Errors in data transmission can also
occur. There may be technology limitations, such as limited buffer size for coordinating
synchronized data transfer and consumption. In:correct data may also result from
inconsistencies in naming conventions or data codes used or inconsistent formats for
input fields. Duplicate tuples also require data cleaning.
Preprocessing is required to remove noisy, missing and inconsistent data for efficient
mining in Wireless Sensor Networks (WSN) data. A number of research works have been
done for mining WSN data. No research work has been found to be done on pre-.
processing the WSN data for efficient query processing. In: this project, we have
evaluated a number of statistical techniques to handle missing data. Among these
techniques, mean before after is found most suitable for handling missing data. We have
. implemented the Approximate Duplicilte Record Detection method to remove the
duplicate records from a dataset.
We have used some WSN datasets available in the internet for experimental purpose. Kmeans
Algorithm has been applied for clustering the dataset. Cleaned and clustered
dataset has shown better performance for query processing than dirty and non clustered
data.