Abstract:
Businesses across all industries, academic institutions or research organizations are gathering and storing more and more unstructured data on a daily basis. Unstructured data is being constantly generated via call center logs, emails, documents on the web, blogs, tweets, customer comments, customer reviews, and so on.Unstructured data takes a lion’s share in digital space and approximately occupies 80% by volume compared to only 20% for structured data. Until recently, the technology didn’t evolve to support doing much with it except storing it or analyzing it manually. While the amount of unstructured data is increasing rapidly, businesses’ ability to summarize, understand and make sense of such data for making better business decisions become challenging. But organizations are in dire need to process and exploit unstructured data to get edge in business. Some big data tools, primarily those based on Hadoop as well as MapReduce, are designed from the ground up to manage and analyze unstructured information. In this project, an attempt is made to determine the scalability of Hadoop cluster on huge volumes of textual unstructured data for word count. For this a Hadoop cluster is established through real environment and sample data sets are analyzed by using the cluster. It is found that it significantly reduce the processing time of desired output based on Hadoop cluster size. The results show that as cluster size increases the performance gives better output in terms of task completion time.