Abstract:
Performance is a critical concern when reading and writing data from billions of records stored in a Big Data warehouse. Many researchers have proposed improving query execution performance in distributed Big Data systems by introducing efficient techniques such as indexing, caching, filtering, map-reduce, query execution plan, data partitioning, etc. In this thesis, we introduce two other scopes for query performance improvement. One is to improve performance of lookup queries after data deletion in Big Data systems that use the Eventual Consistency model. We propose a scheme to improve performance of lookup queries after data deletion by replacing Bloom Filter with a better probabilistic data structure called Cuckoo Filter that supports deletion of elements. Another scope for query performance improvement is to avoid unnecessary network round-trip for query execution in remote nodes in a Big Data cluster when it is known that the nodes do not have the requested partition of data. We propose a scheme using probabilistic filters that are looked up before delegating a query execution to remote nodes, so that queries resulting in no data can be skipped from passing through the network. We evaluate our schemes with a popular Big Data database (Cassandra) and show that each scheme can improve performance of lookup queries for up to 100%. We also show that the proposed schemes do not degrade performance of other data manipulation queries as a side effect.