Abstract:
With the increasing prevalence of parallel computing in modern hardware, optimizing applications to leverage parallelism has become crucial for enhancing performance. Python, known for its ease of development and wide adoption, faces significant challenges in parallelization due to the Global Interpreter Lock (GIL), necessitating workarounds through compiled languages or specialized libraries like Ray, Dask, and Joblib. While existing automatic parallelization tools primarily focus on libraries such as NumPy, they fall short in addressing Python’s advanced features and dynamic nature. This thesis aims to bridge this gap by developing a comprehensive methodology for dynamically analyzing Python programs to identify dependencies and generate parallelization recommendations that can match the performance of expert-written code.
Our proposed solution involves several key components. First, we created a custom tracer capable of logging execution time, memory footprint, and variable interactions in real time. This data is then used to construct a detailed dependency graph, which serves as the foundation for guiding the parallelization recommendation process. Secondly, we developed a novel heuristic to generate recommendations for which statements should be parallelized and how to achieve this. Additionally, we developed a sophisticated program rewriter that applies these recommendations back into the original source code, ensuring that the parallelized code maintains its integrity and functionality.
The effectiveness of our system was tested on various real-world programs sourced from platforms like Kaggle. These tests demonstrated our system’s ability to accurately identify dependency graphs even for complex codebases and to generate parallelization recommendations that significantly reduce execution time. The results showed that our approach not only preserves the original logic and structure of the programs but also enhances their performance by effectively utilizing parallel computing capabilities.
In conclusion, this thesis presents a novel approach to addressing Python’s parallelization challenges, offering a tool that dynamically analyzes and optimizes Python programs. Our work provides a valuable contribution to the field of parallel computing, particularly for developers seeking to improve the performance of Python applications in an increasingly multi-core world.