Dynamic parallelization in distributed join optimization
Abstract
Selection of appropriate level of parallelism is critical in distributed join optimization for efficient execution of data-intensive query workloads. Different join strategies use different communication patterns, which requires the level of parallelism to be specifically determined for each strategy. Although most cost-based join optimizers have advanced in modeling local computation and network communication costs, they still struggle to precisely determine parallelism levels tailored to different join strategies. In addition, the impact of data replication levels on join performance is often overlooked, despite its critical role in improving data availability during execution. We propose a cost model for optimizing join queries using various distributed join methods in data-intensive processing environments. In our approach, the optimal level of parallelism for each join strategy is determined individually based on key factors such as data size, replication level, and query complexity. Then, the strategy with the lowest cost under its own optimal parallelism configuration is selected for execution. DPJoin (Dynamic Parallelization Based Join Query Optimization), our proposed method, uses a saturation-based approach to estimate strategy-specific parallelism-based cost, leveraging adaptive runtime statistics to optimize the physical plan. Experimental findings indicate that DPJoin delivers the best performance in analytical queries, achieving an average reduction of approximately 9% in execution time compared to the closest baseline strategy. In addition, DPJoin achieves faster execution times of up to 60% on low and medium scale data under a full data replication setting.