Cumulative Pipelines for Performance and Storage Optimization in Near Real-Time Processing
- Version
- Download 1
- File Size 265.43 KB
- Download
Cumulative Pipelines for Performance and Storage Optimization in Near Real-Time Processing
Authors:
Arjun Reddy Lingala
Abstract—The efficient processing of large-scale datasets is pivotal in data-driven industries, where batch processing frame- works are widely used to aggregate, analyze, and transform data. With traditional pipelines, performance takes a hit when analysis of historical data combined with latest data is required. In modern data warehousing implementations using HDFS, it requires to query thousands of partitions which are folders on HDFS [4]. Reading HDFS folders takes a lot of compute and I/O cost and impacts the overall query performance. It requires querying multiple partitions and filtering the data even for small queries requiring minimal queries. This paper introduces the concept of Cumulative Pipelines in batch processing, a method- ology designed to optimize data workflows by incrementally maintaining and reusing intermediate results across iterative computations. Cumulative data pipelines aim to address critical challenges such as redundant computation, resource inefficiency, and latency, which often hinder traditional near-real time pro- cessing pipelines. This paper explores the architectural design of Cumulative Pipelines, emphasizing their modular structure and compatibility with contemporary batch processing platforms like Apache Hadoop and Spark. This includes usage of complex data types involved like Maps, Structs and Arrays to preserve the history and maintaining the size of the column for performance optimization by deleting historical data. The paper concludes by discussing advantages and disadvantages of this approach and the impact on performance and compute cost trading with end user query performance.
Keywords—cumulative pipelines, data warehousing, hadoop, performance, storage efficiency, cube tables, backfill, distributed processing, retention