International Scientific Journal of Engineering and Management

An International Scholarly || Multidisciplinary || Open Access || Indexing in all major Database & Metadata
The journal follows the UGC Guidelines and is evaluated for inclusion in the Web of Science
ISSN: 2583-6129

Impact Factor: 7.839

Cumulative Pipelines for Performance and Storage Optimization in Near Real-Time Processing

  • Version
  • Download 1
  • File Size 265.43 KB

Cumulative Pipelines for Performance and Storage Optimization in Near Real-Time Processing

 

Authors:

Arjun Reddy Lingala

arjunreddy.lingala@gmail.com

 

Abstract—The efficient processing of large-scale datasets is pivotal in data-driven industries, where batch processing frame- works are widely used to aggregate, analyze, and transform data. With traditional pipelines, performance takes a hit when analysis of historical data combined with latest data is required. In modern data warehousing implementations using HDFS, it requires to query thousands of partitions which are folders on HDFS [4]. Reading HDFS folders takes a lot of compute and I/O cost and impacts the overall query performance. It requires querying multiple partitions and filtering the data even for small queries requiring minimal queries. This paper introduces the concept of Cumulative Pipelines in batch processing, a method- ology designed to optimize data workflows by incrementally maintaining and reusing intermediate results across iterative computations. Cumulative data pipelines aim to address critical challenges such as redundant computation, resource inefficiency, and latency, which often hinder traditional near-real time pro- cessing pipelines. This paper explores the architectural design of Cumulative Pipelines, emphasizing their modular structure and compatibility with contemporary batch processing platforms like Apache Hadoop and Spark. This includes usage of complex data types involved like Maps, Structs and Arrays to preserve the history and maintaining the size of the column for performance optimization by deleting historical data. The paper concludes by discussing advantages and disadvantages of this approach and the impact on performance and compute cost trading with end user query performance.

Keywords—cumulative pipelines, data warehousing, hadoop, performance, storage efficiency, cube tables, backfill, distributed processing, retention

 

Author's Blog

What is the difference between a Research Paper and a Review Paper?

A research paper and a review paper are both scholarly documents, but they serve different purposes and have different characteristics....
Read More
Author's Blog

What is DOI?

A Digital Object Identifier (DOI) is a unique alphanumeric string that is used to identify and provide a persistent link...
Read More
Author's Blog

What do you need to do during production of your Research Paper?

During the production of a research paper, the following steps need to be taken: conducting research, organizing and analyzing data,...
Read More
Author's Blog

What are the advantages of publishing a research paper?

Publishing a research paper can have many advantages for researchers, including: Career advancement, professional recognition, opportunities for collaboration, increased visibility,...
Read More
Author's Blog

Ways to Support your Academic Wellbeing which preparing the Research Paper/Article

To support your academic wellbeing while publishing a research paper, it's important to set realistic goals, manage your time effectively,...
Read More
Author's Blog

How to improve your Research Paper writing Skills?

Read extensively: One of the best ways to improve your research paper skills is to read extensively in your field...
Read More
Author's Blog

Is DOI compulsory to publish a research paper in a Journal?

DOI is not strictly required to publish a research paper, but it is highly recommended. Basically, the International Scientific Journal...
Read More
Author's Blog

In what ways does research paper give weight to career development?

Publishing a research paper can give weight to a researcher's career development in several ways, such as: establishing oneself as...
Read More
Author's Blog

How to develop a Research Paper from Scratch

Developing a research paper involves several steps including: choosing a topic, conducting background research, formulating a research question or hypothesis,...
Read More
Author's Blog

How Plagiarism report plays crucial role in Research Paper Publication?

Plagiarism is a major concern in the academic and research community, as it undermines the integrity of the research and...
Read More