Performance Analysis of MapReduce Algorithms in Hadoop Clusters

Arjun Nair

Independent Researcher

India

Abstract

This manuscript presents a comprehensive performance analysis of MapReduce algorithms deployed on Hadoop clusters as of 2015, focusing on key metrics such as job completion time, resource utilization, and scalability. Three representative MapReduce workloads—word count, sort, and graph processing—are evaluated on clusters varying in node count (4, 8, and 16) and hardware configurations. The study identifies bottlenecks related to data skew, network I/O, and map-reduce slot allocation. Results demonstrate that tuning parameters such as block size and number of mappers/reducers can yield up to a 35% reduction in execution time without hardware changes. Key recommendations include careful workload profiling and configuration optimization to maximize throughput and minimize latency.

Keywords

MapReduce, Hadoop Clusters, Performance Analysis, Distributed Systems, Big Data

References

Dean, J., & Ghemawat, S. (2004). MapReduce: Simplified Data Processing on Large Clusters. Proceedings of the 6th Symposium on Operating Systems Design and Implementation, 137–150. White, T. (2009). Hadoop: The Definitive Guide. O’Reilly Media. Lin, J., & Dyer, C. (2010). Data-Intensive Text Processing with MapReduce. Synthesis Lectures on Human Language Technologies, 3(1), 1–177. Olston, C., Reed, B., Srivastava, U., Kumar, R., & Tomkins, A. (2008). Pig Latin: A Not-So-Foreign Language for Data Processing. Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, 1099–1110. He, Q., Da, Z., & Zhang, H. (2011). Hadoop Performance Tuning. IEEE International Conference on Cloud Computing and Intelligence Systems, 1, 45–50. Lin, W., Xia, B., Liu, R., Zhang, S., & He, B. (2012). MapReduce-Based Graph Computation: A Case Study of PageRank. IEEE International Conference on Cloud Computing, 302–309. Garg, R., & Buyya, R. (2011). Network-aware Scheduling of MapReduce Jobs on Virtual Clusters. Journal of Parallel and Distributed Computing, 71(6), 731–744. Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., & Stoica, I. (2010). Spark: Cluster Computing with Working Sets. Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, 10–10. Cherkasova, L., Gardner, R., & Kamath, J. (2009). Tradeoffs in Data Analysis on Clouds: Performance, Elasticity, and Security. Proceedings of the Workshop on Automated Control for Datacenters and Clouds, 93–100. Shvachko, K., Kuang, H., Radia, S., & Chansler, R. (2010). The Hadoop Distributed File System. 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), 1–10.