![]()
Diya Verma
Independent Researcher
India
Abstract
The exponential growth of textual data across diverse domains necessitates efficient large-scale text mining techniques. MapReduce, a programming model popularized by Google and implemented by open-source frameworks like Apache Hadoop, has become a prominent solution for distributed processing of massive data sets. This paper evaluates the performance and suitability of the MapReduce framework for large-scale text mining applications, focusing on scalability, fault tolerance, and processing efficiency. Various text mining tasks, including tokenization, frequency analysis, and pattern detection, were implemented and tested on large datasets. Results indicate that MapReduce significantly reduces processing time and handles large volumes of data effectively, despite certain limitations in iterative processing and real-time analysis. The study concludes that while MapReduce remains a strong candidate for batch-oriented large-scale text mining tasks, hybrid approaches integrating other frameworks might be required for more dynamic or iterative applications.
Keywords
MapReduce, Text Mining, Large-Scale Data Processing, Hadoop, Distributed Computing, Scalability, Big Data
REFERENCES
- Dean, J., & Ghemawat, S. (2008). MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, 51(1), 107–113.
This seminal paper introduces the MapReduce programming model and its Google-scale implementation, demonstrating how it transparently handles parallelization, machine failures, and I/O scheduling on terabyte-scale datasets. org - Lin, J., & Dyer, C. (2010). Data-Intensive Text Processing with MapReduce. Morgan & Claypool.
A comprehensive, book-length treatment of how common text-mining algorithms (e.g., inverted indexing, n-gram extraction, language modeling, topic modeling) can be expressed and scaled via MapReduce. umontreal.ca - Ji, Y., Tian, Y., Shen, F., & Tran, J. (2016). Experimental Evaluations of MapReduce in Biomedical Text Mining. In Information Technology: New Generations (Vol. 448, pp. 665–675). Springer.
Presents two case studies—literature search and association mining—run on Amazon EMR, showing that computationally heavier tasks scale better than low-compute ones due to MapReduce overheads (JVM startup, scheduling, disk I/O). springer.com - Lee, K.-H., Lee, Y.-J., Choi, H., Chung, Y. D., & Moon, B. (2012). Parallel Data Processing with MapReduce: A Survey. ACM SIGMOD Record, 40(4), 11–20.
Surveys a broad spectrum of MapReduce-based approaches, performance metrics, datasets, and open challenges across big-data applications—including text analytics pipelines. springer.com - Doulkeridis, C., & Nørvåg, K. (2014). A Survey of Large-Scale Analytical Query Processing in MapReduce. The VLDB Journal, 23(3), 355–380.
Reviews query workloads (including text queries) on Hadoop, Hive, and Pig, and summarizes performance benchmarks, optimizations, and system trade-offs. springer.com - Çatak, F. Ö., & Balaban, M. E. (2013). A MapReduce-Based Distributed SVM Algorithm for Binary Classification. arXiv preprint arXiv:1312.4108.
Describes how to train SVM classifiers on very large text feature sets by splitting and iteratively merging support vectors across a Hadoop cluster, with empirical accuracy and runtime results. org - Kolb, L., Thor, A., & Rahm, E. (2011). Load Balancing for MapReduce-Based Entity Resolution. arXiv preprint arXiv:1108.1631.
Although focused on entity resolution, this paper’s methods for skew handling and data redistribution are directly applicable to large-scale text-mining tasks where key distributions are uneven. org - Heintz, B., Chandra, A., & Sitaraman, R. K. (2012). Optimizing MapReduce for Highly Distributed Environments. arXiv preprint arXiv:1207.7055.
Proposes end-to-end, model-driven optimizations for geographically distributed MapReduce jobs (e.g., text indexing across data centers), achieving up to 41% runtime reduction over vanilla Hadoop. org - Dolev, S., Florissi, P., Gudes, E., Sharma, S., & Singer, I. (2017). A Survey on Geographically Distributed Big-Data Processing using MapReduce. arXiv preprint arXiv:1707.01869.
Classifies and compares batch (MapReduce), stream (Spark), and SQL-style geo-distributed frameworks—highlighting where MapReduce still fits text-mining pipelines that span data centers. org - Zhang, Y., Chen, S., Wang, Q., & Yu, G. (2015). i2MapReduce: Incremental MapReduce for Mining Evolving Big Data. arXiv preprint arXiv:1501.04854.
Introduces fine-grained, key-value–level incremental processing for iterative text-mining algorithms (e.g., topic modeling), reducing I/O overhead when new documents arrive. org