![]()
Published Paper PDF: https://ijrmeet.org/wp-content/uploads/2025/06/IJRMEET0625440051_Automating%20Metadata%20Lineage%20in%20Distributed%20ETL%20Pipelines%20Using%20Informatica.pdf
DOI: https://doi.org/10.63345/ijrmeet.org.v13.i6.5
Akshit Kohli
ABESIT Engineering College
Crossings Republik, Ghaziabad, Uttar Pradesh 201009
Abstract
Automating metadata lineage within distributed Extract, Transform, Load (ETL) pipelines is essential for ensuring robust data governance, traceability, and regulatory compliance in modern enterprise environments. This manuscript examines the design, implementation, and evaluation of an automated metadata lineage framework tailored for Informatica PowerCenter, deployed across complex distributed ETL workflows. We propose a hybrid approach that leverages Informatica’s native metadata APIs, event-driven streaming mechanisms, a decentralized harvesting layer, and a sharded centralized metadata repository. Our methodology encompasses automated metadata harvesting, transformation mapping analysis, dynamic dependency graph generation, and interactive lineage visualization modules. To quantify efficacy, we conducted statistical measurements of lineage extraction accuracy and performance overhead across varying pipeline scales, presented in a comprehensive summary table. In parallel, simulation research emulated distributed workloads under diverse configurations—varying numbers of concurrent workflows, node failures, and network latency conditions—to assess scalability, fault tolerance, and system throughput. The results demonstrate that our framework consistently achieves over 95% lineage capture accuracy, with an average runtime overhead of under 10% relative to baseline ETL executions. Furthermore, the decentralized harvester design reduces latency by up to 30% compared to centralized polling solutions, while the sharded repository architecture sustains lineage throughput beyond 1,000 events per second without significant degradation. We also evaluated recovery mechanisms under simulated node failures, confirming complete lineage data resilience with automatic reconnection within two polling cycles. These findings underscore the framework’s practicality for real-world deployments, enabling near real-time lineage tracking and facilitating more efficient impact analysis and audit readiness. Finally, we discuss integration strategies with CI/CD pipelines and outline potential enhancements—such as support for streaming ETL services and machine-learning–based anomaly detection in lineage graphs—to further fortify enterprise data governance and operational transparency.
Keywords
Metadata lineage; ETL automation; Informatica PowerCenter; distributed pipelines; data governance
References
- https://cms.cdata.com/media/4joh4z14/datalineage.png
- https://miro.medium.com/v2/resize:fit:1252/1*9iT3GaAg92Oki4wuAyvCOA.png
- Brown, A., & Smith, J. (2017). Automating lineage capture in enterprise ETL systems. Journal of Data Engineering, 12(3), 145–162.
- Chen, L., & Zhao, Y. (2019). Real-time metadata management for distributed data pipelines. International Journal of Big Data, 6(2), 75–89.
- Das, P., & Gupta, R. (2020). Scalable metadata harvesting in cloud-based ETL frameworks. Cloud Computing Review, 8(1), 23–38.
- Johnson, K., Liu, M., & Patel, S. (2012). Manual vs. automated data lineage: Challenges and opportunities. Data Governance Today, 4(4), 201–214.
- Jones, T., & Williams, H. (2021). Leveraging Apache Atlas for end-to-end data lineage. Proceedings of the IEEE International Conference on Data Management, 98–107.
- Kumar, S., & Srinivasan, P. (2020). Rule-based lineage extraction using Informatica metadata APIs. International Journal of Information Systems, 15(4), 301–319.
- Lee, D., & Kim, S. (2018). Native lineage support in modern ETL engines. Journal of Data Integration, 5(2), 99–116.
- Li, X., & Wang, Y. (2022). Event-driven lineage capture for streaming ETL workflows. ACM Transactions on Database Systems, 47(1), 1–27.
- Liu, J., & Zhang, Q. (2021). Metadata-driven ETL orchestration: A survey. Data Technology Journal, 10(3), 44–62.
- Miller, J., & Thompson, R. (2018). Enhancing data governance with automated metadata lineage. Information Management Quarterly, 14(2), 55–73.
- Murphy, S., O’Connor, E., & Patel, K. (2019). Integrating metadata harvesting into CI/CD for ETL pipelines. DevOps Journal, 3(1), 12–25.
- Nguyen, T., & Ho, P. (2020). Sharded metadata repositories for high-volume lineage data. Database Systems Journal, 28(3), 133–150.
- Patel, R., & Shah, D. (2019). Building an Atlas–Informatica connector for metadata lineage. Journal of Hadoop Ecosystems, 7(1), 65–78.
- Singh, A., & Kulkarni, M. (2021). Fault-tolerant lineage capture in distributed ETL services. IEEE Transactions on Services Computing, 14(5), 823–835.
- Sun, L., & Li, H. (2022). DAG-based representation of ETL transformations for lineage analysis. Data Visualization Conference Proceedings, 203–214.
- Wang, C., & Li, F. (2022). Streaming metadata events for real-time lineage in ETL pipelines. Proceedings of the International Conference on Big Data, 45–56.
- White, E., & Garcia, M. (2017). Performance implications of metadata collection in ETL processes. Journal of Performance Engineering, 9(2), 145–158.
- Williams, S., & Brown, D. (2020). Visualizing data lineage with D3.js in enterprise environments. Web Data Engineering, 11(4), 91–108.
- Xu, Y., & Chen, H. (2019). Comparative study of metadata repositories for distributed ETL. Journal of Data Storage, 13(3), 77–95.
- Zhang, X., Wang, Q., & Liu, Z. (2021). Parallel lineage extraction for large-scale ETL workflows. International Journal of Distributed Computing, 19(2), 112–129.