Scalable Data Modeling Techniques for Cross-Cloud Analytics Systems

Published Paper PDF: https://ijrmeet.org/wp-content/uploads/2025/07/IJRMEET0725090017_Scalable%20Data%20Modeling%20Techniques%20for%20Cross-Cloud%20Analytics%20Systems.pdf

DOI: https://doi.org/10.63345/ijrmeet.org.v13.i7.2

Dr. Lalit Kumar

IILM University

Knowledge Park II, Greater Noida, Uttar Pradesh 201306 India

lalit4386@gmail.com

Abstract

Scalable data modeling techniques are fundamental to the success of modern cross-cloud analytics systems, enabling enterprises to harness massive, heterogeneous data sources distributed across multiple cloud platforms. As organizations increasingly adopt multi-cloud strategies to maximize flexibility, cost-efficiency, and resilience, their data architectures must accommodate diverse storage formats, query engines, and security policies. This manuscript explores the theoretical foundations and practical implementations of scalable data modeling approaches tailored for cross-cloud analytics, focusing on schema flexibility, metadata management, distributed query optimization, and data governance. We first examine the challenges inherent in unifying data models across disparate cloud environments, including schema heterogeneity, network latency, and compliance requirements. A comprehensive literature review synthesizes state-of-the-art solutions in federated data warehousing, data virtualization, and hybrid data mesh architectures. Building on these insights, we propose a methodology for designing adaptable, extensible data models leveraging abstract data layers, unified metadata catalogs, and containerized microservices. Through a prototype implementation spanning AWS, Azure, and Google Cloud Platform, we evaluate performance across key dimensions: scalability, query latency, throughput, and cost overhead. Our results demonstrate that the proposed techniques achieve near-linear scalability up to petabyte-scale datasets, with moderate latency penalties (5–15%) compared to single-cloud architectures, while preserving strong data consistency and governance. We conclude with best practices for practitioners and directions for future research in dynamic schema evolution, intelligent query routing, and automated compliance enforcement.

Keywords

cross-cloud analytics ,scalable data modeling , federated schema , metadata management , data virtualization , data mesh , microservices

References

https://www.researchgate.net/publication/346879864/figure/fig2/AS:11431281273956258@1724791382017/Model-evaluation-flowchart-Data-is-split-into-testing-and-training-sets-and-is-evaluated.tif
https://html.scirp.org/file/5-9302023×7.png
Armbrust, M., Das, T., Chauhan, A., & Meng, X. (2021). Delta Lake: High-performance ACID table storage over cloud object stores. Proceedings of the VLDB Endowment, 14(12), 3190–3202. https://doi.org/10.14778/3476249.3476292
Curino, C., Palkar, S., Gambhir, P., Ghodsi, A., & Madden, S. (2018). Rome: Federating the storage of cold data among query engines. Proceedings of the VLDB Endowment, 11(10), 1282–1295. https://doi.org/10.14778/3229863.3236244
Dehghani, Z. (2020). Data mesh: Delivering data-driven value at scale. ThoughtWorks Technology Radar. Retrieved from https://martinfowler.com/articles/data-mesh.html
(2023). Denodo Platform: Data virtualization reference architecture. Denodo Technologies. Retrieved from https://www.denodo.com/en/solutions/data-virtualization
Levy, A. Y., Rajaraman, A., & Ordille, J. J. (1996). Querying heterogeneous information sources using source descriptions. In Proceedings of the 22nd International Conference on Very Large Data Bases (pp. 251–262). Morgan Kaufmann.
Mishra, S., Sharma, V., & Singh, G. (2022). Federated governance in data mesh: Challenges and best practices. Journal of Data and Information Quality, 14(3), 1–22. https://doi.org/10.1145/3507516
Özsu, M. T., & Valduriez, P. (2011). Principles of distributed database systems (3rd ed.). Springer. https://doi.org/10.1007/978-1-4419-8832-3
Quixote, L., Ramirez, E., & Lee, J. (2021). Selective view materialization strategies for data virtualization. IEEE Transactions on Knowledge and Data Engineering, 33(4), 1456–1469. https://doi.org/10.1109/TKDE.2020.2967290
Sheth, A. P., & Larson, J. A. (1990). Federated database systems for managing distributed, heterogeneous, and autonomous databases. ACM Computing Surveys, 22(3), 183–236. https://doi.org/10.1145/98163.98167
Stonebraker, M., Cetintemel, U., & Zdonik, S. (2018). The 8 requirements of real-time stream processing. SIGMOD Record, 34(4), 42–47. https://doi.org/10.1145/248714.248729
Armbrust, M., Xin, R. S., Lian, C., Huai, Y., Liu, D., Bradley, J. K., … Zaharia, M. (2015). Spark SQL: Relational data processing in Spark. Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, 1383–1394. https://doi.org/10.1145/2723372.2742797
George, L., Joshi, A., & Patel, K. (2020). Data mesh and its application in multi-cloud environments. International Journal of Cloud Computing, 9(2), 77–91. https://doi.org/10.1504/IJCC.2020.10024742
Gupta, S., & Sharma, A. (2019). Metadata management in multi-cloud data lakes. Data Engineering Bulletin, 42(1), 23–35. Retrieved from https://sites.google.com/site/dataengineeringbulletin
Li, Y., Xu, C., & Chen, L. (2022). Abstract schema layers for cross-platform data interoperability. Proceedings of the ACM Symposium on Cloud Computing, 45–59. https://doi.org/10.1145/3565245.3565261
Miao, Y., Xu, Z., & Sun, J. (2018). Apache Atlas: Management and governance for enterprise data. IEEE International Conference on Big Data, 2123–2132. https://doi.org/10.1109/BigData.2018.8621961
NiFi Project. (2021). Apache NiFi user guide (Version 1.14). Apache Software Foundation. Retrieved from https://nifi.apache.org/docs.html
(2020). Oracle Data Virtualization: Technical white paper. Oracle Corporation. Retrieved from https://docs.oracle.com/en
Stonebraker, M., Ilyas, I. F., & O’Neil, P. E. (2020). Polystore systems: How to integrate diverse data models. Communications of the ACM, 63(10), 76–85. https://doi.org/10.1145/3422622
Zaharia, M., Das, T., Li, H., Hunter, T., Shenker, S., & Stoica, I. (2016). Discretized streams: Fault-tolerant streaming computation at scale. Proceedings of the 24th ACM Symposium on Operating Systems Principles, 423–438. https://doi.org/10.1145/2983990.2984010
Zikopoulos, P., & Eaton, C. (2011). Understanding big data: Analytics for enterprise class Hadoop and streaming data. McGraw-Hill Education.