DevConf.US 2020 is the 3rd annual, free, Red Hat sponsored technology conference for community project and professional contributors to Free and Open Source technologies coming to a web browser near you!
The values offered by public cloud services are clear for analytic workloads. Specialized hardware such as GPUs for doing AI/ML may make more sense to effectively lease with Opex rather than invest Capex on infrastructure that is not continually utilized. However, it may not make sense to build large data sets inside public clouds due both to the cost multiple compared to building out and maintaining private infrastructure and the lock-in nature of using public cloud services. These drivers then lead toward a hybrid architecture where large data sets are built and maintained in private clouds but compute/analytic clusters are spun up in public clouds to the actual analytics on these data sets.
Maintaining a hybrid architecture as described introduces challenges with latency and bandwidth to the public cloud compute cluster from the private data lake. In this presentation we describe research being done at by Mass Open Cloud and Red Hat researchers to build caching solutions to maximize throughput of these leased analytics clusters and avoid re-reading the same data from the external private data lake.