Google Professional Data Engineer

Google Professional Data Engineer

Get started today

Ultimate access to all questions.


Your company's CTO is worried about the high costs associated with running data pipelines, particularly large batch processing jobs. These jobs don't need to run on a strict schedule, and the CTO is open to longer completion times if it means reducing expenses. You're currently using Cloud Dataflow for most pipelines and want to minimize costs without extensive changes. What's your best recommendation?




Explanation:

The optimal solution is to use Cloud Dataflow Flexible Resource Scheduling (FlexRS). FlexRS is designed to lower the costs of batch processing by employing scheduling strategies and a mix of preemptible and regular VMs. While Dataflow Shuffle can speed up batch job execution, it doesn't necessarily cut costs. The Streaming Engine is tailored for stream processing, not batch. Opting for a different Apache Beam runner, such as Apache Flink on Compute Engine, would introduce additional management complexity. For more details, visit Cloud Dataflow FlexRS.