Google Professional Data Engineer

Google Professional Data Engineer

Get started today

Ultimate access to all questions.


You've recently migrated a Hadoop job from your on-premises cluster to Google Cloud Dataproc and Google Cloud Storage (GCS). Your job involves running a complex set of Spark analytical tasks which include numerous shuffling operations. The initial dataset consists of parquet files, each ranging between 200 to 400 MB in size. Post-migration, you've noticed a decline in performance on Dataproc and are looking to optimize this. Given that your organization is highly cost-sensitive, you aim to maintain the use of Dataproc utilizing preemptible VMs with only two non-preemptible workers allocated to this workload. What steps should you take to achieve this optimization?