
Google Professional Data Engineer
Get started today
Ultimate access to all questions.
You've recently migrated a Hadoop job from your on-premises cluster to Google Cloud Dataproc and Google Cloud Storage (GCS). Your job involves running a complex set of Spark analytical tasks which include numerous shuffling operations. The initial dataset consists of parquet files, each ranging between 200 to 400 MB in size. Post-migration, you've noticed a decline in performance on Dataproc and are looking to optimize this. Given that your organization is highly cost-sensitive, you aim to maintain the use of Dataproc utilizing preemptible VMs with only two non-preemptible workers allocated to this workload. What steps should you take to achieve this optimization?
You've recently migrated a Hadoop job from your on-premises cluster to Google Cloud Dataproc and Google Cloud Storage (GCS). Your job involves running a complex set of Spark analytical tasks which include numerous shuffling operations. The initial dataset consists of parquet files, each ranging between 200 to 400 MB in size. Post-migration, you've noticed a decline in performance on Dataproc and are looking to optimize this. Given that your organization is highly cost-sensitive, you aim to maintain the use of Dataproc utilizing preemptible VMs with only two non-preemptible workers allocated to this workload. What steps should you take to achieve this optimization?
Exam-Like