
Ultimate access to all questions.
You've recently migrated a Hadoop job from your on-premises cluster to Google Cloud Dataproc and Google Cloud Storage (GCS). Your job involves running a complex set of Spark analytical tasks which include numerous shuffling operations. The initial dataset consists of parquet files, each ranging between 200 to 400 MB in size. Post-migration, you've noticed a decline in performance on Dataproc and are looking to optimize this. Given that your organization is highly cost-sensitive, you aim to maintain the use of Dataproc utilizing preemptible VMs with only two non-preemptible workers allocated to this workload. What steps should you take to achieve this optimization?
A
Increase the size of your parquet files to ensure them to be 1 GB minimum.
B
Switch to TFRecords formats (appr. 200MB per file) instead of parquet files.
C
Switch from HDDs to SSDs, copy initial data from GCS to HDFS, run the Spark job and copy results back to GCS.
D
Switch from HDDs to SSDs, override the preemptible VMs configuration to increase the boot disk size.