
Ultimate access to all questions.
What is the most cost-effective way to optimize the performance of a complex analytical Spark job that involves shuffling operations and uses initial data in parquet format (each file averages 200-400 MB in size), after migrating from an on-prem Hadoop cluster to Dataproc on GCS, considering the organization's cost sensitivity? The Spark job is currently running on preemptible VMs with only two non-preemptible workers.
A
Switch from using parquet files to TFRecords formats, which are approximately 200 MB per file.
B
Ensure that the parquet files are at least 1 GB in size.
C
Change from using HDDs to SSDs, copy initial data from GCS to HDFS, run the Spark job, and copy the results back to GCS.
D
Change from using HDDs to SSDs and modify the configuration of preemptible VMs to increase the boot disk size.