
Answer-first summary for fast verification
Answer: Ensure that the parquet files are at least 1 GB in size.
Correct answer is B. Options C and D do not meet the cost-effective solution requirement, and shuffling is necessary. For more details, refer to [Google's Dataproc documentation on Spark job tuning](https://cloud.google.com/dataproc/docs/support/spark-job-tuning#limit_the_number_of_files).
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
After migrating a complex analytical Spark job with shuffling operations from an on-prem Hadoop cluster to Dataproc on GCS, how can you optimize its performance given the initial data is in parquet format (average size 200-400 MB each) and your organization is cost-sensitive? The job currently runs on preemptible VMs with only two non-preemptible workers.
A
Switch from using parquet files to TFRecords formats, which are approximately 200 MB per file.
B
Ensure that the parquet files are at least 1 GB in size.
C
Change from using HDDs to SSDs, copy initial data from GCS to HDFS, run the Spark job, and copy the results back to GCS.
D
Change from using HDDs to SSDs and modify the configuration of preemptible VMs to increase the boot disk size.