Ultimate access to all questions.
You're assisting a team of data analysts in setting up a Cloud Dataproc cluster for Spark-based analysis of large datasets. The cluster will handle numerous jobs. Adhering to Google's recommended best practices, which two actions would you prioritize?
Explanation:
Choosing Cloud Storage for persistent storage is advised over HDFS on local disks, ensuring data remains accessible even after cluster shutdown without the need for data migration. Google suggests capping preemptible VMs at 30% for secondary nodes to balance cost and reliability. Autoscaling is beneficial, especially with Cloud Storage, for clusters managing multiple jobs. For more insights, visit Google's Dataproc Best Practices Guide.