Ultimate access to all questions.
You're setting up a managed Hadoop system for your data lake, using Cloud Storage for input, output, and intermediary data to separate storage from compute. However, a specific Hadoop job runs much slower on Cloud Dataproc than on an on-premises bare-metal Hadoop setup with 8-core nodes and 100-GB RAM, suggesting it's disk I/O intensive. What's the best way to improve the performance of this Hadoop job on Cloud Dataproc?