Google Professional Data Engineer

Get started today

Ultimate access to all questions.

Explanation:

Option C and Option D are the correct answers. Option C involves copying the ORC files to the master node first and then transferring them to HDFS, which allows for maximum performance by leveraging local Hadoop Distributed File System. Option D leverages the Cloud Storage connector for Hadoop, allowing you to mount ORC files as external Hive tables directly from Cloud Storage and then replicate these external Hive tables to native ones for better performance. Options A and B are incorrect because direct use of gsutil to transfer files to HDFS is not feasible and does not align with best practices. Option E, while possible, involves additional steps with BigQuery, which is not necessary and does not maximize performance in this scenario.

Explanation:

Comments (0)

No comments yet.

You are planning to transition an existing on-premises Hadoop infrastructure to Cloud Dataproc. The current setup mainly utilizes Hive, and the data is stored in Optimized Row Columnar (ORC) format. All ORC files have already been transferred to a Cloud Storage bucket. In order to enhance performance, you need to duplicate certain data into the cluster’s local Hadoop Distributed File System (HDFS). What are two methods to begin working with Hive on Cloud Dataproc? (Choose two.)

Exam-Like

Run the gsutil utility to transfer all ORC files from the Cloud Storage bucket to HDFS. Mount the Hive tables locally.

7.4%

Run the gsutil utility to transfer all ORC files from the Cloud Storage bucket to any node of the Dataproc cluster. Mount the Hive tables locally.

7.4%

Run the gsutil utility to transfer all ORC files from the Cloud Storage bucket to the master node of the Dataproc cluster. Then run the Hadoop utility to copy them to HDFS. Mount the Hive tables from HDFS.

33.1%

Leverage Cloud Storage connector for Hadoop to mount the ORC files as external Hive tables. Replicate external Hive tables to the native ones.