Google Professional Data Engineer

Get started today

Ultimate access to all questions.

Deep dive into the quiz with AI chat providers.

We prepare a focused prompt with your quiz and certificate details so each AI can offer a more tailored, in-depth explanation.

NO.36 Your company's on-premises Apache Hadoop servers are approaching end-of-life, and IT has decided to migrate the cluster to Google Cloud Dataproc. A like-for-like migration of the cluster would require 50 TB of Google Persistent Disk per node. The CIO is concerned about the cost of using that much block storage. You want to minimize the storage cost of the migration. What should you do?

Real Exam

Community

LLeetQuiz

Use Google Cloud Storage instead of Persistent Disk for the HDFS data.

Use SSD Persistent Disk for the HDFS data.

Use Local SSD for the HDFS data.

Use standard Persistent Disk for the HDFS data.

Use Cloud Storage FUSE for the HDFS data.

Use Persistent Disk SSD for the HDFS data.

Explanation:

Explanation

To minimize storage costs when migrating an on-premises Hadoop cluster to Google Cloud Dataproc, the most cost-effective approach is to use Google Cloud Storage (GCS) instead of Persistent Disk for HDFS data.

Why Option A is Correct:

Cost Efficiency: Google Cloud Storage is significantly cheaper than Persistent Disk for large-scale data storage. GCS offers lower per-gigabyte pricing compared to block storage options.
HDFS Compatibility: Dataproc can use GCS as the underlying storage layer through the Google Cloud Storage connector, which provides HDFS-compatible access to data stored in GCS buckets.
Managed Service Benefits: GCS is a fully managed service that eliminates the need for storage management, scaling automatically, and providing high durability and availability.
Separation of Compute and Storage: By using GCS, you decouple storage from compute, allowing you to scale compute resources independently of storage, which can lead to additional cost savings.

Why Other Options Are Less Optimal:

Options B, D, and F (Persistent Disk variants): All Persistent Disk options are more expensive than GCS for large-scale data storage and require manual management of storage capacity.
Option C (Local SSD): Local SSDs are ephemeral storage that doesn't persist when instances are stopped, making them unsuitable for persistent HDFS data.
Option E (Cloud Storage FUSE): While this could work, it's less efficient than the native GCS connector and may have performance implications.

Implementation Approach:

Create GCS buckets to store your Hadoop data
Configure Dataproc clusters to use GCS as the default filesystem
Migrate existing HDFS data to GCS
Update applications to use GCS paths instead of HDFS paths

This approach provides the best balance of cost savings, performance, and operational simplicity for Hadoop workloads on Google Cloud.

Powered ByGPT-5.2

Comments

Loading comments...