
Ultimate access to all questions.
NO.36 Your company's on-premises Apache Hadoop servers are approaching end-of-life, and IT has decided to migrate the cluster to Google Cloud Dataproc. A like-for-like migration of the cluster would require 50 TB of Google Persistent Disk per node. The CIO is concerned about the cost of using that much block storage. You want to minimize the storage cost of the migration. What should you do?
Explanation:
To minimize storage costs when migrating an on-premises Hadoop cluster to Google Cloud Dataproc, the most cost-effective approach is to use Google Cloud Storage (GCS) instead of Persistent Disk for HDFS data.
Cost Efficiency: Google Cloud Storage is significantly cheaper than Persistent Disk for large-scale data storage. GCS offers lower per-gigabyte pricing compared to block storage options.
HDFS Compatibility: Dataproc can use GCS as the underlying storage layer through the Google Cloud Storage connector, which provides HDFS-compatible access to data stored in GCS buckets.
Managed Service Benefits: GCS is a fully managed service that eliminates the need for storage management, scaling automatically, and providing high durability and availability.
Separation of Compute and Storage: By using GCS, you decouple storage from compute, allowing you to scale compute resources independently of storage, which can lead to additional cost savings.
Options B, D, and F (Persistent Disk variants): All Persistent Disk options are more expensive than GCS for large-scale data storage and require manual management of storage capacity.
Option C (Local SSD): Local SSDs are ephemeral storage that doesn't persist when instances are stopped, making them unsuitable for persistent HDFS data.
Option E (Cloud Storage FUSE): While this could work, it's less efficient than the native GCS connector and may have performance implications.
This approach provides the best balance of cost savings, performance, and operational simplicity for Hadoop workloads on Google Cloud.