
Google Professional Data Engineer
Get started today
Ultimate access to all questions.
You are tasked with creating a storage solution for a new data pipeline project on Google Cloud. The project involves managing 20 TB of input data stored in CSV format. Your primary objective is to minimize the cost associated with querying aggregate values for multiple users. These users will be running their queries on data stored in Cloud Storage using various query engines. Given these requirements, which storage service and schema design would you recommend?
You are tasked with creating a storage solution for a new data pipeline project on Google Cloud. The project involves managing 20 TB of input data stored in CSV format. Your primary objective is to minimize the cost associated with querying aggregate values for multiple users. These users will be running their queries on data stored in Cloud Storage using various query engines. Given these requirements, which storage service and schema design would you recommend?
Explanation:
The correct answer is C: Use Cloud Storage for storage and link as permanent tables in BigQuery for querying. This approach is cost-effective because Cloud Storage is a highly durable and cost-effective object storage service. By linking the CSV data stored in Cloud Storage as permanent tables in BigQuery, you enable multiple users to query the data using multiple engines without the need for additional compute resources. Permanent tables in BigQuery ensure persistent access and the dataset-level access controls help in managing permissions for multiple users. This method minimizes the querying cost since you only pay for the data scanned per query in BigQuery instead of maintaining potentially expensive infrastructure for Cloud Bigtable.