You are developing a new application that requires scalable data collection mechanisms. The application generates continuous data throughout the day, and projections indicate that by the end of the year, the system will produce roughly 150 GB of JSON data daily. The following requirements must be met:

✑ Decouple the data producer from the data consumer to ensure independent scaling and fault tolerance.

✑ Implement a storage system for the raw ingested data that is both space- and cost-efficient, with the capability to store this data indefinitely.

✑ Enable near real-time querying using SQL to analyze the incoming data quickly.

✑ Retain at least 2 years of historical data, allowing the stored data to be queried with SQL for insights spanning over this period.

Which pipeline should you utilize to fulfill these criteria?

Exam-Like

Create an application that provides an API. Write a tool to poll the API and write data to Cloud Storage as gzipped JSON files.

0.0%

Create an application that writes to a Cloud SQL database to store the data. Set up periodic exports of the database to write to Cloud Storage and load into BigQuery.

3.6%

Create an application that publishes events to Cloud Pub/Sub, and create Spark jobs on Cloud Dataproc to convert the JSON data to Avro format, stored on HDFS on Persistent Disk.

7.1%

Create an application that publishes events to Cloud Pub/Sub, and create a Cloud Dataflow pipeline that transforms the JSON event payloads to Avro, writing the data to Cloud Storage and BigQuery.

89.3%

Google Professional Data Engineer

Comments

Get started today