
Answer-first summary for fast verification
Answer: Process the messages with a Dataflow streaming pipeline using Apache Beam's PubSubIO package, and write the output to storage.
The question requires processing Pub/Sub messages exactly once while using the cheapest and simplest solution. - **Option A (Dataproc)**: Dataproc is designed for batch processing or managed Hadoop/Spark clusters. While Spark Streaming can handle streams, managing clusters adds complexity and cost, and ensuring exactly-once semantics requires additional setup. This is not the simplest solution. - **Option B (Dataflow)**: Dataflow is a serverless streaming service with native support for exactly-once processing via Apache Beam's PubSubIO. It handles deduplication automatically using windowing and checkpoints, simplifying the architecture. This aligns with the requirements for simplicity, cost-effectiveness, and guaranteed exactly-once delivery. - **Option C (Cloud Functions)**: Cloud Functions triggered by Pub/Sub may process messages multiple times (at-least-once delivery). Writing to BigQuery and then deduplicating adds extra steps and complexity, making it less simple than a single Dataflow pipeline. - **Option D (Two Dataflow pipelines + Bigtable)**: Using two pipelines and Bigtable introduces unnecessary complexity and cost. Dataflow can deduplicate messages in a single pipeline, making this approach redundant and more expensive. **Conclusion**: Option B provides a fully managed, serverless solution with built-in exactly-once semantics, minimal complexity, and cost efficiency.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
Your team builds services on Google Cloud. You need to process messages from a Pub/Sub topic and store them, ensuring each message is processed exactly once to prevent data duplication or conflicts. The solution must be the simplest and most cost-effective. What is the recommended approach?
A
Process the messages with a Dataproc job, and write the output to storage.
B
Process the messages with a Dataflow streaming pipeline using Apache Beam's PubSubIO package, and write the output to storage.
C
Process the messages with a Cloud Function, and write the results to a BigQuery location where you can run a job to deduplicate the data.
D
Retrieve the messages with a Dataflow streaming pipeline, store them in Cloud Bigtable, and use another Dataflow streaming pipeline to deduplicate messages.
No comments yet.