
Answer-first summary for fast verification
Answer: Add a ParDo transform in Cloud Dataflow to discard corrupt elements.
The correct answer is B. ParDo is the core parallel processing operation in the Apache Beam SDKs, which Cloud Dataflow is built upon. ParDo allows custom processing logic to be applied to each element in the pipeline. In this case, ParDo can be used to filter out the corrupted elements by discarding them from the PCollection. This makes it the most straightforward and efficient method to achieve the desired result of cleaning the data before it is loaded into BigQuery.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
In the context of developing a data processing pipeline using Google Cloud services, you are tasked with streaming IoT data from Cloud Pub/Sub through Cloud Dataflow to BigQuery. During the preview phase of the data, you observe that approximately 2% of the data is corrupt. What steps would you take to update the Cloud Dataflow pipeline in order to filter out the corrupt data?
A
Add a SideInput that returns a Boolean if the element is corrupt.
B
Add a ParDo transform in Cloud Dataflow to discard corrupt elements.
C
Add a Partition transform in Cloud Dataflow to separate valid data from corrupt data.
D
Add a GroupByKey transform in Cloud Dataflow to group all of the valid data together and discard the rest.