
Answer-first summary for fast verification
Answer: Add a try... catch block to your DoFn that transforms the data, use a sideOutput to create a PCollection that can be stored to Pub/Sub later.
The correct answer is D. Adding a try-catch block to your DoFn that transforms the data and using a sideOutput to create a PCollection of erroneous data would allow for reprocessing the failed data without creating a bottleneck in the pipeline. Storing the erroneous data in a separate PCollection makes it easier to debug, analyze, and reprocess as needed. In contrast, writing to Pub/Sub directly from the DoFn (option C) could create a bottleneck due to additional I/O operations. Option A, simply filtering out erroneous data, would not allow for reprocessing the failing data, which could lead to data loss. Option B's approach of just extracting erroneous rows from logs does not directly handle the data within the pipeline itself.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
Your team has been tasked with developing and maintaining Extract, Transform, and Load (ETL) processes within your company’s data infrastructure. Currently, you have a Dataflow job that is experiencing failures due to errors in the input data. To improve the reliability and resilience of this ETL pipeline, including the ability to reprocess all the failing data, what steps should you take?
A
Add a filtering step to skip these types of errors in the future, extract erroneous rows from logs.
B
Add a try... catch block to your DoFn that transforms the data, extract erroneous rows from logs.
C
Add a try... catch block to your DoFn that transforms the data, write erroneous rows to Pub/Sub directly from the DoFn.
D
Add a try... catch block to your DoFn that transforms the data, use a sideOutput to create a PCollection that can be stored to Pub/Sub later.