
Answer-first summary for fast verification
Answer: Load only the new dataset and the updated subset of the existing dataset into Spark, perform an inner join, and then update the matching records. This approach minimizes data redundancy and ensures only relevant data is processed.
Option D is the most efficient and cost-effective approach for incremental processing under the given constraints. By loading only the new dataset and the relevant subset of the existing dataset, it minimizes data movement and processing volume, aligning with data governance policies. The inner join ensures that only records with updates are processed, reducing unnecessary computations. Options A, B, and C either process excessive data or may not accurately reflect the updates, leading to inefficiencies or inaccuracies in the incremental processing.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
You are tasked with optimizing a Spark job for incremental processing on a large dataset in Azure Databricks. The scenario involves a new dataset that contains updates to a subset of an existing dataset. Your goal is to ensure efficient processing while minimizing data redundancy and cost. Consider the following constraints: the existing dataset is very large, and the new dataset is significantly smaller but contains critical updates. Additionally, the solution must comply with data governance policies that require minimal data movement. Given these constraints, which of the following approaches is the BEST to achieve efficient incremental processing? (Choose one option.)
A
Load the entire existing dataset and the new dataset into Spark, perform a full join, and then filter out the unchanged records. This approach ensures all possible updates are considered but may not be cost-effective due to the large volume of data processed.
B
Load only the new dataset into Spark, perform a left join with the existing dataset, and then update the matching records. This approach may miss updates if the new dataset contains records not present in the existing dataset.
C
Load only the new dataset into Spark, perform a right join with the existing dataset, and then update the matching records. This approach may incorrectly process records from the existing dataset that do not have updates in the new dataset.
D
Load only the new dataset and the updated subset of the existing dataset into Spark, perform an inner join, and then update the matching records. This approach minimizes data redundancy and ensures only relevant data is processed.