
Ultimate access to all questions.
You are tasked with optimizing a Spark job for incremental processing on a large dataset in Azure Databricks. The scenario involves a new dataset that contains updates to a subset of an existing dataset. Your goal is to ensure efficient processing while minimizing data redundancy and cost. Consider the following constraints: the existing dataset is very large, and the new dataset is significantly smaller but contains critical updates. Additionally, the solution must comply with data governance policies that require minimal data movement. Given these constraints, which of the following approaches is the BEST to achieve efficient incremental processing? (Choose one option.)
A
Load the entire existing dataset and the new dataset into Spark, perform a full join, and then filter out the unchanged records. This approach ensures all possible updates are considered but may not be cost-effective due to the large volume of data processed.
B
Load only the new dataset into Spark, perform a left join with the existing dataset, and then update the matching records. This approach may miss updates if the new dataset contains records not present in the existing dataset.
C
Load only the new dataset into Spark, perform a right join with the existing dataset, and then update the matching records. This approach may incorrectly process records from the existing dataset that do not have updates in the new dataset.
D
Load only the new dataset and the updated subset of the existing dataset into Spark, perform an inner join, and then update the matching records. This approach minimizes data redundancy and ensures only relevant data is processed.