Ultimate access to all questions.
Upgrade Now 🚀
Sign in to unlock AI tutor
When designing an ETL pipeline in Databricks to enrich raw event data with reference data from an external REST API, which strategy ensures efficient integration of these external data lookups within your Spark job?
A
Storing reference data in Delta Lake and utilizing Delta caching to optimize join performance
B
Implementing an RDD-based approach to manually manage lookups and parallelize requests
C
Using a broadcast join to distribute reference data across all nodes for efficient lookups
D
Leveraging Spark‘s mapPartitions to perform batched API requests for each partition