
Explanation:
D: Leveraging Spark‘s mapPartitions to perform batched API requests for each partition.
This is the most efficient and recommended strategy for enriching large-scale raw event data with lookups from an external REST API inside a Spark job.
API call efficiency: Row-by-row UDFs (or map()) would generate millions of individual HTTP requests, leading to throttling, high latency, connection overhead, and poor performance. mapPartitions lets you process an entire partition (a batch of rows) at once. Inside the partition iterator, you can:
Scalability with Spark's distributed model: Partitions are naturally processed in parallel across executors. This aligns with Spark's execution model without unnecessary data movement.
Common Databricks/Spark pattern: This is a standard approach for external system enrichment (REST APIs, databases without native connectors, ML inference, etc.). Databricks documentation and community resources frequently highlight it for scalable REST interactions.
A: Storing reference data in Delta Lake + caching is excellent if the reference data is static or can be periodically materialized (e.g., daily refresh). However, the question specifies "from an external REST API," implying dynamic/on-the-fly lookups. Delta caching optimizes joins but doesn't address live API calls.
B: RDD-based manual management is low-level, error-prone, and bypasses the optimized DataFrame API (Catalyst, Tungsten, etc.). It's rarely the best choice in modern Spark/Databricks workflows.
C: Broadcast join is ideal for small, static reference data that fits in memory (e.g., a small lookup table fetched once and broadcast). It doesn't directly handle dynamic external API calls per event. If the reference data can be fully fetched upfront, you could load it into a small DataFrame and broadcast, but this isn't the general case for ongoing REST enrichment.
foreachBatch for streaming scenarios.mapPartitions.This approach balances performance, reliability, and maintainability for production ETL pipelines.
Ultimate access to all questions.
When designing an ETL pipeline in Databricks to enrich raw event data with reference data from an external REST API, which strategy ensures efficient integration of these external data lookups within your Spark job?
A
Storing reference data in Delta Lake and utilizing Delta caching to optimize join performance
B
Implementing an RDD-based approach to manually manage lookups and parallelize requests
C
Using a broadcast join to distribute reference data across all nodes for efficient lookups
D
Leveraging Spark‘s mapPartitions to perform batched API requests for each partition