
Answer-first summary for fast verification
Answer: Utilize production-sized datasets and production-grade clusters while using the **Run All** execution mode to measure performance.
### Why Option B is Correct * **Representative Workload:** Performance characteristics such as task parallelism, shuffle patterns, and I/O overhead change significantly when moving from small 'toy' datasets to production-scale data. Testing on realistic volumes is essential to identify bottlenecks that do not appear in small-scale development. * **Consistent Runtime Mode:** Executing cells one-by-one interactively introduces driver-client round-trip latency and prevents Spark from optimizing the entire execution path. Using **Run All** executes the notebook as a contiguous batch, ensuring that Spark's internal optimizations—like pipelining and whole-stage code generation—are applied uniformly. ### Why Other Options are Incorrect * **Option A:** PySpark and Spark SQL are first-class citizens on Databricks. Because they use the same Catalyst optimizer and JVM backend as Scala, restructuring them into JARs is generally unnecessary for performance evaluation. * **Option C:** Open-source Spark lacks Databricks-specific optimizations such as the Photon engine, Delta caching, and specialized cloud I/O integrations. Local benchmarks will not accurately reflect Databricks production performance. * **Option D:** While it is true that `display()` triggers jobs and caching can skew repetitive manual runs, this is a description of the current problem rather than a proactive adjustment to improve measurement accuracy. * **Option E:** The Photon engine can be enabled on both interactive and job clusters; it is not restricted to the Jobs UI. While job clusters are recommended for production, they are not the only way to achieve Photon acceleration.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
A data engineer is troubleshooting performance bottlenecks in their pipeline logic. Currently, they are developing interactively by executing notebook cells one-by-one and using display() calls to validate each step. To estimate production execution time, they manually re-run cells multiple times.
Which of the following adjustments would provide the most precise evaluation of how the code will perform once deployed to production?
A
Restructure all PySpark and Spark SQL logic into Scala JARs, as Scala is the only language that allows for accurate performance benchmarking and optimal execution in interactive notebooks.
B
Utilize production-sized datasets and production-grade clusters while using the Run All execution mode to measure performance.
C
Perform benchmarking within an Integrated Development Environment (IDE) against local builds of open-source Spark and Delta Lake to establish a performance baseline.
D
Continue using display() calls to trigger jobs manually, while accounting for the fact that Spark only contributes to the logical query plan until an action is called.
E
Execute the notebook via the Jobs UI to monitor timing, as the Photon acceleration engine can only be enabled on clusters launched for scheduled jobs.