
Ultimate access to all questions.
A data engineer is troubleshooting performance bottlenecks in their pipeline logic. Currently, they are developing interactively by executing notebook cells one-by-one and using display() calls to validate each step. To estimate production execution time, they manually re-run cells multiple times.
Which of the following adjustments would provide the most precise evaluation of how the code will perform once deployed to production?
A
Restructure all PySpark and Spark SQL logic into Scala JARs, as Scala is the only language that allows for accurate performance benchmarking and optimal execution in interactive notebooks.
B
Utilize production-sized datasets and production-grade clusters while using the Run All execution mode to measure performance.
C
Perform benchmarking within an Integrated Development Environment (IDE) against local builds of open-source Spark and Delta Lake to establish a performance baseline.
D
Continue using display() calls to trigger jobs manually, while accounting for the fact that Spark only contributes to the logical query plan until an action is called.
E
Execute the notebook via the Jobs UI to monitor timing, as the Photon acceleration engine can only be enabled on clusters launched for scheduled jobs.