
Ultimate access to all questions.
In your role as a Databricks Certified Data Engineer, you are tasked with optimizing the performance of a Databricks notebook that processes large datasets. The notebook is part of a critical data pipeline that feeds into a real-time analytics dashboard. The dashboard's performance has been degrading, and initial investigations suggest the notebook is the bottleneck. You need to ensure the solution is cost-effective, scalable, and complies with the organization's data governance policies. Which of the following approaches would BEST address these requirements? Choose one option.
A
Profile the notebook using the Databricks UI to identify slow-running cells, then optimize the code by removing unnecessary operations and using more efficient algorithms, without considering the impact on cluster resources or data governance.
B
Use the Databricks notebook's built-in visualizations to identify performance bottlenecks, and switch all data processing tasks to Databricks SQL, assuming it will automatically optimize performance without further analysis.
C
Enable the Databricks runtime metrics dashboard to monitor cluster performance in real-time, optimize the code by caching frequently used DataFrames and using broadcast joins for small reference data, and review the solution's compliance with data governance policies.
D
Analyze the Spark UI to identify stages with high execution time, then optimize the code by arbitrarily increasing the cluster size and applying partitioning techniques without assessing their suitability for the workload.