Databricks Certified Data Engineer - Associate

Databricks Certified Data Engineer - Associate

Get started today

Ultimate access to all questions.


How can you parameterize a query to filter data based on a batch date that changes with each run, without manually altering the code each time?





Explanation:

âś… B. Create a notebook parameter for batch date, assign its value to a Python variable, and use a Spark DataFrame to filter the data based on this variable.

Databricks notebooks support notebook parameters, which can be set during job runs or interactively. By defining batch_date as a notebook parameter, you can pass different values each time the program runs. This value can be accessed within the notebook using Databricks utilities (dbutils.widgets.get("batch_date")), assigned to a Python variable, and used in a Spark DataFrame's where clause for filtering. This method is flexible, clean, and integrates well with Databricks job scheduling.

❌ A. Store the batch date in the Spark configuration and use a Spark DataFrame to filter the data based on the Spark configuration. While possible, using Spark configuration for a dynamically changing batch date is less straightforward and conventional than using notebook parameters, adding unnecessary complexity.

❌ C. Manually edit the code every time to change the batch date. This approach is inefficient, error-prone, and contradicts the requirement of avoiding manual changes.

❌ D. Create a dynamic view that automatically calculates the batch date and use this view to query the data. While views can simplify querying, this option doesn't directly address the need for a parameterized batch date that changes with each run. The logic for determining the batch date would still need to be defined, potentially leading back to a solution similar to option B.

❌ E. There is no way to combine a Python variable and Spark code for filtering. This is incorrect. Databricks notebooks seamlessly integrate Python and Spark, allowing Python variables to be used within Spark code, including DataFrame filtering operations.