In a data engineering project using Databricks, you are tasked with implementing a solution to calculate the moving average of a column named 'value' over a window of 5 rows in a Spark SQL DataFrame. The solution must meet several key requirements: it should be efficient, reusable across different DataFrames without the need for caching, easily maintainable, and updatable without redeploying the entire application. Additionally, the solution should leverage Spark's capabilities to ensure scalability and performance. Considering these constraints, which of the following approaches best meets all the specified requirements? (Choose two options)

Simulated

Define a Python UDF (User-Defined Function) using the pandas_udf decorator with a Series input and Series output type, then register it as a SQL function. Use this function in a SQL query with the OVER clause to specify the window frame.

21.6%

Create a temporary SQL UDF by specifying the fully qualified name of the UDF class and the path to the JAR file containing the UDF implementation. Use this UDF in a SELECT statement to calculate the moving average over the specified window.

42.5%

Use the built-in AVG function in Spark SQL with the OVER clause to specify the window frame directly in the SQL query without defining any UDF.

13.4%

Implement the moving average calculation logic directly in the DataFrame API using the window function and the aggregate function, then register the result as a temporary view for SQL queries.

21.3%

Databricks Certified Data Engineer - Associate

Get started today

Comments