
Answer-first summary for fast verification
Answer: Create a temporary SQL UDF by specifying the fully qualified name of the UDF class and the path to the JAR file containing the UDF implementation. Use this UDF in a SELECT statement to calculate the moving average over the specified window., Implement the moving average calculation logic directly in the DataFrame API using the window function and the aggregate function, then register the result as a temporary view for SQL queries.
Option B is correct because it demonstrates the creation of a temporary SQL UDF, which is reusable across different DataFrames and does not require the DataFrame to be cached. It also allows for easy maintenance and updates without redeploying the entire application. Option D is also correct as it leverages the DataFrame API for the calculation, which is efficient and does not require caching, and the result can be made available for SQL queries by registering it as a temporary view. Option A is incorrect because, while pandas_udf is efficient, it requires the DataFrame to be in Python memory, which may not be ideal for all scenarios. Option C is incorrect because it does not meet the requirement of being reusable across different DataFrames without redefining the query.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
In a data engineering project using Databricks, you are tasked with implementing a solution to calculate the moving average of a column named 'value' over a window of 5 rows in a Spark SQL DataFrame. The solution must meet several key requirements: it should be efficient, reusable across different DataFrames without the need for caching, easily maintainable, and updatable without redeploying the entire application. Additionally, the solution should leverage Spark's capabilities to ensure scalability and performance. Considering these constraints, which of the following approaches best meets all the specified requirements? (Choose two options)
A
Define a Python UDF (User-Defined Function) using the pandas_udf decorator with a Series input and Series output type, then register it as a SQL function. Use this function in a SQL query with the OVER clause to specify the window frame.
B
Create a temporary SQL UDF by specifying the fully qualified name of the UDF class and the path to the JAR file containing the UDF implementation. Use this UDF in a SELECT statement to calculate the moving average over the specified window.
C
Use the built-in AVG function in Spark SQL with the OVER clause to specify the window frame directly in the SQL query without defining any UDF.
D
Implement the moving average calculation logic directly in the DataFrame API using the window function and the aggregate function, then register the result as a temporary view for SQL queries.