
Answer-first summary for fast verification
Answer: result = dbutils.data.Summary(df, 'age') print(result['mean'], result['median'], result['stddev'])
The correct approach to compute summary statistics using dbutils data summaries is to first import the dbutils module and then call the 'Summary' function on the DataFrame, passing the column name as an argument. The summary statistics can then be accessed using the corresponding keys in the result dictionary. Option B does this correctly. Option A is incorrect because it uses the built-in Spark SQL functions instead of dbutils data summaries. Option C is incorrect because the 'describe' method is not available in Spark DataFrames. Option D is incorrect because it uses the 'agg' method instead of dbutils data summaries.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
You are given a Spark DataFrame 'df' with a numerical column 'age'. Write a code snippet that computes the mean, median, and standard deviation of the 'age' column using dbutils data summaries, and explain the steps involved.
A
from pyspark.sql.functions import mean, median, stddev result = df.select(mean('age'), median('age'), stddev('age')) print(A)
B
result = dbutils.data.Summary(df, 'age') print(result['mean'], result['median'], result['stddev'])
C
result = df.describe() print(result['age']['mean'], result['age']['50%'], result['age']['stddev'])
D
result = df.agg(mean('age'), median('age'), stddev('age')) print(D)
No comments yet.