
Answer-first summary for fast verification
Answer: from pyspark.sql.functions import cov result = df.select(cov('price', 'quantity')) print(A)
The correct approach to compute the covariance between two numerical columns is to use the 'cov' function from the 'pyspark.sql.functions' module. Option A does this correctly. Option B is incorrect because the 'corr' method computes the correlation coefficient, not the covariance. Option C is incorrect because it calculates the covariance of a new column created by multiplying 'price' and 'quantity', which is not the correct approach. Option D is incorrect because the 'groupBy' method is not needed for computing covariance.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
You are given a Spark DataFrame 'df' with a numerical column 'price'. Write a code snippet that computes the covariance between the 'price' column and another numerical column 'quantity', and explain the steps involved.
A
from pyspark.sql.functions import cov result = df.select(cov('price', 'quantity')) print(A)
B
result = df.stat.corr('price', 'quantity') print(B)
C
result = df.withColumn('price_quantity', df.price * df.quantity) result = result.select(cov('price_quantity', 'price')) print(C)
D
result = df.groupBy().agg(cov('price', 'quantity')) print(D)
No comments yet.