
Answer-first summary for fast verification
Answer: mean = df.temperature.mean() std_dev = df.temperature.stddev() lower_bound = mean - 3 * std_dev upper_bound = mean + 3 * std_dev df = df.filter((df.temperature > lower_bound) & (df.temperature < upper_bound)) print(B)
The correct approach to remove outliers beyond 3 standard deviations from the mean is to first calculate the mean and standard deviation of the 'temperature' column. Then, calculate the lower and upper bounds by subtracting and adding 3 times the standard deviation from the mean, respectively. Finally, filter the DataFrame to keep only the rows within these bounds. Option B does this correctly. Option A is incorrect because it only removes values beyond the upper bound. Option C is incorrect because it uses the 'agg' method instead of directly calculating the mean and standard deviation. Option D is incorrect because it uses the 'summary' method, which is not applicable for calculating mean and standard deviation.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
You are given a Spark DataFrame 'df' with a numerical column 'temperature'. Write a code snippet that removes outliers from the 'temperature' column that are beyond 3 standard deviations from the mean, and explain the steps involved.
A
mean = df.temperature.mean() std_dev = df.temperature.stddev() threshold = mean + 3 * std_dev df = df.filter(df.temperature < threshold) print(A)
B
mean = df.temperature.mean() std_dev = df.temperature.stddev() lower_bound = mean - 3 * std_dev upper_bound = mean + 3 * std_dev df = df.filter((df.temperature > lower_bound) & (df.temperature < upper_bound)) print(B)
C
mean = df.temperature.agg('mean') std_dev = df.temperature.agg('stddev') df = df.filter((df.temperature > (mean - 3 * std_dev)) & (df.temperature < (mean + 3 * std_dev))) print(C)
D
mean = df.temperature.summary('mean') std_dev = df.temperature.summary('stddev') df = df.filter((df.temperature > (mean - 3 * std_dev)) & (df.temperature < (mean + 3 * std_dev))) print(D)
No comments yet.