Ultimate access to all questions.
Upgrade Now 🚀
Sign in to unlock AI tutor
You are given a Spark DataFrame 'df' with a numerical column 'temperature'. Write a code snippet that removes outliers from the 'temperature' column that are beyond 3 standard deviations from the mean, and explain the steps involved.
A
mean = df.temperature.mean()
std_dev = df.temperature.stddev()
threshold = mean + 3 * std_dev
df = df.filter(df.temperature < threshold)
print(A)
B
lower_bound = mean - 3 * std_dev
upper_bound = mean + 3 * std_dev
df = df.filter((df.temperature > lower_bound) & (df.temperature < upper_bound))
print(B)
C
mean = df.temperature.agg('mean')
std_dev = df.temperature.agg('stddev')
df = df.filter((df.temperature > (mean - 3 * std_dev)) & (df.temperature < (mean + 3 * std_dev)))
print(C)
D
mean = df.temperature.summary('mean')
std_dev = df.temperature.summary('stddev')
print(D)