
Answer-first summary for fast verification
Answer: Q1 = df.height.quantile(0.25) Q3 = df.height.quantile(0.75) IQR = Q3 - Q1 lower_bound = Q1 - 1.5 * IQR upper_bound = Q3 + 1.5 * IQR df = df.filter((df.height >= lower_bound) & (df.height <= upper_bound)) print(A)
The correct approach to remove outliers based on the interquartile range (IQR) method is to first calculate the first quartile (Q1) and the third quartile (Q3) using the 'quantile' method. Then, calculate the IQR as the difference between Q3 and Q1. The lower and upper bounds are calculated as Q1 - 1.5 * IQR and Q3 + 1.5 * IQR, respectively. Finally, filter the DataFrame to keep only the rows within these bounds. Option A does this correctly. Option B is incorrect because it uses the 'approxQuantile' method instead of the 'quantile' method. Option C is incorrect because it uses the 'percentile' method, which is not available in Spark DataFrames. Option D is incorrect because it incorrectly calculates the IQR as the sum of Q1 and Q3.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
You are given a Spark DataFrame 'df' with a numerical column 'height'. Write a code snippet that removes outliers from the 'height' column based on the interquartile range (IQR) method, and explain the steps involved.
A
Q1 = df.height.quantile(0.25) Q3 = df.height.quantile(0.75) IQR = Q3 - Q1 lower_bound = Q1 - 1.5 * IQR upper_bound = Q3 + 1.5 * IQR df = df.filter((df.height >= lower_bound) & (df.height <= upper_bound)) print(A)
B
Q1 = df.height.approxQuantile(0.25, 0.01) Q3 = df.height.approxQuantile(0.75, 0.01) IQR = Q3 - Q1 lower_bound = Q1 - 1.5 * IQR upper_bound = Q3 + 1.5 * IQR df = df.filter((df.height > lower_bound) & (df.height < upper_bound)) print(B)
C
Q1 = df.height.percentile(0.25) Q3 = df.height.percentile(0.75) IQR = Q3 - Q1 lower_bound = Q1 - 1.5 * IQR upper_bound = Q3 + 1.5 * IQR df = df.filter((df.height > lower_bound) & (df.height < upper_bound)) print(C)
D
Q1 = df.height.approxQuantile(0.25, 0.01) Q3 = df.height.approxQuantile(0.75, 0.01) IQR = Q3 + Q1 lower_bound = Q1 - 1.5 * IQR upper_bound = Q3 + 1.5 * IQR df = df.filter((df.height >= lower_bound) & (df.height <= upper_bound)) print(D)