
Answer-first summary for fast verification
Answer: lower_bound = df.salary.approxQuantile(0.1, 0.01) upper_bound = df.salary.approxQuantile(0.9, 0.01) df = df.filter((df.salary >= lower_bound) & (df.salary <= upper_bound)) print(D)
The correct approach to remove outliers based on percentiles is to first calculate the lower and upper bounds using the 'approxQuantile' method, which approximates the quantile values with a specified relative error. Then, filter the DataFrame to keep only the rows within these bounds. Option D does this correctly. Option A is incorrect because it uses the 'percentile' method, which is not available in Spark DataFrames. Option B is incorrect because it uses the minimum and maximum values instead of percentiles. Option C is incorrect because it uses the 'quantile' method, which is not available in Spark DataFrames.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
You are given a Spark DataFrame 'df' with a numerical column 'salary'. Write a code snippet that removes outliers from the 'salary' column that are less than the 10th percentile or greater than the 90th percentile, and explain the steps involved.
A
lower_bound = df.salary.percentile(0.1) upper_bound = df.salary.percentile(0.9) df = df.filter((df.salary >= lower_bound) & (df.salary <= upper_bound)) print(A)
B
lower_bound = df.salary.min() upper_bound = df.salary.max() df = df.filter((df.salary > lower_bound) & (df.salary < upper_bound)) print(B)
C
lower_bound = df.salary.quantile(0.1) upper_bound = df.salary.quantile(0.9) df = df.filter((df.salary > lower_bound) & (df.salary < upper_bound)) print(C)
D
lower_bound = df.salary.approxQuantile(0.1, 0.01) upper_bound = df.salary.approxQuantile(0.9, 0.01) df = df.filter((df.salary >= lower_bound) & (df.salary <= upper_bound)) print(D)
No comments yet.