Databricks Certified Machine Learning - Associate

Databricks Certified Machine Learning - Associate

Get started today

Ultimate access to all questions.


You are given a Spark DataFrame 'df' with a numerical column 'salary'. Write a code snippet that removes outliers from the 'salary' column that are less than the 10th percentile or greater than the 90th percentile, and explain the steps involved.




Explanation:

The correct approach to remove outliers based on percentiles is to first calculate the lower and upper bounds using the 'approxQuantile' method, which approximates the quantile values with a specified relative error. Then, filter the DataFrame to keep only the rows within these bounds. Option D does this correctly. Option A is incorrect because it uses the 'percentile' method, which is not available in Spark DataFrames. Option B is incorrect because it uses the minimum and maximum values instead of percentiles. Option C is incorrect because it uses the 'quantile' method, which is not available in Spark DataFrames.