You are given a Spark DataFrame 'df' with a numerical column 'salary'. Write a code snippet that removes outliers from the 'salary' column that are less than the 10th percentile or greater than the 90th percentile, and explain the steps involved.

Simulated

lower_bound = df.salary.percentile(0.1) upper_bound = df.salary.percentile(0.9) df = df.filter((df.salary >= lower_bound) & (df.salary <= upper_bound)) print(A)

27.5%

lower_bound = df.salary.min() upper_bound = df.salary.max() df = df.filter((df.salary > lower_bound) & (df.salary < upper_bound)) print(B)

2.0%

lower_bound = df.salary.quantile(0.1) upper_bound = df.salary.quantile(0.9) df = df.filter((df.salary > lower_bound) & (df.salary < upper_bound)) print(C)

27.5%

lower_bound = df.salary.approxQuantile(0.1, 0.01) upper_bound = df.salary.approxQuantile(0.9, 0.01) df = df.filter((df.salary >= lower_bound) & (df.salary <= upper_bound)) print(D)

43.1%

Databricks Certified Machine Learning - Associate

Get started today

Comments

You are given a Spark DataFrame 'df' with a numerical column 'salary'. Write a code snippet that removes outliers from the 'salary' column that are less than the 10th percentile or greater than the 90th percentile, and explain the steps involved.