Databricks Certified Machine Learning - Associate

Get started today

Ultimate access to all questions.

Explanation:

The most suitable imputation strategy for a categorical variable such as 'happiness_tier' is to use the mode, which is the most frequent value in the dataset. This approach is preferred because:

Categorical variables represent classifications that don't have a numerical meaning, making methods like mean or median inappropriate.
Mode ensures that the imputed value is a valid category, maintaining the integrity of the data.
No imputation can lead to issues in analyses that require complete datasets, while mean/median imputation could introduce nonsensical values into categorical data.

Thus, imputing missing values in categorical columns with the mode is the most effective strategy.

Explanation:

The most suitable imputation strategy for a categorical variable such as 'happiness_tier' is to use the mode, which is the most frequent value in the dataset. This approach is preferred because:

Categorical variables represent classifications that don't have a numerical meaning, making methods like mean or median inappropriate.
Mode ensures that the imputed value is a valid category, maintaining the integrity of the data.
No imputation can lead to issues in analyses that require complete datasets, while mean/median imputation could introduce nonsensical values into categorical data.

Thus, imputing missing values in categorical columns with the mode is the most effective strategy.

Comments (0)

No comments yet.

When dealing with missing values in a dataset that includes a categorical variable like 'happiness_tier', what is the most appropriate imputation strategy?

Real Exam

No imputation is necessary for categorical variables.

3.0%

Impute using the median.

6.1%

Impute using the mean.

6.1%

Impute using the most common value (or mode).

84.8%