
Answer-first summary for fast verification
Answer: happiness_tier
The most appropriate columns for imputing missing values with the most common value (mode) are those that are categorical or discrete. In this scenario: - **happiness_tier (Option B)**: This column is likely categorical (given its string data type and name), making mode imputation a standard and effective approach. It ensures the imputed values align with the existing data distribution without introducing anomalies. - **Units (Option D)**: Although 'units' is an integer column, it probably represents a count or quantity. For such numerical data, mean or median imputation might be more suitable, depending on the data's distribution. - **Customer_id (Options C and E)**: As the primary key, 'customer_id' should uniquely identify each row. Missing values here may indicate data collection issues that should be resolved directly rather than through imputation. - **Spend (Option A)**: Represented as a double, 'spend' is a continuous numerical value. Mean or median imputation would typically be preferred over mode for such data. In summary, mode imputation is most fitting for categorical data like 'happiness_tier', while numerical data may require different imputation strategies.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
A data scientist is working with a feature set that includes the following schema: customer_id STRING, spend DOUBLE, units INTEGER, happiness_tier STRING. The customer_id column serves as the primary key. Each column in the feature set contains some missing values. The scientist plans to replace these missing values by imputing a common value for each feature. Which columns from the feature set are best suited for imputation using the most common value (mode) of the column?
A
Spend
B
happiness_tier
C
Customer_id, happiness_tier
D
Units
E
Customer_id
No comments yet.