
Explanation:
The most appropriate columns for imputing missing values with the most common value (mode) are those that are categorical or discrete. In this scenario:
happiness_tier (Option B): This column is likely categorical (given its string data type and name), making mode imputation a standard and effective approach. It ensures the imputed values align with the existing data distribution without introducing anomalies.
Units (Option D): Although 'units' is an integer column, it probably represents a count or quantity. For such numerical data, mean or median imputation might be more suitable, depending on the data's distribution.
Customer_id (Options C and E): As the primary key, 'customer_id' should uniquely identify each row. Missing values here may indicate data collection issues that should be resolved directly rather than through imputation.
Spend (Option A): Represented as a double, 'spend' is a continuous numerical value. Mean or median imputation would typically be preferred over mode for such data.
In summary, mode imputation is most fitting for categorical data like 'happiness_tier', while numerical data may require different imputation strategies.
Ultimate access to all questions.
A data scientist is working with a feature set that includes the following schema: customer_id STRING, spend DOUBLE, units INTEGER, happiness_tier STRING. The customer_id column serves as the primary key. Each column in the feature set contains some missing values. The scientist plans to replace these missing values by imputing a common value for each feature. Which columns from the feature set are best suited for imputation using the most common value (mode) of the column?
A
Spend
B
happiness_tier
C
Customer_id, happiness_tier
D
Units
E
Customer_id
No comments yet.