
Explanation:
Correct Option: C. One-hot encoding
One-hot encoding is the most suitable technique for managing categorical data in this scenario because it converts categorical variables into a form that could be provided to ML algorithms to do a better job in prediction. It creates a binary column for each category and indicates the presence of the category with a '1' and its absence with a '0'. This method is particularly useful for linear regression models where the interpretability of the model is crucial.
Why other options are incorrect:
Ultimate access to all questions.
No comments yet.
In the context of preparing data for a machine learning model, you are working with a dataset that includes categorical variables such as 'Product Category' with values like 'Electronics', 'Clothing', and 'Home Appliances'. The dataset also contains numerical features. Your goal is to preprocess this data to ensure optimal performance of a linear regression model, considering constraints like computational efficiency and the interpretability of the model. Which of the following techniques should you employ to manage the categorical data effectively? Choose the best option.
A
Data augmentation to artificially increase the size of the dataset by creating variations of the existing data points.
B
Normalization to scale all numerical features to a range between 0 and 1, without addressing the categorical data.
C
One-hot encoding to transform each category value into a new binary column, enabling the model to interpret categorical variables numerically.
D
Standardization to adjust all features to have a mean of 0 and a standard deviation of 1, focusing solely on numerical features.
E
Both One-hot encoding for categorical variables and Standardization for numerical features to ensure all data is appropriately scaled and interpretable by the model.