
Answer-first summary for fast verification
Answer: One-hot encoding to transform each category value into a new binary column, enabling the model to interpret categorical variables numerically., Both One-hot encoding for categorical variables and Standardization for numerical features to ensure all data is appropriately scaled and interpretable by the model.
**Correct Option: C. One-hot encoding** One-hot encoding is the most suitable technique for managing categorical data in this scenario because it converts categorical variables into a form that could be provided to ML algorithms to do a better job in prediction. It creates a binary column for each category and indicates the presence of the category with a '1' and its absence with a '0'. This method is particularly useful for linear regression models where the interpretability of the model is crucial. **Why other options are incorrect:** - **A. Data augmentation**: This technique is not relevant for handling categorical data; it's used to increase the diversity of data available for training models, without collecting new data. - **B. Normalization**: While normalization is important for scaling numerical features, it does not address the need to convert categorical data into a numerical format. - **D. Standardization**: Similar to normalization, standardization is used for scaling numerical features and does not help in managing categorical data. - **E. Both One-hot encoding and Standardization**: While this option combines the correct approach for categorical data with the appropriate scaling for numerical features, the question specifically asks for the technique to manage categorical data, making 'C' the best single answer. However, 'E' is also correct in a broader context of data preprocessing.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
In the context of preparing data for a machine learning model, you are working with a dataset that includes categorical variables such as 'Product Category' with values like 'Electronics', 'Clothing', and 'Home Appliances'. The dataset also contains numerical features. Your goal is to preprocess this data to ensure optimal performance of a linear regression model, considering constraints like computational efficiency and the interpretability of the model. Which of the following techniques should you employ to manage the categorical data effectively? Choose the best option.
A
Data augmentation to artificially increase the size of the dataset by creating variations of the existing data points.
B
Normalization to scale all numerical features to a range between 0 and 1, without addressing the categorical data.
C
One-hot encoding to transform each category value into a new binary column, enabling the model to interpret categorical variables numerically.
D
Standardization to adjust all features to have a mean of 0 and a standard deviation of 1, focusing solely on numerical features.
E
Both One-hot encoding for categorical variables and Standardization for numerical features to ensure all data is appropriately scaled and interpretable by the model.