
Answer-first summary for fast verification
Answer: Convert the categorical string data to one-hot hash buckets.
The most suitable method for encoding categorical values with high cardinality, such as those with over 10,000 unique values, is to use one-hot hash buckets. This approach helps to manage the large number of unique values efficiently by hashing the categorical string data and using a fixed number of hash buckets. Techniques like simple integer encoding or run-length encoding are less effective for high cardinality and can lead to poor performance or scalability issues. Therefore, option B, converting the categorical string data to one-hot hash buckets, is the correct answer.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
You are developing a deep neural network classification model, and your dataset contains several columns with categorical input values. Some of these columns have a high cardinality, with more than 10,000 unique values. Considering the complexity and the amount of unique categorical values, what is the most suitable method to encode these categorical values for input into the model?
A
Convert each categorical value into an integer value.
B
Convert the categorical string data to one-hot hash buckets.
C
Map the categorical variables into a vector of boolean values.
D
Convert each categorical value into a run-length encoded string.
No comments yet.