
Answer-first summary for fast verification
Answer: Use one-hot hash buckets to convert the categorical string data, efficiently managing the dimensionality by representing each unique value as a binary vector.
For categorical data with high cardinality (over 10,000 unique values), converting the categorical string data to numerical values via one-hot hash buckets is efficient. This method represents each unique string value as a binary vector, where the vector's length equals the number of unique values in the column. One element in the vector is set to 1, and the rest to 0, indicating the specific unique value the input corresponds to. This approach is particularly beneficial for high-cardinality data as it helps manage the data's dimensionality effectively. For more details, refer to [Google Cloud's documentation on XGBoost](https://cloud.google.com/ai-platform/training/docs/algorithms/xgboost) and discussions on [Stack Overflow](https://stackoverflow.com/questions/26473233/in-preprocessing-data-with-high-cardinality-do-you-hash-first-or-one-hot-encode).
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
In the process of developing a deep neural network classification model, you encounter a dataset with categorical input values where some columns have a high cardinality of over 10,000 unique values. Considering the need for efficiency and scalability in model training, along with the constraints of managing high-dimensional data, what is the most effective method to encode these categorical values? Choose the best option.
A
Transform each categorical value into an integer value, which may lead to an arbitrary ordering of categories and potentially mislead the model.
B
Encode the categorical variables into a vector of boolean values, which could result in an excessively sparse and high-dimensional representation.
C
Use one-hot hash buckets to convert the categorical string data, efficiently managing the dimensionality by representing each unique value as a binary vector.
D
Apply run-length encoding to convert each categorical value into a string, which is not suitable for non-sequential data and may not capture the categorical nature effectively.