
Answer-first summary for fast verification
Answer: Utilize one-hot encoding with hash buckets to efficiently represent the categorical data., Implement both one-hot encoding for categories with fewer than 100 unique values and hash buckets for those exceeding this threshold.
One-hot encoding with hash buckets is the most effective method for handling high-cardinality categorical data due to its efficiency in storage and ability to clearly represent data for model training. This approach also supports scalability and cost-efficiency, crucial for real-time applications. Option E introduces a hybrid approach that could be considered in scenarios where the dataset contains a mix of high and low cardinality features, offering a balance between representational efficiency and computational cost. However, for the specific scenario described, one-hot encoding with hash buckets (Option B) remains the optimal choice.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
In the context of developing a deep neural network for classification tasks, you are presented with a dataset that includes categorical inputs. Some of these categorical columns contain over 10,000 unique values, posing a challenge for effective encoding. The dataset is part of a larger project aimed at predicting customer behavior in real-time, requiring the solution to be scalable and cost-efficient. Given these constraints, which of the following methods is MOST effective for encoding these high-cardinality categorical values as model inputs? Choose the best option.
A
Assign a unique numerical value to each categorical value, maintaining a direct mapping.
B
Utilize one-hot encoding with hash buckets to efficiently represent the categorical data.
C
Convert each categorical variable into a vector of boolean values, indicating the presence or absence of each category.
D
Apply run-length encoding to compress the categorical string data, reducing the input size.
E
Implement both one-hot encoding for categories with fewer than 100 unique values and hash buckets for those exceeding this threshold.