
Ultimate access to all questions.
In the context of developing a deep neural network for classification tasks, you are presented with a dataset that includes categorical inputs. Some of these categorical columns contain over 10,000 unique values, posing a challenge for effective encoding. The dataset is part of a larger project aimed at predicting customer behavior in real-time, requiring the solution to be scalable and cost-efficient. Given these constraints, which of the following methods is MOST effective for encoding these high-cardinality categorical values as model inputs? Choose the best option.
A
Assign a unique numerical value to each categorical value, maintaining a direct mapping.
B
Utilize one-hot encoding with hash buckets to efficiently represent the categorical data.
C
Convert each categorical variable into a vector of boolean values, indicating the presence or absence of each category.
D
Apply run-length encoding to compress the categorical string data, reducing the input size.
E
Implement both one-hot encoding for categories with fewer than 100 unique values and hash buckets for those exceeding this threshold.