
Ultimate access to all questions.
You are tasked with developing a natural language processing (NLP) model for text classification on a dataset that includes millions of product descriptions and 100,000 unique words. The model will be implemented using a recurrent neural network (RNN). Given the scale of the dataset and the complexity of the task, which preprocessing method is most appropriate for preparing the words as inputs to the RNN? Consider the need for efficiency, scalability, and the ability to capture semantic relationships between words. Choose the best option from the following:
A
Generate a one-hot encoding for each word, resulting in a sparse matrix of size 100,000, and use these encodings as inputs to your model.
B
Utilize word embeddings from a pre-trained model such as Word2Vec or GloVe, which provide dense vector representations of words, capturing semantic relationships, and input these embeddings into your model.
C
Assign a unique numerical identifier to each word, ranging from 1 to 100,000, and use these identifiers directly as inputs to your model.
D
Order the words based on their frequency of occurrence in the dataset and use the rank order as the encoding for each word in your model.
E
Combine both one-hot encoding and word embeddings by first generating one-hot encodings and then applying a dimensionality reduction technique to obtain dense vectors.