
Ultimate access to all questions.
You are working on a project to develop a natural language processing (NLP) model for a large e-commerce platform. The model's task is to classify millions of product descriptions into predefined categories. The dataset includes product descriptions with a vocabulary of 100,000 unique words. The platform requires the model to be efficient, scalable, and capable of understanding the semantic relationships between words to improve classification accuracy. Given these requirements, which preprocessing approach should you adopt for feeding the words into a recurrent neural network (RNN)? Choose the best option.
A
Sort the words by their frequency of occurrence and use these frequencies as encodings in your model, arguing that more frequent words are more important for classification.
B
Generate a one-hot encoding for each word, ensuring that each word is represented as a unique vector in a high-dimensional space, to maintain distinctiveness between words.
C
Assign a unique numerical identifier to each word from 1 to 100,000, treating each word as an independent category without considering semantic relationships.
D
Utilize pre-trained word embeddings to represent each word, capturing semantic similarities and differences between words based on their usage in a large corpus of text.
E
Combine the approaches of one-hot encoding for less frequent words and pre-trained word embeddings for more frequent words, to balance between computational efficiency and semantic richness.