
Answer-first summary for fast verification
Answer: Utilize pre-trained word embeddings to represent each word, capturing semantic similarities and differences between words based on their usage in a large corpus of text., Combine the approaches of one-hot encoding for less frequent words and pre-trained word embeddings for more frequent words, to balance between computational efficiency and semantic richness.
The most effective strategy for this scenario is to **utilize pre-trained word embeddings** (Option D) because they encapsulate the semantic relationships between words, enabling the model to learn more meaningful representations and significantly decrease training time. However, in cases where computational resources are a constraint and a balance between efficiency and semantic richness is desired, **combining one-hot encoding for less frequent words and pre-trained word embeddings for more frequent words** (Option E) can be a viable alternative. This approach leverages the efficiency of one-hot encoding for words that appear less frequently and the semantic richness of embeddings for more common words. ### Why not the other options? - **Sorting words by frequency** (Option A) organizes words based on occurrence but fails to capture semantic relationships, making it less suitable for tasks requiring understanding of word meanings. - **One-hot encoding** (Option B) treats words as isolated entities, disregarding semantic connections and resulting in high-dimensional input vectors that are computationally intensive. - **Assigning numerical identifiers** (Option C) does not account for semantic relationships between words, limiting the model's ability to understand text meaningfully.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
You are working on a project to develop a natural language processing (NLP) model for a large e-commerce platform. The model's task is to classify millions of product descriptions into predefined categories. The dataset includes product descriptions with a vocabulary of 100,000 unique words. The platform requires the model to be efficient, scalable, and capable of understanding the semantic relationships between words to improve classification accuracy. Given these requirements, which preprocessing approach should you adopt for feeding the words into a recurrent neural network (RNN)? Choose the best option.
A
Sort the words by their frequency of occurrence and use these frequencies as encodings in your model, arguing that more frequent words are more important for classification.
B
Generate a one-hot encoding for each word, ensuring that each word is represented as a unique vector in a high-dimensional space, to maintain distinctiveness between words.
C
Assign a unique numerical identifier to each word from 1 to 100,000, treating each word as an independent category without considering semantic relationships.
D
Utilize pre-trained word embeddings to represent each word, capturing semantic similarities and differences between words based on their usage in a large corpus of text.
E
Combine the approaches of one-hot encoding for less frequent words and pre-trained word embeddings for more frequent words, to balance between computational efficiency and semantic richness.