
Google Professional Machine Learning Engineer
Get started today
Ultimate access to all questions.
You are tasked with developing a natural language processing (NLP) model for text classification on a dataset that includes millions of product descriptions and 100,000 unique words. The model will be implemented using a recurrent neural network (RNN). Given the scale of the dataset and the complexity of the task, which preprocessing method is most appropriate for preparing the words as inputs to the RNN? Consider the need for efficiency, scalability, and the ability to capture semantic relationships between words. Choose the best option from the following:
You are tasked with developing a natural language processing (NLP) model for text classification on a dataset that includes millions of product descriptions and 100,000 unique words. The model will be implemented using a recurrent neural network (RNN). Given the scale of the dataset and the complexity of the task, which preprocessing method is most appropriate for preparing the words as inputs to the RNN? Consider the need for efficiency, scalability, and the ability to capture semantic relationships between words. Choose the best option from the following:
Explanation:
Option B is the most appropriate because word embeddings from pre-trained models like Word2Vec or GloVe provide dense vector representations that efficiently capture semantic relationships between words, making them highly suitable for input into an RNN. This method is scalable and leverages existing knowledge about word meanings. Option E is also correct but less optimal as it involves an additional step of dimensionality reduction, which may not be necessary when pre-trained embeddings are available. One-hot encodings (Option A) are inefficient and do not capture semantic relationships. Numerical identifiers (Option C) and frequency-based encodings (Option D) do not inherently capture semantic meanings and are less effective for NLP tasks.