In the context of a Natural Language Processing (NLP) research project aimed at predicting the political affiliation of authors based on their written articles, your team has access to a large and diverse dataset. The dataset includes articles from multiple authors, each with a known political affiliation. To ensure the model's ability to generalize and to prevent data leakage, you decide to follow the standard 80%-10%-10% data distribution across training, testing, and evaluation subsets. Considering the importance of maintaining the integrity of the dataset's distribution and the need for the model to learn generalizable patterns, which of the following strategies is the MOST appropriate for distributing the training examples across the train-test-eval subsets while maintaining the 80-10-10 proportion? Choose one correct option.

Real Exam

Randomly distribute individual texts across the subsets, ensuring that each subset contains a mix of texts from all authors without considering the author's political affiliation.

4.5%

Randomly distribute sentences from the texts across the subsets, breaking the continuity of the articles to ensure a diverse representation in each subset.

9.1%

Distribute paragraphs of texts (i.e., chunks of consecutive sentences) across the subsets, aiming to maintain some context within the subsets but risking the inclusion of multiple paragraphs from the same author in different subsets.

13.6%

Randomly distribute authors across the subsets, ensuring that all texts from a single author are contained within the same subset to prevent data leakage and to maintain the integrity of the author's writing style and political affiliation across subsets.

72.7%

Google Professional Machine Learning Engineer

Get started today

Comments