
Answer-first summary for fast verification
Answer: Randomly distribute authors across the subsets, ensuring that all texts from a single author are contained within the same subset to prevent data leakage and to maintain the integrity of the author's writing style and political affiliation across subsets.
The optimal strategy for distributing the data across the train-test-eval subsets involves randomly distributing authors. This approach ensures that each author's writing style and patterns are represented across all subsets, facilitating the learning of generalizable patterns by the model. It also prevents data leakage, where information about an author's political affiliation in one subset could influence predictions for the same author in another subset. Additionally, this method helps maintain a balanced distribution of political affiliations across all subsets, ensuring the model is trained on a representative sample of data. Distributing texts, sentences, or paragraphs randomly could introduce biases or data leakage, undermining the model's effectiveness.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
In the context of a Natural Language Processing (NLP) research project aimed at predicting the political affiliation of authors based on their written articles, your team has access to a large and diverse dataset. The dataset includes articles from multiple authors, each with a known political affiliation. To ensure the model's ability to generalize and to prevent data leakage, you decide to follow the standard 80%-10%-10% data distribution across training, testing, and evaluation subsets. Considering the importance of maintaining the integrity of the dataset's distribution and the need for the model to learn generalizable patterns, which of the following strategies is the MOST appropriate for distributing the training examples across the train-test-eval subsets while maintaining the 80-10-10 proportion? Choose one correct option.
A
Randomly distribute individual texts across the subsets, ensuring that each subset contains a mix of texts from all authors without considering the author's political affiliation.
B
Randomly distribute sentences from the texts across the subsets, breaking the continuity of the articles to ensure a diverse representation in each subset.
C
Distribute paragraphs of texts (i.e., chunks of consecutive sentences) across the subsets, aiming to maintain some context within the subsets but risking the inclusion of multiple paragraphs from the same author in different subsets.
D
Randomly distribute authors across the subsets, ensuring that all texts from a single author are contained within the same subset to prevent data leakage and to maintain the integrity of the author's writing style and political affiliation across subsets.