
Answer-first summary for fast verification
Answer: Distribute authors randomly across the train-test-eval subsets: Train set: [TextA1, TextA2, TextD1, TextD2, ...] Test set: [TextB1, TextB2, ...] Eval set: [TextC1, TextC2, ...]
The correct answer is B. Distributing authors randomly across the train-test-eval subsets is the best approach. This ensures that texts by a given author appear in only one of the subsets, which prevents data leakage and allows the model to generalize better. Each author has a distinct writing style, and mixing texts from the same author across different subsets could lead to the model overfitting to specific authors rather than learning to predict political affiliation based on the text's content. By segregating authors, the model will be evaluated more robustly on its ability to generalize to new, unseen authors.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
Your team is working on a Natural Language Processing (NLP) research project aimed at predicting the political affiliation of authors based on the articles they have written. You have a large training dataset consisting of numerous articles penned by various authors. Following best practices, you decided to split your dataset into training, testing, and evaluation subsets with an 80%-10%-10% distribution, respectively. Considering this setup, how should you distribute the training examples across the train-test-eval subsets while maintaining the 80-10-10 proportion?
A
Distribute texts randomly across the train-test-eval subsets: Train set: [TextA1, TextB2, ...] Test set: [TextA2, TextC1, TextD2, ...] Eval set: [TextB1, TextC2, TextD1, ...]
B
Distribute authors randomly across the train-test-eval subsets: Train set: [TextA1, TextA2, TextD1, TextD2, ...] Test set: [TextB1, TextB2, ...] Eval set: [TextC1, TextC2, ...]
C
Distribute sentences randomly across the train-test-eval subsets: Train set: [SentenceA11, SentenceA21, SentenceB11, SentenceB21, SentenceC11, SentenceD21 ...] Test set: [SentenceA12, SentenceA22, SentenceB12, SentenceC22, SentenceC12, SentenceD22 ...] Eval set: [SentenceA13, SentenceA23, SentenceB13, SentenceC23, SentenceC13, SentenceD31 ...]
D
Distribute paragraphs of texts (i.e., chunks of consecutive sentences) across the train-test-eval subsets: Train set: [SentenceA11, SentenceA12, SentenceD11, SentenceD12 ...] Test set: [SentenceA13, SentenceB13, SentenceB21, SentenceD23, SentenceC12, SentenceD13 ...] Eval set: [SentenceA11, SentenceA22, SentenceB13, SentenceD22, SentenceC23, SentenceD11 ...]