
Ultimate access to all questions.
Your team is working on a Natural Language Processing (NLP) research project aimed at predicting the political affiliation of authors based on the articles they have written. You have a large training dataset consisting of numerous articles penned by various authors. Following best practices, you decided to split your dataset into training, testing, and evaluation subsets with an 80%-10%-10% distribution, respectively. Considering this setup, how should you distribute the training examples across the train-test-eval subsets while maintaining the 80-10-10 proportion?
A
Distribute texts randomly across the train-test-eval subsets: Train set: [TextA1, TextB2, ...] Test set: [TextA2, TextC1, TextD2, ...] Eval set: [TextB1, TextC2, TextD1, ...]
B
Distribute authors randomly across the train-test-eval subsets: Train set: [TextA1, TextA2, TextD1, TextD2, ...] Test set: [TextB1, TextB2, ...] Eval set: [TextC1, TextC2, ...]
C
Distribute sentences randomly across the train-test-eval subsets: Train set: [SentenceA11, SentenceA21, SentenceB11, SentenceB21, SentenceC11, SentenceD21 ...] Test set: [SentenceA12, SentenceA22, SentenceB12, SentenceC22, SentenceC12, SentenceD22 ...] Eval set: [SentenceA13, SentenceA23, SentenceB13, SentenceC23, SentenceC13, SentenceD31 ...]
D
Distribute paragraphs of texts (i.e., chunks of consecutive sentences) across the train-test-eval subsets: Train set: [SentenceA11, SentenceA12, SentenceD11, SentenceD12 ...] Test set: [SentenceA13, SentenceB13, SentenceB21, SentenceD23, SentenceC12, SentenceD13 ...] Eval set: [SentenceA11, SentenceA22, SentenceB13, SentenceD22, SentenceC23, SentenceD11 ...]