Google Professional Machine Learning Engineer

Get started today

Ultimate access to all questions.

Your team is working on a Natural Language Processing (NLP) research project aimed at predicting the political affiliation of authors based on the articles they have written. You have a large training dataset consisting of numerous articles penned by various authors. Following best practices, you decided to split your dataset into training, testing, and evaluation subsets with an 80%-10%-10% distribution, respectively. Considering this setup, how should you distribute the training examples across the train-test-eval subsets while maintaining the 80-10-10 proportion?

Exam-Like

Distribute texts randomly across the train-test-eval subsets: Train set: [TextA1, TextB2, ...] Test set: [TextA2, TextC1, TextD2, ...] Eval set: [TextB1, TextC2, TextD1, ...]

14.0%

Distribute authors randomly across the train-test-eval subsets: Train set: [TextA1, TextA2, TextD1, TextD2, ...] Test set: [TextB1, TextB2, ...] Eval set: [TextC1, TextC2, ...]

Comments

Loading comments...

Distribute sentences randomly across the train-test-eval subsets: Train set: [SentenceA11, SentenceA21, SentenceB11, SentenceB21, SentenceC11, SentenceD21 ...] Test set: [SentenceA12, SentenceA22, SentenceB12, SentenceC22, SentenceC12, SentenceD22 ...] Eval set: [SentenceA13, SentenceA23, SentenceB13, SentenceC23, SentenceC13, SentenceD31 ...]

15.1%

Distribute paragraphs of texts (i.e., chunks of consecutive sentences) across the train-test-eval subsets: Train set: [SentenceA11, SentenceA12, SentenceD11, SentenceD12 ...] Test set: [SentenceA13, SentenceB13, SentenceB21, SentenceD23, SentenceC12, SentenceD13 ...] Eval set: [SentenceA11, SentenceA22, SentenceB13, SentenceD22, SentenceC23, SentenceD11 ...]

14.0%