
Answer-first summary for fast verification
Answer: Use stratified sampling based on key features to ensure representation.
Stratified sampling ensures that each subset (training and testing) contains approximately the same percentage of samples of each target class as the complete set, which is crucial for maintaining the representativeness of the dataset. Ignoring data skew or using a simple random split could lead to subsets that are not representative of the overall dataset.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
Consider a scenario where you are tasked with splitting a large distributed dataset using Spark ML. The dataset contains 10 million records and is stored in a Hive table. Describe the steps you would take to ensure an effective split while minimizing data skew and ensuring that the training and testing subsets are representative of the overall dataset. Additionally, discuss potential challenges and how you would address them.
A
Use randomSplit() with a fixed seed and ignore data skew.
B
Use stratified sampling based on key features to ensure representation.
C
Manually partition the data based on a hash of the primary key.
D
Split the data without considering the distribution of key features.