
Explanation:
In a distributed environment, feature engineering needs to be both efficient and scalable. Spark ML provides a suite of feature transformation functions such as Bucketizer for discretizing continuous features and VectorAssembler for combining multiple columns into a single vector column. These functions are optimized for distributed processing, ensuring that feature engineering tasks can be performed efficiently across multiple nodes.
Ultimate access to all questions.
No comments yet.
Given a scenario where you need to perform feature engineering on a dataset that is distributed across multiple nodes, describe how you would approach this task using Spark ML. What specific features of Spark ML would you utilize and why?
A
Use Spark ML's DataFrame API for feature extraction due to its simplicity.
B
Leverage Spark ML's feature transformation functions like Bucketizer and VectorAssembler for efficient distributed processing.
C
Implement custom Python scripts for feature engineering to ensure flexibility.
D
Rely on manual data aggregation and then apply scikit-learn for feature engineering.