
Answer-first summary for fast verification
Answer: Create a Spark MLlib Transformer class for the feature engineering logic.
The recommended approach for encapsulating feature engineering logic into a reusable component in Databricks is to create a Spark MLlib Transformer class. This is because Spark MLlib Transformers are specifically designed for building reusable components in machine learning pipelines. They offer a structured and standardized way to encapsulate feature engineering logic, including input and output DataFrames, transformation logic, and parameterization. Transformers provide several benefits such as reusability, maintainability, scalability, and facilitate documentation. On the other hand, custom PySpark UDFs are less ideal due to difficulties in integration with Spark MLlib pipelines, inefficiency from serialization and deserialization overheads, and limited flexibility. The `map` function is unsuitable for complex feature engineering logic as it's primarily for simple one-to-one transformations. Saving and loading intermediate DataFrames is inefficient and creates workflow dependencies. Therefore, creating a Spark MLlib Transformer class is the most effective method for encapsulating feature engineering logic in a Databricks notebook, promoting a modular and maintainable machine learning workflow.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
A data scientist is developing a Databricks notebook that requires extensive feature engineering, such as creating new columns and applying transformations. They aim to encapsulate this feature engineering logic into a reusable component. What is the best approach to accomplish this?
A
Save the intermediate DataFrame and load it in other notebooks as needed.
B
Use the map function to apply the feature engineering transformations.
C
Define a custom PySpark UDF (User-Defined Function) and apply it to the DataFrame.
D
Create a Spark MLlib Transformer class for the feature engineering logic.
No comments yet.