
Answer-first summary for fast verification
Answer: `df.select("customer_id", model(*columns).alias("predictions"))`
## Explanation **Correct Answer: B** **Why Option B is correct:** 1. `model = mlflow.pyfunc.spark_udf(spark, model_uri="models:/churn/prod")` creates a Spark UDF from the MLflow model 2. The UDF can be used directly in DataFrame operations 3. `model(*columns)` applies the UDF to the specified columns using the unpacking operator `*` 4. `.alias("predictions")` renames the resulting column to "predictions" 5. `df.select("customer_id", model(*columns).alias("predictions"))` selects both the customer_id column and the predictions column, resulting in the desired schema **Why other options are incorrect:** - **Option A**: Uses `df.map()` which is not the correct way to apply UDFs in Spark DataFrames. The `map()` function is for RDD operations, not DataFrame operations. - **Option C**: `model.predict(df, columns)` is not a valid method for MLflow Spark UDFs. MLflow Spark UDFs are callable functions, not objects with a `.predict()` method. - **Option D**: Uses `pandas_udf()` which is unnecessary since `mlflow.pyfunc.spark_udf()` already creates a Spark UDF. This would create redundant UDF wrapping. - **Option E**: `df.apply()` is not a valid DataFrame method in PySpark. The correct method for applying functions row-wise is `withColumn()` or using UDFs in `select()` statements. **Key Concepts:** - MLflow's `pyfunc.spark_udf()` creates a Spark User-Defined Function (UDF) that can be used in DataFrame operations - Spark UDFs can be applied to columns using the unpacking operator `*` when passing multiple columns - The `select()` method is used to choose specific columns from a DataFrame - The `alias()` method renames columns in the resulting DataFrame
Author: Keng Suppaseth
Ultimate access to all questions.
No comments yet.
The data science team has created and logged a production model using MLflow. The model accepts a list of column names and returns a new column of type DOUBLE.
The following code correctly imports the production model, loads the customers table containing the customer_id key column into a DataFrame, and defines the feature columns needed for the model.
model = mlflow.pyfunc.spark_udf(spark, model_uri="models:/churn/prod")
df = spark.table("customers")
columns = ["account_age", "time_since_last_seen", "app_rating"]
model = mlflow.pyfunc.spark_udf(spark, model_uri="models:/churn/prod")
df = spark.table("customers")
columns = ["account_age", "time_since_last_seen", "app_rating"]
Which code block will output a DataFrame with the schema "customer_id LONG, predictions DOUBLE"?
A
df.map(lambda x: model(x[columns])).select("customer_id, predictions")
B
df.select("customer_id", model(*columns).alias("predictions"))
C
model.predict(df, columns)
D
df.select("customer_id", pandas_udf(model, columns).alias("predictions"))
E
df.apply(model, columns).select("customer_id, predictions")