
Ultimate access to all questions.
The data science team has created and logged a production model using MLflow. The model accepts a list of column names and returns a new column of type DOUBLE.
The following code correctly imports the production model, loads the customers table containing the customer_id key column into a DataFrame, and defines the feature columns needed for the model.
model = mlflow.pyfunc.spark_udf(spark, model_uri="models:/churn/prod")
df = spark.table("customers")
columns = ["account_age", "time_since_last_seen", "app_rating"]
model = mlflow.pyfunc.spark_udf(spark, model_uri="models:/churn/prod")
df = spark.table("customers")
columns = ["account_age", "time_since_last_seen", "app_rating"]
Which code block will output a DataFrame with the schema "customer_id LONG, predictions DOUBLE"?
A
df.map(lambda x: model(x[columns])).select("customer_id, predictions")
B
df.select("customer_id", model(*columns).alias("predictions"))
C
model.predict(df, columns)
D
df.select("customer_id", pandas_udf(model, columns).alias("predictions"))
E
df.apply(model, columns).select("customer_id, predictions")
Explanation:
Correct Answer: B
Why Option B is correct:
model = mlflow.pyfunc.spark_udf(spark, model_uri="models:/churn/prod") creates a Spark UDF from the MLflow modelmodel(*columns) applies the UDF to the specified columns using the unpacking operator *.alias("predictions") renames the resulting column to "predictions"df.select("customer_id", model(*columns).alias("predictions")) selects both the customer_id column and the predictions column, resulting in the desired schemaWhy other options are incorrect:
df.map() which is not the correct way to apply UDFs in Spark DataFrames. The map() function is for RDD operations, not DataFrame operations.model.predict(df, columns) is not a valid method for MLflow Spark UDFs. MLflow Spark UDFs are callable functions, not objects with a .predict() method.pandas_udf() which is unnecessary since mlflow.pyfunc.spark_udf() already creates a Spark UDF. This would create redundant UDF wrapping.df.apply() is not a valid DataFrame method in PySpark. The correct method for applying functions row-wise is withColumn() or using UDFs in select() statements.Key Concepts:
pyfunc.spark_udf() creates a Spark User-Defined Function (UDF) that can be used in DataFrame operations* when passing multiple columnsselect() method is used to choose specific columns from a DataFramealias() method renames columns in the resulting DataFrame