
Explanation:
Cmd 5 (display(finalDF)) should be removed because the display() function is an interactive command used for visualizing data in notebooks. In a scheduled job, there's no user interface to render the output, making this command unnecessary and potentially causing issues. While Cmd 2 (printSchema()) is also for debugging, it merely logs the schema to the driver logs and doesn't interfere with job execution. The other commands (Cmd 3, Cmd 4, Cmd 6) are essential transformations and output steps for the pipeline.
Ultimate access to all questions.
No comments yet.
A data engineer has prepared a notebook to be scheduled as part of a data pipeline. The following commands produce correct results when executed as shown:
Cmd 1:
rawDF = spark.table("raw_data")
Cmd 2:
rawDF.printSchema()
Cmd 3:
flattenedDF = rawDF.select("*", "values.*")
Cmd 4:
finalDF = flattenedDF.drop("values")
Cmd 5:
display(finalDF)
Cmd 6:
finalDF.write.mode("append").saveAsTable("flat_data")
Cmd 1:
rawDF = spark.table("raw_data")
Cmd 2:
rawDF.printSchema()
Cmd 3:
flattenedDF = rawDF.select("*", "values.*")
Cmd 4:
finalDF = flattenedDF.drop("values")
Cmd 5:
display(finalDF)
Cmd 6:
finalDF.write.mode("append").saveAsTable("flat_data")
Which command should be excluded from the notebook before scheduling it as a job?
A
Cmd 2
B
Cmd 3
C
Cmd 4
D
Cmd 5