
Answer-first summary for fast verification
Answer: Cmd 5
Cmd 5 (display(finalDF)) should be removed because the `display()` function is an interactive command used for visualizing data in notebooks. In a scheduled job, there's no user interface to render the output, making this command unnecessary and potentially causing issues. While Cmd 2 (printSchema()) is also for debugging, it merely logs the schema to the driver logs and doesn't interfere with job execution. The other commands (Cmd 3, Cmd 4, Cmd 6) are essential transformations and output steps for the pipeline.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
A data engineer has prepared a notebook to be scheduled as part of a data pipeline. The following commands produce correct results when executed as shown:
Cmd 1:
rawDF = spark.table("raw_data")
Cmd 2:
rawDF.printSchema()
Cmd 3:
flattenedDF = rawDF.select("*", "values.*")
Cmd 4:
finalDF = flattenedDF.drop("values")
Cmd 5:
display(finalDF)
Cmd 6:
finalDF.write.mode("append").saveAsTable("flat_data")
Cmd 1:
rawDF = spark.table("raw_data")
Cmd 2:
rawDF.printSchema()
Cmd 3:
flattenedDF = rawDF.select("*", "values.*")
Cmd 4:
finalDF = flattenedDF.drop("values")
Cmd 5:
display(finalDF)
Cmd 6:
finalDF.write.mode("append").saveAsTable("flat_data")
Which command should be excluded from the notebook before scheduling it as a job?
A
Cmd 2
B
Cmd 3
C
Cmd 4
D
Cmd 5
No comments yet.