
Answer-first summary for fast verification
Answer: df.dropDuplicates(['id'])
The correct answer is C because `dropDuplicates(['id'])` specifically removes duplicate rows based on the `id` column, which aligns with the requirement to deduplicate based on a specific column.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
Given a DataFrame df with columns id, name, and timestamp, how would you create a new DataFrame that removes duplicate rows based on the id column? Provide the Spark code to achieve this.
A
df.dropDuplicates(['id', 'name'])
B
df.distinct()
C
df.dropDuplicates(['id'])
D
df.drop('id')
No comments yet.