
Explanation:
The correct answer is C because dropDuplicates(['id']) specifically removes duplicate rows based on the id column, which aligns with the requirement to deduplicate based on a specific column.
Ultimate access to all questions.
No comments yet.
Given a DataFrame df with columns id, name, and timestamp, how would you create a new DataFrame that removes duplicate rows based on the id column? Provide the Spark code to achieve this.
A
df.dropDuplicates(['id', 'name'])
B
df.distinct()
C
df.dropDuplicates(['id'])
D
df.drop('id')