Ultimate access to all questions.
Upgrade Now 🚀
Sign in to unlock AI tutor
A data engineer is examining the distinct() and dropDuplicates() methods in Spark for de-duplicating a DataFrame. Which statement accurately describes the use of these methods for de-duplication?
distinct()
dropDuplicates()
A
The distinct() method can be used to remove duplicates based on specific columns by passing column names as arguments.
B
In Databricks, the distinct() method is deprecated, leaving dropDuplicates() as the only supported method for de-duplication.
C
The methods dropDuplicates() and drop_duplicates() are interchangeable, as per the official Spark documentation.
drop_duplicates()
D
Both distinct() and dropDuplicates() methods allow for the removal of duplicates based on specific columns.
E
The dropDuplicates() method is restricted to RDDs, while the distinct() method is exclusively for DataFrames.