
Explanation:
The pyspark.sql.DataFrame.dropDuplicates function is the correct choice for returning a new DataFrame that excludes duplicate rows. It offers the flexibility to consider only certain columns when determining duplicates. For more details, refer to the official documentation.
Ultimate access to all questions.
No comments yet.
In the context of PySpark, which function is designed to generate a new DataFrame by eliminating duplicate rows, with the option to consider only specific columns for identifying duplicates?
A
pyspark.sql.DataFrame.drop
B
pyspark.sql.DataFrame.distinct
C
pyspark.sql.DataFrame.dropDuplicates
D
pyspark.sql.DataFrame.na.drop
E
pyspark.sql.DataFrame.dropna