
Ultimate access to all questions.
You are working with a Delta Lake table 'transactions' that contains duplicate rows. Write a PySpark code snippet to deduplicate the rows based on all columns and save the result back to the same table.
A
df = spark.read.format('delta').load('/path/to/transactions') df = df.dropDuplicates() df.write.format('delta').mode('overwrite').save('/path/to/transactions')
B
df = spark.read.format('delta').load('/path/to/transactions') df = df.distinct() df.write.format('delta').mode('overwrite').save('/path/to/transactions')
C
df = spark.read.format('delta').load('/path/to/transactions') df = df.dropDuplicates(['column1', 'column2']) df.write.format('delta').mode('overwrite').save('/path/to/transactions')
D
df = spark.read.format('delta').load('/path/to/transactions') df = df.distinct(['column1', 'column2']) df.write.format('delta').mode('overwrite').save('/path/to/transactions')