
Answer-first summary for fast verification
Answer: `sensor_readings.dropDuplicates('key_column').write.format('delta').mode('append').saveAsTable('sensor_readings')`
The correct approach is option D: `sensor_readings.dropDuplicates('key_column').write.format('delta').mode('append').saveAsTable('sensor_readings')`. This method is preferred because: - `dropDuplicates('key_column')` efficiently removes duplicate rows based on the specified column, ensuring data integrity without unnecessary writes. - `write.format('delta')` specifies the use of Delta format, which is optimized for updates and avoids full table overwrites. - `mode('append')` ensures that only distinct records are added to the existing table, preserving original data and avoiding potential loss. Other options have drawbacks: - Option A uses `distinct()` with `overwrite`, which can lead to data loss. - Option B involves an unnecessary SQL step, potentially reducing performance. - Option C is designed for updating records, not for deduplication, making it less straightforward for this scenario. Thus, option D is the most efficient and reliable method for deduplicating and writing data to the Delta table while maintaining existing data integrity.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
A data engineering team is utilizing Databricks to write data to a Delta table named 'sensor_readings'. Their goal is to ensure that any duplicate records, based on a specific key column, are removed before writing to the table. Which of the following code snippets should they use to achieve this efficiently?
A
sensor_readings.distinct().write.format('delta').mode('overwrite').saveAsTable('sensor_readings')
B
spark.sql('SELECT DISTINCT * FROM sensor_readings').write.format('delta').mode('overwrite').saveAsTable('sensor_readings')
C
sensor_readings.write.format('delta').mode('upsert').option('key_column', 'value').save()
D
sensor_readings.dropDuplicates('key_column').write.format('delta').mode('append').saveAsTable('sensor_readings')
No comments yet.