
Explanation:
The correct approach is option D: sensor_readings.dropDuplicates('key_column').write.format('delta').mode('append').saveAsTable('sensor_readings'). This method is preferred because:
dropDuplicates('key_column') efficiently removes duplicate rows based on the specified column, ensuring data integrity without unnecessary writes.write.format('delta') specifies the use of Delta format, which is optimized for updates and avoids full table overwrites.mode('append') ensures that only distinct records are added to the existing table, preserving original data and avoiding potential loss.Other options have drawbacks:
distinct() with overwrite, which can lead to data loss.Thus, option D is the most efficient and reliable method for deduplicating and writing data to the Delta table while maintaining existing data integrity.
Ultimate access to all questions.
A data engineering team is utilizing Databricks to write data to a Delta table named 'sensor_readings'. Their goal is to ensure that any duplicate records, based on a specific key column, are removed before writing to the table. Which of the following code snippets should they use to achieve this efficiently?
A
sensor_readings.distinct().write.format('delta').mode('overwrite').saveAsTable('sensor_readings')
B
spark.sql('SELECT DISTINCT * FROM sensor_readings').write.format('delta').mode('overwrite').saveAsTable('sensor_readings')
C
sensor_readings.write.format('delta').mode('upsert').option('key_column', 'value').save()
D
sensor_readings.dropDuplicates('key_column').write.format('delta').mode('append').saveAsTable('sensor_readings')
No comments yet.