Databricks Certified Data Engineer - Associate

Databricks Certified Data Engineer - Associate

Get started today

Ultimate access to all questions.


You are working on a project that requires processing a large dataset stored in Azure Databricks. The dataset contains a primary key 'id' and a timestamp column 'event_time'. Your task is to create a new table that ensures data uniqueness based on the 'id' column and converts the 'event_time' to a timestamp format for accurate time-based analysis. Considering the requirements for data uniqueness, correct timestamp conversion, and optimal performance, which of the following Spark SQL queries would you choose? (Choose one option)




Explanation:

Option C is the correct choice because it effectively uses the DISTINCT keyword to eliminate duplicate rows based on the 'id' column and correctly casts the 'event_time' to a timestamp format using the CAST function, ensuring data uniqueness and accurate time representation. Option A fails to cast the timestamp, Option B uses GROUP BY which is unnecessary for simply removing duplicates, and Option D incorrectly applies FROM_UNIXTIME, which is not universally applicable for all timestamp formats.