Databricks Certified Data Engineer - Associate

Databricks Certified Data Engineer - Associate

Get started today

Ultimate access to all questions.


In a Databricks environment, you are working with a dataset that includes a 'user_activity' column. This column contains JSON objects with various user activity data, including a 'login_time' field formatted as 'yyyy-MM-dd HH:mm:ss'. Your task is to ensure data quality by validating that the 'login_time' field is not null for any row in the dataset. Considering the nuances of querying semi-structured JSON data in Databricks SQL and the need for efficient data processing, which of the following Spark SQL queries would you use to achieve this validation? Choose the best option from the following:




Explanation:

Option C is correct because it utilizes the Databricks SQL syntax for extracting fields from JSON string columns, which is :. This method efficiently retrieves the value of the specified field as a string. By checking user_activity:login_time IS NOT NULL, the query ensures that only rows where the login_time field exists and is not null in the JSON data are returned. This approach is recommended for querying semi-structured JSON data in Databricks SQL due to its efficiency and correctness.