Databricks Certified Data Engineer - Associate

Databricks Certified Data Engineer - Associate

Get started today

Ultimate access to all questions.


In a scenario where you are working with Azure Databricks and need to process a large dataset stored in a CSV file named 'sales_data.csv' located in the 'data' directory of your Azure Blob Storage. The dataset contains sales transactions with a header row and you want to ensure that the schema is automatically inferred to avoid manual schema definition. Additionally, you need to create a table named 'sales_table' for further analysis. Considering the requirements of cost efficiency, performance, and schema inference, which of the following Spark DataFrame queries would you use to read the CSV file and create the table? Choose the best option and explain why it is the most suitable for this scenario.




Explanation:

The correct answer is A, as it uses the 'USING CSV' clause to specify the data source type as CSV, which is appropriate for reading CSV files directly. The OPTIONS include the path to the Azure Blob Storage, indicating the file's location, 'header' set to 'true' to use the first row as headers, and 'inferSchema' set to 'true' to automatically infer the data types of each column. This approach is cost-efficient and performance-optimized for the given scenario, as it avoids the overhead of converting the data into other formats like Delta, Parquet, or ORC before reading it.