Databricks Certified Data Engineer - Professional

Databricks Certified Data Engineer - Professional

Get started today

Ultimate access to all questions.


You are designing a data lake solution for a multinational retail company that processes millions of sales transactions daily. The dataset includes columns such as 'transaction_id', 'product_id', 'date', 'region', and 'amount'. The company requires efficient query performance for analyzing sales trends over time and across different regions, while also considering cost optimization and scalability. Given these requirements, which of the following partitioning strategies would you recommend to best meet the company's needs? Choose the single best option.




Explanation:

The correct answer is B because partitioning by 'date' and 'region' aligns with the company's need for efficient query performance in analyzing sales trends over time and across different regions. This approach leverages the natural segmentation of data by time and geography, which are common dimensions for analysis in retail scenarios. Option A is less effective because 'transaction_id' and 'product_id' do not support the analytical queries the company is likely to perform. Option C is inappropriate because 'amount' does not provide a meaningful way to segment data for analysis. Option D is incorrect because partitioning is essential for optimizing query performance and cost in large datasets, even with advanced query engines.