
Answer-first summary for fast verification
Answer: Partition the data by month and use data tiering, placing the current month's data in high-performance storage and older data in more cost-effective storage.
Partitioning the data by month enables efficient querying of the latest month's data by allowing queries to target specific partitions, avoiding the need to scan the entire dataset. Implementing data tiering ensures that the current month's data is stored in high-performance storage for quick access, while older data is stored in cost-effective storage to minimize expenses. This strategy effectively balances query performance with storage costs by optimizing access to frequently queried data while maintaining availability of historical data. Storing all data in the cheapest storage (option B) could degrade query performance for recent data. Normalizing the dataset (option A) might not be efficient for accessing the latest month's data due to the necessity of joining multiple tables. Duplicating recent data (option D) could introduce complexity and higher costs as the dataset expands.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
In a data lakehouse environment, how should you model a dataset that is frequently queried for the latest month's data but also contains years of historical data, to balance query performance and storage cost?
A
Normalize the dataset into several tables based on access patterns, without considering the impact on storage costs.
B
Store all data in the most cost-effective storage available and rely heavily on caching to optimize query performance.
C
Partition the data by month and use data tiering, placing the current month's data in high-performance storage and older data in more cost-effective storage.
D
Create a duplicate of the latest month's data in a separate high-performance storage system and combine it with historical data at query time to reduce costs.