
Databricks Certified Data Engineer - Professional
Get started today
Ultimate access to all questions.
You are a data engineer working with a large dataset in Delta Lake on Microsoft Azure. Your organization requires efficient archiving or deletion of old data to comply with data retention policies and optimize storage costs. The dataset includes a timestamp column recording when each record was created. Which of the following strategies should you implement to achieve this goal? Choose the best option and explain why it is the most suitable. (Choose one option)
You are a data engineer working with a large dataset in Delta Lake on Microsoft Azure. Your organization requires efficient archiving or deletion of old data to comply with data retention policies and optimize storage costs. The dataset includes a timestamp column recording when each record was created. Which of the following strategies should you implement to achieve this goal? Choose the best option and explain why it is the most suitable. (Choose one option)
Explanation:
The most effective strategy for managing old data in Delta Lake, especially when compliance and cost optimization are priorities, is to implement a data retention policy based on a time-based column such as a timestamp. This approach allows for the systematic identification and processing of records that exceed the defined retention period, ensuring that only relevant data is retained. Option A is incorrect because random selection does not guarantee compliance with retention policies or efficient storage use. Option B is flawed because unique identifiers do not necessarily correlate with the age of the data. Option D is inappropriate as it focuses on access patterns rather than data age, which does not address the requirement for archiving or deleting old data.