
Answer-first summary for fast verification
Answer: Partition the Delta table by the `topic` field. This enables the use of directory-level Access Control Lists (ACLs) or Unity Catalog row filters for access restriction and allows for efficient, metadata-only `DELETE` operations based on partition boundaries.
### Explanation **Correct Choice: A** Partitioning by the `topic` column is the most efficient solution because: - **Access Control:** In Delta Lake, partitioning creates physical subdirectories (e.g., `.../topic=registration/`). You can apply storage-level permissions or Unity Catalog policies to these specific paths to restrict access to PII. - **Efficient Deletion:** When you run a `DELETE` statement where the predicate matches a partition column (e.g., `WHERE topic = 'registration' AND event_date < current_date() - 14`), Delta Lake performs a metadata-only operation. It drops the entire file from the table state rather than rewriting files, which is performant even at a massive scale. - **Single Table:** This approach allows all data to reside in one logical table, satisfying the requirement for a unified schema. **Why others are incorrect:** - **B:** Time Travel is intended for auditing and accidental rollbacks, not for long-term data retention. Files are eventually removed by `VACUUM` based on the `logRetentionDuration` setting. - **C:** A single Delta table cannot be spanned across multiple separate object storage containers/buckets while maintaining ACID guarantees and standard metadata management. - **D:** PII is defined by the content of the data, not the data type. Storing PII in a `binary` format does not exempt it from compliance regulations like GDPR or CCPA. - **E:** While similar to A, partitioning by a specific boolean flag is less flexible than partitioning by the source `topic` itself, and the original schema identifies `topic` as the distinguishing attribute.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
A data engineering team is managing a single Delta Lake table that ingests data from several Kafka topics. They have two primary requirements for data governance and compliance:
Which of the following solutions satisfies these requirements while maintaining high performance?
A
Partition the Delta table by the topic field. This enables the use of directory-level Access Control Lists (ACLs) or Unity Catalog row filters for access restriction and allows for efficient, metadata-only DELETE operations based on partition boundaries.
B
Implement a biweekly job that deletes all data in the table, then use Delta Lake's Time Travel functionality to recover and provide access to the non-PII historical records.
C
Store the PII and non-PII data in the same table but use separate underlying object storage containers for each partition to ensure isolation at the physical storage layer.
D
Since the PII data is stored within a binary type field, it is automatically excluded from PII classification, meaning no specific access controls or deletion policies are required.
E
Partition the data using a custom boolean field named is_registration. Apply Access Control Lists (ACLs) to the resulting directory and execute DELETE statements against this flag.