
Answer-first summary for fast verification
Answer: Distribute the data validation process across the nodes in the data lake to parallelize the process and handle large volumes of data efficiently.
Option B is the correct approach as it leverages the distributed nature of the data lake to parallelize the data validation process, ensuring efficient handling of large volumes of data. Option A may not be scalable for large data lakes. Option C is incorrect as data quality and skew are crucial for analytics. Option D is not scalable and may not identify all issues.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
In a scenario where you are tasked with optimizing a data lake for analytics, how would you approach ensuring data quality and handling data skew in a distributed storage system?
A
Use a centralized data validation tool to check for data completeness, consistency, accuracy, and integrity before ingestion into the data lake.
B
Distribute the data validation process across the nodes in the data lake to parallelize the process and handle large volumes of data efficiently.
C
Ignore data quality and skew issues, focusing only on the storage capacity and performance of the data lake.
D
Manually inspect a sample of the data to ensure quality and consistency before ingestion into the data lake.
No comments yet.