
Answer-first summary for fast verification
Answer: Partition the output into a Parquet table based on one or more columns to optimize query performance through predicate pushdown and minimize data scanned.
Option C is the correct choice because partitioning the output based on one or more columns allows for predicate pushdown, which significantly reduces the amount of data scanned during query operations, thereby improving performance. This approach also scales well as the dataset grows, maintaining efficient query performance without a proportional increase in cost. Option A, while cost-effective in terms of storage, does not optimize for read performance. Option B improves read parallelism but may lead to excessive metadata overhead. Option D offers flexibility but does not provide the same level of performance optimization as partitioning at write time.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
You are designing a Spark application to process a large dataset with complex transformations. The final output needs to be stored in a Parquet format for efficient querying in a data lake environment. The dataset is expected to grow over time, and the solution must support high performance for both write and read operations, while also being cost-effective. Considering these requirements, which of the following strategies would you choose and why? (Choose one option.)
A
Store the output in a single large Parquet file to reduce storage costs and simplify file management.
B
Distribute the output across multiple small Parquet files to enhance read parallelism but risk increasing metadata overhead.
C
Partition the output into a Parquet table based on one or more columns to optimize query performance through predicate pushdown and minimize data scanned.
D
Use a non-partitioned Parquet table and dynamically repartition the data at read time to adjust to varying query patterns.
No comments yet.