
Answer-first summary for fast verification
Answer: Set spark.sql.files.maxPartitionBytes to 512 MB, ingest the data, execute the narrow transformations, and then write to parquet.
The goal is to write a 1TB JSON dataset into Parquet files of ~512MB each without shuffling. Shuffling is expensive and should be avoided. The key is to control the number of partitions during data ingestion and processing. - **Option A**: Setting `spark.sql.files.maxPartitionBytes` to 512MB ensures that when reading the JSON data, Spark creates partitions of up to 512MB. Narrow transformations (e.g., `filter`, `select`) retain the partition count. Writing to Parquet will produce one file per partition, achieving the target size without shuffling. - **Option B**: Sorting triggers a shuffle, violating the no-shuffling requirement. - **Option C**: `spark.sql.adaptive.advisoryPartitionSizeInBytes` applies to adaptive query execution during shuffles, which are absent here. Coalescing to 2048 partitions requires existing partitions to be higher, which isn’t guaranteed without a shuffle. - **Option D**: `repartition(2048)` causes a full shuffle, which is explicitly avoided for performance. Thus, **Option A** is correct as it directly controls input partitioning, avoids shuffling, and achieves the target file size.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
What is the optimal strategy for writing a 1 TB JSON dataset to Parquet with a target file size of 512 MB per partition, while avoiding data shuffling, when Delta Lake's built-in file-sizing features like Auto-Optimize and Auto-Compaction are unavailable?
A
Set spark.sql.files.maxPartitionBytes to 512 MB, ingest the data, execute the narrow transformations, and then write to parquet.
B
Set spark.sql.shuffle.partitions to 2,048 partitions (1TB10241024/512), ingest the data, execute the narrow transformations, optimize the data by sorting it (which automatically repartitions the data), and then write to parquet.
C
Set spark.sql.adaptive.advisoryPartitionSizeInBytes to 512 MB bytes, ingest the data, execute the narrow transformations, coalesce to 2,048 partitions (1TB10241024/512), and then write to parquet.
D
Ingest the data, execute the narrow transformations, repartition to 2,048 partitions (1TB* 1024*1024/512), and then write to parquet.