
Answer-first summary for fast verification
Answer: Configure `spark.sql.files.maxPartitionBytes` to 512 MB, ingest the data, perform only narrow transformations, and then write the result to Parquet.
To achieve a specific output file size without a shuffle, you must control the size of the initial input partitions. The configuration `spark.sql.files.maxPartitionBytes` determines the maximum number of bytes per partition when reading file-based data sources. Since Spark typically writes one output file per task (partition) in the absence of a shuffle, setting this value to 512 MB for a 1 TB dataset will create approximately 2,048 partitions (1 TB ÷ 512 MB). As long as only narrow transformations (like `filter` or `select`) are applied, these partitions remain at the desired size, resulting in ~2,048 Parquet part-files of ~512 MB each with zero shuffle overhead. **Why other options are incorrect:** * **repartition()** and **sort**: Both operations inherently trigger a shuffle, which the question asks to avoid. * **spark.sql.shuffle.partitions** and **spark.sql.adaptive.advisoryPartitionSizeInBytes**: These settings only affect the behavior of shuffle stages. In a pipeline with only narrow transformations, these settings are ignored. * **coalesce**: While `coalesce` can reduce partitions without a full shuffle, it is unnecessary if the initial scan is already configured to the correct partition size.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
A data engineering team needs to ingest a 1 TB JSON dataset and convert it into Parquet format with a target part-file size of approximately 512 MB. Given that Delta Lake features like Auto-Optimize are unavailable, how can they achieve this target size with optimal performance while strictly avoiding any data shuffling?
A
Set spark.sql.shuffle.partitions to 2,048 before ingestion, perform narrow transformations, and apply a sort operation to organize the data before writing to Parquet.
B
Configure spark.sql.files.maxPartitionBytes to 512 MB, ingest the data, perform only narrow transformations, and then write the result to Parquet.
C
Set spark.sql.adaptive.advisoryPartitionSizeInBytes to 512 MB, ingest the data, perform narrow transformations, and use coalesce to reach 2,048 partitions before writing.
D
Ingest the data, perform narrow transformations, and use repartition(2048) to set the number of output files based on the target size before writing to Parquet.
E
Configure spark.sql.shuffle.partitions to 512 before ingestion, perform narrow transformations, and write the result directly to the destination.