
Explanation:
To achieve a specific output file size without a shuffle, you must control the size of the initial input partitions. The configuration spark.sql.files.maxPartitionBytes determines the maximum number of bytes per partition when reading file-based data sources.
Since Spark typically writes one output file per task (partition) in the absence of a shuffle, setting this value to 512 MB for a 1 TB dataset will create approximately 2,048 partitions (1 TB ÷ 512 MB). As long as only narrow transformations (like filter or select) are applied, these partitions remain at the desired size, resulting in ~2,048 Parquet part-files of ~512 MB each with zero shuffle overhead.
Why other options are incorrect:
coalesce can reduce partitions without a full shuffle, it is unnecessary if the initial scan is already configured to the correct partition size.Ultimate access to all questions.
No comments yet.
A data engineering team needs to ingest a 1 TB JSON dataset and convert it into Parquet format with a target part-file size of approximately 512 MB. Given that Delta Lake features like Auto-Optimize are unavailable, how can they achieve this target size with optimal performance while strictly avoiding any data shuffling?
A
Set spark.sql.shuffle.partitions to 2,048 before ingestion, perform narrow transformations, and apply a sort operation to organize the data before writing to Parquet.
B
Configure spark.sql.files.maxPartitionBytes to 512 MB, ingest the data, perform only narrow transformations, and then write the result to Parquet.
C
Set spark.sql.adaptive.advisoryPartitionSizeInBytes to 512 MB, ingest the data, perform narrow transformations, and use coalesce to reach 2,048 partitions before writing.
D
Ingest the data, perform narrow transformations, and use repartition(2048) to set the number of output files based on the target size before writing to Parquet.
E
Configure spark.sql.shuffle.partitions to 512 before ingestion, perform narrow transformations, and write the result directly to the destination.