
Explanation:
The most efficient way to control output file size while avoiding shuffles is to control the input partition size.
spark.sql.files.maxPartitionBytes: By setting this configuration to 512 MB, Spark is instructed to pack at most 512 MB of source data into each input partition when reading the JSON files.select, filter, or map are narrow transformations. They preserve the existing partitioning of the DataFrame because they do not require data to be moved across the network.Why other options are less ideal:
maxPartitionBytes. Additionally, advisoryPartitionSizeInBytes is an AQE setting typically used to optimize shuffle partitions, not the initial scan.Ultimate access to all questions.
No comments yet.
A data engineering team is tasked with converting a 1 TB JSON dataset into Parquet format. The goal is to produce part-files that are approximately 512 MB each. Given that built-in Databricks features like Auto-Optimize and Auto-Compaction are not available for this workload, which strategy provides the most efficient performance by ensuring the target file size is met without triggering a data shuffle?
A
Ingest the data, perform the necessary narrow transformations, and then use df.repartition(2048) to create 2,048 partitions (calculated as 1 TB / 512 MB) before writing to Parquet.
B
Configure spark.sql.files.maxPartitionBytes to 512 MB, ingest the JSON data, perform narrow transformations, and then write the resulting DataFrame to Parquet.
C
Set spark.sql.shuffle.partitions to 2,048, ingest the data, apply narrow transformations, and perform an orderBy operation to ensure data is sorted and repartitioned before writing to Parquet.
D
Set spark.sql.adaptive.advisoryPartitionSizeInBytes to 512 MB, ingest the data, apply narrow transformations, and then use df.coalesce(2048) to reduce the partition count before writing to Parquet.