
Answer-first summary for fast verification
Answer: Configure `spark.sql.files.maxPartitionBytes` to 512 MB, ingest the JSON data, perform narrow transformations, and then write the resulting DataFrame to Parquet.
The most efficient way to control output file size while avoiding shuffles is to control the **input partition size**. 1. **`spark.sql.files.maxPartitionBytes`**: By setting this configuration to 512 MB, Spark is instructed to pack at most 512 MB of source data into each input partition when reading the JSON files. 2. **Narrow Transformations**: Operations such as `select`, `filter`, or `map` are narrow transformations. They preserve the existing partitioning of the DataFrame because they do not require data to be moved across the network. 3. **Deterministic Output**: When writing the DataFrame to Parquet, Spark executes one task per partition. Since each partition is already sized at ~512 MB due to the initial configuration and no shuffle has occurred, the resulting Parquet files will naturally align with the target size. **Why other options are less ideal:** * **Repartitioning (Option A)** and **Sorting (Option C)** are wide transformations. They trigger a full data shuffle across the cluster, which is computationally expensive and slow for a 1 TB dataset. * **Coalesce (Option D)** is intended to reduce partitions without a full shuffle, but it does not provide the same granular control over the initial data split as `maxPartitionBytes`. Additionally, `advisoryPartitionSizeInBytes` is an AQE setting typically used to optimize shuffle partitions, not the initial scan.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
A data engineering team is tasked with converting a 1 TB JSON dataset into Parquet format. The goal is to produce part-files that are approximately 512 MB each. Given that built-in Databricks features like Auto-Optimize and Auto-Compaction are not available for this workload, which strategy provides the most efficient performance by ensuring the target file size is met without triggering a data shuffle?
A
Ingest the data, perform the necessary narrow transformations, and then use df.repartition(2048) to create 2,048 partitions (calculated as 1 TB / 512 MB) before writing to Parquet.
B
Configure spark.sql.files.maxPartitionBytes to 512 MB, ingest the JSON data, perform narrow transformations, and then write the resulting DataFrame to Parquet.
C
Set spark.sql.shuffle.partitions to 2,048, ingest the data, apply narrow transformations, and perform an orderBy operation to ensure data is sorted and repartitioned before writing to Parquet.
D
Set spark.sql.adaptive.advisoryPartitionSizeInBytes to 512 MB, ingest the data, apply narrow transformations, and then use df.coalesce(2048) to reduce the partition count before writing to Parquet.