Databricks Certified Data Engineer - Professional

Databricks Certified Data Engineer - Professional

Get started today

Ultimate access to all questions.


What is the optimal strategy for writing a 1 TB JSON dataset to Parquet with a target file size of 512 MB per partition, while avoiding data shuffling, when Delta Lake's built-in file-sizing features like Auto-Optimize and Auto-Compaction are unavailable?




Explanation:

The goal is to write a 1TB JSON dataset into Parquet files of ~512MB each without shuffling. Shuffling is expensive and should be avoided. The key is to control the number of partitions during data ingestion and processing.

  • Option A: Setting spark.sql.files.maxPartitionBytes to 512MB ensures that when reading the JSON data, Spark creates partitions of up to 512MB. Narrow transformations (e.g., filter, select) retain the partition count. Writing to Parquet will produce one file per partition, achieving the target size without shuffling.

  • Option B: Sorting triggers a shuffle, violating the no-shuffling requirement.

  • Option C: spark.sql.adaptive.advisoryPartitionSizeInBytes applies to adaptive query execution during shuffles, which are absent here. Coalescing to 2048 partitions requires existing partitions to be higher, which isn’t guaranteed without a shuffle.

  • Option D: repartition(2048) causes a full shuffle, which is explicitly avoided for performance.

Thus, Option A is correct as it directly controls input partitioning, avoids shuffling, and achieves the target file size.