
Databricks Certified Data Engineer - Professional
Get started today
Ultimate access to all questions.
What is the optimal strategy for writing a 1 TB JSON dataset to Parquet with a target file size of 512 MB per partition, while avoiding data shuffling, when Delta Lake's built-in file-sizing features like Auto-Optimize and Auto-Compaction are unavailable?
What is the optimal strategy for writing a 1 TB JSON dataset to Parquet with a target file size of 512 MB per partition, while avoiding data shuffling, when Delta Lake's built-in file-sizing features like Auto-Optimize and Auto-Compaction are unavailable?
Explanation:
The goal is to write a 1TB JSON dataset into Parquet files of ~512MB each without shuffling. Shuffling is expensive and should be avoided. The key is to control the number of partitions during data ingestion and processing.
-
Option A: Setting
spark.sql.files.maxPartitionBytes
to 512MB ensures that when reading the JSON data, Spark creates partitions of up to 512MB. Narrow transformations (e.g.,filter
,select
) retain the partition count. Writing to Parquet will produce one file per partition, achieving the target size without shuffling. -
Option B: Sorting triggers a shuffle, violating the no-shuffling requirement.
-
Option C:
spark.sql.adaptive.advisoryPartitionSizeInBytes
applies to adaptive query execution during shuffles, which are absent here. Coalescing to 2048 partitions requires existing partitions to be higher, which isn’t guaranteed without a shuffle. -
Option D:
repartition(2048)
causes a full shuffle, which is explicitly avoided for performance.
Thus, Option A is correct as it directly controls input partitioning, avoids shuffling, and achieves the target file size.