
Ultimate access to all questions.
What is the optimal strategy for writing a 1 TB JSON dataset to Parquet with a target file size of 512 MB per partition, while avoiding data shuffling, when Delta Lake's built-in file-sizing features like Auto-Optimize and Auto-Compaction are unavailable?
A
Set spark.sql.files.maxPartitionBytes to 512 MB, ingest the data, execute the narrow transformations, and then write to parquet.
B
Set spark.sql.shuffle.partitions to 2,048 partitions (1TB10241024/512), ingest the data, execute the narrow transformations, optimize the data by sorting it (which automatically repartitions the data), and then write to parquet.
C
Set spark.sql.adaptive.advisoryPartitionSizeInBytes to 512 MB bytes, ingest the data, execute the narrow transformations, coalesce to 2,048 partitions (1TB10241024/512), and then write to parquet.
D
Ingest the data, execute the narrow transformations, repartition to 2,048 partitions (1TB* 1024*1024/512), and then write to parquet.