
Ultimate access to all questions.
A data engineering team is tasked with converting a 1 TB JSON dataset into Parquet format. The goal is to produce part-files that are approximately 512 MB each. Given that built-in Databricks features like Auto-Optimize and Auto-Compaction are not available for this workload, which strategy provides the most efficient performance by ensuring the target file size is met without triggering a data shuffle?
A
Ingest the data, perform the necessary narrow transformations, and then use df.repartition(2048) to create 2,048 partitions (calculated as 1 TB / 512 MB) before writing to Parquet.
B
Configure spark.sql.files.maxPartitionBytes to 512 MB, ingest the JSON data, perform narrow transformations, and then write the resulting DataFrame to Parquet.
C
Set spark.sql.shuffle.partitions to 2,048, ingest the data, apply narrow transformations, and perform an orderBy operation to ensure data is sorted and repartitioned before writing to Parquet.
D
Set spark.sql.adaptive.advisoryPartitionSizeInBytes to 512 MB, ingest the data, apply narrow transformations, and then use df.coalesce(2048) to reduce the partition count before writing to Parquet.