
Ultimate access to all questions.
A data engineering team needs to ingest a 1 TB JSON dataset and convert it into Parquet format with a target part-file size of approximately 512 MB. Given that Delta Lake features like Auto-Optimize are unavailable, how can they achieve this target size with optimal performance while strictly avoiding any data shuffling?
A
Set spark.sql.shuffle.partitions to 2,048 before ingestion, perform narrow transformations, and apply a sort operation to organize the data before writing to Parquet.
B
Configure spark.sql.files.maxPartitionBytes to 512 MB, ingest the data, perform only narrow transformations, and then write the result to Parquet.
C
Set spark.sql.adaptive.advisoryPartitionSizeInBytes to 512 MB, ingest the data, perform narrow transformations, and use coalesce to reach 2,048 partitions before writing.
D
Ingest the data, perform narrow transformations, and use repartition(2048) to set the number of output files based on the target size before writing to Parquet.
E
Configure spark.sql.shuffle.partitions to 512 before ingestion, perform narrow transformations, and write the result directly to the destination.