
Ultimate access to all questions.
You are tasked with optimizing the storage of a PySpark DataFrame on disk. Discuss how you would control the size of individual part-files when writing the DataFrame to disk. Explain the importance of this control and how it affects query performance.
A
By setting the maxRecordsPerFile option when writing the DataFrame, which controls the maximum number of records per file, thus influencing the size of part-files.
B
By using coalesce to reduce the number of partitions before writing, which directly controls the number of part-files and their sizes.
C
By repartitioning the DataFrame based on a specific column, which can help in creating balanced part-files but does not directly control their size.
D
By increasing the shuffle partitions, which indirectly affects the part-file sizes by distributing data more evenly across partitions.