Databricks Certified Data Engineer - Professional

Get started today

Ultimate access to all questions.

You are tasked with optimizing the storage of a PySpark DataFrame on disk. Discuss how you would control the size of individual part-files when writing the DataFrame to disk. Explain the importance of this control and how it affects query performance.

Simulated

Last updated: December 25, 2025 at 14:03

By setting the maxRecordsPerFile option when writing the DataFrame, which controls the maximum number of records per file, thus influencing the size of part-files.

58.0%

Comments

Loading comments...

By using coalesce to reduce the number of partitions before writing, which directly controls the number of part-files and their sizes.

By repartitioning the DataFrame based on a specific column, which can help in creating balanced part-files but does not directly control their size.

13.7%

By increasing the shuffle partitions, which indirectly affects the part-file sizes by distributing data more evenly across partitions.

10.2%