
Ultimate access to all questions.
In the context of optimizing a PySpark DataFrame write operation to disk for subsequent read performance, consider the following scenario: You are tasked with writing a large DataFrame to disk in a manner that ensures optimal read performance for downstream processing. The DataFrame contains sensitive data that must be encrypted at rest, and the solution must comply with organizational policies that limit the maximum size of individual part-files to 128MB to facilitate efficient data processing. Which of the following approaches BEST meets these requirements? Choose the single best option.
A
Use the repartition() method to increase the number of partitions, ensuring data is evenly distributed, and then write the DataFrame to disk with encryption enabled.
B
Use the coalesce() method to reduce the number of partitions, minimizing the overhead of small files, and then write the DataFrame to disk with encryption enabled.
C
Use the write.partitionBy() method to organize the data by specific columns for efficient querying, and then write the DataFrame to disk with encryption enabled.
D
Use the write.option('spark.sql.files.maxPartitionBytes', '134217728') to limit the size of individual part-files to 128MB, ensuring compliance with organizational policies, and then write the DataFrame to disk with encryption enabled.