
Answer-first summary for fast verification
Answer: Use the `write.option('spark.sql.files.maxPartitionBytes', '134217728')` to limit the size of individual part-files to 128MB, ensuring compliance with organizational policies, and then write the DataFrame to disk with encryption enabled.
Option D is the correct choice because it directly addresses the requirement to limit the size of individual part-files to 128MB using the 'spark.sql.files.maxPartitionBytes' option, which is crucial for optimizing read performance. Additionally, it implies the use of encryption for data at rest, meeting the security requirement. While options A and B involve adjusting the number of partitions, they do not guarantee control over the size of part-files. Option C focuses on partitioning by columns for query efficiency but does not address the part-file size limitation. Therefore, Option D is the most comprehensive solution that meets all specified requirements.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
In the context of optimizing a PySpark DataFrame write operation to disk for subsequent read performance, consider the following scenario: You are tasked with writing a large DataFrame to disk in a manner that ensures optimal read performance for downstream processing. The DataFrame contains sensitive data that must be encrypted at rest, and the solution must comply with organizational policies that limit the maximum size of individual part-files to 128MB to facilitate efficient data processing. Which of the following approaches BEST meets these requirements? Choose the single best option.
A
Use the repartition() method to increase the number of partitions, ensuring data is evenly distributed, and then write the DataFrame to disk with encryption enabled.
B
Use the coalesce() method to reduce the number of partitions, minimizing the overhead of small files, and then write the DataFrame to disk with encryption enabled.
C
Use the write.partitionBy() method to organize the data by specific columns for efficient querying, and then write the DataFrame to disk with encryption enabled.
D
Use the write.option('spark.sql.files.maxPartitionBytes', '134217728') to limit the size of individual part-files to 128MB, ensuring compliance with organizational policies, and then write the DataFrame to disk with encryption enabled.