
Ultimate access to all questions.
You are developing a PySpark application to process a large dataset stored in a CSV file. The dataset contains millions of records, and you need to perform several transformations. Your goal is to optimize the job's performance while considering cost, compliance, and scalability. Which of the following strategies would you choose and why? (Choose one option.)
A
Read the CSV file as a single partition using the spark.read.csv() method to minimize the number of tasks and reduce overhead.
B
Read the CSV file with multiple partitions using the spark.read.csv() method and then apply the repartition() function to evenly distribute the data across the cluster.
C
Read the CSV file as a DataFrame and enable the inferSchema option to automatically infer the schema, ensuring flexibility with varying data types.
D
Read the CSV file as a DataFrame and specify the schema manually to avoid the computational overhead of schema inference, especially for large datasets.