
Answer-first summary for fast verification
Answer: Read the CSV file as a DataFrame and specify the schema manually to avoid the computational overhead of schema inference, especially for large datasets.
Option D is the most efficient and scalable approach for processing large datasets in PySpark. Manually specifying the schema eliminates the need for schema inference, which can be computationally expensive and slow for large datasets. This approach not only improves performance but also ensures that the data types are consistent and predictable, which is crucial for compliance and data quality. Option A may lead to data skew and performance bottlenecks due to the single partition. Option B improves data distribution but does not address the schema inference overhead. Option C, while convenient, may not be efficient for large datasets due to the computational cost of schema inference.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
You are developing a PySpark application to process a large dataset stored in a CSV file. The dataset contains millions of records, and you need to perform several transformations. Your goal is to optimize the job's performance while considering cost, compliance, and scalability. Which of the following strategies would you choose and why? (Choose one option.)
A
Read the CSV file as a single partition using the spark.read.csv() method to minimize the number of tasks and reduce overhead.
B
Read the CSV file with multiple partitions using the spark.read.csv() method and then apply the repartition() function to evenly distribute the data across the cluster.
C
Read the CSV file as a DataFrame and enable the inferSchema option to automatically infer the schema, ensuring flexibility with varying data types.
D
Read the CSV file as a DataFrame and specify the schema manually to avoid the computational overhead of schema inference, especially for large datasets.
No comments yet.