Databricks Certified Data Engineer - Professional

Get started today

Ultimate access to all questions.

You are developing a PySpark application to process a large dataset stored in a CSV file. The dataset contains millions of records, and you need to perform several transformations. Your goal is to optimize the job's performance while considering cost, compliance, and scalability. Which of the following strategies would you choose and why? (Choose one option.)

Simulated

Read the CSV file as a single partition using the spark.read.csv() method to minimize the number of tasks and reduce overhead.

12.5%

Read the CSV file with multiple partitions using the spark.read.csv() method and then apply the repartition() function to evenly distribute the data across the cluster.

23.5%

Comments

Loading comments...

Read the CSV file as a DataFrame and enable the inferSchema option to automatically infer the schema, ensuring flexibility with varying data types.

20.2%