Databricks Certified Associate Developer for Apache Spark

Databricks Certified Associate Developer for Apache Spark

Get started today

Ultimate access to all questions.


Which of the following operations can be used to create a new DataFrame from DataFrame storesDF without causing a shuffle?





Explanation:

The correct operations are union() and coalesce(1).

  • C. union(): Combines two DataFrames by appending rows without shuffling data. It is a narrow transformation as it simply stacks partitions.
  • D. coalesce(1): Reduces partitions by merging existing ones without a full shuffle. Coalesce avoids shuffling by combining partitions locally where possible, even though moving to a single partition may involve data movement, it does not induce a shuffle (unlike repartition).

Other options:

  • A. intersect(): Requires identifying common rows, which typically triggers a shuffle.
  • B. repartition(1): Explicitly redistributes data via a shuffle.
  • E. rdd.getNumPartitions(): Returns an integer (number of partitions), not a DataFrame.