
Answer-first summary for fast verification
Answer: The approx_count_distinct() operation cannot determine an exact number of distinct values in a column.
The code block uses `approx_count_distinct()`, which is intended for approximate counts, not exact counts. The correct function for an exact count of distinct values is `countDistinct()`. Option E accurately states that `approx_count_distinct()` cannot provide an exact count. The other options are incorrect because: the `alias()` operation is valid (B is wrong), exact counts are achievable with `countDistinct()` (C is wrong), and `approx_count_distinct()` can be used as a standalone function (D is wrong). Option A is misleading because adjusting the `rsd` parameter still results in an approximation, not an exact count.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
Identify the error in the following code block intended to return the exact number of distinct values in the division column of DataFrame storesDF:
Code block:
storesDF.agg(approx_count_distinct(col("division")).alias("divisionDistinct"))
storesDF.agg(approx_count_distinct(col("division")).alias("divisionDistinct"))
A
The approx_count_distinct() operation needs a second argument to set the rsd parameter to ensure it returns the exact number of distinct values.
B
There is no alias() operation for the approx_count_distinct() operation's output.
C
There is no way to return an exact distinct number in Spark because the data Is distributed across partitions.
D
The approx_count_distinct()operation is not a standalone function - it should be used as a method from a Column object.
E
The approx_count_distinct() operation cannot determine an exact number of distinct values in a column.
No comments yet.