
Explanation:
The code block uses approx_count_distinct(), which is intended for approximate counts, not exact counts. The correct function for an exact count of distinct values is countDistinct(). Option E accurately states that approx_count_distinct() cannot provide an exact count. The other options are incorrect because: the alias() operation is valid (B is wrong), exact counts are achievable with countDistinct() (C is wrong), and approx_count_distinct() can be used as a standalone function (D is wrong). Option A is misleading because adjusting the rsd parameter still results in an approximation, not an exact count.
Ultimate access to all questions.
Identify the error in the following code block intended to return the exact number of distinct values in the division column of DataFrame storesDF:
Code block:
storesDF.agg(approx_count_distinct(col("division")).alias("divisionDistinct"))
storesDF.agg(approx_count_distinct(col("division")).alias("divisionDistinct"))
A
The approx_count_distinct() operation needs a second argument to set the rsd parameter to ensure it returns the exact number of distinct values.
B
There is no alias() operation for the approx_count_distinct() operation's output.
C
There is no way to return an exact distinct number in Spark because the data Is distributed across partitions.
D
The approx_count_distinct()operation is not a standalone function - it should be used as a method from a Column object.
E
The approx_count_distinct() operation cannot determine an exact number of distinct values in a column.
No comments yet.