
Databricks Certified Data Engineer - Associate
Get started today
Ultimate access to all questions.
Suppose you have a DataFrame df
with columns user_id
and subscription_type
. How would you ensure that each user_id
is associated with just one unique subscription_type
? Provide the Spark code to validate this. (Choose Two)
Suppose you have a DataFrame df
with columns user_id
and subscription_type
. How would you ensure that each user_id
is associated with just one unique subscription_type
? Provide the Spark code to validate this. (Choose Two)
Explanation:
B and D are both correct ways to identify user_ids associated with more than one subscription_type (i.e., where uniqueness is violated).
B uses countDistinct and filters for those not equal to 1.
D uses collect_set and checks the size of the set.
A only finds users with more than one subscription type, but misses those with zero (if possible).
C checks for users with multiple rows after deduplication, which could be misleading if there are duplicate rows.
E counts all rows per user, not distinct subscription types, so it fails if a user has multiple identical subscriptions.