Databricks Certified Data Engineer - Associate

Databricks Certified Data Engineer - Associate

Get started today

Ultimate access to all questions.


Suppose you have a DataFrame df with columns user_id and subscription_type. How would you ensure that each user_id is associated with just one unique subscription_type? Provide the Spark code to validate this. (Choose Two)





Explanation:

B and D are both correct ways to identify user_ids associated with more than one subscription_type (i.e., where uniqueness is violated).

B uses countDistinct and filters for those not equal to 1.

D uses collect_set and checks the size of the set.

A only finds users with more than one subscription type, but misses those with zero (if possible).

C checks for users with multiple rows after deduplication, which could be misleading if there are duplicate rows.

E counts all rows per user, not distinct subscription types, so it fails if a user has multiple identical subscriptions.