
Answer-first summary for fast verification
Answer: df.groupBy('user_id') \.agg(countDistinct('subscription_type').alias('distinct_count')).filter('distinct_count != 1'), df.groupBy('user_id').agg(collect_set('subscription_type').alias('subs')).filter(size('subs') != 1)
B and D are both correct ways to identify user_ids associated with more than one subscription_type (i.e., where uniqueness is violated). B uses countDistinct and filters for those not equal to 1. D uses collect_set and checks the size of the set. A only finds users with more than one subscription type, but misses those with zero (if possible). C checks for users with multiple rows after deduplication, which could be misleading if there are duplicate rows. E counts all rows per user, not distinct subscription types, so it fails if a user has multiple identical subscriptions.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
Suppose you have a DataFrame df with columns user_id and subscription_type. How would you ensure that each user_id is associated with just one unique subscription_type? Provide the Spark code to validate this. (Choose Two)
A
df.groupBy('user_id').agg(countDistinct('subscription_type').alias('distinct_count')).filter('distinct_count > 1')
B
df.groupBy('user_id') .agg(countDistinct('subscription_type').alias('distinct_count')).filter('distinct_count != 1')
C
df.dropDuplicates(['user_id', 'subscription_type']).groupBy('user_id').count().filter('count > 1')
D
df.groupBy('user_id').agg(collect_set('subscription_type').alias('subs')).filter(size('subs') != 1)
E
df.groupBy('user_id').agg(count('subscription_type').alias('cnt')).filter('cnt > 1')
No comments yet.