Databricks Certified Data Engineer - Associate

Get started today

Ultimate access to all questions.

Deep dive into the quiz with AI chat providers.

We prepare a focused prompt with your quiz and certificate details so each AI can offer a more tailored, in-depth explanation.

A data engineer is curating data in the silver layer of a hospital management data warehouse system. The data engineer is trying to aggregate hospital billing data from a table patient_billing to generate a daily revenue fact table daily_revenue.

Assume this as a sample of dataframe billing_df:

billing_id	patient_id	department	billing_date	amount_billed	quantity
401	p001	Cardiology	2024-03-01	1500	1
402	p002	Radiology	2024-03-02	3000	1
403	p001	Cardiology	2024-03-01	6500	1
404	p003	Radiology	2024-03-03	500	1

Which code snippet aggregates the amount billed per day with the unique invoices from a Dataframe billing_df? (Options were provided separately.)

Other Sources

Community

DDucse

Last updated: December 7, 2025 at 14:03

daily_revenue_df = billing_df.groupBy("billing_date").agg( sum("amount_billed").alias("total_revenue"), sum("billing_id").alias("total_invoices") )

daily_revenue_df = billing_df.groupBy("billing_date").agg( col("amount_billed").alias("total_revenue"), count("billing_id").alias("total_invoices") )

daily_revenue_df = billing_df.groupBy("billing_date").agg( sum("amount_billed").alias("total_revenue"), count_distinct("patient_id").alias("total_invoices") )

daily_revenue_df = billing_df.groupBy("billing_date").agg( sum("amount_billed").alias("total_revenue"), count_distinct("billing_id").alias("total_invoices") )

Explanation:

Correct answer: D.

Reasoning:

Goal: calculate total daily revenue (sum of amount_billed per billing_date) and the number of unique invoices per day (unique billing_id count).
Option A wrongly uses sum("billing_id") to compute invoices — summing invoice ids is meaningless for counting invoices.
Option B uses count("billing_id") which counts rows; that can be incorrect in the presence of duplicate invoice IDs (if the same invoice appears multiple rows). The question explicitly asks for unique invoices, so we need a distinct count.
Option C counts distinct patient_id which gives the number of unique patients per day, not unique invoices.
Option D correctly sums amount_billed and uses count_distinct("billing_id") (in PySpark use F.countDistinct("billing_id") or F.countDistinct(col("billing_id"))) to get the unique invoice count.

Example expected output from the sample data:

2024-03-01: total_revenue = 1500 + 6500 = 8000, total_invoices = 2 (billing_id 401 and 403)
2024-03-02: total_revenue = 3000, total_invoices = 1
2024-03-03: total_revenue = 500, total_invoices = 1

Recommended PySpark code: from pyspark.sql import functions as F

daily_revenue_df = billing_df.groupBy("billing_date").agg( F.sum("amount_billed").alias("total_revenue"), F.countDistinct("billing_id").alias("total_invoices") )

Section: ELT With Spark SQL and Python

Powered ByGemini-3 Flash

Comments

Loading comments...