
Ultimate access to all questions.
A data engineer is curating data in the silver layer of a hospital management data warehouse system. The data engineer is trying to aggregate hospital billing data from a table patient_billing to generate a daily revenue fact table daily_revenue.
Assume this as a sample of dataframe billing_df:
| billing_id | patient_id | department | billing_date | amount_billed | quantity |
|---|---|---|---|---|---|
| 401 | p001 | Cardiology | 2024-03-01 | 1500 | 1 |
| 402 | p002 | Radiology | 2024-03-02 | 3000 | 1 |
| 403 | p001 | Cardiology | 2024-03-01 | 6500 | 1 |
| 404 | p003 | Radiology | 2024-03-03 | 500 | 1 |
Which code snippet aggregates the amount billed per day with the unique invoices from a Dataframe billing_df? (Options were provided separately.)
B
daily_revenue_df = billing_df.groupBy("billing_date").agg(
col("amount_billed").alias("total_revenue"),
count("billing_id").alias("total_invoices")
col("amount_billed").alias("total_revenue"),
count("billing_id").alias("total_invoices")
)
C
daily_revenue_df = billing_df.groupBy("billing_date").agg(
sum("amount_billed").alias("total_revenue"),
count_distinct("patient_id").alias("total_invoices")
sum("amount_billed").alias("total_revenue"),
count_distinct("patient_id").alias("total_invoices")
)_
D
daily_revenue_df = billing_df.groupBy("billing_date").agg(
sum("amount_billed").alias("total_revenue"),
count_distinct("billing_id").alias("total_invoices")
sum("amount_billed").alias("total_revenue"),
count_distinct("billing_id").alias("total_invoices")
)_
Next
Explanation:
Correct answer: D.
Reasoning:
Example expected output from the sample data:
Recommended PySpark code: from pyspark.sql import functions as F
daily_revenue_df = billing_df.groupBy("billing_date").agg( F.sum("amount_billed").alias("total_revenue"), F.countDistinct("billing_id").alias("total_invoices") )
Section: ELT With Spark SQL and Python