Databricks Certified Data Engineer - Professional

Databricks Certified Data Engineer - Professional

Get started today

Ultimate access to all questions.


The data engineering team maintains the following code:

accountDF = spark.table("accounts")
orderDF = spark.table("orders")
itemDF = spark.table("items")
orderWithItemDF = (orderDF.join(
    itemDF,
    orderDF.itemID == itemDF.itemID)
    .select(
        orderDF.accountID, 
        orderDF.itemID,
        itemDF.itemName))
finalDF = (accountDF.join(
    orderWithItemDF,
    accountDF.accountID == orderWithItemDF.accountID)
    .select(
        orderWithItemDF["*"],
        accountDF.city))
(finalDF.write
    .mode("overwrite")
    .table("enriched_itemized_orders_by_account"))

Assuming that this code produces logically correct results and the data in the source tables has been de-duplicated and validated, which statement describes what will occur when this code is executed?




Explanation:

The code snippet provided demonstrates a batch processing operation where data from three tables ('accounts', 'orders', 'items') is joined and the result is written to a new table 'enriched_itemized_orders_by_account' using the 'overwrite' mode. This means that every time the code is executed, the entire 'enriched_itemized_orders_by_account' table will be replaced with the current data from the source tables, not just updated or incrementally changed. The 'overwrite' mode does not perform partial updates or use primary keys to determine which rows to update; it completely replaces the target table with the new data. Therefore, the correct statement is that the 'enriched_itemized_orders_by_account' table will be overwritten using the current valid version of data in each of the three tables referenced in the join logic.