Reddit

A nightly job ingests data into a Delta Lake table using the following code:

from pyspark.sql.functions import current_timestamp, input_file_name, col
from pyspark.sql.column import Column

def ingest_daily_batch(time_col: Column, year: int, month: int, day: int):
    (spark.read
     .format("parquet")
     .load(f"/mnt/daily_batch/{year}/{month}/{day}")
     .select("*",
             time_col.alias("ingest_time"),
             input_file_name().alias("source_file"))
     .write
     .mode("append")
     .saveAsTable("bronze"))

from pyspark.sql.functions import current_timestamp, input_file_name, col
from pyspark.sql.column import Column

def ingest_daily_batch(time_col: Column, year: int, month: int, day: int):
    (spark.read
     .format("parquet")
     .load(f"/mnt/daily_batch/{year}/{month}/{day}")
     .select("*",
             time_col.alias("ingest_time"),
             input_file_name().alias("source_file"))
     .write
     .mode("append")
     .saveAsTable("bronze"))

The next step in the pipeline requires a function that returns an object that can be used to manipulate new records that have not yet been processed to the next table in the pipeline.

Which code snippet completes this function definition?

def new_records():
```*_

def new_records():
```*_

Exam-Like

return spark.readStream.table("bronze")

59.5%

return spark.read.option("readChangeFeed", "true").table ("bronze")

29.8%

return (spark.read.table("bronze").filter(col("ingest_time") == current_timestamp())

5.8%

return spark.read.option("readChangeFeed","true").table("bronze")

5.0%

return (spark.read.table("bronze").filter(col("source_file") == f"/mnt/daily_batch/{year}/{month}/{day}")

Databricks Certified Data Engineer - Professional

Get started today

Comments