Databricks Certified Data Engineer - Associate

Get started today

Ultimate access to all questions.

Deep dive into the quiz with AI chat providers.

We prepare a focused prompt with your quiz and certificate details so each AI can offer a more tailored, in-depth explanation.

A data analyst has provided a data engineering team with the following Spark SQL query:

SELECT district,
       avg(sales)
FROM store_sales_20220101
GROUP BY district;

SELECT district,
       avg(sales)
FROM store_sales_20220101
GROUP BY district;

The data analyst would like the data engineering team to run this query every day. The date at the end of the table name (20220101) should automatically be replaced with the current date each time the query is run.

Which of the following approaches could be used by the data engineering team to efficiently automate this process?

Exam-Like

Community

KKeng

Last updated: January 13, 2026 at 09:02

They could wrap the query using PySpark and use Python’s string variable system to automatically update the table name.

They could manually replace the date within the table name with the current day’s date.

They could request that the data analyst rewrites the query to be run less frequently.

They could replace the string-formatted date in the table name with a timestamp-formatted date.

They could pass the table into PySpark and develop a robustly tested module on the existing query.

Explanation:

Explanation

Correct Answer: A

Option A is the correct approach because:

Automation: Using PySpark with Python's string variable system allows for dynamic table name generation based on the current date
Efficiency: This approach automates the process, eliminating manual intervention
Scalability: The solution can be scheduled to run daily without human involvement
Flexibility: Python's datetime module can easily generate date strings in the required format (YYYYMMDD)

Why other options are incorrect:

B. Manual replacement: This is inefficient and error-prone, requiring daily manual work which defeats the purpose of automation.

C. Run less frequently: This doesn't solve the automation problem; it just reduces the frequency of a manual task.

D. Timestamp-formatted date: This doesn't address the automation requirement. Changing the date format doesn't make the table name update automatically.

E. Develop robustly tested module: While testing is important, this option is vague and doesn't specifically address the dynamic table name requirement. Option A is more specific and directly solves the problem.

Best Practice Implementation:

from pyspark.sql import SparkSession
from datetime import datetime

spark = SparkSession.builder.appName("DailySalesQuery").getOrCreate()

# Get current date in YYYYMMDD format
current_date = datetime.now().strftime("%Y%m%d")

# Construct dynamic table name
table_name = f"store_sales_{current_date}"

# Execute query with dynamic table name
query = f"""
SELECT district,
       avg(sales)
FROM {table_name}
GROUP BY district
"""

result_df = spark.sql(query)

from pyspark.sql import SparkSession
from datetime import datetime

spark = SparkSession.builder.appName("DailySalesQuery").getOrCreate()

# Get current date in YYYYMMDD format
current_date = datetime.now().strftime("%Y%m%d")

# Construct dynamic table name
table_name = f"store_sales_{current_date}"

# Execute query with dynamic table name
query = f"""
SELECT district,
       avg(sales)
FROM {table_name}
GROUP BY district
"""

result_df = spark.sql(query)

This approach can be scheduled using Databricks Jobs or other orchestration tools to run daily, fully automating the process.

Powered ByGPT-5.2

Comments

Loading comments...