
Ultimate access to all questions.
A data analyst has provided a data engineering team with the following Spark SQL query:
SELECT district,
avg(sales)
FROM store_sales_20220101
GROUP BY district;
SELECT district,
avg(sales)
FROM store_sales_20220101
GROUP BY district;
The data analyst would like the data engineering team to run this query every day. The date at the end of the table name (20220101) should automatically be replaced with the current date each time the query is run.
Which of the following approaches could be used by the data engineering team to efficiently automate this process?
A
They could wrap the query using PySpark and use Python’s string variable system to automatically update the table name.
B
They could manually replace the date within the table name with the current day’s date.
C
They could request that the data analyst rewrites the query to be run less frequently.
D
They could replace the string-formatted date in the table name with a timestamp-formatted date.
E
They could pass the table into PySpark and develop a robustly tested module on the existing query.
Explanation:
Correct Answer: A
Option A is the correct approach because:
Why other options are incorrect:
B. Manual replacement: This is inefficient and error-prone, requiring daily manual work which defeats the purpose of automation.
C. Run less frequently: This doesn't solve the automation problem; it just reduces the frequency of a manual task.
D. Timestamp-formatted date: This doesn't address the automation requirement. Changing the date format doesn't make the table name update automatically.
E. Develop robustly tested module: While testing is important, this option is vague and doesn't specifically address the dynamic table name requirement. Option A is more specific and directly solves the problem.
Best Practice Implementation:
from pyspark.sql import SparkSession
from datetime import datetime
spark = SparkSession.builder.appName("DailySalesQuery").getOrCreate()
# Get current date in YYYYMMDD format
current_date = datetime.now().strftime("%Y%m%d")
# Construct dynamic table name
table_name = f"store_sales_{current_date}"
# Execute query with dynamic table name
query = f"""
SELECT district,
avg(sales)
FROM {table_name}
GROUP BY district
"""
result_df = spark.sql(query)
from pyspark.sql import SparkSession
from datetime import datetime
spark = SparkSession.builder.appName("DailySalesQuery").getOrCreate()
# Get current date in YYYYMMDD format
current_date = datetime.now().strftime("%Y%m%d")
# Construct dynamic table name
table_name = f"store_sales_{current_date}"
# Execute query with dynamic table name
query = f"""
SELECT district,
avg(sales)
FROM {table_name}
GROUP BY district
"""
result_df = spark.sql(query)
This approach can be scheduled using Databricks Jobs or other orchestration tools to run daily, fully automating the process.