
Ultimate access to all questions.
Deep dive into the quiz with AI chat providers.
We prepare a focused prompt with your quiz and certificate details so each AI can offer a more tailored, in-depth explanation.
A data analyst has developed a query that runs against Delta table. They want help from the data engineering team to implement a series of tests to ensure the data returned by the query is clean. However, the data engineering team uses Python for its tests rather than SQL.
Which of the following operations could the data engineering team use to run the query and operate with the results in PySpark?
A
SELECT * FROM sales
B
spark.delta.table
C
spark.sql
D
There is no way to share data between PySpark and SQL.
E
spark.table
Explanation:
In PySpark, there are multiple ways to execute SQL queries and work with the results:
Correct Options:
Incorrect Options:
spark.sql() to execute.spark.read.format("delta").table() or spark.table() if the table is registered.How to implement the solution:
# Method 1: Using spark.sql()
query = "SELECT * FROM sales WHERE amount > 1000"
df = spark.sql(query)
# Now you can run tests on the DataFrame
# Example test: check for null values
null_count = df.filter(df.amount.isNull()).count()
assert null_count == 0, f"Found {null_count} null values in amount column"
# Method 2: Using spark.table() if the query result is saved as a view
spark.sql("CREATE OR REPLACE TEMP VIEW clean_sales AS SELECT * FROM sales WHERE amount > 1000")
df = spark.table("clean_sales")
# Method 1: Using spark.sql()
query = "SELECT * FROM sales WHERE amount > 1000"
df = spark.sql(query)
# Now you can run tests on the DataFrame
# Example test: check for null values
null_count = df.filter(df.amount.isNull()).count()
assert null_count == 0, f"Found {null_count} null values in amount column"
# Method 2: Using spark.table() if the query result is saved as a view
spark.sql("CREATE OR REPLACE TEMP VIEW clean_sales AS SELECT * FROM sales WHERE amount > 1000")
df = spark.table("clean_sales")
Both spark.sql() and spark.table() allow the data engineering team to work with SQL query results in PySpark for implementing data quality tests.