
Answer-first summary for fast verification
Answer: spark.sql, spark.table
## Explanation **Correct Answer: C (spark.sql)** The `spark.sql()` method is the primary way to execute SQL queries in PySpark and work with the results as DataFrames. This allows the data engineering team to: 1. Run the SQL query developed by the data analyst 2. Get the results as a PySpark DataFrame 3. Perform data quality tests and validations using Python/PySpark **Why other options are incorrect or less suitable:** - **A (SELECT * FROM sales)**: This is just a SQL query string, not a PySpark operation. It needs to be wrapped in `spark.sql()` to execute. - **B (spark.delta.table)**: This is not a valid PySpark method. The correct way to work with Delta tables is `spark.read.format("delta").table()` or `spark.table()` for registered tables. - **D (There is no way to share data between PySpark and SQL)**: This is incorrect. PySpark and SQL are fully integrated in Databricks - you can execute SQL queries from PySpark and vice versa. - **E (spark.table)**: While `spark.table()` can load a table as a DataFrame, it cannot execute arbitrary SQL queries. It only works with registered tables, not with complex SQL queries that might involve joins, aggregations, or other transformations. **Key Points:** - `spark.sql("your SQL query here")` returns a DataFrame that can be used for further processing in PySpark - This enables seamless collaboration between SQL analysts and Python/PySpark data engineers - The resulting DataFrame supports all PySpark operations for data validation and testing
Author: Keng Suppaseth
Ultimate access to all questions.
No comments yet.
A data analyst has developed a query that runs against Delta table. They want help from the data engineering team to implement a series of tests to ensure the data returned by the query is clean. However, the data engineering team uses Python for its tests rather than SQL. Which of the following operations could the data engineering team use to run the query and operate with the results in PySpark?
A
SELECT * FROM sales
B
spark.delta.table
C
spark.sql
D
There is no way to share data between PySpark and SQL.
E
spark.table