
Ultimate access to all questions.
A data analyst has developed a query that runs against Delta table. They want help from the data engineering team to implement a series of tests to ensure the data returned by the query is clean. However, the data engineering team uses Python for its tests rather than SQL. Which of the following operations could the data engineering team use to run the query and operate with the results in PySpark?
A
SELECT * FROM sales
B
spark.delta.table
C
spark.sql
D
There is no way to share data between PySpark and SQL.
E
spark.table
Explanation:
Correct Answer: C (spark.sql)
The spark.sql() method is the primary way to execute SQL queries in PySpark and work with the results as DataFrames. This allows the data engineering team to:
Why other options are incorrect or less suitable:
A (SELECT * FROM sales): This is just a SQL query string, not a PySpark operation. It needs to be wrapped in spark.sql() to execute.
B (spark.delta.table): This is not a valid PySpark method. The correct way to work with Delta tables is spark.read.format("delta").table() or spark.table() for registered tables.
D (There is no way to share data between PySpark and SQL): This is incorrect. PySpark and SQL are fully integrated in Databricks - you can execute SQL queries from PySpark and vice versa.
E (spark.table): While spark.table() can load a table as a DataFrame, it cannot execute arbitrary SQL queries. It only works with registered tables, not with complex SQL queries that might involve joins, aggregations, or other transformations.
Key Points:
spark.sql("your SQL query here") returns a DataFrame that can be used for further processing in PySpark