
Ultimate access to all questions.
A data analyst has created a Delta table sales that is used by the entire data analysis team. They want help from the data engineering team to implement a series of tests to ensure the data is clean. However, the data engineering team uses Python for its tests rather than SQL.
Which of the following commands could the data engineering team use to access sales in PySpark?
A
SELECT * FROM sales
B
There is no way to share data between PySpark and SQL.
C
spark.sql("sales")
D
spark.delta.table("sales")
E
spark.table("sales")
Explanation:
Explanation:
In Databricks, Delta tables created in SQL are accessible from PySpark through the SparkSession. The correct way to access a Delta table in PySpark is using spark.table("sales"). This method returns a DataFrame that can be used for data processing and testing in Python.
Let's analyze each option:
A. SELECT * FROM sales - This is SQL syntax, not PySpark. While you could use spark.sql("SELECT * FROM sales"), the option as written is not valid PySpark code.
B. There is no way to share data between PySpark and SQL. - This is incorrect. Databricks is built on Apache Spark, which allows seamless data sharing between SQL and Python/PySpark through the SparkSession and catalog.
C. spark.sql("sales") - This is incorrect syntax. spark.sql() expects a SQL query string, not just a table name. The correct usage would be spark.sql("SELECT * FROM sales").
D. spark.delta.table("sales") - This is not a valid method. While there is spark.read.format("delta").table("sales"), the spark.delta.table() method doesn't exist in the standard PySpark API.
E. spark.table("sales") - CORRECT. This is the standard way to access a table in PySpark. It returns a DataFrame that can be used for data processing, transformations, and testing in Python.
Key Points:
spark.table() is the standard PySpark method for accessing registered tables