
Answer-first summary for fast verification
Answer: `import pyspark.pandas as ps`
The correct answer is **C. `import pyspark.pandas as ps`**. **Explanation:** The pandas API on Spark, previously known as Koalas, enables users to apply pandas-like syntax while leveraging the distributed computing capabilities of Apache Spark. This API is integrated into PySpark under the `pyspark.pandas` namespace. - **Option A:** `import pandas as ps` imports the standard pandas library, which lacks Spark's distributed computing features. - **Options B, D, and E:** These import statements are invalid in the context of the pandas API on Spark. Modules such as `databricks.pandas`, `pandas.spark`, or `databricks.pyspark` do not exist in standard distributions. By using `import pyspark.pandas as ps`, the data scientist can efficiently refactor their pandas DataFrame code for large-scale data processing, combining the ease of pandas syntax with Spark's scalability.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
A data scientist is transitioning their pandas DataFrame code to utilize the pandas API on Spark. They are working with the following incomplete code snippet:
________BLANK_________
df = ps.read_parquet(path)
df["category"].value_counts()
________BLANK_________
df = ps.read_parquet(path)
df["category"].value_counts()
Which line of code should they use to successfully complete the refactoring with the pandas API on Spark?
A
import pandas as ps
B
import databricks.pandas as ps
C
import pyspark.pandas as ps
D
import pandas.spark as ps
E
import databricks.pyspark as ps