Ultimate access to all questions.
A data scientist is transitioning their pandas DataFrame code to utilize the pandas API on Spark. They are working with the following incomplete code snippet:
________BLANK_________
df = ps.read_parquet(path)
df["category"].value_counts()
Which line of code should they use to successfully complete the refactoring with the pandas API on Spark?
Explanation:
The correct answer is C. import pyspark.pandas as ps
.
Explanation:
The pandas API on Spark, previously known as Koalas, enables users to apply pandas-like syntax while leveraging the distributed computing capabilities of Apache Spark. This API is integrated into PySpark under the pyspark.pandas
namespace.
import pandas as ps
imports the standard pandas library, which lacks Spark's distributed computing features.databricks.pandas
, pandas.spark
, or databricks.pyspark
do not exist in standard distributions.By using import pyspark.pandas as ps
, the data scientist can efficiently refactor their pandas DataFrame code for large-scale data processing, combining the ease of pandas syntax with Spark's scalability.