
Explanation:
The pandas API on Spark is the optimal choice for a data scientist facing scalability issues with pandas. It allows the use of familiar pandas syntax while leveraging Apache Spark's distributed computing capabilities, enabling efficient processing of large datasets without extensive refactoring. Other options like the Feature Store, PySpark DataFrame API, Spark SQL, and Scala Dataset API either do not address the scalability issue directly or require significant changes to the existing codebase.
Ultimate access to all questions.
No comments yet.
A data scientist has developed a feature engineering notebook using the pandas library. However, as the data volume increases, the notebook's runtime escalates significantly, and processing speed decreases proportionally with the data size. Which tool should the data scientist consider to efficiently scale their notebook for big data with minimal refactoring?
A
Feature Store
B
PySpark DataFrame API
C
Spark SQL
D
Scala Dataset API
E
pandas API on Spark