
Answer-first summary for fast verification
Answer: pandas API on Spark
The **pandas API on Spark** is the optimal choice for a data scientist facing scalability issues with pandas. It allows the use of familiar pandas syntax while leveraging Apache Spark's distributed computing capabilities, enabling efficient processing of large datasets without extensive refactoring. Other options like the **Feature Store**, **PySpark DataFrame API**, **Spark SQL**, and **Scala Dataset API** either do not address the scalability issue directly or require significant changes to the existing codebase.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
A data scientist has developed a feature engineering notebook using the pandas library. However, as the data volume increases, the notebook's runtime escalates significantly, and processing speed decreases proportionally with the data size. Which tool should the data scientist consider to efficiently scale their notebook for big data with minimal refactoring?
A
Feature Store
B
PySpark DataFrame API
C
Spark SQL
D
Scala Dataset API
E
pandas API on Spark
No comments yet.