Ultimate access to all questions.
Upgrade Now 🚀
Sign in to unlock AI tutor
Given a scenario where you have a data pipeline that currently uses Pandas for data manipulation, and you need to scale it to handle larger datasets. How would you approach the task of refactoring the pipeline to use Pandas API on Spark?
A
Replace all Pandas operations with their equivalent native Spark operations and rewrite the entire pipeline.
B
Use Pandas API on Spark as a drop-in replacement for Pandas, without any changes to the existing code.
C
Refactor the pipeline to use Pandas API on Spark selectively, by identifying and replacing only the operations that benefit from distributed computing.
D
Keep using Pandas for data manipulation and parallelize the pipeline using a multi-threading approach.