
Ultimate access to all questions.
Consider a scenario where you have a data pipeline that performs various data manipulation tasks using Pandas. You are now required to refactor the pipeline to use Pandas API on Spark to leverage distributed computing. What are the key steps you would follow in the refactoring process?
A
Identify the Pandas operations in the pipeline, replace them with their equivalent native Spark operations, and rewrite the entire pipeline.
B
Use Pandas API on Spark as a drop-in replacement for Pandas, without any changes to the existing code.
C
Identify the Pandas operations in the pipeline, replace them with their equivalent Pandas API on Spark operations, and test the refactored pipeline for correctness and performance.
D
Keep using Pandas for data manipulation and parallelize the pipeline using a multi-threading approach.