
Ultimate access to all questions.
Consider a scenario where you need to transform large volumes of data using Apache Spark in Azure Databricks. The data includes customer purchase histories and needs to be aggregated by region and month. Which of the following approaches would be most efficient for this task, considering the need for parallel processing and scalability?
A
Use a for-loop to iterate through the data and aggregate it sequentially.
B
Leverage Spark's DataFrame API to perform groupBy operations on region and month, followed by aggregation functions like sum and count.
C
Export the data to a CSV file and use a local Python script to perform the aggregation.
D
Use a single SQL query to perform the aggregation directly on the source database.