
Answer-first summary for fast verification
Answer: Leverage Spark's DataFrame API to perform groupBy operations on region and month, followed by aggregation functions like sum and count.
The most efficient approach is to use Spark's DataFrame API, which allows for parallel processing and scalability. GroupBy operations on region and month, followed by aggregation functions, can be executed in a distributed manner, making it suitable for large datasets.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
Consider a scenario where you need to transform large volumes of data using Apache Spark in Azure Databricks. The data includes customer purchase histories and needs to be aggregated by region and month. Which of the following approaches would be most efficient for this task, considering the need for parallel processing and scalability?
A
Use a for-loop to iterate through the data and aggregate it sequentially.
B
Leverage Spark's DataFrame API to perform groupBy operations on region and month, followed by aggregation functions like sum and count.
C
Export the data to a CSV file and use a local Python script to perform the aggregation.
D
Use a single SQL query to perform the aggregation directly on the source database.
No comments yet.