
Microsoft Fabric Analytics Engineer Associate DP-600
Get started today
Ultimate access to all questions.
You are working on analyzing customer purchases within a Fabric notebook using PySpark. The analysis involves two primary DataFrames described as follows:
transactions
: This DataFrame contains transaction data with 10 million rows and five columns: transaction_id
, customer_id
, product_id
, amount
, and date
. Each row corresponds to a single transaction.
customers
: This DataFrame holds customer details with 1,000 rows and three columns: customer_id
, name
, and country
.
Your task is to join these DataFrames on the customer_id
column. It is crucial to minimize data shuffling during this process. You start by writing the following code:
from pyspark.sql import functions as F
results =
What code should you complete to populate the results
DataFrame and achieve the goal of minimal data shuffling?
You are working on analyzing customer purchases within a Fabric notebook using PySpark. The analysis involves two primary DataFrames described as follows:
transactions
: This DataFrame contains transaction data with 10 million rows and five columns:transaction_id
,customer_id
,product_id
,amount
, anddate
. Each row corresponds to a single transaction.customers
: This DataFrame holds customer details with 1,000 rows and three columns:customer_id
,name
, andcountry
.
Your task is to join these DataFrames on the customer_id
column. It is crucial to minimize data shuffling during this process. You start by writing the following code:
from pyspark.sql import functions as F
results =
What code should you complete to populate the results
DataFrame and achieve the goal of minimal data shuffling?
Explanation:
In Apache Spark, broadcasting refers to an optimization technique for join operations. When you join two DataFrames and one of them is significantly smaller than the other, Spark can 'broadcast' the smaller table to all nodes in the cluster. This approach avoids the need for network shuffles for each row of the larger table, significantly reducing the execution time of the join operation. In this case, broadcasting the 'customers' DataFrame, which is smaller, will minimize data shuffling.