Ultimate access to all questions.
You are working on analyzing customer purchases within a Fabric notebook using PySpark. The analysis involves two primary DataFrames described as follows:
transactions
: This DataFrame contains transaction data with 10 million rows and five columns: transaction_id
, customer_id
, product_id
, amount
, and date
. Each row corresponds to a single transaction.customers
: This DataFrame holds customer details with 1,000 rows and three columns: customer_id
, name
, and country
.Your task is to join these DataFrames on the customer_id
column. It is crucial to minimize data shuffling during this process. You start by writing the following code:
from pyspark.sql import functions as F
results =
What code should you complete to populate the results
DataFrame and achieve the goal of minimal data shuffling?