
Ultimate access to all questions.
You are working on analyzing customer purchases within a Fabric notebook using PySpark. The analysis involves two primary DataFrames described as follows:
transactions: This DataFrame contains transaction data with 10 million rows and five columns: transaction_id, customer_id, product_id, amount, and date. Each row corresponds to a single transaction.customers: This DataFrame holds customer details with 1,000 rows and three columns: customer_id, name, and country.Your task is to join these DataFrames on the customer_id column. It is crucial to minimize data shuffling during this process. You start by writing the following code:
from pyspark.sql import functions as F
results =
What code should you complete to populate the results DataFrame and achieve the goal of minimal data shuffling?