
Ultimate access to all questions.
You are working on analyzing customer purchases within a Fabric notebook using PySpark. The analysis involves two primary DataFrames described as follows:
transactions: This DataFrame contains transaction data with 10 million rows and five columns: transaction_id, customer_id, product_id, amount, and date. Each row corresponds to a single transaction.customers: This DataFrame holds customer details with 1,000 rows and three columns: customer_id, name, and country.Your task is to join these DataFrames on the customer_id column. It is crucial to minimize data shuffling during this process. You start by writing the following code:
from pyspark.sql import functions as F
results =
from pyspark.sql import functions as F
results =
What code should you complete to populate the results DataFrame and achieve the goal of minimal data shuffling?_
A
transactions.join(F.broadcast(customers), transactions.customer_id == customers.customer_id)
B
transactions.join(customers, transactions.customer_id == customers.customer_id).distinct()
C
transactions.join(customers, transactions.customer_id == customers.customer_id)
D
transactions.crossJoin(customers).where(transactions.customer_id == customers.customer_id)