
Answer-first summary for fast verification
Answer: Yes, Yes
1. Yes: The code selects specific columns, 'SalesOrderNumber', 'OrderDate', 'CustomerName', and 'UnitPrice', from the DataFrame using the select method. 2. No: Removing the partitionBy line would not result in no performance changes. In fact, partitioning data involves overhead as Spark needs to organize the data into separate folders/files based on the partitioning column, so removing it would decrease the overhead. 3. Yes: Adding inferSchema=True will result in extra execution time. InferSchema forces Spark to read the file twice, once to infer the schema and once to read the data, resulting in additional overhead.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
In the context of a Fabric workspace using the default Spark starter pool and runtime version 1.2, you aim to read a CSV file named Sales_raw.csv located in a lakehouse. You intend to select specific columns and save the filtered data as a Delta table in the managed area of the lakehouse. The CSV file Sales_raw.csv comprises 12 columns. You have the following code snippet:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('example').getOrCreate()
df = spark.read.csv('/path/to/Sales_raw.csv', header=True)
selected_df = df.select('SalesOrderNumber', 'OrderDate', 'CustomerName', 'UnitPrice')
selected_df.write.format('delta').mode('overwrite').partitionBy('OrderDate').save('/path/to/output')
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('example').getOrCreate()
df = spark.read.csv('/path/to/Sales_raw.csv', header=True)
selected_df = df.select('SalesOrderNumber', 'OrderDate', 'CustomerName', 'UnitPrice')
selected_df.write.format('delta').mode('overwrite').partitionBy('OrderDate').save('/path/to/output')
For each of the following statements, select Yes if the statement is true. Otherwise, select No.
A
Yes
B
No
C
Yes
D
No
No comments yet.