
Databricks Certified Associate Developer for Apache Spark
Get started today
Ultimate access to all questions.
Which of the following code blocks correctly splits the storeCategory
column from DataFrame storesDF
at the underscore character, creating two new columns named storeValueCategory
and storeSizeCategory
?
A sample of DataFrame storesDF
is shown below:
storeId open openDate storeCategory
0 true 1100746394 VALUE_MEDIUM
1 true 944572255 MAINSTREAM_SMALL
2 false 925495628 PREMIUM_LARGE
3 true 1397353092 VALUE_MEDIUM
4 true 986505057 VALUE_LARGE
5 true 955988614 PREMIUM_LARGE
Which of the following code blocks correctly splits the storeCategory
column from DataFrame storesDF
at the underscore character, creating two new columns named storeValueCategory
and storeSizeCategory
?
A sample of DataFrame storesDF
is shown below:
storeId open openDate storeCategory
0 true 1100746394 VALUE_MEDIUM
1 true 944572255 MAINSTREAM_SMALL
2 false 925495628 PREMIUM_LARGE
3 true 1397353092 VALUE_MEDIUM
4 true 986505057 VALUE_LARGE
5 true 955988614 PREMIUM_LARGE
Explanation:
The question requires splitting the storeCategory
column into two new columns using the underscore _
as the delimiter. The correct approach uses the split
function from PySpark's functions
module, which returns an array. The first element (index 0) of the array corresponds to the part before the underscore, and the second element (index 1) corresponds to the part after. Option C correctly uses split(col('storeCategory'), '_')[0]
and [1]
to extract the two parts. The split
function is applied to the column, and array indices are used to access the split elements. Option D is also correct because split('storeCategory', '_')
is equivalent to split(col('storeCategory'), '_')
. In PySpark, passing a string to split
treats it as a column name, so the syntax in D is valid and produces the same result as C. Options A and E incorrectly use indices 1 and 2, which would result in out-of-bounds errors or null values. Option B uses col.split()
, which is not the correct method for splitting a column in PySpark.