Databricks Certified Associate Developer for Apache Spark

Databricks Certified Associate Developer for Apache Spark

Get started today

Ultimate access to all questions.


Which of the following code blocks correctly splits the storeCategory column from DataFrame storesDF at the underscore character, creating two new columns named storeValueCategory and storeSizeCategory?

A sample of DataFrame storesDF is shown below:

storeId  open    openDate    storeCategory
0        true    1100746394  VALUE_MEDIUM
1        true    944572255   MAINSTREAM_SMALL
2        false   925495628   PREMIUM_LARGE
3        true    1397353092  VALUE_MEDIUM
4        true    986505057   VALUE_LARGE
5        true    955988614   PREMIUM_LARGE





Explanation:

The question requires splitting the storeCategory column into two new columns using the underscore _ as the delimiter. The correct approach uses the split function from PySpark's functions module, which returns an array. The first element (index 0) of the array corresponds to the part before the underscore, and the second element (index 1) corresponds to the part after. Option C correctly uses split(col('storeCategory'), '_')[0] and [1] to extract the two parts. The split function is applied to the column, and array indices are used to access the split elements. Option D is also correct because split('storeCategory', '_') is equivalent to split(col('storeCategory'), '_'). In PySpark, passing a string to split treats it as a column name, so the syntax in D is valid and produces the same result as C. Options A and E incorrectly use indices 1 and 2, which would result in out-of-bounds errors or null values. Option B uses col.split(), which is not the correct method for splitting a column in PySpark.