
Answer-first summary for fast verification
Answer: (storesDF.withColumn("storeValueCategory", split(col("storeCategory"), "_")[0]) .withColumn("storeSizeCategory", split(col("storeCategory"), "_")[1])), (storesDF.withColumn("storeValueCategory", split("storeCategory", "_")[0]) .withColumn("storeSizeCategory", split("storeCategory", "_")[1]))
The question requires splitting the `storeCategory` column into two new columns using the underscore `_` as the delimiter. The correct approach uses the `split` function from PySpark's `functions` module, which returns an array. The first element (index 0) of the array corresponds to the part before the underscore, and the second element (index 1) corresponds to the part after. Option C correctly uses `split(col('storeCategory'), '_')[0]` and `[1]` to extract the two parts. The `split` function is applied to the column, and array indices are used to access the split elements. Option D is also correct because `split('storeCategory', '_')` is equivalent to `split(col('storeCategory'), '_')`. In PySpark, passing a string to `split` treats it as a column name, so the syntax in D is valid and produces the same result as C. Options A and E incorrectly use indices 1 and 2, which would result in out-of-bounds errors or null values. Option B uses `col.split()`, which is not the correct method for splitting a column in PySpark.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
Which of the following code blocks correctly splits the storeCategory column from DataFrame storesDF at the underscore character, creating two new columns named storeValueCategory and storeSizeCategory?
A sample of DataFrame storesDF is shown below:
storeId open openDate storeCategory
0 true 1100746394 VALUE_MEDIUM
1 true 944572255 MAINSTREAM_SMALL
2 false 925495628 PREMIUM_LARGE
3 true 1397353092 VALUE_MEDIUM
4 true 986505057 VALUE_LARGE
5 true 955988614 PREMIUM_LARGE
storeId open openDate storeCategory
0 true 1100746394 VALUE_MEDIUM
1 true 944572255 MAINSTREAM_SMALL
2 false 925495628 PREMIUM_LARGE
3 true 1397353092 VALUE_MEDIUM
4 true 986505057 VALUE_LARGE
5 true 955988614 PREMIUM_LARGE
A
(storesDF.withColumn("storeValueCategory", split(col("storeCategory"), "")[1]) .withColumn("storeSizeCategory", split(col("storeCategory"), "")[2]))
B
(storesDF.withColumn("storeValueCategory", col("storeCategory").split("")[0]) .withColumn("storeSizeCategory", col("storeCategory").split("")[1]))
C
(storesDF.withColumn("storeValueCategory", split(col("storeCategory"), "")[0]) .withColumn("storeSizeCategory", split(col("storeCategory"), "")[1]))
D
(storesDF.withColumn("storeValueCategory", split("storeCategory", "")[0]) .withColumn("storeSizeCategory", split("storeCategory", "")[1]))
E
(storesDF.withColumn("storeValueCategory", col("storeCategory").split("")[1]) .withColumn("storeSizeCategory", col("storeCategory").split("")[2]))