
Answer-first summary for fast verification
Answer: result = df.withColumn('words', explode(split(df.text, ' '))) print(D)
The correct approach to tokenize a string column into individual words is to first use the 'split' function to split the text into an array of words, and then use the 'explode' function to transform the array into separate rows. Option D does this correctly. Option A is incorrect because it only splits the text into an array but does not create separate rows for each word. Option B is incorrect because it uses the 'explode' function without splitting the text first. Option C is incorrect because it creates a new column with an array of words but does not explode the array into separate rows.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
You are given a Spark DataFrame 'df' with a string column 'text'. Write a code snippet that tokenizes the 'text' column into individual words using the 'split' function, and explain the steps involved.
A
from pyspark.sql.functions import split result = df.select(split('text', ' ')) print(A)
B
from pyspark.sql.functions import explode result = df.select(explode(split('text', ' '))) print(B)
C
result = df.withColumn('words', split(df.text, ' ')) print(C)
D
result = df.withColumn('words', explode(split(df.text, ' '))) print(D)
No comments yet.