
Explanation:
The goal is to remove the substring " End" only when it occurs at the end of the storeReview column in a PySpark DataFrame. The regex pattern " End$" does exactly that:
" End" → literal substring
$ → end of string anchor
" End" → literal substring
$ → end of string anchor
We use the regexp_replace function from pyspark.sql.functions:
regexp_replace(str: ColumnOrName, pattern: str, replacement: str) -> Column
First parameter can be either:
a Column object (e.g., col("storeReview")) OR
a string column name (e.g., "storeReview")
First parameter can be either:
a Column object (e.g., col("storeReview")) OR
a string column name (e.g., "storeReview")
Correct Options
✅ Option B
storesDF.withColumn( "storeReview", regexp_replace(col("storeReview"), " End$", "") )
Works: passes a Column object using col().
Correct pattern and replacement.
Works: passes a Column object using col().
Correct pattern and replacement.
✅ Option D
storesDF.withColumn( "storeReview", regexp_replace("storeReview", " End$", "") )
Works: passes column name as a string, which is also valid.
Works: passes column name as a string, which is also valid.
Why Others Are Wrong
A ❌
A ❌
col("storeReview").regexp_replace(" End$", "")
Fails in PySpark: Column objects do not have a .regexp_replace() method → raises AttributeError.
C ❌ Missing the replacement argument — regexp_replace needs 3 arguments.
E ❌ Uses regexp_extract which extracts instead of replacing.
C ❌ Missing the replacement argument — regexp_replace needs 3 arguments.
E ❌ Uses regexp_extract which extracts instead of replacing.
Final Answer for PySpark:
B and D ✅
💡 Real-World Tip: When writing PySpark transformations, remember:
Functions like regexp_replace, concat, lower, etc., live in pyspark.sql.functions.
Column expressions do not expose them as methods like Pandas does — so Option A works in Scala Spark (where Column has .regexp_replace()), but not in PySpark.
Functions like regexp_replace, concat, lower, etc., live in pyspark.sql.functions.
Column expressions do not expose them as methods like Pandas does — so Option A works in Scala Spark (where Column has .regexp_replace()), but not in PySpark.
Ultimate access to all questions.
No comments yet.
Which of the following code blocks correctly returns a new DataFrame with a modified storeReview column where the suffix "End" has been removed from each string in the storeReview column of DataFrame storesDF?
A sample of DataFrame storesDF is shown below:
storeId storeReview
0 sem eleifend diam End
1 ...vitae odio egesta End
2 ...amet curabitur en End
3 ...tristique loborti End
4 ..condimentum facil End
storeId storeReview
0 sem eleifend diam End
1 ...vitae odio egesta End
2 ...amet curabitur en End
3 ...tristique loborti End
4 ..condimentum facil End
A
storesDF.withColumn("storeReview", col("storeReview").regexp_replace(" End$", ""))
B
storesDF.withColumn("storeReview", regexp_replace(col("storeReview"), " End$", ""))
C
storesDF.withColumn("storeReview”, regexp_replace(col("storeReview"), " End$"))
D
storesDF.withColumn("storeReview", regexp_replace("storeReview", " End$", ""))
E
storesDF.withColumn("storeReview", regexp_extract(col("storeReview"), " End$", ""))