
Answer-first summary for fast verification
Answer: SELECT SUBSTRING_INDEX(product_description, ' ', 1) as product_name FROM dataset
The correct answer is **C: `SELECT SUBSTRING_INDEX(product_description, ' ', 1) as product_name FROM dataset`**. ### Why C is the best option - **SUBSTRING_INDEX(str, delim, count)** is a built-in Spark SQL function (supported in Azure Databricks / Spark) specifically designed for this exact use case. - When `count = 1` and `delim = ' '`, it efficiently returns everything **before the first occurrence** of the space — i.e., precisely the first word. - It is **scalar** (operates directly on the string without creating intermediate collections), making it lightweight and performant on large datasets. - No extra array handling or indexing is required. **Example behavior**: - `'Wireless Headphones with Noise Cancellation'` → `'Wireless'` - `'Laptop'` (no space) → `'Laptop'` - Handles leading/trailing spaces reasonably well for this simple requirement. ### Why not the other options? - **A: `SELECT SPLIT(product_description, ' ')[0] as product_name FROM dataset`** This also works (SPLIT returns an array, and `[0]` grabs the first element). It is a common and valid pattern in Spark SQL. However, it creates a full array for every row even when you only need the first element. This makes it slightly less efficient than `SUBSTRING_INDEX`, especially on very wide tables or extremely large datasets. For most cases the difference is negligible, but the question emphasizes "**efficiency and optimal performance**" on a **large dataset**, so C edges it out. - **B: `SELECT product_description FROM dataset WHERE product_description LIKE '%product_name%'`** Completely incorrect. - It doesn't extract anything — it just filters rows. - The literal string `'product_name'` in the LIKE clause has no relation to the actual product name. - This would return the full description (or nothing), not the extracted first word. - **D: `SELECT product_description as product_name FROM dataset LIMIT 1`** Wrong on multiple levels. - It aliases the entire description column (no extraction). - `LIMIT 1` returns only one arbitrary row instead of processing the whole dataset. - Useless for transforming every row. ### Performance & Best Practices (for the exam) - For simple "first word" extraction on large data → prefer **SUBSTRING_INDEX** when available (it stops at the first delimiter). - If you need more advanced parsing (multiple spaces, punctuation, etc.) → consider `regexp_extract(product_description, '^(\\S+)', 1)` or a combination with `TRIM`. - Both SPLIT and SUBSTRING_INDEX are vectorized and run well in Spark's Catalyst optimizer, but SUBSTRING_INDEX avoids array creation overhead. - Always test with realistic data volumes in Databricks (use the Spark UI to compare query plans and shuffle/CPU metrics if performance is critical). **Recommended final query (clean & robust)**: ```sql SELECT SUBSTRING_INDEX(TRIM(product_description), ' ', 1) AS product_name FROM dataset; ``` This trims any leading/trailing whitespace first for extra safety.
Ultimate access to all questions.
Author: LeetQuiz Editorial Team
In a scenario where you are working with a large dataset in Azure Databricks that contains a 'product_description' column with text descriptions of products, you are tasked with extracting the product name from each description. The product name is always the first word in the description. Considering the need for efficiency and accuracy in processing large datasets, which of the following Spark SQL queries would you use to achieve this task? Choose the best option that correctly extracts the product name while ensuring optimal performance.
A
SELECT SPLIT(product_description, ' ')[0] as product_name FROM dataset
B
SELECT product_description FROM dataset WHERE product_description LIKE '%product_name%'
C
SELECT SUBSTRING_INDEX(product_description, ' ', 1) as product_name FROM dataset
D
SELECT product_description as product_name FROM dataset LIMIT 1