
Answer-first summary for fast verification
Answer: Flatten the dataframe to one chunk per row, create a unique identifier for each row, and enable change feed on the output Delta table.
The correct answer is B because it addresses all key requirements for Databricks Vector Search ingestion. First, flattening the dataframe to one chunk per row is necessary since Vector Search expects individual document chunks as separate rows. Second, creating a unique identifier for each row is essential for tracking and managing individual chunks. Most critically, enabling change feed on the output Delta table is mandatory - as emphasized in the community discussion with 100% consensus - because Vector Search relies on Change Data Feed (CDF) to automatically detect and ingest new or updated data. Option D is close but misses the crucial change feed requirement. Option A is incorrect because using UDFs for JSON formatting and autoloader is not the standard approach for Vector Search preparation. Option C is insufficient as it doesn't flatten the array of chunks or enable change feed.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
A Generative AI Engineer has developed scalable PySpark code to process unstructured PDF documents and split them into chunks for storage in a Databricks Vector Search index. The resulting DataFrame contains two columns: the original filename as a string and an array of text chunks from that document.
What steps must the Generative AI Engineer take to prepare and store these chunks for ingestion into Databricks Vector Search?
A
Use PySpark’s autoloader to apply a UDF across all chunks, formatting them in a JSON structure for Vector Search ingestion.
B
Flatten the dataframe to one chunk per row, create a unique identifier for each row, and enable change feed on the output Delta table.
C
Utilize the original filename as the unique identifier and save the dataframe as is.
D
Create a unique identifier for each document, flatten the dataframe to one chunk per row and save to an output Delta table.
No comments yet.