
Answer-first summary for fast verification
Answer: Flatten the dataframe to one chunk per row, create a unique identifier for each row, and save to a Delta table
Option B is the most performant approach because it flattens the dataframe to have one row per text chunk, each with a unique identifier. This structure is optimal for Vector Search indexing as it allows for efficient embedding generation, similarity search, and retrieval of individual chunks. The flattened format eliminates the need to process nested arrays during search operations, reducing computational overhead. Option A involves unnecessary data splitting that doesn't improve search performance. Option C maintains the nested array structure, which is inefficient for vector search operations. Option D using JSON files in Unity Catalog Volume introduces file I/O overhead and is less efficient than Delta table storage for vector search scenarios.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
What is the most performant way to store a DataFrame in a Vector Search index when the DataFrame has two columns: (1) the original document file name and (2) an array of text chunks for each document?
A
Split the data into train and test set, create a unique identifier for each document, then save to a Delta table
B
Flatten the dataframe to one chunk per row, create a unique identifier for each row, and save to a Delta table
C
First create a unique identifier for each document, then save to a Delta table
D
Store each chunk as an independent JSON file in Unity Catalog Volume. For each JSON file, the key is the document section name and the value is the array of text chunks for that section
No comments yet.