
Explanation:
Why Option B is correct:
Hierarchical chunking is specifically designed to preserve semantic context across related paragraphs and sections of documents. This approach creates parent-child relationships between chunks, where:
Scientific papers have complex structure: The problem mentions queries spanning methodology, results, and discussion sections. Hierarchical chunking maintains the relationships between these sections by preserving the document hierarchy.
Scale requirements: With 25 million scientific papers, hierarchical chunking can efficiently handle large corpora while maintaining semantic relationships across the entire dataset.
Why other options are incorrect:
Option A (Fixed-size chunking): Arbitrary token-based splitting can break semantic units, especially at section boundaries. A 300-token chunk might cut through the middle of a methodology section, losing context.
Option C (Semantic chunking): While semantic chunking considers content meaning, it doesn't explicitly preserve hierarchical relationships between sections. The buffer size of 1 and 85% threshold may not adequately handle the complex structure of scientific papers.
Option D (No chunking, manual splitting): Not scalable for 25 million documents. Manual splitting is impractical and doesn't guarantee preservation of semantic context across related paragraphs.
Key takeaway: Hierarchical chunking is ideal for documents with complex structures (like scientific papers) where preserving relationships between sections (methodology→results→discussion) is crucial for accurate RAG performance.
Ultimate access to all questions.
No comments yet.
A pharmaceutical company is developing a Retrieval Augmented Generation (RAG) application that uses an Amazon Bedrock knowledge base. The knowledge base uses Amazon OpenSearch Service as a data source for more than 25 million scientific papers. Users report that the application produces inconsistent answers that cite irrelevant sections of papers when queries span methodology, results, and discussion sections of the papers.
The company needs to improve the knowledge base to preserve semantic context across related paragraphs on the scale of the entire corpus of data.
Which solution will meet these requirements?
A
Configure the knowledge base to use fixed-size chunking. Set a 300-token maximum chunk size and a 10% overlap between chunks. Use an appropriate Amazon Bedrock embedding model.
B
Configure the knowledge base to use hierarchical chunking. Use parent chunks that contain 1,000 tokens and child chunks that contain 200 tokens. Set a 50-token overlap between chunks.
C
Configure the knowledge base to use semantic chunking. Use a buffer size of 1 and a breakpoint percentile threshold of 85% to determine chunk boundaries based on content meaning.
D
Configure the knowledge base not to use chunking. Manually split each document into separate files before ingestion. Apply post-processing reranking during retrieval.