Databricks Certified Generative AI Engineer - Associate

Get started today

Ultimate access to all questions.

Explanation:

The question asks for the Python package that extracts text from HTML documents with the fewest lines of code. BeautifulSoup (option D) is specifically designed for parsing HTML and XML documents, making it the most straightforward choice for extracting text from HTML. It provides simple methods like get_text() to quickly retrieve all text content. In contrast, pytesseract (A) is for OCR from images, numpy (B) is for numerical computing, and PyPDF2 (C) is for PDF processing, none of which are suitable for HTML text extraction. The community discussion shows 100% consensus on D, confirming it as the optimal choice for minimal code implementation.

Explanation:

Comments (0)

No comments yet.

A Generative AI Engineer is developing a RAG application that retrieves context from source documents in HTML format. They want to implement a solution with the fewest lines of code.

Which Python package should be used to extract the text from the HTML source documents?

Exam-Like

Last updated: May 24, 2026 at 14:02

pytesseract

13.6%

numpy

2.5%

pypdf2

7.8%

beautifulsoup

76.1%

Databricks Certified Generative AI Engineer - Associate

Get started today

Comments (0)

Get started today

Comments (0)

A Generative AI Engineer is developing a RAG application that retrieves context from source documents in HTML format. They want to implement a solution with the fewest lines of code. Which Python package should be used to extract the text from the HTML source documents?

A Generative AI Engineer is developing a RAG application that retrieves context from source documents in HTML format. They want to implement a solution with the fewest lines of code.

Which Python package should be used to extract the text from the HTML source documents?