
Ultimate access to all questions.
A data engineer wants to create a data entity from a couple of tables. The data entity must be used by other data engineers in other sessions. It also must be saved to a physical location. Which of the following data entities should the data engineer create?
A
Database
B
Function
Explanation:
Correct Answer: A (Database)
Why Database is the correct choice:
Persistent Storage: A database in Databricks is a logical collection of tables that is stored in a physical location (typically in Unity Catalog or the Hive metastore). It persists across sessions and can be accessed by multiple users.
Cross-session Accessibility: Databases are catalog objects that can be accessed by other data engineers in different sessions, as they are registered in the metastore.
Physical Storage: Databases have associated physical storage locations where table data is actually stored (in cloud storage like S3, ADLS, etc.).
Table Aggregation: A database can contain multiple tables, which aligns with the requirement of creating a data entity "from a couple of tables."
Why Function is incorrect:
Not a Data Entity: Functions (UDFs - User Defined Functions) are code objects, not data entities. They are used to transform data, not to store or organize data.
No Physical Storage: Functions don't have physical storage locations for data; they are stored as code definitions in the metastore.
Different Purpose: Functions are used for data processing and transformation, not for organizing and persisting data structures.
Additional Context: In Databricks, databases (also called schemas in some contexts) are the primary way to organize tables logically. They provide:
The data engineer should create a database to organize the tables, which will then be accessible to other engineers across different sessions and will have physical storage backing.