
Answer-first summary for fast verification
Answer: The record count is derived from the `numRecords` statistics stored within the Delta transaction log.
Delta Lake maintains a transaction log (the `_delta_log`) which contains JSON `AddFile` entries for every data file in the table. Each entry includes metadata such as column-level statistics (min/max/nulls) and the total `numRecords` for that specific file. When a `SELECT COUNT(*)` query is executed, Databricks SQL leverages these pre-computed statistics by summing the `numRecords` values directly from the transaction log. This allows the engine to return the total row count without opening, reading, or scanning any actual data files, ensuring high performance even for massive datasets. **Why other options are incorrect:** * **Scanning data files:** This is unnecessary and would be far too slow for large tables. * **Parquet metadata footers:** While Parquet files store row counts in their footers, reading every footer still requires significant I/O. Delta Lake avoids this by consolidating those stats into the log. * **Hive Metastore:** The metastore tracks table location and schema, but it does not maintain the accurate, versioned row counts required for Delta Lake tables. * **Result Cache:** While caching can speed up identical subsequent queries, the fundamental mechanism for calculating the count on a live or updated table is the transaction log metadata.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
A Databricks SQL dashboard monitors the total record count of a Delta Lake table using the query SELECT COUNT(*) FROM table_name. How are the results efficiently generated when the dashboard is refreshed?
A
The record count is derived from the numRecords statistics stored within the Delta transaction log.
B
The row count is computed by performing a full data scan of all underlying Parquet files.
C
The record count is calculated by reading the metadata footers of every Parquet file in the table directory.
D
The record count is determined by querying statistics maintained in the Hive Metastore.
E
The results are exclusively returned from the Databricks SQL result cache, regardless of underlying data changes.