
Answer-first summary for fast verification
Answer: Implement **Liquid Clustering** on the columns `(event_date, event_type, country_code)` and disable all manual partitioning and Z-Ordering, relying on automatic data layout management and file compaction.
**A: Incorrect (suboptimal for this scenario)** Increasing target file size to 1 GB via compaction helps reduce small-file problems and metadata overhead, which is always good. However, without clustering (Z-Order or Liquid), file skipping remains limited to partition pruning (if partitioned) or basic min/max stats. Ad-hoc filters on combinations of high-to-medium cardinality columns (especially user_id ranges) will still scan large amounts of data, leading to poor performance gains for the described workload. This is a basic hygiene step, not a "significant" optimization for ad-hoc exploratory use. **B: Incorrect (high risk of over-partitioning + maintenance issues)** Partitioning by `event_date` enables good partition pruning for time filters, and Z-Ordering on the three columns can co-locate data for better skipping on mixed predicates. However, with ~1,000+ values for `event_date` over 3 years, you risk creating many small partitions → small files after writes → defeating the purpose unless aggressive auto-compaction + frequent OPTIMIZE is used. Z-Ordering on three columns (including high-cardinality user_id) dilutes effectiveness (Z-Order works best on 1–4 columns, ideally lower-to-medium cardinality). High maintenance cost and risk of skewed small partitions make this less ideal in 2025+ environments. **C: Correct (best modern approach for this scenario)** **Liquid Clustering** (introduced in DBR 11.3 LTS+, generally available and strongly recommended for new tables) automatically clusters data on up to 4 clustering keys (`event_date, event_type, country_code`) using multi-dimensional layout (improved over classic Z-Order curve). It provides excellent data skipping for ad-hoc combinations of these columns without manual partitioning — avoiding over-partitioning pitfalls. It compacts files automatically in the background, reduces maintenance burden (no frequent manual OPTIMIZE needed for layout), preserves all Delta features (ACID, time travel, schema evolution, Unity Catalog security), and scales well for TB-scale tables with mixed query patterns. In cost-sensitive environments with ad-hoc workloads on serverless SQL warehouses, this delivers the best balance of performance uplift and low operational cost. **D: Incorrect (incomplete / risky trade-offs)** Auto-Optimize + Optimized Writes is excellent hygiene and should be enabled, but partitioning only by low-cardinality `country_code` misses pruning for the dominant `event_date` filter. Bloom filters help equality lookups on high-cardinality columns (good for user_id == ...), but they don't help range or multi-column filters common in ad-hoc analysis. Avoiding `OPTIMIZE` entirely risks persistent small-file buildup from streaming, degrading performance over time. This is partial optimization at best.
Ultimate access to all questions.
No comments yet.
Author: LeetQuiz Editorial Team
You are optimizing a large Delta Lake table (~8 TB) that supports frequent ad-hoc exploratory queries from analysts in a cost-sensitive environment. The table contains clickstream data with high-cardinality filtering columns (user_id, session_id), medium-cardinality columns (event_type, country_code), and a commonly used time-based column (event_date — ~3 years of daily data). Queries often filter on combinations of event_date + event_type or event_date + country_code, with occasional filters on user_id ranges or session patterns, but almost never equality filters on high-cardinality IDs alone.
The table currently has thousands of small files per partition due to streaming ingestion, and analysts frequently complain about slow response times on Databricks SQL warehouses (serverless, Photon-enabled). You must significantly improve query performance and reduce DBU consumption while preserving full ACID compliance, time travel, schema evolution, and governance controls (including row-level security via Unity Catalog).
Which of the following approaches represents the most effective single action (or primary strategy) under these constraints?
A
Run OPTIMIZE regularly without Z-Ordering and target file sizes of ~1 GB using spark.databricks.delta.optimize.maxFileSize, accepting longer OPTIMIZE runtimes to minimize small-file problems and reduce metadata overhead.
B
Partition the table by event_date (using PARTITIONED BY (event_date)) and implement Z-Ordering on (event_type, country_code, user_id) in the same OPTIMIZE ... ZORDER BY command, then schedule frequent OPTIMIZE jobs.
C
Implement Liquid Clustering on the columns (event_date, event_type, country_code) and disable all manual partitioning and Z-Ordering, relying on automatic data layout management and file compaction.
D
Enable Auto-Optimize + Optimized Writes, partition by country_code (low cardinality), add bloom filter indexes on user_id and session_id, and avoid running OPTIMIZE to minimize compute costs during maintenance.