
Answer-first summary for fast verification
Answer: hash-distributed on PurchaseKey
## Analysis of Distribution Options ### Understanding the Query Pattern The provided query filters data by `DateKey` (a 31-day range) and groups by `SupplierKey` and `StockItemKey`. The table will grow to approximately 1.1 billion rows (1 million daily × 3 years), making distribution strategy critical for performance. ### Evaluation of Distribution Options **A: Replicated** - **Not suitable** for this scenario. Replicated tables work well for small dimension tables (<2GB) but are inefficient for large fact tables. With 1.1 billion rows, replication would consume excessive storage and memory across all compute nodes. **B: Hash-distributed on PurchaseKey** - **Optimal choice**. `PurchaseKey` is likely a unique identifier (surrogate key) with high cardinality, ensuring even data distribution across all 60 distributions. This prevents data skew and allows parallel processing. While `PurchaseKey` isn't directly used in the query, hash distribution on a high-cardinality column ensures balanced workload across all compute nodes. **C: Round-robin** - **Suboptimal**. Round-robin provides even distribution but doesn't optimize for specific query patterns. For analytical queries with aggregations, hash distribution typically performs better by localizing related data. **D: Hash-distributed on DateKey** - **Poor choice** despite the query filtering on `DateKey`. Date columns have low cardinality (only ~1,095 unique values for 3 years), causing severe data skew. With 60 distributions, most dates would map to empty distributions while a few would be overloaded. This creates "hot spots" where only a few distributions handle most of the workload, defeating the purpose of MPP architecture. ### Key Distribution Principles Applied - **High cardinality distribution**: `PurchaseKey` ensures even data spread across all distributions - **Avoid date column distribution**: Prevents data skew and uneven workload distribution - **Large fact table optimization**: Hash distribution is preferred over round-robin for analytical workloads - **MPP efficiency**: Even distribution allows all 60 distributions to participate in query processing ### Performance Impact Hash distribution on `PurchaseKey` ensures that when the query scans the 31-day date range, the work is evenly distributed across all compute nodes. Each node processes a proportional share of the data, maximizing parallel processing efficiency and minimizing query execution time.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
You are designing a fact table named FactPurchase in an Azure Synapse Analytics dedicated SQL pool for a retail store. The table will contain purchase data from suppliers with the following columns:
PurchaseKeySupplierKeyStockItemKeyDateKeyPurchaseQuantityPurchaseAmountThe table will have 1 million rows of data added daily and will contain three years of data. Daily Transact-SQL queries similar to the following will be executed:
SELECT
SupplierKey,
StockItemKey,
COUNT(*)
FROM FactPurchase
WHERE DateKey >= 20210101
AND DateKey <= 20210131
GROUP BY SupplierKey, StockItemKey
SELECT
SupplierKey,
StockItemKey,
COUNT(*)
FROM FactPurchase
WHERE DateKey >= 20210101
AND DateKey <= 20210131
GROUP BY SupplierKey, StockItemKey
Which table distribution type will minimize query times?

A
replicated
B
hash-distributed on PurchaseKey
C
round-robin
D
hash-distributed on DateKey