
Answer-first summary for fast verification
Answer: hash-distributed on PurchaseKey
## Analysis of Distribution Options For a large fact table in Azure Synapse Analytics with over 1 billion rows (1 million daily × 3 years), the optimal distribution strategy must consider data volume, query patterns, and performance characteristics. ### **Option B: Hash-distributed on PurchaseKey (Recommended)** **Why this is optimal:** - **High Cardinality**: PurchaseKey is likely the primary key with unique values, ensuring even data distribution across all 60 distributions without data skew - **Efficient Data Movement**: Hash distribution on a high-cardinality column minimizes data shuffling during query execution - **Scalability**: With over 1 billion rows, hash distribution provides the best performance for large fact tables - **Best Practice**: Microsoft recommends hash distribution for large fact tables, using columns with many unique values and minimal NULLs **Performance Benefits:** - Even distribution prevents compute node bottlenecks - Optimizes parallel processing for aggregation operations - Better overall system utilization compared to other methods ### **Why Other Options Are Less Suitable:** **Option A: Replicated** - **Not scalable** for large fact tables (1+ billion rows) - **High storage overhead** - data copied to every compute node - **Performance degradation** during data loading and updates - **Memory constraints** on compute nodes **Option C: Round-robin** - **Random distribution** doesn't optimize for query patterns - **Potential data movement** during GROUP BY operations - **Less predictable performance** compared to hash distribution - **Not ideal** for aggregation-heavy workloads **Option D: Hash-distributed on IsOrderFinalized** - **Low cardinality** - only 2 possible values (Yes/No) - **Severe data skew** - would use only 2 out of 60 distributions - **Poor resource utilization** - most compute nodes remain idle - **Performance bottleneck** - defeats the purpose of distributed processing ### **Query Pattern Considerations:** While the query doesn't directly use PurchaseKey in GROUP BY or WHERE clauses, hash distribution on a high-cardinality column like PurchaseKey provides the foundation for efficient parallel processing. The system can still leverage the even data distribution to optimize the GROUP BY operations on SupplierKey, StockItemKey, and IsOrderFinalized. **Conclusion:** Hash distribution on PurchaseKey provides the most balanced and scalable approach for this large fact table scenario, aligning with Azure Synapse Analytics best practices for data warehousing workloads.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
You are designing a fact table named FactPurchase in an Azure Synapse Analytics dedicated SQL pool for a retail store. The table will contain purchase data from suppliers with the following columns:
PurchaseKeySupplierKeyStockItemKeyDateKeyIsOrderFinalizedThe table will have 1 million rows added daily and will store three years of data. Daily queries will be executed that are similar to the following:
SELECT
SupplierKey,
StockItemKey,
IsOrderFinalized,
COUNT(*)
FROM FactPurchase
WHERE DateKey >= 20210101
AND DateKey <= 20210131
GROUP BY
SupplierKey,
StockItemKey,
IsOrderFinalized
SELECT
SupplierKey,
StockItemKey,
IsOrderFinalized,
COUNT(*)
FROM FactPurchase
WHERE DateKey >= 20210101
AND DateKey <= 20210131
GROUP BY
SupplierKey,
StockItemKey,
IsOrderFinalized
Which table distribution type will minimize query times?

A
replicated
B
hash-distributed on PurchaseKey
C
round-robin
D
hash-distributed on IsOrderFinalized