
Answer-first summary for fast verification
Answer: REPLICATE
In Azure Synapse Analytics dedicated SQL pools, the **REPLICATE** distribution type is optimal for date dimension tables that are used by all fact tables to minimize data movement during queries. Here's why: - **REPLICATE Distribution**: This creates a full copy of the table on each compute node. Since date dimension tables are typically small (containing dates, holidays, fiscal periods, etc.), the storage overhead is minimal. When joined with fact tables distributed using HASH or ROUND_ROBIN, the replicated dimension is locally available on every node, eliminating the need for data movement (shuffling) during joins. This significantly improves query performance. - **Why Not HASH**: HASH distribution spreads data across nodes based on a distribution key. For a date dimension, this would require aligning the distribution key with fact tables (e.g., using `DateKey`). However, fact tables may use different distribution keys (e.g., `ProductKey` or `CustomerKey`), leading to data movement during joins if the keys don't match. This defeats the goal of minimizing movement. - **Why Not ROUND_ROBIN**: ROUND_ROBIN distributes rows evenly but randomly across nodes. This would cause data movement in almost every join scenario, as there's no logical alignment with fact table distribution, resulting in poor performance for dimension-table joins. Best practices for Azure Synapse Analytics recommend using REPLICATE for small dimension tables (typically under 2 GB) to leverage local joins and avoid shuffling. Since date dimensions are compact and universally used, replication ensures efficient query execution across all fact tables.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
You are designing a date dimension table in an Azure Synapse Analytics dedicated SQL pool that will be used by all fact tables. Which distribution type should you use to minimize data movement during queries?
A
HASH
B
REPLICATE
C
ROUND_ROBIN
No comments yet.