Microsoft Azure Data Engineer Associate - DP-203

Get started today

Ultimate access to all questions.

Explanation:

Analysis of Distribution Options

Understanding the Query Pattern

The provided query filters data by DateKey (a 31-day range) and groups by SupplierKey and StockItemKey. The table will grow to approximately 1.1 billion rows (1 million daily × 3 years), making distribution strategy critical for performance.

Evaluation of Distribution Options

A: Replicated

Not suitable for this scenario. Replicated tables work well for small dimension tables (<2GB) but are inefficient for large fact tables. With 1.1 billion rows, replication would consume excessive storage and memory across all compute nodes.

B: Hash-distributed on PurchaseKey

Optimal choice. PurchaseKey is likely a unique identifier (surrogate key) with high cardinality, ensuring even data distribution across all 60 distributions. This prevents data skew and allows parallel processing. While PurchaseKey isn't directly used in the query, hash distribution on a high-cardinality column ensures balanced workload across all compute nodes.

C: Round-robin

Suboptimal. Round-robin provides even distribution but doesn't optimize for specific query patterns. For analytical queries with aggregations, hash distribution typically performs better by localizing related data.

D: Hash-distributed on DateKey

Poor choice despite the query filtering on DateKey. Date columns have low cardinality (only ~1,095 unique values for 3 years), causing severe data skew. With 60 distributions, most dates would map to empty distributions while a few would be overloaded. This creates "hot spots" where only a few distributions handle most of the workload, defeating the purpose of MPP architecture.

Key Distribution Principles Applied

High cardinality distribution: PurchaseKey ensures even data spread across all distributions
Avoid date column distribution: Prevents data skew and uneven workload distribution
Large fact table optimization: Hash distribution is preferred over round-robin for analytical workloads
MPP efficiency: Even distribution allows all 60 distributions to participate in query processing

Performance Impact

Hash distribution on PurchaseKey ensures that when the query scans the 31-day date range, the work is evenly distributed across all compute nodes. Each node processes a proportional share of the data, maximizing parallel processing efficiency and minimizing query execution time.

Explanation:

Analysis of Distribution Options

Understanding the Query Pattern

Evaluation of Distribution Options

A: Replicated

Not suitable for this scenario. Replicated tables work well for small dimension tables (<2GB) but are inefficient for large fact tables. With 1.1 billion rows, replication would consume excessive storage and memory across all compute nodes.

B: Hash-distributed on PurchaseKey

Optimal choice. PurchaseKey is likely a unique identifier (surrogate key) with high cardinality, ensuring even data distribution across all 60 distributions. This prevents data skew and allows parallel processing. While PurchaseKey isn't directly used in the query, hash distribution on a high-cardinality column ensures balanced workload across all compute nodes.

C: Round-robin

Suboptimal. Round-robin provides even distribution but doesn't optimize for specific query patterns. For analytical queries with aggregations, hash distribution typically performs better by localizing related data.

D: Hash-distributed on DateKey

Poor choice despite the query filtering on DateKey. Date columns have low cardinality (only ~1,095 unique values for 3 years), causing severe data skew. With 60 distributions, most dates would map to empty distributions while a few would be overloaded. This creates "hot spots" where only a few distributions handle most of the workload, defeating the purpose of MPP architecture.

Key Distribution Principles Applied

High cardinality distribution: PurchaseKey ensures even data spread across all distributions
Avoid date column distribution: Prevents data skew and uneven workload distribution
Large fact table optimization: Hash distribution is preferred over round-robin for analytical workloads
MPP efficiency: Even distribution allows all 60 distributions to participate in query processing

Performance Impact

Comments (0)

No comments yet.

You are designing a fact table named FactPurchase in an Azure Synapse Analytics dedicated SQL pool for a retail store. The table will contain purchase data from suppliers with the following columns:

PurchaseKey
SupplierKey
StockItemKey
DateKey
PurchaseQuantity
PurchaseAmount

The table will have 1 million rows of data added daily and will contain three years of data. Daily Transact-SQL queries similar to the following will be executed:

SELECT
    SupplierKey,
    StockItemKey,
    COUNT(*)
FROM FactPurchase
WHERE DateKey >= 20210101
    AND DateKey <= 20210131
GROUP BY SupplierKey, StockItemKey

SELECT
    SupplierKey,
    StockItemKey,
    COUNT(*)
FROM FactPurchase
WHERE DateKey >= 20210101
    AND DateKey <= 20210131
GROUP BY SupplierKey, StockItemKey

Which table distribution type will minimize query times?

Exam-Like

Last updated: July 15, 2026 at 14:06

replicated

hash-distributed on PurchaseKey

round-robin