Analysis of Partition Strategy for Azure Synapse Analytics
Understanding the Architecture
Azure Synapse Analytics dedicated SQL pool uses a Massively Parallel Processing (MPP) architecture with 60 distributions by default. When data is hash-distributed on ProductID across 20,000 products, the data is automatically spread across these 60 distributions.
Key Performance Considerations
For optimal clustered columnstore index (CCI) performance in Azure Synapse Analytics:
- Each columnstore segment should contain approximately 1 million rows for efficient compression and query performance
- The 60 distributions operate independently, so we need to consider the data distribution across them
- Too many partitions can lead to small row groups that reduce compression efficiency
- Too few partitions can limit partition elimination benefits
Calculation Breakdown
- Total Records: 2.4 billion rows
- Distributions: 60 (automatic in dedicated SQL pool)
- Rows per Distribution: 2.4 billion ÷ 60 = 40 million rows per distribution
- Optimal Rows per Partition: 1 million rows (for efficient columnstore compression)
- Partitions per Distribution: 40 million ÷ 1 million = 40 partitions per distribution
Why 40 Partitions is Optimal
- 40 partitions ensures each partition within each distribution contains approximately 1 million rows
- This aligns with Microsoft's best practice of having 1+ million rows per columnstore segment
- Provides good balance between partition elimination benefits and compression efficiency
- Avoids the overhead of too many small partitions while maintaining query performance
Analysis of Other Options
- B: 240 partitions - This would result in only 10 million rows per partition across the entire table, but when distributed across 60 distributions, each partition would have only ~167,000 rows per distribution, which is below the optimal 1 million threshold
- C: 400 partitions - Results in even smaller partitions (~100,000 rows per distribution), further reducing compression efficiency
- D: 2,400 partitions - Creates extremely small partitions (~16,667 rows per distribution), which would severely impact columnstore compression and query performance
The key insight is that partitions are applied across the 60 distributions, not within them. Therefore, the optimal number of partitions should ensure that each partition within each distribution contains approximately 1 million rows.