Analysis of Data Skew Detection in Azure Synapse Analytics
Understanding the Scenario
- You have a dedicated SQL pool named Pool1 containing a fact table named Table1
- You need to identify data skew - the uneven distribution of data across distributions in a distributed table
- The question specifically asks what to run in Synapse Studio
Evaluation of Options
✅ Option A: Connect to Pool1 and run DBCC PDW_SHOWSPACEUSED
- Optimal Choice: This is the correct approach for dedicated SQL pools
- DBCC PDW_SHOWSPACEUSED specifically shows space usage across all 60 distributions in Azure Synapse Analytics
- It displays row counts and space usage per distribution, allowing you to identify uneven data distribution
- Must be executed against the dedicated pool (Pool1) where the table resides
- This command provides the most direct view of data distribution across distributions
❌ Option B: Connect to the built-in pool and run DBCC PDW_SHOWSPACEUSED
- Incorrect: The built-in pool refers to the serverless SQL pool, not the dedicated pool
- DBCC PDW_SHOWSPACEUSED is not available in serverless pools
- Cannot access tables in dedicated pools from serverless pools
❌ Option C: Connect to the built-in pool and run DBCC CHECKALLOC
- Incorrect: DBCC CHECKALLOC is for checking page allocation consistency, not data skew
- This command verifies page allocation structures, not data distribution across nodes
- Again, built-in pool cannot access dedicated pool tables
❌ Option D: Connect to the built-in pool and query sys.dm_pdw_sys_info
- Incorrect: sys.dm_pdw_sys_info provides system-level information about the SQL pool, not table-specific data distribution
- This DMV shows appliance-level counters and system configuration, not data skew within specific tables
- Built-in pool cannot query dedicated pool system views
Key Technical Considerations
- Dedicated vs Built-in Pools: Dedicated SQL pools (formerly SQL DW) use MPP architecture with data distributed across 60 distributions. Built-in pools are serverless and cannot access dedicated pool objects.
- Data Skew Detection: The primary method for identifying data skew in dedicated pools is analyzing distribution-level statistics, which DBCC PDW_SHOWSPACEUSED provides.
- Best Practice: For comprehensive analysis, you might later use sys.dm_pdw_nodes_db_partition_stats, but DBCC PDW_SHOWSPACEUSED is the quickest way to get an initial view of data distribution.
Conclusion
Option A is the only technically correct approach that connects to the appropriate pool and uses the proper command for identifying data skew in dedicated SQL pool tables.