
Answer-first summary for fast verification
Answer: hash distributed table with clustered Columnstore index
## Analysis of Distribution Strategy for Large Fact Table ### Key Requirements: - **Table Size**: 10 TB (very large fact table) - **Query Pattern**: Queries primarily use `SaleKey` column for data retrieval - **Performance Goal**: Optimize performance across distributed nodes ### Why Hash Distribution with Clustered Columnstore Index (Option B) is Optimal: #### 1. **Hash Distribution Benefits**: - **Data Skew Prevention**: Hash distribution on `SaleKey` ensures even data distribution across all compute nodes, preventing data skew that could create performance bottlenecks. - **Query Performance**: Since queries use `SaleKey` in WHERE clauses, hash distribution enables query predicate pushdown, allowing each node to process only relevant data without cross-node data movement. - **Large Table Optimization**: Hash distribution is specifically recommended for large fact tables (typically >2 GB) in star schema designs, which aligns perfectly with this 10 TB scenario. #### 2. **Clustered Columnstore Index Benefits**: - **Compression Efficiency**: Columnstore indexes provide exceptional compression (typically 10x reduction), crucial for managing 10 TB of data efficiently. - **Analytical Query Performance**: Columnstore architecture is optimized for analytical workloads with high scan efficiency and batch mode processing. - **Petabyte-Scale Readiness**: Clustered columnstore is the recommended storage format for large-scale data warehousing scenarios. #### 3. **Combined Advantages**: - The hash distribution on `SaleKey` ensures queries targeting specific sale keys are processed locally on relevant nodes. - Clustered columnstore provides the storage efficiency and scan performance needed for large-scale analytical queries. ### Why Other Options Are Less Suitable: **Option A (Hash distributed table with clustered index)**: - Clustered indexes are better for OLTP-style point lookups, not analytical queries on large datasets. - Lacks the compression and batch processing benefits of columnstore for 10 TB data. **Option C (Round robin distributed table with clustered index)**: - Round-robin distribution doesn't leverage the query pattern using `SaleKey`, leading to unnecessary data movement. - Same clustered index limitations as Option A. **Option D (Round robin distributed table with clustered Columnstore index)**: - While columnstore is good, round-robin distribution is inefficient when queries use a specific column (`SaleKey`) for filtering. - Would require data shuffling across nodes for most queries. **Option E (Heap table with distribution replicate)**: - Replicated tables are unsuitable for 10 TB data due to storage limitations on each node. - Replicated tables work best for small dimension tables, not large fact tables. ### Best Practice Alignment: This solution follows Azure Synapse Analytics best practices: - Use hash distribution when you have a clear distribution key that aligns with query patterns - Use clustered columnstore for large fact tables to maximize compression and query performance - Avoid data movement by distributing on commonly filtered columns The combination of hash distribution on the frequently queried `SaleKey` column with the analytical efficiency of clustered columnstore index provides the optimal performance architecture for this 10 TB fact table scenario.
Ultimate access to all questions.
No comments yet.
Author: LeetQuiz Editorial Team
You have a large 10 TB fact table in Azure Synapse Analytics. Queries primarily use the SaleKey column to retrieve data, as shown in the provided table.
To distribute this large fact table across multiple nodes for optimal performance, which distribution technology should you use?

A
hash distributed table with clustered index
B
hash distributed table with clustered Columnstore index
C
round robin distributed table with clustered index
D
round robin distributed table with clustered Columnstore index
E
heap table with distribution replicate