
Answer-first summary for fast verification
Answer: In the tables use a hash distribution of ArrivalAirportID and AirportID.
## Analysis of Distribution Strategy for Query Performance Optimization ### Context We have two fact tables (Flight and Weather) that will be frequently joined on airport-related columns. In Azure Synapse Analytics (formerly SQL Data Warehouse), the distribution strategy is critical for query performance, especially for join operations. ### Optimal Solution: Hash Distribution on Join Columns **Option B is the correct recommendation** because: - **Hash distribution on join columns (ArrivalAirportID and AirportID)** ensures that rows with matching join keys are co-located on the same compute node - This enables **local joins** without data movement between nodes, which is the most efficient join strategy in distributed systems - When both tables are distributed on the same join column, the query engine can perform the join locally on each distribution, eliminating the need for data shuffling - Airport IDs are good candidates for hash distribution as they typically have sufficient cardinality to avoid data skew ### Why Other Options Are Less Suitable **Option A (Hash distribution on datetime columns):** - DateTime columns often have poor distribution characteristics due to natural data skew (more flights during certain hours/days) - Distributing on datetime columns would not optimize the join performance since the actual join condition is on airport IDs - This would require data shuffling during join operations, significantly impacting performance **Option C (IDENTITY columns):** - IDENTITY columns are sequential and typically result in highly skewed data distribution - Sequential distribution leads to hot spots where some nodes handle disproportionate amounts of data - This approach doesn't optimize join performance and may actually degrade it due to data movement requirements **Option D (Composite columns):** - Creating composite columns adds complexity without clear performance benefits - Composite distribution keys are rarely optimal unless specifically required by the query patterns - This approach doesn't directly address the join optimization requirement and may introduce unnecessary overhead ### Best Practice Justification In Azure Synapse Analytics, the fundamental principle for optimizing join performance is to distribute both tables on the same join column using hash distribution. This strategy minimizes data movement across the distributed architecture, which is the primary performance bottleneck in large-scale data processing systems.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
You have two fact tables named Flight and Weather. Queries will join these tables on the following columns:
Flight Table | Weather Table
------------------|-----------------
DepartureDate | Date
DepartureAirport | AirportCode
Flight Table | Weather Table
------------------|-----------------
DepartureDate | Date
DepartureAirport | AirportCode
You need to recommend a solution that maximizes query performance. What should you include in the recommendation?

A
In the tables use a hash distribution of ArrivalDateTime and ReportDateTime.
B
In the tables use a hash distribution of ArrivalAirportID and AirportID.
C
In each table, create an IDENTITY column.
D
In each table, create a column as a composite of the other two columns in the table.
No comments yet.