
Ultimate access to all questions.
You are in the process of developing a real-time data warehouse for your company using Google BigQuery streaming inserts. Given the nature of the data stream, there isn't a guarantee that the data will be sent only once. However, each row of data does include a unique ID and an event timestamp. To ensure that duplicates are excluded during interactive query sessions, which type of query should you implement?
A
Include ORDER BY DESK on timestamp column and LIMIT to 1.
B
Use GROUP BY on the unique ID column and timestamp column and SUM on the values.
C
Use the LAG window function with PARTITION by unique ID along with WHERE LAG IS NOT NULL.
D
Use the ROW_NUMBER window function with PARTITION by unique ID along with WHERE row equals 1._
Explanation:
The correct answer is D. The ROW_NUMBER window function with PARTITION by unique ID along with WHERE row equals 1 will provide a mechanism to ensure data deduplication. This method assigns a unique number to each row within a partition based on the unique ID. By selecting only rows with a row number of 1, you ensure that duplicates are excluded, returning only the first instance of each unique ID. Options A, B, and C fail to effectively remove duplicates based on the unique ID and event timestamp combination._