Google Professional Data Engineer

Get started today

Ultimate access to all questions.

Deep dive into the quiz with AI chat providers.

We prepare a focused prompt with your quiz and certificate details so each AI can offer a more tailored, in-depth explanation.

You are in the process of developing a real-time data warehouse for your company using Google BigQuery streaming inserts. Given the nature of the data stream, there isn't a guarantee that the data will be sent only once. However, each row of data does include a unique ID and an event timestamp. To ensure that duplicates are excluded during interactive query sessions, which type of query should you implement?

Exam-Like

Include ORDER BY DESK on timestamp column and LIMIT to 1.

Use GROUP BY on the unique ID column and timestamp column and SUM on the values.

Use the LAG window function with PARTITION by unique ID along with WHERE LAG IS NOT NULL.

Use the ROW_NUMBER window function with PARTITION by unique ID along with WHERE row equals 1.

Explanation:

The correct answer is D. The ROW_NUMBER window function with PARTITION by unique ID along with WHERE row equals 1 will provide a mechanism to ensure data deduplication. This method assigns a unique number to each row within a partition based on the unique ID. By selecting only rows with a row number of 1, you ensure that duplicates are excluded, returning only the first instance of each unique ID. Options A, B, and C fail to effectively remove duplicates based on the unique ID and event timestamp combination.

Powered ByGPT-5.2

Comments

Loading comments...