
Answer-first summary for fast verification
Answer: Utilize a Dataproc cluster with the BigQuery Hadoop connector to read data from each table, sort the data, and calculate a hash from non-timestamp columns for comparison.
Option C is the most reliable method for comparing the outputs of the original and migrated ETL jobs. It involves using a Dataproc cluster and the BigQuery Hadoop connector to read and sort data from each table, then calculating a hash from non-timestamp columns. This ensures a comprehensive comparison of the tables' contents, even in the absence of a primary key. Options A and B, which suggest using random samples, are less reliable as they only check a subset of the data and may lead to false positives. Option D's approach of creating stratified random samples is unclear and may not accurately represent the data, potentially leading to incorrect comparisons.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
After migrating an ETL job to BigQuery, how can you compare its output with the original job's output when the tables lack a primary key for joining, but you have the original job's output table?
A
Use the RAND() function to select random samples from each table and compare these samples.
B
Employ the HASH() function to select random samples from the tables and compare these samples.
C
Utilize a Dataproc cluster with the BigQuery Hadoop connector to read data from each table, sort the data, and calculate a hash from non-timestamp columns for comparison.
D
Create stratified random samples using the OVER() function and compare equivalent samples from each table.
No comments yet.