
Explanation:
Option A is the most reliable method because it ensures the contents of the tables are identical by comparing hashes of sorted, non-timestamp columns, even without a primary key. Options B and C are less reliable as they only check subsets of data and may lead to false positives. Option D's approach is unclear and may not provide a representative sample for accurate comparison.
Ultimate access to all questions.
After migrating an ETL job to BigQuery, how can you compare its output with the original job's output when the tables lack a primary key for joining, but you have the original job's output table?
A
Use a Dataproc cluster and the BigQuery Hadoop connector to read the data from each table and calculate a hash from non-timestamp columns of the table after sorting. Compare the hashes of each table.
B
Select random samples from the tables using the RAND() function and compare the samples.
C
Select random samples from the tables using the HASH() function and compare the samples.
D
Create stratified random samples using the OVER() function and compare equivalent samples from each table.
No comments yet.