Google Professional Data Engineer

Get started today

Ultimate access to all questions.

Deep dive into the quiz with AI chat providers.

We prepare a focused prompt with your quiz and certificate details so each AI can offer a more tailored, in-depth explanation.

NO.3 Your company is performing data preprocessing for a learning algorithm in Google Cloud Dataflow. Numerous data logs are being generated during this step, and the team wants to analyze them. Due to the dynamic nature of the campaign, the data is growing exponentially every hour. The data scientists have written the following code to read the data for a new key features in the logs. BigQueryIO.Read .named("ReadLogData") .from("clouddataflow-readonly:samples.log_data") You want to improve the performance of this data read. What should you do?

Real Exam

Community

LLeetQuiz

Specify the TableReference: object in the code.

Use .fromQuery operation to read specific fields from the table.

Use of both the Google BigQuery TableSchema and TableFieldSchema classes.

Call a transform that returns TableRow objects, where each element in the PCollection represents a single row in the table.

Explanation:

Explanation

The correct answer is D because:

When reading from BigQuery in Dataflow, using TableRow objects directly in the PCollection provides better performance compared to other methods
Each element in the PCollection representing a single row as a TableRow object allows for efficient parallel processing
This approach avoids unnecessary serialization/deserialization overhead
The current code BigQueryIO.Read.named("ReadLogData").from("clouddataflow-readonly:samples.log_data") is already reading the entire table, but using TableRow objects optimizes how the data is processed in the pipeline

Why other options are less optimal:

A: Specifying TableReference doesn't significantly improve read performance
B: Using .fromQuery to read specific fields could help with performance by reducing data volume, but it's not the most direct performance improvement for the read operation itself
C: Using TableSchema and TableFieldSchema classes is more about schema definition rather than read performance optimization

The key insight is that in Dataflow pipelines, representing BigQuery data as TableRow objects in PCollections provides the most efficient processing model for large-scale data operations.

Powered ByGPT-5.2

Comments

Loading comments...