Google Professional Data Engineer

Get started today

Ultimate access to all questions.

Deep dive into the quiz with AI chat providers.

We prepare a focused prompt with your quiz and certificate details so each AI can offer a more tailored, in-depth explanation.

NO.44 Your company is running their first dynamic campaign, serving different offers by analyzing real-time data during the holiday season. The data scientists are collecting terabytes of data that rapidly grows every hour during their 30-day campaign. They are using Google Cloud Dataflow to preprocess the data and collect the feature (signals) data that is needed for the machine learning model in Google Cloud Bigtable. The team is observing suboptimal performance with reads and writes of their initial load of 10 TB of data. They want to improve this performance while minimizing cost. What should they do?

Real Exam

Community

LLeetQuiz

Redefine the schema by evenly distributing reads and writes across the row space of the table.

The performance issue should be resolved over time as the site of the BigDate cluster is increased.

Redesign the schema to use a single row key to identify values that need to be updated frequently in the cluster.

Redesign the schema to use row keys based on numeric IDs that increase sequentially per user

Explanation:

Explanation

This question addresses performance optimization in Google Cloud Bigtable when dealing with large-scale data ingestion and processing.

Key Issues:

10 TB initial data load with suboptimal read/write performance
Data rapidly growing every hour during 30-day campaign
Need to improve performance while minimizing cost

Analysis of Options:

Option A (Correct): Redefining the schema to evenly distribute reads and writes across the row space is the optimal solution. In Bigtable, performance is heavily dependent on row key design. When reads and writes are concentrated on a small number of rows ("hotspotting"), it creates bottlenecks. By distributing the workload evenly across the row space, you can leverage Bigtable's automatic sharding and achieve better parallelism.

Option B (Incorrect): Simply waiting for the cluster size to increase won't resolve the fundamental schema design issue. Hotspotting will persist regardless of cluster size, and this approach would increase costs without solving the performance problem.

Option C (Incorrect): Using a single row key for frequently updated values would actually worsen the hotspotting problem. This concentrates all updates on one row, creating a severe bottleneck.

Option D (Incorrect): Using sequentially increasing numeric IDs can create hotspotting because new data always goes to the "end" of the table, concentrating writes on a small number of tablets.

Best Practice: The optimal approach is to design row keys that distribute the workload evenly, such as using hash prefixes, salting, or other techniques that spread operations across the entire row space.

Powered ByGPT-5.2

Comments

Loading comments...