
Ultimate access to all questions.
You are working on a project that involves training a TensorFlow model on a structured dataset containing 100 billion records, which are currently stored in multiple CSV files. The dataset is expected to grow over time, and you are tasked with optimizing the input/output execution performance to ensure efficient model training. Given the scale of the data, cost-effectiveness, and the need for scalability, which of the following approaches would you recommend? (Choose two options)
A
Load the data into BigQuery, and read the data from BigQuery. This approach leverages BigQuery's serverless architecture for scalable analytics but may not be the most cost-effective for iterative model training.
B
Convert the CSV files into shards of TFRecords, and store the data in the Hadoop Distributed File System (HDFS). This method provides distributed storage but may introduce complexity in management and lacks the seamless integration with TensorFlow that Cloud Storage offers.
C
Load the data into Cloud Bigtable, and read the data from Bigtable. Cloud Bigtable is suitable for large-scale, low-latency applications but may not be optimized for the high-throughput requirements of model training.
D
Convert the CSV files into shards of TFRecords, and store the data in Cloud Storage. This approach is optimized for TensorFlow, offering high performance and scalability, with additional benefits like lifecycle management and encryption.
E
Use a combination of loading the data into BigQuery for initial exploratory analysis and then converting the data into TFRecords stored in Cloud Storage for model training. This hybrid approach leverages the strengths of both services but requires additional steps and management.