Ultimate access to all questions.
You are working on optimizing the training performance of a TensorFlow model that processes a large dataset stored as a single 5 terabyte CSV file on Cloud Storage. The current input data pipeline is inefficient, leading to prolonged training times. Considering the need for scalability, cost-effectiveness, and compliance with data processing best practices, what initial steps should you take to enhance the pipeline's performance? (Choose two correct options)
Explanation:
The most effective initial steps to optimize the input pipeline performance for a large 5 terabyte CSV file on Cloud Storage involve converting the input CSV file into a TFRecord file format (Option B) and dividing the dataset into multiple CSV files for parallel processing (Option D). These steps are recommended because:
While other options may offer some benefits, they either do not address the root cause of inefficiency (Option A), risk underfitting by reducing the dataset size (Option C), or are not as comprehensive as combining TFRecord conversion with parallel processing (Option E). Therefore, Options B and D are the most effective initial steps to enhance pipeline efficiency for large CSV files.