
Answer-first summary for fast verification
Answer: Split into multiple CSV files and use a parallel interleave transformation.
The correct answer is C: Split into multiple CSV files and use a parallel interleave transformation. This action is recommended as it allows multiple workers to read the data in parallel, which can greatly improve the efficiency of the input pipeline. Converting a large 5 terabyte CSV file to a TFRecord file (Option A), while beneficial in terms of efficiency, can be a time-consuming process and still leaves you with a single large file. By splitting the CSV file into smaller chunks, you can leverage parallel processing to speed up data loading and preprocessing, thereby optimizing the input pipeline performance.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
You are training a machine learning model using TensorFlow, and you have a large dataset stored as a single 5-terabyte CSV file in Google Cloud Storage. During the profiling of the model's training time, you identify performance issues related to inefficiencies in the input data pipeline. To optimize the input pipeline performance and accelerate the training process, which action should you try first?
A
Preprocess the input CSV file into a TFRecord file.
B
Randomly select a 10 gigabyte subset of the data to train your model.
C
Split into multiple CSV files and use a parallel interleave transformation.
D
Set the reshuffle_each_iteration parameter to true in the tf.data.Dataset.shuffle method.