Ultimate access to all questions.
Upgrade Now 🚀
Sign in to unlock AI tutor
When dealing with iterative algorithms in Spark that require frequent access to a large dataset, which caching strategy best optimizes performance while effectively managing memory usage?
A
Cache the dataset using MEMORY_ONLY storage level, relying on Spark‘s eviction policy to manage memory.
B
Avoid caching and reload the dataset from source at each iteration to guarantee data consistency.
C
Use persist(StorageLevel.MEMORY_AND_DISK_SER) for the dataset to ensure it remains in memory as much as possible, spilling to disk when necessary.
D
Persist the dataset in serialized form using MEMORY_ONLY_SER to reduce memory footprint at the cost of CPU overhead for serialization/deserialization.