
Answer-first summary for fast verification
Answer: Modify `compute.isin_limit` to better handle large lists.
The correct answer is **B. `compute.isin_limit`**. This configuration option is specifically designed to optimize `isin` operations by controlling the maximum list length for which broadcasting is used for filtering. Broadcasting the list to all executors enables efficient filtering across partitions for lists within the `compute.isin_limit`. For lists exceeding this limit, data is collected to the driver node for filtering, which may be less efficient for large datasets. Adjusting `compute.isin_limit` can significantly enhance the efficiency of `Column.isin(list)` operations, especially with large datasets in pandas-on-Spark. However, it's important to balance this adjustment with memory usage considerations to avoid potential memory issues. Other options listed do not directly impact the performance of `isin` operations.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
When working with a large dataset, how can you enhance the efficiency of filtering using Column.isin(list)?
A
Adjust compute.default_index_type to optimize the operation.
B
Modify compute.isin_limit to better handle large lists.
C
Use compute.ordered_head for improved performance.
D
Change compute.default_index_cache settings.
No comments yet.