
Answer-first summary for fast verification
Answer: compute.default_index_type
The correct answer is **C. compute.default_index_type**. **Explanation:** Pandas API on Spark offers two main index types: - **Sequence:** A simple integer index, which is the default but may not be the most efficient for large datasets. - **Distributed:** Leverages Spark's distributed processing for better performance with large-scale operations. The `compute.default_index_type` option allows you to set the default index type for new DataFrames. Choosing "distributed" can enhance performance for index-related operations in large datasets. **Why not the others?** - **A. compute.default_index_cache:** Manages index caching, not the index type. - **B. compute.ops_on_diff_frames:** Governs operations on DataFrames with differing indexes, unrelated to default index type. - **D. compute.shortcut_limit:** Limits the number of rows collected to the driver for certain operations, not related to index type. **Example Usage:** ```python import pandas as pd # Set default index type to "distributed" pd.set_option("compute.default_index_type", "distributed") # Create a DataFrame with a distributed index df = pd.DataFrame({"A": [1, 2, 3, 4, 5], "B": [5, 4, 3, 2, 1]}) # More efficient index operations df.sort_index() df.loc[1:4] ``` **Key Takeaways:** - Opt for a distributed index with large datasets or frequent index operations. - Index type choice can significantly affect performance. - Experiment to find the optimal setup for your needs.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
You're working with large datasets in Pandas API on Spark and notice performance issues with operations involving the default index. Which configuration option should you adjust to specify the default index type for better performance?
A
compute.default_index_cache
B
compute.ops_on_diff_frames
C
compute.default_index_type
D
compute.shortcut_limit