
Ultimate access to all questions.
In the context of preparing a dataset for machine learning, you are tasked with evaluating the impact of outliers on the dataset's distribution. The dataset is large, with high dimensionality, and you are particularly interested in quantifying the effect of outliers rather than just identifying them. Given the constraints of computational efficiency and the need for a statistical measure that directly assesses the spread of data points around the mean, which of the following methods would you choose? Additionally, consider the scenario where the dataset contains features with varying scales, and you need a method that is scale-invariant. Choose the best option.
A
Principal component analysis (PCA), as it reduces dimensionality and can indirectly highlight outliers by showing which components contribute most to the variance.
B
Creating scatter plot visualizations for each pair of features to visually inspect for outliers, despite the method's inability to quantify their impact.
C
Standard deviation analysis, which provides a statistical measure of how spread out the data points are around the mean, useful for identifying outliers as those beyond 3 standard deviations.
D
K-means clustering, which groups data into clusters based on similarity but is not designed for outlier detection or impact assessment.
E
Both standard deviation analysis and scatter plot visualization, to first quantify the impact of outliers and then visually confirm their presence.