
Ultimate access to all questions.
You are working on a project that requires feature selection for a machine learning model. The dataset has a large number of features, and you need to identify the most relevant features that contribute to the model's performance. Explain how you would use Apache Spark to perform feature selection at scale.
A
To perform feature selection at scale using Apache Spark, you would first represent the dataset as an RDD (Resilient Distributed Dataset) or a DataFrame in Spark. Then, you would use Spark's built-in feature selection algorithms, such as chi-squared test or mutual information, to evaluate the relevance of each feature. Finally, you would select the top-k features with the highest scores and use them for training the machine learning model.
B
To perform feature selection at scale using Apache Spark, you would first represent the dataset as an RDD (Resilient Distributed Dataset) or a DataFrame in Spark. Then, you would use Spark's built-in feature selection algorithms, such as chi-squared test or mutual information, to evaluate the relevance of each feature. However, you would not select the top-k features with the highest scores for training the machine learning model.
C
To perform feature selection at scale using Apache Spark, you would first represent the dataset as an RDD (Resilient Distributed Dataset) or a DataFrame in Spark. Then, you would manually inspect each feature and select the top-k features that you believe are most relevant for training the machine learning model.
D
To perform feature selection at scale using Apache Spark, you would first represent the dataset as an RDD (Resilient Distributed Dataset) or a DataFrame in Spark. Then, you would use a single machine to perform feature selection using traditional algorithms, such as chi-squared test or mutual information, without leveraging the distributed computing power of Spark.