
Ultimate access to all questions.
You are working on a project that requires text analytics on a large collection of documents. The dataset is too large to fit in memory, and you need to use distributed computing to process the data. Explain how you would use Apache Spark to perform text analytics at scale.
A
To perform text analytics at scale using Apache Spark, you would first represent the dataset as an RDD (Resilient Distributed Dataset) or a DataFrame in Spark. Then, you would use Spark's built-in text processing functions, such as tokenization, stop word removal, and stemming, to preprocess the text data. Finally, you would apply text analytics algorithms, such as topic modeling or sentiment analysis, to the preprocessed data to extract insights and patterns.
B
To perform text analytics at scale using Apache Spark, you would first represent the dataset as an RDD (Resilient Distributed Dataset) or a DataFrame in Spark. Then, you would use Spark's built-in text processing functions, such as tokenization, stop word removal, and stemming, to preprocess the text data. However, you would not apply any text analytics algorithms to the preprocessed data.
C
To perform text analytics at scale using Apache Spark, you would first represent the dataset as an RDD (Resilient Distributed Dataset) or a DataFrame in Spark. Then, you would manually inspect each document and perform text analytics without using any built-in functions or algorithms.
D
To perform text analytics at scale using Apache Spark, you would first represent the dataset as an RDD (Resilient Distributed Dataset) or a DataFrame in Spark. Then, you would use a single machine to perform text analytics without leveraging the distributed computing power of Spark.