
Databricks Certified Machine Learning - Associate
Get started today
Ultimate access to all questions.
A data scientist is using Spark SQL to import data into a machine learning pipeline. After importing, the data is collected into a pandas DataFrame, and machine learning tasks are performed using scikit-learn. Which Databricks cluster mode is most suitable for this scenario? Choose the ONE best answer.
A data scientist is using Spark SQL to import data into a machine learning pipeline. After importing, the data is collected into a pandas DataFrame, and machine learning tasks are performed using scikit-learn. Which Databricks cluster mode is most suitable for this scenario? Choose the ONE best answer.
Explanation:
The workflow described involves a transition from distributed computing with Spark SQL to single-node processing with pandas and scikit-learn. Since scikit-learn operates on a single machine, the computational tasks after data import do not leverage distributed computing. Therefore, a Single Node cluster is the most appropriate choice. It allows for data import with Spark SQL without allocating unnecessary resources for distributed computing that the latter part of the workflow cannot utilize. Other options like SQL Endpoint, Standard, High Concurrency, and Pooled clusters are either not optimized for this workflow or provide more resources than needed for the scikit-learn tasks.