Databricks Certified Machine Learning - Associate

Get started today

Ultimate access to all questions.

Consider a scenario where you have a large dataset with millions of rows and you want to train a machine learning model using Spark ML. Explain the steps involved in handling the large dataset and optimizing the performance of the model. Provide a code snippet demonstrating the optimization techniques used in Spark ML.

Simulated

Last updated: December 24, 2025 at 14:03

Use the cache method of the Spark DataFrame API to cache the entire dataset in memory, and then train the model using Spark ML.

11.4%

Use the repartition method of the Spark DataFrame API to increase the number of partitions, and then train the model using Spark ML.

Comments

Loading comments...

Use the persist method of the Spark DataFrame API with the MEMORY_AND_DISK storage level to cache the dataset, and then train the model using Spark ML.

57.1%

Use the coalesce method of the Spark DataFrame API to decrease the number of partitions, and then train the model using Spark ML.

5.7%