Databricks Certified Data Engineer - Professional

Ultimate access to all questions.

The following code has been migrated to a Databricks notebook from a legacy workload:

git clone https://github.com/foo/data_loader;
python ./data_loader/run.py;
mv ~/output /dbfs/mnt/new_data

git clone https://github.com/foo/data_loader;
python ./data_loader/run.py;
mv ~/output /dbfs/mnt/new_data

The code executes successfully and produces logically correct results, but it takes over 20 minutes to extract and load approximately 1 GB of data.

Which statement could explain this performance issue?

Exam-Like

%sh triggers a cluster restart to collect and install Git. Most of the latency is related to cluster startup time.

9.5%

Instead of cloning, the code should use %sh pip install so that the Python code can get executed in parallel across all nodes in a cluster.

10.2%

Comments

Loading comments...