
Explanation:
Partitioning by time (using reasonably fine-grained buckets like hours or minutes) enables horizontal scaling of writes. By using partitionBy('event_time'), each Spark task can write into its own partition directory in parallel. This significantly reduces file-system contention and metadata-service throttling, which are common bottlenecks when hundreds of high-velocity pipelines ingest data simultaneously.
Reference: Databricks documentation on performance efficiency recommends keeping data clustered and isolatable during ingestion, often by maintaining a natural time sort order.
Ultimate access to all questions.
No comments yet.
A large organization needs to implement a near real-time ingestion solution to support hundreds of pipelines delivering high-volume, high-velocity updates across multiple tables. Which of the following strategies is most suitable for maximizing write throughput and reducing storage-level contention?
A
Deploy Databricks High Concurrency clusters to leverage optimized cloud storage connections for maximum ingestion throughput.
B
Consolidate all tables into a single database to allow the Databricks Catalyst Metastore to logically balance overall throughput.
C
Isolate each Delta Lake table in its own dedicated storage container to mitigate API rate limits imposed by cloud vendors.
D
Partition ingestion tables by a small time duration (e.g., hourly) to facilitate parallel writes of numerous data files across different directories.
E
Configure Databricks clusters to store all data on attached SSD volumes instead of cloud object storage to enhance file I/O performance.