
Answer-first summary for fast verification
Answer: Partition ingestion tables by a small time duration (e.g., hourly) to facilitate parallel writes of numerous data files across different directories.
### Why this is correct: **Partitioning by time** (using reasonably fine-grained buckets like hours or minutes) enables horizontal scaling of writes. By using `partitionBy('event_time')`, each Spark task can write into its own partition directory in parallel. This significantly reduces file-system contention and metadata-service throttling, which are common bottlenecks when hundreds of high-velocity pipelines ingest data simultaneously. ### Why the other options are incorrect: * **High Concurrency clusters:** These are designed to optimize concurrent SQL queries for BI workloads; they do not address object-store write contention or API limits during the ingestion phase. * **Centralizing in one database:** Logical organization within a database or Unity Catalog does not affect the physical layout of data or how writes are parallelized at the storage layer. * **Dedicated storage containers:** While separating 'hot' tables can help with broad API limits, it increases operational complexity (managing hundreds of mounts) and fails to solve contention within the table itself. * **Attached SSD volumes:** Delta Lake is architected to reside on cloud object storage (S3/ADLS/GCS). Persisting production Delta tables on ephemeral SSDs is not a supported or scalable pattern. **Reference:** Databricks documentation on performance efficiency recommends keeping data clustered and isolatable during ingestion, often by maintaining a natural time sort order.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
A large organization needs to implement a near real-time ingestion solution to support hundreds of pipelines delivering high-volume, high-velocity updates across multiple tables. Which of the following strategies is most suitable for maximizing write throughput and reducing storage-level contention?
A
Deploy Databricks High Concurrency clusters to leverage optimized cloud storage connections for maximum ingestion throughput.
B
Consolidate all tables into a single database to allow the Databricks Catalyst Metastore to logically balance overall throughput.
C
Isolate each Delta Lake table in its own dedicated storage container to mitigate API rate limits imposed by cloud vendors.
D
Partition ingestion tables by a small time duration (e.g., hourly) to facilitate parallel writes of numerous data files across different directories.
E
Configure Databricks clusters to store all data on attached SSD volumes instead of cloud object storage to enhance file I/O performance.