
Explanation:
Option B is the most efficient and scalable approach. It involves creating a data pipeline that can handle scheduled runs, apply necessary transformations and validations, and partition data by date to optimize storage and query performance.
Ultimate access to all questions.
No comments yet.
You are tasked with ingesting a large dataset from an external API into your lakehouse. The dataset is expected to grow significantly over time. Describe the steps you would take to ensure efficient data ingestion using a data pipeline. Include considerations for data validation, transformation, and storage optimization.
A
Use a simple ETL process without transformations, store data in raw format.
B
Create a data pipeline with scheduled runs, apply necessary transformations and validations, partition data by date, and store in a structured format.
C
Manually download and upload data periodically, perform no transformations.
D
Ingest data without scheduling, store in a single large file.