Explanation
Option A is the correct answer because maintaining data quality rules separately from the pipeline follows the best practice of separation of concerns and enables reusability across multiple tables and pipelines.
Why Option A is correct:
- Reusability: By maintaining data quality rules separately, the same set of rules can be applied to multiple tables without duplicating code
- Maintainability: Changes to data quality rules can be made in one place and automatically apply to all tables using those rules
- Separation of Concerns: Keeps data quality logic separate from data transformation logic, making both easier to manage
- CI/CD Integration: Separate data quality rules can be version-controlled and deployed independently
Why other options are incorrect:
- Option B: Running a separate pipeline concurrently doesn't ensure the data quality rules are applied as a dependency and may lead to timing issues
- Option C: Tagging datasets doesn't actually apply data quality rules; it's just metadata and doesn't enforce the rules
- Option D: While creating a task dependency is better than concurrent execution, it still embeds the rules within the workflow rather than maintaining them separately for reusability
This approach aligns with Databricks best practices for data quality management in CI/CD workflows.