
Explanation:
AWS Glue DataBrew is designed for visual, no-code data profiling and transformation via recipes, and it supports recurring schedules. It allows exploring data, authoring a recipe, and scheduling jobs without writing code. Loading into Redshift requires SQL. Lambda introduces custom code. EMR and Spark add unnecessary complexity and cluster management.
Ultimate access to all questions.
A retail analytics team at Riverstone Gear plans to prepare a customer orders dataset for machine learning using AWS Glue DataBrew. The raw files in Amazon S3 contain missing fields, concatenated name-and-address values, and columns that are no longer needed. The team wants to profile and fix the data with point-and-click steps and have the same workflow run automatically every 6 hours without writing code. Which approach should they take?
A
Load the data into Amazon Redshift, clean it with SQL statements, and trigger recurring runs using Amazon EventBridge
B
Create AWS Lambda functions in Python to parse and cleanse the files, and schedule invocations with Amazon EventBridge
C
Build a DataBrew project to explore the data, author a recipe, and create a scheduled DataBrew job to apply it on a 6-hour cadence
D
Provision an Amazon EMR cluster to run Apache Spark data cleaning jobs and orchestrate the schedule with AWS Step Functions
No comments yet.