
Answer-first summary for fast verification
Answer: Edit the job to use job bookmarks.
## Explanation **Correct Answer: A - Edit the job to use job bookmarks.** **Why this is correct:** 1. **AWS Glue Job Bookmarks** are specifically designed to track data that has already been processed in previous job runs. 2. When job bookmarks are enabled, AWS Glue maintains state information about what data has been processed, preventing reprocessing of the same data in subsequent runs. 3. This is the most efficient and recommended AWS solution for incremental data processing scenarios where new data is added daily to an S3 bucket. 4. Job bookmarks work by tracking the files that have been processed and only processing new or modified files in subsequent runs. **Why other options are incorrect:** **B - Edit the job to delete data after the data is processed:** - This is not a good practice as it would permanently delete source data after processing. - The requirement is to prevent reprocessing, not to delete source data. - Deleting source data would prevent any re-processing or auditing needs. **C - Edit the job by setting the NumberOfWorkers field to 1:** - This controls the number of DPUs (Data Processing Units) allocated to the job, not data processing logic. - Setting workers to 1 would not prevent reprocessing of old data; it would only affect processing speed and parallelism. - This might actually slow down processing but wouldn't solve the incremental processing problem. **D - Use a FindMatches machine learning (ML) transform:** - FindMatches is an AWS Glue ML transform used for deduplication and finding matching records. - This is not relevant to preventing reprocessing of entire datasets; it's for data quality and matching purposes. - It doesn't track what data has been processed in previous runs. **Key AWS Glue Concepts:** - **Job Bookmarks**: Track processed data to enable incremental processing - **Incremental Processing**: Only process new or modified data since the last job run - **State Management**: AWS Glue maintains state information in its data catalog **Best Practice:** Always enable job bookmarks for recurring ETL jobs that process new data incrementally to optimize costs and processing time.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
A company has an AWS Glue extract, transform, and load (ETL) job that runs every day at the same time. The job processes XML data that is in an Amazon S3 bucket. New data is added to the S3 bucket every day. A solutions architect notices that AWS Glue is processing all the data during each run.
What should the solutions architect do to prevent AWS Glue from reprocessing old data?
A
Edit the job to use job bookmarks.
B
Edit the job to delete data after the data is processed.
C
Edit the job by setting the NumberOfWorkers field to 1.
D
Use a FindMatches machine learning (ML) transform.