AWS Certified Cloud Practitioner

Get started today

Ultimate access to all questions.

Quick Answer

Answer-first summary for fast verification

Answer: Write an AWS Glue extract, transform, and load (ETL) job. Use the FindMatches machine learning (ML) transform to transform the data to perform data deduplication.

AWS Glue provides built-in machine learning transforms like FindMatches. FindMatches is specifically designed to help identify duplicate records across datasets, even when the data doesn't match exactly. Using a built-in AWS Glue ML transform eliminates the need to write custom deduplication logic or manage third-party libraries, representing the least operational overhead.

Author: Ritesh Yadav

Quick Answer

Answer-first summary for fast verification

Answer: Write an AWS Glue extract, transform, and load (ETL) job. Use the FindMatches machine learning (ML) transform to transform the data to perform data deduplication.

Author: Ritesh Yadav

Comments (0)

No comments yet.

Question 19 A company is migrating a legacy application to an Amazon S3 based data lake. A data engineer reviewed data that is associated with the legacy application. The data engineer found that the legacy data contained some duplicate information. The data engineer must identify and remove duplicate information from the legacy application data. Which solution will meet these requirements with the LEAST operational overhead?

Real Exam

Community

RRitesh

Last updated: April 29, 2026 at 05:44

Write a custom extract, transform, and load (ETL) job in Python. Use the DataFrame.drop_duplicates() function by importing the Pandas library to perform data deduplication.

Write an AWS Glue extract, transform, and load (ETL) job. Use the FindMatches machine learning (ML) transform to transform the data to perform data deduplication.

Write a custom extract, transform, and load (ETL) job in Python. Import the Python dedupe library. Use the dedupe library to perform data deduplication.

Write an AWS Glue extract, transform, and load (ETL) job. Import the Python dedupe library. Use the dedupe library to perform data deduplication.