
Answer-first summary for fast verification
Answer: Use S3 Select to write a SQL SELECT statement to retrieve the required column from the S3 objects.
Option B is CORRECT because S3 Select allows you to directly query data in S3 objects using SQL, which can efficiently retrieve only the required column from Parquet files without the need to set up additional services or processes. This approach has the least operational overhead since it directly leverages S3's built-in capabilities for querying data.
Author: Ritesh Yadav
Ultimate access to all questions.
Question 20/60
A data engineer has a one-time task to read data from objects that are in Apache Parquet format in an Amazon S3 bucket. The data engineer needs to query only one column of the data.
Which solution will meet these requirements with the LEAST operational overhead?
A
Configure an AWS Lambda function to load data from the S3 bucket into a pandas dataframe. Write a SQL SELECT statement on the dataframe to query the required column.
B
Use S3 Select to write a SQL SELECT statement to retrieve the required column from the S3 objects.
C
Prepare an AWS Glue DataBrew project to consume the S3 objects and to query the required column.
D
Run an AWS Glue crawler on the S3 objects. Use a SQL SELECT statement in Amazon Athena to query the required column.
No comments yet.