
AWS Certified Solutions Architect - Professional
Get started today
Ultimate access to all questions.
A company is collecting a significant volume of data from a fleet of IoT devices. This data is stored in Optimized Row Columnar (ORC) format within the Hadoop Distributed File System (HDFS) on a persistent Amazon EMR cluster. The company's data analytics team utilizes SQL queries through Apache Presto, which is deployed on the same EMR cluster. These queries involve scanning extensive datasets, have a runtime of less than 15 minutes, and are executed exclusively between 5 PM and 10 PM. Given the concern over the high costs associated with this current setup, a solutions architect is tasked with identifying the most cost-effective solution for facilitating SQL data queries. Which of the following solutions would meet these requirements?
A company is collecting a significant volume of data from a fleet of IoT devices. This data is stored in Optimized Row Columnar (ORC) format within the Hadoop Distributed File System (HDFS) on a persistent Amazon EMR cluster. The company's data analytics team utilizes SQL queries through Apache Presto, which is deployed on the same EMR cluster. These queries involve scanning extensive datasets, have a runtime of less than 15 minutes, and are executed exclusively between 5 PM and 10 PM. Given the concern over the high costs associated with this current setup, a solutions architect is tasked with identifying the most cost-effective solution for facilitating SQL data queries. Which of the following solutions would meet these requirements?
Explanation:
The most cost-effective solution is to store the data in the EMR File System (EMRFS) and use Presto in Amazon EMR to query the data. EMRFS is an implementation of HDFS that allows Amazon EMR clusters to read and write data directly to Amazon S3. By leveraging EMRFS, the company can store data in Amazon S3, which is cost-effective, and continue to use Presto for querying. Additionally, EMR clusters can be turned off when not in use, further reducing costs. This aligns with the company's requirement to only run queries between 5 PM and 10 PM. Options A, B, and D either involve higher costs or are less optimized for the described use case.