
Answer-first summary for fast verification
Answer: the PySpark library in a Fabric notebook
The PySpark library in a Fabric notebook is the best choice for this scenario. PySpark supports parallel processing, which is essential for handling a large dataset with one billion items. Additionally, PySpark can efficiently handle data transformations and visualizations. It minimizes the duplication of data and reduces the time it takes to load the data compared to other options. While pandas is a powerful library, it does not handle parallel processing as effectively as PySpark. Using Microsoft Power BI reports for core visuals would be beneficial for sharing insights, but it is not ideal for the initial data transformation and anomaly detection process.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
As a Fabric Analytics Engineer, you are managing a Fabric tenant that hosts JSON files in OneLake, each containing a billion items. Your task is to conduct a time series analysis on these items. The objectives include transforming the data, visualizing it to extract insights, performing anomaly detection, and sharing these insights with business users. It is crucial that the solution adheres to the following requirements: utilize parallel processing, minimize data duplication, and reduce data loading times. Which tool or method should you employ to transform and visualize the data?
A
the PySpark library in a Fabric notebook
B
the pandas library in a Fabric notebook
C
a Microsoft Power BI report that uses core visuals