
Ultimate access to all questions.
A data engineer runs a statement every day to copy the previous day's sales into the table transactions. Each day's sales are in their own file in the location "/transactions/raw".
Today, the data engineer runs the following command to complete this task:
COPY INTO transactions
FROM "/transactions/raw"
FILEFORMAT = PARQUET;
COPY INTO transactions
FROM "/transactions/raw"
FILEFORMAT = PARQUET;
After running the command today, the data engineer notices that the number of records in table transactions has not changed.
What explains why the statement might not have copied any new records into the table?
A
The format of the files to be copied were not included with the FORMAT_OPTIONS keyword.
B
The COPY INTO statement requires the table to be refreshed to view the copied rows.
C
The previous day's file has already been copied into the table.
D
The PARQUET file format does not support COPY INTO.
Explanation:
The correct answer is C. The previous day's file has already been copied into the table.
COPY INTO's default behavior: In Databricks, the COPY INTO command has built-in idempotency. It tracks which files have already been loaded into a table and skips them on subsequent runs.
Automatic deduplication: When you run COPY INTO on a directory, it checks its internal tracking system to see which files have already been processed. If a file has already been loaded into the target table, it will be skipped automatically.
File-level tracking: The command maintains metadata about which source files have been successfully loaded, preventing duplicate data ingestion.
FORMAT_OPTIONS keyword is not required for basic Parquet file loading. The FILEFORMAT = PARQUET specification is sufficient.COPY INTO statement does not require table refresh. Changes are immediately visible after successful execution.COPY INTO in Databricks.The COPY INTO command is designed for incremental data loading with automatic deduplication, making it ideal for daily ETL/ELT workflows where you want to load only new files without writing complex logic to track what has already been loaded.