
Answer-first summary for fast verification
Answer: Forward fill method involves filling in the missing values with the last observed value in the dataset, while backward fill method involves filling in the missing values with the first observed value in the dataset. These approaches can be useful for time series data or when there is a clear trend in the data, but may not work well for datasets with random missingness or when the missing values are not related to the observed values.
**Correct Answer: A** ### Explanation of Forward Fill and Backward Fill Methods **Forward Fill (ffill or pad):** This method fills missing values with the **last observed (previous) non-missing value** in the dataset (going forward in the order of the data). **Backward Fill (bfill or backfill):** This method fills missing values with the **next observed (following) non-missing value** in the dataset (going backward from the missing position). In pandas (which is commonly used in Databricks with Spark or Koalas), you would apply them like this: - `df['House Size'].ffill()` or `df['House Size'].fillna(method='ffill')` - `df['House Size'].bfill()` or `df['House Size'].fillna(method='bfill')` These are also available in Spark DataFrames via `fillna()` with window functions or by using pandas UDFs in Databricks. ### Benefits and Limitations (as described in Option A) **Benefits:** - Simple and computationally efficient. - Useful for **time series data** or sequentially ordered data (e.g., house sizes recorded over time or along a logical sequence) where values tend to stay similar or follow a trend. - Preserves the existing data distribution better than using a global statistic (mean/median) in cases with local continuity. - No need to calculate additional statistics. **Limitations:** - Assumes that the value after (or before) a missing entry is a good proxy for the missing one. - Can **propagate errors** if there is a sudden change or outlier right before/after the gap. - Performs poorly with **random missingness** (MCAR - Missing Completely At Random) or when missing values have no sequential relationship to observed ones. - Not suitable if the data has no natural order (e.g., rows are shuffled or represent independent observations). - Can introduce **bias** in non-time-series datasets, especially for a feature like 'House Size' if houses are not listed in any meaningful order (price, location, date, etc.). ### Why the Other Options Are Incorrect - **Option B**: This has the definitions **reversed**. Forward fill uses the *previous* value, not the next. Backward fill uses the *next* value, not the previous. The benefit part is partially true but the core definitions are wrong. - **Option C**: Completely incorrect. Forward and backward fill do **not** use fixed values (mean/median) or random values. Those are different imputation techniques (simple imputation or stochastic imputation). - **Option D**: False. Forward and backward fill work perfectly fine on numerical features. They are commonly used on both numerical and categorical data when there is a logical order (especially time series). They are **not** limited to categorical features. ### Databricks ML Associate Exam Tips - Understand when to use `fillna(method='ffill')` vs `fillna(method='bfill')` in pandas/Spark workflows. - Know that these methods are **order-dependent** — always ensure your DataFrame is properly sorted if the order matters. - In production ML pipelines on Databricks, consider combining them (e.g., ffill then bfill for edge cases) or using more advanced imputation (e.g., via MLlib Imputer, regression imputation, or KNN). - These methods are fast and work well in distributed Spark environments but should be used cautiously on non-sequential data. **Key Takeaway for the Exam:** Forward fill = last observed (previous) value Backward fill = next observed value Option **A** correctly describes both the process and realistically discusses benefits/limitations.
Ultimate access to all questions.
Author: LeetQuiz Editorial Team
In a dataset with a numerical feature 'House Size', you have noticed that some values are missing. You have decided to use forward fill or backward fill methods to fill in the missing values. Explain the process of forward fill and backward fill methods and discuss the potential benefits and limitations of each approach.
A
Forward fill method involves filling in the missing values with the last observed value in the dataset, while backward fill method involves filling in the missing values with the first observed value in the dataset. These approaches can be useful for time series data or when there is a clear trend in the data, but may not work well for datasets with random missingness or when the missing values are not related to the observed values.
B
Forward fill method involves filling in the missing values with the next observed value in the dataset, while backward fill method involves filling in the missing values with the previous observed value in the dataset. These approaches can capture the local relationships between 'House Size' values, but may introduce bias if the missing values are not related to the observed values.
C
Forward fill method involves filling in the missing values with a fixed value, such as the mean or median of 'House Size', while backward fill method involves filling in the missing values with a random value. These approaches can provide simple imputation methods, but may not capture the relationships between 'House Size' and other features.
D
Forward fill and backward fill methods are not suitable for numerical features like 'House Size', as they are designed for categorical features only.