
Explanation:
Correct Answer: A
Forward Fill (ffill or pad):
This method fills missing values with the last observed (previous) non-missing value in the dataset (going forward in the order of the data).
Backward Fill (bfill or backfill):
This method fills missing values with the next observed (following) non-missing value in the dataset (going backward from the missing position).
In pandas (which is commonly used in Databricks with Spark or Koalas), you would apply them like this:
df['House Size'].ffill() or df['House Size'].fillna(method='ffill')df['House Size'].bfill() or df['House Size'].fillna(method='bfill')These are also available in Spark DataFrames via fillna() with window functions or by using pandas UDFs in Databricks.
Benefits:
Limitations:
Option B: This has the definitions reversed. Forward fill uses the previous value, not the next. Backward fill uses the next value, not the previous. The benefit part is partially true but the core definitions are wrong.
Option C: Completely incorrect. Forward and backward fill do not use fixed values (mean/median) or random values. Those are different imputation techniques (simple imputation or stochastic imputation).
Option D: False. Forward and backward fill work perfectly fine on numerical features. They are commonly used on both numerical and categorical data when there is a logical order (especially time series). They are not limited to categorical features.
fillna(method='ffill') vs fillna(method='bfill') in pandas/Spark workflows.Key Takeaway for the Exam:
Forward fill = last observed (previous) value
Backward fill = next observed value
Option A correctly describes both the process and realistically discusses benefits/limitations.
Ultimate access to all questions.
In a dataset with a numerical feature 'House Size', you have noticed that some values are missing. You have decided to use forward fill or backward fill methods to fill in the missing values. Explain the process of forward fill and backward fill methods and discuss the potential benefits and limitations of each approach.
A
Forward fill method involves filling in the missing values with the last observed value in the dataset, while backward fill method involves filling in the missing values with the first observed value in the dataset. These approaches can be useful for time series data or when there is a clear trend in the data, but may not work well for datasets with random missingness or when the missing values are not related to the observed values.
B
Forward fill method involves filling in the missing values with the next observed value in the dataset, while backward fill method involves filling in the missing values with the previous observed value in the dataset. These approaches can capture the local relationships between 'House Size' values, but may introduce bias if the missing values are not related to the observed values.
C
Forward fill method involves filling in the missing values with a fixed value, such as the mean or median of 'House Size', while backward fill method involves filling in the missing values with a random value. These approaches can provide simple imputation methods, but may not capture the relationships between 'House Size' and other features.
D
Forward fill and backward fill methods are not suitable for numerical features like 'House Size', as they are designed for categorical features only.