1. Introduction
Missing data can occur due to various reasons, such as participant non-response in surveys, equipment malfunction in experimental settings, or data entry errors [1] [2]. The presence of missing data can significantly impact statistical analyses and lead to incorrect conclusions if not properly addressed [3]. Missing data is a pervasive issue in empirical research across various disciplines, including social sciences, medical research, and data science [4]. It occurs when no value is available for a variable in an observation, potentially leading to incomplete datasets that can compromise the validity and reliability of statistical analyses [5] [6].
Missing values can significantly impact your analysis: They can introduce bias if not handled properly. Many machine learning algorithms can’t handle missing values that are out of the box. They can lead to the loss of important information if instances with missing values are simply discarded. Improperly handled missing values can lead to incorrect conclusions or predictions.
Missing values can sneak into your data for a variety of reasons. Here are some common reasons: Data Entry Errors: Sometimes, it’s just human error. Someone might forget to input a value or accidentally delete one. Sensor Malfunctions: In IoT or scientific experiments, a faulty sensor might fail to record data at certain times [7]. Survey Non-Response: In surveys, respondents might skip questions they are uncomfortable answering or don’t understand. Merged Datasets: When combining data from multiple sources, some entries might not have corresponding values in all datasets. Data Corruption: During data transfer or storage, some values might get corrupted and become unreadable. Intentional Omissions: Some data might be intentionally left out due to privacy concerns or irrelevance. Sampling Issues: The data collection method might systematically miss certain types of data. Time-Sensitive Data: In time series data, values might be missing for periods when data wasn’t collected (e.g., weekends, holidays) [8].
Researchers typically encounter three main types of missing data mechanisms, as defined by Rubin [4]: 1) Missing Completely at Random (MCAR): The probability of missing data is unrelated to both observed and unobserved variables. 2) Missing at Random (MAR): The probability of missing data depends on observed variables but not on unobserved variables. 3) Missing Not at Random (MNAR): The probability of missing data depends on unobserved variables, including the missing data itself.
Addressing missing data is of paramount importance in research for several reasons: 1) Bias reduction: Ignoring missing data or using simplistic methods like complete case analysis can lead to biased estimates and incorrect inferences [9]. Proper handling of missing data helps minimize this bias and improve the accuracy of research findings. 2) Statistical power: Missing data reduces the effective sample size, leading to decreased statistical power. Appropriate imputation techniques can help maintain or even increase power by utilizing all available information [10]. 3) Generalizability: Incomplete datasets may not be representative of the population of interest, potentially limiting the generalizability of research findings. Addressing missing data can help improve the external validity of results [11]. 4) Ethical considerations: In clinical trials and other human subjects research, ignoring missing data may lead to the waste of valuable data that participants have provided, raising ethical concerns. 5) Regulatory compliance: In some fields, such as clinical trials, regulatory bodies require proper handling and reporting of missing data [12]. 6) Improved decision-making: In applied settings, such as business analytics or public policy, accurate and complete data is essential for informed decision-making [13].
While imputation methods offer valuable tools for handling missing data, several challenges and considerations must be addressed to ensure effective and reliable results: 1) Handling different data types: Imputation methods must be able to handle various data types, including continuous, categorical, and mixed data [14]. 2) Dealing with high-dimensional data: As the number of variables increases, imputation becomes more challenging due to the curse of dimensionality [15]. 3) Computational efficiency: Some advanced imputation methods can be computationally intensive, necessitating efficient implementations for large-scale applications [16]. 4) Preserving data distributions and relationships: Imputation methods should maintain the statistical properties of the original data, including distributions and relationships between variables [17].
Missing data is a pervasive issue in research across various disciplines, often leading to biased or inefficient analyses [4]. Addressing missing data is crucial for maintaining the integrity and reliability of research findings [18]. This review aims to provide a comprehensive overview of missing data imputation techniques, their applications, and current challenges in the field.
Table 1 summarizing the recent papers on imputation methods published between 2017 and 2024, focusing on the year, model used, columns imputed, and key results.
Given the critical nature of missing data in research, this comprehensive review aims to achieve the following objectives: 1) Provide an up-to-date synthesis of current missing data imputation techniques, including traditional methods and advanced machine learning approaches. 2) Critically evaluate the strengths and limitations of various imputation methods across different research contexts and data types. 3) Examine the impact of different missing data mechanisms on the performance of imputation techniques. 4) Explore emerging trends and future directions in missing data imputation, including the application of deep learning and artificial intelligence techniques.
Offer practical guidelines to help researchers select appropriate imputation methods based on their specific research context, data characteristics, and analysis goals. Identify gaps in the current literature and propose areas for future research in missing data imputation.
Provide an overview of traditional and advanced imputation methods.
Discuss evaluation metrics for imputation techniques.
Examine challenges and considerations in missing data imputation.
Explore case studies and applications across various domains.
Identify future directions and open problems in the field data imputation.
By addressing these objectives, this review aims to provide researchers, statisticians, and data scientists with a comprehensive understanding of missing data imputation techniques, their applications, and their implications for research integrity and validity.
Table 1. Summary of imputation methods in published papers (2017-2024).
Paper title |
Year |
Model used |
Columns imputed |
Key results |
Doreswamy et al. [19] |
2017 |
kernel ridge, linear
regression, random
forest, SVM, and KNN |
Multiple variables
of NCDC weather
dataset |
Accounted for temporal dependencies in
longitudinal data, leading to more accurate
estimates of parameters and improved
model performance. |
Hosahalli et al. [20] |
2018 |
Machine learning
models |
NCDC weather
datasets |
Improved accuracy of predictive models
for NCDC weather dataset compared to
single imputation methods. |
Khanani [21] |
2021 |
Predictive mean
matching (PMM) |
Education data |
Demonstrated effectiveness in imputing
missing values in educational data from a
public school with both numerical and
categorical features. |
Thakur et al. [22] |
2021 |
Machine learning |
Time series data |
Provided a comprehensive overview of
multiple imputation techniques and their
applications in machine learning, highlighting their advantages and limitations. |
Psychogyios et al. [23] |
2023 |
KNN-MICE-GAN |
Age, gender, diagnosis codes, lab results |
Improved accuracy of predictive models
for hospital readmission compared to
single imputation methods. |
Omar et al. [24] |
2023 |
Random forest,
decision tree, neural network and support vector machine |
Dropout in higher
education |
The results showed that the Random Forest
algorithm obtained the best performance,
with an AUC of 0.9623 in the prediction
of college dropout. |
Nida et al. [25] |
2023 |
Mean imputation,
KNN, PMM |
Rainfall data |
The KNN achieved high imputation
accuracy for missing rainfall variables
values, improving the analysis of complex weather datasets. |
Psychogyios et al. [23] |
2023 |
GAN |
Electronic health
records |
The results show that GAN achieved high
imputation accuracy and outperform the
standard baselines. |
Teegavarapu et al. [26] |
2024 |
Spatial and temporal
interpolation methods |
Hydrometeorological
data |
Provided a comprehensive overview of
multiple imputation techniques for
precipitation, temperature, and streamflows and highlighting their advantages and
limitations. |
Almeida et al. [27] |
2024 |
Focalize K-NN
method |
Time series data |
The results demonstrated that the
effectiveness of Focalize K-NN for imputing missing values in time series data. |
Kowsar et al. [28] |
2024 |
Self-attention
imputation method |
Electronic health
records |
The proposed imputation method
demonstrates superior performance across
a range of missing data proportions (10%
to 50%) under the assumption of missing
completely at random (MCAR). |
2. Missing Data Types
Understanding the mechanisms of missingness is essential for selecting appropriate imputation methods and accurately interpreting results. Rubin’s classification system for missing data, which remains fundamental to the field, identifies three main types of missing data [18].
2.1. Missing Completely at Random (MCAR)
Missingness is considered completely random when it does not depend on any other variables. This condition, known as Missing Completely at Random (MCAR), occurs when the probability of missing data is unrelated to both observed and unobserved variables [29]-[31]. In this scenario, the missingness is purely due to chance, with no influence from any characteristics of the data [4]. MCAR is the most stringent assumption and rarely occurs in practice. However, when data are MCAR, analyses using only complete cases will be unbiased, although potentially inefficient [1].
Let
be the missingness indicator and
be the complete data. MCAR can be defined as:
.
For example, in a survey, some participants accidentally skip questions regardless of their characteristics or responses to other items.
2.2. Missing at Random (MAR)
The probability of missing data depends on other observed variables but not on the missing data itself. This condition, known as Missing at Random (MAR), occurs when the likelihood of missingness is related to observed variables but not to unobserved ones [18]. MAR represents a less stringent assumption compared to Missing Completely at Random (MCAR) and is often more realistic in practice. In summary, data are considered MAR when the probability of missingness is influenced by observed variables while remaining independent of unobserved variables [18]. MAR allows for relationships between observed variables and the probability of missingness. Many modern imputation methods, such as multiple imputation, assume MAR [17].
Let
be the observed data and
be the missing data. MAR can be defined as:
.
For example, in a longitudinal study on income, older participants are more likely to withhold information about their earnings, but this likelihood is not related to the actual income amount after accounting for age. For example, men might be less likely to answer questions about emotions in a survey.
2.3. Missing Not at Random (MNAR)
Missingness is categorized as Missing Not at Random (MNAR) when the probability of missing data is influenced by the values of the missing data itself or by unobserved variables. In this case, the missingness is directly related to the unobserved values [4]. MNAR is the most challenging type of missing data to handle. Standard imputation methods can yield biased results, necessitating specialized techniques or sensitivity analyses to address the issue effectively [32].
MNAR occurs when the probability of missing data depends on unobserved variables or the missing values themselves [4]. This is the most challenging type of missingness to address and often requires specialized techniques. MNAR occurs when:
.
An example of MNAR is in a mental health survey, participants with severe depression may be less likely to complete questions about their symptoms, with this likelihood directly related to the severity of their undisclosed condition. Similarly, individuals with high incomes might be less inclined to report their income in a survey.
In practice, it is often impossible to definitively determine whether data are MAR or MNAR based solely on observed data [33]. Therefore, researchers often must rely on subject-matter knowledge and conduct sensitivity analyses to assess the robustness of their findings under different missing data assumptions [17].
3. Missing Value Imputation
Figure 1. The flowchart of imputation of missing values.
The problem of missing data imputation arises when a dataset contains unobserved values for certain features. Let
represent a random vector, where each
corresponds to a feature and follows a distribution
. A binary mask vector
indicates the presence or absence of observations, with
denoting an observed value and
a missing value. Given a dataset of
instances
, an imputed dataset is constructed by replacing missing values (where
) in the observed data
with pre-imputed values, potentially random noise, resulting in
. The objective is to develop an imputation model, IMP, that generates an imputed dataset
such that each imputed sample
is drawn from the conditional distribution
, thereby preserving the original data’s distributional properties. This yields a complete dataset
, where
for each sample
.
Following imputation of missing values, predictive modeling was performed as shown in Figure 1. Application of
distinct imputation algorithms,
, to the original dataset
yielded
completed datasets, denoted
. A standard predictive model,
, was then applied to each of these completed datasets to predict the outcome.
4. Methods
This study evaluated several imputation methods for handling missing data, ranging from simple statistical techniques to more sophisticated deep learning approaches. Specific methods included.
4.1. Traditional Imputation Methods
Traditional methods for handling missing data have been widely used due to their simplicity and ease of implementation [1]. Traditional imputation methods are widely used to handle missing data in datasets. These methods aim to replace missing values with plausible estimates based on the available data. The most common traditional imputation techniques are discussed as follows.
4.1.1. Mean/Median Imputation
Missing categorical values were replaced with the mode, while missing numerical values were replaced with the mean of the corresponding feature. This method involves replacing missing values with the mean or median of the observed values for that variable [4]. Although simple, this approach can result in biased estimates and an underestimation of standard errors. For numerical data, missing values are replaced with the mean or median of the non-missing values in the same column, while for categorical data, the mode (most frequent value) is utilized.
4.1.2. MissForest
This iterative approach uses mean/mode imputation to initialize the dataset [16]. Then, a random forest is trained to predict the missing values in each feature, iteratively refining the imputed values until convergence (defined by a lack of improvement in the imputed matrix). Convergence criteria included a maximum of 20 iterations and 100 trees. The difference between successive imputed matrices (
and
) for numerical (
) and categorical (
) features was measured as:
(1)
(2)
where
represents the number of missing values in categorical variables.
4.1.3. LOCF and NOCB
Last Observation Carried Forward (LOCF) and Next Observation Carried Backward (NOCB) are both methods used for handling missing data in longitudinal studies. LOCF fills in missing data points by carrying forward the last observed value. For example, if a participant’s value is missing at a follow-up, the last recorded value is used to fill in that gap [34].
The advantages of LOCF method are: 1) Simplicity: Easy to implement and understand. 2) Preservation of Sample Size: Retains all participants in the analysis, which can be important in clinical trials. On the other hand, the disadvantages of LOCF method are: 1) Assumption of Stability: Implies that the last observation is a good estimate for future values, which may not hold true. 2) Potential Bias: Can introduce bias if the last observed value is not representative of the participant’s state at the time of the missing data. 3) Underestimation of Variability: Fails to account for natural fluctuations in the data, potentially leading to misleading conclusions.
NOCB fills in missing data points by carrying the next observed value backward. For example, if a participant’s value is missing before a subsequent observation, the next recorded value is used to fill in the gap.
The advantages of NOCB method are: 1) Preservation of Trends: Can better reflect changes over time if later observations are more representative of the participant’s condition. 2) Potentially Reduces Bias: Addresses some issues associated with LOCF by using future data, which may be more accurate. On the other hand, the disadvantages of NOCB method are: 1) Assumption of Continuity: Assumes that the value observed in the future can be reliably applied to the past, which may not always be valid. 2) Temporal Distortion: Can introduce bias if there are systematic changes between the missing data point and the next observation. 3) More Complex: Generally considered less intuitive and harder to justify in some contexts than LOCF.
4.1.4. Hot Deck Imputation
Hot deck imputation involves replacing missing values with observed values from similar respondents or cases [35]. This method can help preserve the distribution of the data but may be challenging to implement for large datasets. This method replaces missing values with values from a similar donor record (a record with non-missing values) in the dataset.
The main steps of Hot Deck Imputation are: 1) Define a set of matching criteria (e.g., age, gender, income) based on the variables with available data. 2) For each missing value, find a donor record that matches the criteria. 3) Replace the missing value with the corresponding value from the donor record. Advantages: Simple, can be effective for handling missing values in categorical variables.
4.1.5. Multivariate Imputation by Chained Equations (MICE)
This method generates
imputed datasets [36]. Parameter estimates and standard errors are calculated for each dataset, and then pooled to obtain overall estimates (
) and variances (
):
Between-dataset variability (
) is also calculated:
4.1.6. Neighborhood Aware Autoencoder (NAA)
This approach uses a denoising autoencoder, pre-imputed with kNN (
), to learn feature relationships and impute missing values [37]. The encoder and decoder are defined by:
where
and
are the hidden and output vectors, respectively;
and
are the encoder weights and bias; and
and
are the decoder weights and bias. Training minimizes the reconstruction error between
and
.
4.1.7. Improved Neighborhood Aware Autoencoder (I-NAA)
This enhanced version uses an undercomplete autoencoder architecture. To avoid overfitting to the initial kNN imputation, the kNN imputation is updated every 10 epochs, varying the
value within a predefined range. Furthermore, the missing values to be imputed are randomly selected at the start of each epoch. A custom loss function combines mean squared error (MSE) for numerical features and binary cross-entropy (BCE) for categorical features:
4.1.8. Multiple Imputation (MI)
Multiple imputation (MI) is a powerful technique for handling missing data that addresses the limitations of single imputation methods. Unlike single imputation, which replaces missing values with a single estimate, MI generates multiple complete datasets by imputing missing values multiple times, each time using different plausible values [38]. This approach accounts for the uncertainty introduced by missing data, leading to more accurate and robust analyses.
The main steps of MI method are: 1) Imputation: Multiple complete datasets are created by imputing the missing values using a statistical model that accounts for the relationships between variables. The model is typically based on the observed data and assumes a specific distribution for the missing values. Each imputed dataset is generated using different random draws from the conditional distribution of the missing values, reflecting the uncertainty associated with the missing data. 2) Analysis: Each of the imputed datasets is analyzed separately using the chosen statistical methods. This results in multiple sets of estimates for the parameters of interest. 3) Pooling: The results from each imputed dataset are combined using appropriate methods to obtain a single set of estimates and standard errors that reflect the uncertainty introduced by missing data. The most common pooling methods include averaging the estimates and variances across the imputed datasets.
The advantages of Multiple Imputation are: 1) Accounts for Uncertainty: MI explicitly acknowledges the uncertainty associated with missing values by generating multiple plausible estimates. This results in more realistic confidence intervals and p-values [39]. 2) Reduces Bias: By generating multiple imputed datasets, MI reduces the bias introduced by single imputation methods, especially when the missing data is not missing at random. 3) More Accurate Estimates: MI generally produces more accurate estimates of parameters and statistical tests than single imputation methods. 4) Provides Insights into Missing Data: The variability of estimates across imputed datasets can provide insights into the sensitivity of the analysis to the missing data.
The challenges of Multiple Imputation are: 1) Computational Complexity: MI can be computationally intensive, especially for large datasets and complex models. 2) Model Selection: Choosing the appropriate imputation model is crucial. The model should accurately reflect the relationships between variables and the distribution of the missing data. 3) Software Requirements: Specialized software is often required to perform multiple imputation, as it involves generating and analyzing multiple datasets.
The choice of imputation method depends on the specific characteristics of the data, the nature of the missing data, and the goals of the analysis. It is important to consider the potential biases and limitations of each method before applying it to your data. Table 2 summarizing common imputation methods, their use cases, advantages, disadvantages, Python packages, and suitability for classification or regression problems.
Table 2. Summary of imputation methods.
Method |
Use cases |
Advantages |
Disadvantages |
Python package |
Problem type |
Mean/Median
Imputation |
Simple missing value
replacement, suitable
for numerical data
with a clear central
tendency. |
Simple,
computationally
inexpensive. |
Can introduce bias,
especially for non-normally distributed data. Does not account for relationships between variables. |
“SimpleImputer” (scikit-learn) |
Regression |
K-Nearest
Neighbors (KNN) |
Handles both numerical and categorical data,
accounts for
relationships between variables. |
Accounts for
relationships between variables, effective for both numerical and
categorical data. |
Can be computationally
expensive for large
datasets, sensitive to
the choice of k. |
“KNNImputer” (scikit-learn) |
Regression/Classification |
Last
Observation
Carried Forward (LOCF) |
Primarily for time
series data, replaces missing values with
the last observed
value. |
Simple, can be
effective for time
series data with a
strong trend. |
Can introduce bias if the data is not trending, can propagate errors if there are consecutive missing values. |
“fillna (method = ‘ffill’)” (pandas) |
Time Series |
Multiple
Imputation
(MI) |
Handles complex
missing data patterns, accounts for
uncertainty in
imputation. |
Accounts for
uncertainty in
imputation, can provide more accurate estimates than single imputation methods. |
Can be computationally
expensive, requires
specialized software. |
“IterativeImputer” (scikit-learn), “fancyimpute” |
Regression/
Classification |
Hot-Deck
Imputation |
Primarily for
categorical data,
replaces missing values with values from a
similar donor record. |
Simple, can be
effective for handling missing values in
categorical variables. |
Can introduce bias if the donor records are not truly similar to the record with the missing value. |
“KNNImputer” (scikit-learn) can
be adapted |
Classification |
4.2. Advanced Imputation Techniques
Regression imputation uses the relationship between variables to predict missing values based on observed data [9]. This method can account for relationships between variables but may overestimate the strength of these relationships.
4.2.1. K-Nearest Neighbors (KNN) Imputation
This method identifies the k-nearest neighbors to the missing value based on the similarity of other features [40]. The missing value is then replaced with the average (for numerical data) or the most frequent value (for categorical data) among those neighbors.
The main steps of KNN Imputation are: 1) Calculate the distance between the data point with the missing value and all other points in the dataset. 2) Identify the k-nearest neighbors based on these distances. 3) For numerical data, compute the average of the corresponding values in the k-nearest neighbors and use it as the imputed value. For categorical data, select the most frequent value among the neighbors.
Missing values were imputed using the average (numerical features) or mode (categorical features) of the
nearest neighbors in feature space, using Euclidean distance. For example,
. Formally, for a sample
with four nearest neighbors
, the imputed value
is calculated as:
where
and
is an indicator function. The Euclidean distance between points
and
is defined as:
where
is the number of features.
KNN imputation identifies the
most similar cases to the ones with missing data and uses their values for imputation [41] [42]. This method can capture complex relationships in the data but may be computationally expensive for large datasets. K-Nearest Neighbors (KNN) imputation is a popular method for handling missing data, leveraging the similarities between observations. Here is a closer look at how KNN imputation works, its advantages, and its limitations.
The advantages of KNN Imputation are: 1) Flexibility: KNN can be applied to both numerical and categorical data, making it versatile. 2) Local Information: By considering the closest observations, KNN can capture local data patterns, potentially leading to more accurate imputations. 3) Non-parametric: KNN does not assume a specific data distribution, which can be advantageous in real-world datasets.
The limitations of KNN Imputation are: 1) Computationally Intensive: KNN can be slow, especially with large datasets, since it requires distance calculations for each observation. 2) Curse of Dimensionality: As the number of features increases, the concept of “closeness” can become less meaningful, making it harder to identify true neighbors. Sensitive to Outliers: The presence of outliers can skew distance calculations, leading to poor imputation results.
4.2.2. Decision Trees and Random Forests
These methods use tree-based models to predict missing values based on other variables [16]. They can handle both categorical and continuous variables and capture non-linear relationships. Decision Trees and Random Forests are powerful machine learning techniques that can also be used to input missing values into datasets. Here’s a breakdown of how these methods work for imputation, along with their advantages and disadvantages.
The advantages of Decision Trees Imputation: 1) Captures Non-linear Relationships: Decision trees can model complex relationships between features, potentially leading to more accurate imputations. 2) Interpretable: The model is relatively easy to interpret, as you can visualize how decisions are made. The disadvantages of Decision Trees Imputation: 1) Overfitting: Decision trees can easily overfit to the training data, especially if not properly pruned. 2) Sensitivity to Noise: Outliers can affect the structure of the tree, impacting the imputation results.
The advantages of Random Forests Imputation: 1) Improved Accuracy: Random forests generally provide better accuracy than single decision trees due to their ensemble nature, reducing overfitting and variance. 2) Robust to Outliers: The averaging mechanism makes random forests less sensitive to outliers compared to individual trees. 3) Handles Large Datasets: Random forests can effectively manage large datasets with high dimensionality. The disadvantages of Random Forests Imputation: 1) Complexity: The model is less interpretable than a single decision tree, as it’s harder to visualize how predictions are made. 2) Computationally Intensive: Training multiple trees can be resource-intensive, especially for large datasets.
4.2.3. Support Vector Machines (SVM)
Support Vector Machines (SVM) are primarily known for classification and regression tasks. However, they can also be utilized to input missing data. SVM-based imputation methods use support vector regression to predict missing values [43] [44]. These methods can be effective for high-dimensional data but may require careful tuning of hyperparameters.
The advantages of SVM-Based Imputation are: 1) Effective for Non-linear Relationships: SVM can capture complex, non-linear relationships in the data by using different kernel functions (e.g., polynomial, radial basis function). 2) Robustness to Overfitting: SVM includes regularization parameters that help prevent overfitting, making it suitable for high-dimensional datasets. 3) Flexibility: SVM can be applied to both classification (categorical variables) and regression (continuous variables) tasks.
The limitations of SVM-Based Imputation are: 1) Computational Complexity: SVM can be computationally intensive, particularly for large datasets, due to the optimization required for finding the best hyperplane. 2) Parameter Sensitivity: The performance of SVM can be sensitive to the choice of kernel and hyperparameters (e.g., C and gamma), requiring careful tuning. 3) Requires Sufficient Data: SVM models generally require a substantial amount of complete data to build an accurate model, which may not always be available.
4.3. Deep Learning Approaches
Autoencoders are neural networks that can learn compressed representations of data and have been applied to missing data imputation [45]-[47]. They can capture complex patterns in the data but may require large amounts of training data.
Generative Adversarial Networks (GANs)
GANs have been adapted for missing data imputation by learning to generate realistic imputed values [48] [49]. This approach can produce high-quality imputations but may be challenging to train and tune. The Generative Adversarial Imputation Network (GAIN) [50] employs a generative adversarial network (GAN) architecture. Unlike standard GANs, the discriminator in GAIN does not classify the entire generated output as real or fake; instead, it classifies each individual variable as either imputed or observed. Convergence is achieved when the generator produces imputations indistinguishable from the true data distribution. A “hint” mechanism augments the discriminator’s input with partial information about the missing values (M), represented by the hint vector H. This hint is typically a proportion of M (e.g., 90% identical). The generator then learns to impute the remaining values. The original work demonstrates that insufficient hints lead to multiple optimal generator outputs.
Formally, given a random vector Z, the generator produces an imputed dataset
, from which
is derived (Equation (1)). The discriminator loss function,
, is defined as:
where
is the true mask vector,
is the generated mask vector, and
is the hint vector. The summation is restricted to indices where
to prevent overfitting to the hint. The discriminator is trained to minimize this loss:
The generator loss function comprises two terms:
, which measures the generator’s ability to deceive the discriminator, and
, which quantifies the accuracy of the imputation for observed values:
where
is the data dimensionality and
is defined as:
The generator is trained to minimize the combined loss:
where
is a scaling parameter.
4.4. Time Series-Specific Methods
ARIMA Models
Autoregressive Integrated Moving Average (ARIMA) models are a class of statistical models used for analyzing and forecasting time series data. ARIMA models can be employed to impute missing values in time series data by leveraging temporal dependencies [51] [52]. They are particularly useful when the data exhibits trends and seasonality. An ARIMA model is denoted as ARIMA (p, d, q), where “p” represents the order of the autoregressive (AR) component, “d” represents the degree of difference required to make the time series stationary, and “q” represents the order of the moving average (MA) component. The AR component models the relationship between the current observation and previous observations; the MA component models the relationship between the current observation and past forecast errors, and differencing (d) removes trends and makes the series stationary. A general ARIMA (p, d, q) model can be represented by the following equation:
where
is the time series at time
,
is the backshift operator (
),
is the autoregressive polynomial of order
,
is the moving average polynomial of order
, and
is white noise. The choice of
, and
values is crucial for model fitting and depends on the characteristics of the specific time series being analyzed. Techniques like autocorrelation and partial autocorrelation functions (ACF and PACF) are often used to identify suitable model orders. Table 3 shows the summary of advanced imputation techniques.
Table 3. Summary of advanced imputation techniques.
Method |
Advantages |
Disadvantages |
Python package |
Problem type |
Multiple Imputation
by Chained Equations (MICE) |
Handles complex relationships
between variables, accounts for uncertainty in imputation. |
Can be computationally
expensive, requires careful
model selection. |
“IterativeImputer”
(scikit-learn), “fancyimpute” |
Regression/ Classification |
Random Forest
Imputation |
Handles mixed-type data
(numerical and categorical),
robust to outliers. |
Can be computationally
expensive, may overfit if the
data is highly correlated. |
“MissForest” |
Regression/ Classification |
Generative Adversarial Networks (GANs) |
Can generate realistic synthetic data, handles complex data
distributions. |
Requires significant
computational resources, can
be challenging to train. |
“Tensorflow”, “Pytorch” |
Regression/ Classification |
Deep Learning
Imputation |
Can capture complex non-linear relationships in the data, handles high-dimensional datasets. |
Requires large amounts of
data, can be computationally
expensive, may overfit. |
“Tensorflow”, “Pytorch” |
Regression/ Classification |
Bayesian
Imputation |
Accounts for prior knowledge
and uncertainty, provides
probabilistic estimates. |
Can be computationally
intensive, requires careful
model specification. |
“PyMC3”, “PyStan” |
Regression/ Classification |
5. Evaluation Metrics for Imputation Methods
Assessing the performance of imputation methods is crucial for selecting appropriate techniques [53]. Evaluating the performance of imputation methods is crucial to ensure that the imputed data maintains the integrity and reliability of the original data. Several metrics are commonly used to assess the effectiveness of imputation techniques. The choice of evaluation metric depends on the specific objective of the imputation and the nature of the data. These evaluation metrics can be broadly categorized into two groups, as shown in Table 4.
Table 4. Regression and classification metrics.
Regression metrics |
Classification metrics |
Mean Squared Error (MSE) |
|
Accuracy |
|
Root Mean Squared Error (RMSE) |
|
Precision |
|
Mean Absolute Error (MAE) |
|
Recall (Sensitivity) |
|
Mean Absolute Percentage
Error (MAPE) |
|
F1-Score |
|
6. Challenges and Considerations
While imputation methods offer valuable tools for handling missing data, several challenges and considerations must be addressed when implementing missing data imputation to ensure effective and reliable results [54].
Bias Introduction: Imputation methods can introduce bias into the data, particularly if the missing values are not missing at random (MAR). This means that the missingness is related to the value of the missing variable itself or other variables in the dataset. Example: If missing values in income are more likely to occur for individuals with lower incomes, simply replacing them with the mean income will underestimate the true average income.
Data Distribution: Imputation methods often assume that the data follows a specific distribution (e.g., normal distribution). If the data deviates significantly from this assumption, the imputed values may not be representative. Example: Using mean imputation on a skewed distribution will result in imputed values that are biased towards the tail of the distribution.
Missing Value Patterns: The pattern of missing values can significantly impact the effectiveness of imputation methods. If missing values are clustered or follow a specific pattern, simple methods like mean imputation may not be appropriate. Example: If consecutive values are missing in a time series, LOCF or NOCB may introduce significant bias.
Computational Complexity: Some imputation methods, like multiple imputation or KNN imputation, can be computationally expensive, especially for large datasets. 1) Domain Knowledge: Incorporating domain knowledge into the imputation process can significantly improve the accuracy and relevance of the imputed values. Example: In medical data, understanding the relationships between different variables and the potential causes of missing values can guide the choice of imputation method.
2) Model Selection: Choosing the appropriate imputation model is crucial. The choice should be based on the characteristics of the data, the nature of the missing values, and the goals of the analysis. 3) Interpretability: The interpretability of the imputed values is important for understanding the results of the analysis.
7. Case Studies and Applications
This section illustrates the practical implementation of missing data imputation techniques in a range of different fields. Examples of these applications can be found as follows.
Healthcare: Examining approaches for resolving incomplete entries in electronic health records, with a focus on maintaining data integrity and improving diagnostic accuracy [55]-[57].
Finance: Analysis of strategies for managing incomplete datasets in financial forecasting, specifically focusing on stock market predictions and their implications for investment decision-making [58] [59].
Social Sciences: Investigation of techniques to mitigate the impact of non-response in survey data, exploring methods to preserve statistical validity and minimize bias in population-level inferences [60] [61].
8. Future Directions and Open Problems
Several areas for future research and development in missing data imputation include [62] [63].
1) Emerging techniques and research areas: Federated learning for privacy-preserving imputation [64] [65]. Reinforcement learning for adaptive imputation strategies [66]. Transfer learning for imputation in low-resource settings [67] [68].
2) Integration with big data and real-time systems: Developing scalable and efficient imputation methods for streaming data and large-scale datasets [69]-[71] as follows: Distributed Algorithms: Use scalable imputation algorithms that can handle large datasets efficiently. Techniques like mini-batch processing or parallel computing can be useful. Big Data Frameworks: Leverage tools like Apache Spark or Hadoop, which can process large volumes of data quickly and support machine learning libraries for imputation.
3) Ethical considerations in data imputation: Addressing potential biases and fairness issues in imputation methods, especially in sensitive applications like healthcare and criminal justice [72]-[74]. Here are some key points for these Ethics: a) Transparency: Researchers should clearly communicate how missing data will be handled, including the imputation methods used. This transparency builds trust and allows for reproducibility. b) Bias and Misrepresentation: Imputation can introduce bias if not done carefully. Researchers must consider whether the imputed data accurately reflects the underlying population or if it skews results. c) Informed Consent: Participants should be informed about how their data, including any imputed values, will be used in research. This includes potential implications for privacy and the integrity of their responses. d) Appropriateness of Methods: Different imputation methods (mean, median, predictive modeling, etc.) have different assumptions. Choosing the right method is crucial to avoid distorting the data and the conclusions drawn from it. e) Impact on Decision-Making: The results derived from imputed data can influence policy or clinical decisions. Researchers should ensure that their imputation practices do not lead to harmful outcomes. f) Equity: Consider whether the imputation methods used could disproportionately affect certain groups. Ensuring that imputation methods do not reinforce existing inequalities is vital. g) Ethical Oversight: It’s beneficial to have an ethical review process in place to assess the imputation strategies and their potential implications for participants and broader societal contexts. h) Data Integrity: Strive to maintain the integrity of the original dataset. Imputation should not compromise the authenticity of the data, and researchers should be mindful of the limitations that come with imputed values. i) Training and Expertise: Ensure that those involved in the imputation process have the necessary training and understanding of the ethical implications of their work.
9. Conclusion
As data collection and analysis continue to grow in importance across various domains, the field of missing data imputation is likely to see further advancements and innovations. This review has provided a comprehensive overview of missing data imputation techniques, from traditional methods, and statistical methods to advanced machine learning approaches. Key observations highlight: 1) The critical role of understanding missing data mechanisms. 2) The trade-offs between simple and complex imputation methods. 3) There is a need for careful evaluation and selection of imputation techniques. and 4) The potential of machine learning and deep learning approaches for handling complex missing data patterns. Future research should focus on developing more robust, efficient, and adaptable imputation methods that can handle the increasing complexity and scale of modern datasets while addressing ethical concerns and preserving data integrity.