Imputation Techniques for Missing Data: Overview

Imputation is a process to handle missing data while retaining a substantial portion of the data. It involves estimating and replacing the missing data with substitute values. This is done to ensure that the dataset is complete, preserve the relationships between variables and maintain sample sizes. It is not advisable to remove the data corresponding to the missing values as it can lead to significant data loss and potential biases.

Imputation techniques are different for different types of data. For example, mean or median is used for continuous numerical data, mode is used for categorical data while linear interpolation is used for time series data.

They can be classified into statistical methods and model-based techniques.

Statistical Imputation Methods

Statistical characteristics of the given data form the basis of the statistical imputation methods which are fast and effective. Various patterns and relationships are identified by analysing the data which are further utilized to calculate possible values for the missing data. Sometimes assumptions are made about the data to impute the missing values by observing the patterns which can induce some level of uncertainty in the predicted values.

Some of these methods include:

• Mean/Median imputation
• Linear interpolation
• Mode imputation
• Hot deck and cold deck imputation
• Stochastic regression imputation

Mean/Median Imputation:

Using this method, the null values can be replaced with the median ( the middlemost value) or the mean (average) for numerical or continuous data. Replacing with mean is not suggestible when there are outliers in the data rather the median is used to fill in the missing data. This method preserves the central tendency of the data.

Mode imputation:

The missing values are filled in using the mode (most frequently occurring value) of the data for categorical data. This is based on the fact that the most probable value for the missing values is the most frequently occurring or most common value. This method preserves the distribution of the variable.

Linear interpolation:

Linear interpolation estimates and replaces the null values assuming a linear relationship between the adjacent values and a smooth linear trend is followed in the data. It is used when the data is ordered and has a natural progression or time-series data and may not be suitable for datasets with non-linear patterns or relations.

Hot deck and Cold deck imputation:

Hot deck imputation uses value obtained by randomly selecting a pool, which is called the ‘donor pool’ and is similar to the cases of the null values but has complete data. Here, ‘hot deck’ refers to the available pool of cases that are used as a source for imputing. There are many variations in this method stratified, nearest neighbour, time series etc.

The cold deck method fills the missing values using an external dataset that is complete as a source. These methods tend to maintain the characteristics of the observed data.

Model-based Imputation Methods

When there are complex patterns and relationships among the data variables and if the traditional imputation methods are not able to capture all those, model-based imputation techniques play a significant role. They can provide good results even when the relations between the variables are non-linear and intricate, when the dataset is huge with many variables and observations, when there are hierarchical dependencies etc.

These methods require larger sample sizes and suitable computational resources. There are several methods of which some are listed below.

• KNN imputation
• Random forest imputation
• GAN based imputation
• Bayesian networks

KNN imputation:

This technique uses the KNN(k nearest neighbours) algorithm for filling the null values with the mean or the mode of the ‘k’ nearest neighbours of the corresponding values. It is used for discrete, ordinal, and categorical data as well as continuous values. It is based on the distance measurement from each instance.

Random Forest imputation:

Random forest algorithm is used in this method to replace the missing values by training the model over the observed dataset with the variable having missing values as the target variable. Unlike KNN which is sensitive to outliers, this method can handle non-linearity and outliers in the data. It consists of multiple decision trees which are used to estimate the missing value by calculating the average of these decision tree estimates. This also can be used with both numerical and categorical data.

GAN based imputation:

GAN stands for Generative Adversarial Network. They are semi-supervised and generate synthetic data with a limited training set.

GAIN - Generative adversarial imputation nets, a GAN-based data imputing method where the generator takes the real data and fills the null values giving a complete vector which is then passed to the discriminator that tries to differentiate between the real and generated data in the vector.

Bayesian networks imputation:

A Bayesian network, also known as a probabilistic graphical model, using the probabilistic dependencies among the variables provides a graphical representation of those. For imputing missing values, the data is used to construct the network and it is trained to estimate the missing values from the learned dependencies of the given data. Complex dependencies and uncertainties can also be addressed using these networks.

Imputation Methods using sklearn:

Simple Imputer:

This imputer is generally used to impute the missing value with a constant and specific value or the central tendencies like the mean, mode and median. It also allows for different missing value encodings and supports sparse matrices.

Iterative Imputer:

In this method, the missing values are predicted based on the values of other features based on the relations between them. It works in a round-robin fashion and iterative manner. The dataset is taken and the column with the missing values is taken as the output(y) and the rest of the columns as (X). Then a regressor is trained using this data with the non-missing values where y is the target variable and the input is X. Then it is used to predict for the missing values y using the X input. This process is continuously repeated or iterated till the specified max number of iterations. Then the final values are returned for the missing values.

Missing Indicator:

This indicator converts the dataset into a binary matrix with the same shape as the input dataset indicating the missing values in the original dataset. 1 is used to indicate that there is a missing value in the input dataset. After converting and indicating the missing values, a simple imputer or an iterative imputer is used to fill in the values. The information about which values were previously missing can also be preserved. Here missingness is seen as a feature itself.

KNN Imputer:

In this method, the KNN approach is used for imputing the missing values. It is non-parametric and used for both categorical and numerical data. The imputer takes the data and calculates the distance metric. Then for each sample with missing values, the k-neighbours are found and imputation is done. Imputation can be done in two ways- using uniform averaging or weighted averaging. It is useful for dealing with complex patterns of missingness.

In this article, We have seen the various imputation methods used for dealing with missing data both numerical and categorical data.

Do Checkout:

References:

https://scikit-learn.org/stable/modules/impute.html

https://www.simplilearn.com/data-imputation-article

https://machinelearningmastery.com/statistical-imputation-for-missing-values-in-machine-learning/

https://betterdatascience.com/impute-missing-data-with-python-and-knn/

https://towardsdatascience.com/the-use-of-knn-for-missing-values-cf33d935c637

https://www.analyticsvidhya.com/blog/2021/06/defining-analysing-and-implementing-imputation-techniques/

https://www.analyticsvidhya.com/blog/2021/10/handling-missing-value/

https://statisticsglobe.com/regression-imputation-stochastic-vs-deterministic/