Evolutionary k- Nearest Neighbor Imputation Algorithm for Gene Expression Data

Hiroshi Madushani de Silva, Amal Shehan Perera


Large data sets are produced by the gene expression process which is done by using the DNA microarray technology. These gene expression data are recognized as a common data source which contains missing expression values. In this paper, we present a genetic algorithm optimized k- Nearest neighbor algorithm (Evolutionary kNNImputation) for missing data imputation. Despite the common imputation methods this paper addresses the effectiveness of using supervised learning algorithms for missing data imputation. Missing data imputation approaches can be categorized into four main categories and among the four approaches our focus is mainly on local approach where the proposed Evolutionary k- Nearest Neighbor Imputation Algorithm falls in. The Evolutionary k- Nearest Neighbor Imputation Algorithm is an extension of the common k- nearest Neighbor Imputation Algorithm which the genetic algorithm is used to optimize some parameters of k- Nearest Neighbor Algorithm. The selection of similarity matrix and the selection of the parameter value k can be identified as the optimization problem. We have compared the proposed Evolutionary k- Nearest Neighbor Imputation algorithm with k- Nearest Neighbor Imputation algorithm and mean imputation method. The three algorithms were tested using gene expression datasets. Certain percentages of values are randomly deleted in the datasets and recovered the missing values using the two algorithms. Results show that Evolutionary kNNImputation outperforms kNNImputation and mean imputation while showing the importance of using a supervised learning algorithm in missing data estimation. Even though mean imputation happened to show low mean error for very few missing rates, supervised learning algorithms became effective when it comes to certain missing rates in datasets.


Missing data imputation; kNNImputation; EvlkNNImputation; Genetic algorithm optimization; Supervised learning algorithm; Big data; Similarity metric; Gene expression data; Evolutionary algorithms

Full Text:


Printing Sponsor
University of Colombo
School of Computing

Managed & Published

Creative Commons License
This journal is published under a Creative Commons Attribution 4.0 International License.