Evolutionary k- Nearest Neighbor Imputation Algorithm for Gene Expression Data

Hiroshi Madushani de Silva; Amal Shehan Perera

PDF

Published Jun 12, 2017

Hiroshi Madushani de Silva

University of Moratuwa

Amal Shehan Perera

University of Moratuwa

Abstract

Large data sets are produced by the gene expression process which is done by using the DNA microarray technology. These gene expression data are recognized as a common data source which contains missing expression values. In this paper, we present a genetic algorithm optimized k- Nearest neighbor algorithm (Evolutionary kNNImputation) for missing data imputation. Despite the common imputation methods this paper addresses the effectiveness of using supervised learning algorithms for missing data imputation. Missing data imputation approaches can be categorized into four main categories and among the four approaches our focus is mainly on local approach where the proposed Evolutionary k- Nearest Neighbor Imputation Algorithm falls in. The Evolutionary k- Nearest Neighbor Imputation Algorithm is an extension of the common k- nearest Neighbor Imputation Algorithm which the genetic algorithm is used to optimize some parameters of k- Nearest Neighbor Algorithm. The selection of similarity matrix and the selection of the parameter value k can be identified as the optimization problem. We have compared the proposed Evolutionary k- Nearest Neighbor Imputation algorithm with k- Nearest Neighbor Imputation algorithm and mean imputation method. The three algorithms were tested using gene expression datasets. Certain percentages of values are randomly deleted in the datasets and recovered the missing values using the two algorithms. Results show that Evolutionary kNNImputation outperforms kNNImputation and mean imputation while showing the importance of using a supervised learning algorithm in missing data estimation. Even though mean imputation happened to show low mean error for very few missing rates, supervised learning algorithms became effective when it comes to certain missing rates in datasets.

Issue

Vol 10 No 1 (2017)

Select the Journal Issue

Articles

Author Biographies

Hiroshi Madushani de Silva, University of Moratuwa

Instructor, Department of Computer Science and Engineering

Amal Shehan Perera, University of Moratuwa

Senior Lecturer, Department of Computer Science and Engineering

Article Sidebar

Main Article Content

Abstract

Article Details

Hiroshi Madushani de Silva, University of Moratuwa

Amal Shehan Perera, University of Moratuwa