Matrix completion and feature imputation algorithms
|Build Status| |Coverage Status| |DOI|
A variety of matrix completion and imputation algorithms implemented in Python.
.. code:: python
from fancyimpute import BiScaler, KNN, NuclearNormMinimization, SoftImpute
# X is the complete data matrix
# X_incomplete has the same values as X except a subset have been replace with NaN
# Use 3 nearest rows which have a feature to fill in each row's missing features
X_filled_knn = KNN(k=3).complete(X_incomplete)
# matrix completion using convex optimization to find low-rank solution
# that still matches observed values. Slow!
X_filled_nnm = NuclearNormMinimization().complete(X_incomplete)
# Instead of solving the nuclear norm objective directly, instead
# induce sparsity using singular value thresholding
X_filled_softimpute = SoftImpute().complete(X_incomplete_normalized)
# print mean squared error for the three imputation methods above
nnm_mse = ((X_filled_nnm[missing_mask] - X[missing_mask]) ** 2).mean()
print("Nuclear norm minimization MSE: %f" % nnm_mse)
softImpute_mse = ((X_filled_softimpute[missing_mask] - X[missing_mask]) ** 2).mean()
print("SoftImpute MSE: %f" % softImpute_mse)
knn_mse = ((X_filled_knn[missing_mask] - X[missing_mask]) ** 2).mean()
print("knnImpute MSE: %f" % knn_mse)
SimpleFill
: Replaces missing entries with the mean or median of
each column.
KNN
: Nearest neighbor imputations which weights samples using the
mean squared difference on features for which two rows both have
observed data.
SoftImpute
: Matrix completion by iterative soft thresholding of
SVD decompositions. Inspired by the
softImpute <https://web.stanford.edu/~hastie/swData/softImpute/vignette.html>
_
package for R, which is based on Spectral Regularization Algorithms
for Learning Large Incomplete
Matrices <http://web.stanford.edu/~hastie/Papers/mazumder10a.pdf>
_
by Mazumder et. al.
IterativeSVD
: Matrix completion by iterative low-rank SVD
decomposition. Should be similar to SVDimpute from Missing value
estimation methods for DNA
microarrays <http://www.ncbi.nlm.nih.gov/pubmed/11395428>
__ by
Troyanskaya et. al.
MICE
: Reimplementation of Multiple Imputation by Chained
Equations <http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3074241/>
__.
MatrixFactorization
: Direct factorization of the incomplete
matrix into low-rank U
and V
, with an L1 sparsity penalty on
the elements of U
and an L2 penalty on the elements of V
.
Solved by gradient descent.
NuclearNormMinimization
: Simple implementation of Exact Matrix
Completion via Convex
Optimization <http://statweb.stanford.edu/~candes/papers/MatrixCompletion.pdf>
_
by Emmanuel Candes and Benjamin Recht using
cvxpy <http://www.cvxpy.org/en/latest/>
_. Too slow for large
matrices.
BiScaler
: Iterative estimation of row/column means and standard
deviations to get doubly normalized matrix. Not guaranteed to
converge but works well in practice. Taken from Matrix Completion
and Low-Rank SVD via Fast Alternating Least
Squares <http://arxiv.org/abs/1410.2596>
__.
.. |Build Status| image:: https://travis-ci.org/hammerlab/fancyimpute.svg?branch=master :target: https://travis-ci.org/hammerlab/fancyimpute .. |Coverage Status| image:: https://coveralls.io/repos/github/hammerlab/fancyimpute/badge.svg?branch=master :target: https://coveralls.io/github/hammerlab/fancyimpute?branch=master .. |DOI| image:: https://zenodo.org/badge/doi/10.5281/zenodo.51773.svg :target: http://dx.doi.org/10.5281/zenodo.51773