Marginalized Stacked Denoising Autoencoder


From http://fastml.com/very-fast-denoising-autoencoder-with-a-robot-arm/

Source Code : http://www.cse.wustl.edu/~mchen/


Once upon a time we were browsing machine learning papers and software. We were interested in autoencoders and found a rather unusual one. It was called marginalized Stacked Denoising Autoencoder and the author claimed that it preserves the strong feature learning capacity of Stacked Denoising Autoencoders, but is orders of magnitudes faster. We like all things fast, so we were hooked.

About autoencoders

Wikipedia says that an autoencoder is an artificial neural network and its aim is to learn a compressed representation for a set of data. This means it is being used for dimensionality reduction. In other words, an autoencoder is a neural network meant to replicate the input. It would be trivial with a big enough number of units in a hidden layer: the network would just find an identity mapping. Hence dimensionality reduction: a hidden layer size is typically smaller than input layer.

mSDA is a curious specimen: it is not a neural network and it doesn’t reduce dimensionality of the input. Why use it, then? For denoising. mSDA is a stack of mDA’s, which are linear denoisers. mDA takes a matrix of observations, makes it noisy and finds optimal weights for a linear transformation to reconstruct the original values. A linear closed form solution is the secret of speed and the reason why data dimensionality must stay the same.

Apparently by stacking a few of those denoisers you can get better results. To achieve non-linear encoding, output of each layer is filtered through tanh function. This is faster than optimizing weights which have already been filtered, as in back-propagation (in case you don’t know, back propagation tends to be slow, especially in multi-layer architectures). Slowness is the Achilles’ heel of traditional stacked autoencoders.

The main trick of mSDA is marginalizing noise - it means that noise is never actually introduced to the data. Instead, by marginalizing, the algorithm is effectively using infinitely many copies of noisy data to compute the denoising transformation [Chen]. Sounds good, but does it actually work? We don’t know. We’ll see.

All in all, we don’t care that much for theory, we care for results. If it works, great. If it doesn’t, who cares for theory? So let’s find a noisy dataset and see what we can do.

The data

results matching ""

    No results matching ""