Background Research in Attempts to Follow a Talk on “Quantitative Evaluation of Deep Generative Models” by Salakhutdinov

This are my notes from quickly researching some topics (sort of an ad-hoc glossary of some machine learning terms) in an attempt to follow a talk by Russ Salakhutdinov at CMU – the talk slides can be found here: http://www.cs.cmu.edu/~rsalakhu/talk_Eval.pdf. I found him while doing some background research on Apple.

Unsupervised Learning

Infer a function to describe a hidden structure from “unlabeled” data – a classification not included in the observation.  No objective way to evaluate the accuracy of the structure that is outbut by the algorithm, which is not the case with supervised learning.

Sparse coding

Class of unsupervised methods for learning sets of over-complete bases to represent data efficiently.  You find a set of basis vectors, so you can represent an input vector as a linear combo of the basis vectors.  Over-complete basis can be better with patterns in input data.

Sparsity means you have few non-zero components.  Sparsity works with things like sensory data, natural images.  They can be described wtih small number of atomic elements, like surfaces or edges.

Over-Complete

Linear algebra concept.  You’re complete if you can represent a vector as a linear combo of a set of basis vectors.  You’re overcomplete if you can remove one of the basis vectors and still be complete.  Over-completeness can yield a more stable or compact representation sometimes.

Autoencoder

An autoencoder is an articifical neural net, used in unsupervised learning.  Use to learn a representation of a data set, so you can dimensionally reduce it.  This reduction is reducing the number of random vars being considered, to obtain a set of principle variables.  This involves feature selection and feature extraction.

Simplest form of autoencoder is multilayer perceptron.

Fully Observed System

In a fully observable environment, the state of the system is known at all times.   With a chess board, all information required to make an optimal decision is available.   Past state isn’t relevant.

Fully Observed Belief Nets

A probabilistic graphical model.  Represents a set of random variables and their conditional dependencies via a directed acyclic graph (DAG).

Ex: Bayesian network could represent the probabilistic relationships between diseases and symptoms. Given symptoms, the network can be used to compute the probabilities of the presence of various diseases.

Nodes represent random vars, edges represent conditional dependencies, unconnected nodes are independent.  Each node has probabability function that takes as input a set of values, and outputs the probability of the variable represented by the node.

An example of a belief network might be oberving a patient’s symptoms of coughing, wheezing, fever, and smoking habits.  You might want to predict the causes of the symptoms, is it emphysema?

NADE (Neural Autoregressive Density Estimator)

A neural network architecture applied to the problem of unsupervised distribution and density estimation. They leverage the probability product rule and a weight sharing scheme inspired from restricted Boltzmann machines, to yield an estimator that is both tractable and has good generalization performance.

PIXELRNN (Pixel Recurrent Neural Networks)

Used in modeling distribution of natural images with unsupervised learning.  Needs to be expressive, tractable, and scalable.  Involves sequentially predicting pixels in an image, modeling the discrete probability of the raw pixel values and encodes complete set of dependencies in the image.

Can be used in image compression, to forms of reconstruction and deblurring.

Tractable Models

Computational tractability means that one can solve a problem in polynomial time, concerns the relationship between the size of inputs and the time it takes to solve as a function of that.

Stochastic recurrent neural network

A RNN is a type of artificial neural net – where connection between units form a directed cycle (path of edges and vertices wherein a vertex is reachable from itself).  Exhibits dynamic temporal behaviour, can use internal memory to process arbitrary sequences of inputs.  Used in handwriting recognition or speech recognition.

 

Stochastic RNN is built by introducing random variations into the network, possibly giving the neurons stochastic transfer functions, or by giving them stochastic weights.  Used in optimization problems as they can avoid local minima somewhat.

Stochastic neural networks are a type of artificial neural networks built by introducing random variations into the network, either by giving the network’s neurons stochastic transfer functions, or by giving them stochastic weights. This makes them useful tools for optimization problems, since the random fluctuations help it escape from local minima.  Used in risk management, for example.

Boltzman Machines

A type of stochastic recurrent neural network and Markov random field.  Able to represent and (given sufficient time) solve difficult combinatoric problems.

They are named after the Boltzmann distribution in statistical mechanics, which is used in their sampling function.  A Boltzmann machine, like a Hopfield network, is a network of units with an “energy” defined for the network. It also has binary units, but unlike Hopfield nets, Boltzmann machine units are stochastic.

The network is run by repeatedly choosing a unit and setting its state according to a formula. After running for long enough at a certain temperature, the probability of a global state of the network will depend only upon that global state’s energy, according to a Boltzmann distribution, and not on the initial state from which the process was started. This means that log-probabilities of global states become linear in their energies. This relationship is true when the machine is “at thermal equilibrium“, meaning that the probability distribution of global states has converged. If we start running the network from a high temperature, and gradually decrease it until we reach a thermal equilibrium at a low temperature, we may converge to a distribution where the energy level fluctuates around the global minimum. This process is called simulated annealing.

If we want to train the network so that the chance it will converge to a global state is according to an external distribution that we have over these states, we need to set the weights so that the global states with the highest probabilities will get the lowest energies. This is done with a training procedure.

The Boltzmann machine would theoretically be a rather general computational medium. For instance, if trained on photographs, the machine would theoretically model the distribution of photographs, and could use that model to, for example, complete a partial photograph.

Unfortunately, there is a serious practical problem with the Boltzmann machine, namely that it seems to stop learning correctly when the machine is scaled up to anything larger than a trivial machine.

Variational Autoencoders

Found in unsupervised machine learning, involves probabilistic Bayesian inference and deep learning.  Autoencoders sequentially deconstruct input data into hidden representations, then use these representations to sequentially reconstruct outputs that resemble the original.  This is representation learning, works on unlabeled data – without things like “this person has emphysema, or this person has bronchitus”.

See http://blog.fastforwardlabs.com/2016/08/12/introducing-variational-autoencoders-in-prose-and.html.

You must provide some constraints to make the hidden representation useful, or you will just copy the image.

Variational Autoencoders (VAEs) incorporate regularization by explicitly learning the joint distribution over data and a set of latent variables that is most compatible with observed datapoints and some designated prior distribution over latent space.

VAEs are an excellent tool for manifold learning—recovering the “true” manifold in lower-dimensional space along which the observed data lives with high probability mass—and generative modeling of complex datasets like images, text, and audio—conjuring up brand new examples, consistent with the observed training set, that do not exist in nature.

An end-to-end autoencoder (input to reconstructed input) can be split into two complementary networks: an encoder and a decoder. The encoder maps input x to a latent representation, or so-called hidden code, z. The decoder maps the hidden code to a reconstructed input value x̃ .
Helmholtz Machines

Another type of artificial NN.  Can account for the hidden structure of a set of data by being trained to create a generative model of the original set of data.  The hope is that by learning economical representations of the data, the underlying structure of the generative model should reasonably approximate the hidden structure of the data set.

A Helmholtz machine contains two networks, a bottom-up recognition network that takes the data as input and produces a distribution over hidden variables, and a top-down “generative” network that generates values of the hidden variables and the data itself.

Helmholtz machines are usually trained using an unsupervised learning algorithm, but can be used in supervised learning to do things like character recognition.

Generative Adversarial Networks

Generative adversarial networks are a branch of unsupervised machine learning, implemented by a system of two neural networks competing against each other in a zero-sum game framework.

Can be used to make images that a human might find convincing.  You could create a picture of a duck in video game, for example.

One network is generative and one is discriminative.[1] Typically, the generative network is taught to map from a latent space to a particular data distribution of interest, and the discriminative network is simultaneously taught to discriminate between instances from the true data distribution and synthesized instances produced by the generator. The generative network’s training objective is to increase the error rate of the discriminative network (fooling it).

In practice, a particular dataset serves as the training data for the discriminator. Training the discriminator involves presenting the discriminator with samples from the dataset and samples synthesized by the generator, and backpropagating from a binary classification loss. In order to produce a sample, typically the generator is seeded with a randomized input that is sampled from a predefined latent space (e.g., a multivariate normal distribution). Training the generator involves back-propagating the negation of the binary classification loss of the discriminator. The generator adjusts its parameters so that the training data and generated data cannot be distinguished by the discriminator model. The goal is to find a setting of parameters that makes generated data look like the training data to the discriminator network.[4]

In practice, the generator is typically a deconvolutional neural network, and the discriminator is a convolutional neural network.

Moment Matching Networks

Addressing learning deep generative models from data. Generates an independent sample via a single feedforward pass through a multilayer perceptron.

Training a generative adversarial network requires careful optimization of a difficult minimax program. Instead, utilizes a technique from statistical hypothesis testing known as maximum mean discrepancy (MMD), which leads to a simple objective that can be interpreted as matching all orders of statistics between a dataset and samples from the model, and can be trained by backpropagation.

To further boost the performance of this approach – involves combining generative network with an auto-encoder network, using MMD to learn to generate codes that can then be decoded to produce samples.  When measured using the Toronto Face Database, the researchers found excellent generative models compared to baseline approaches.

See http://jmlr.org/proceedings/papers/v37/li15.pdf.

Directed or Undirected Models

Bayensian networks are directed.  Undirected graphical models are useful when natural directionality doesnt exist between variables, this is the case for Markov Networks or MRFs.

One can combine directed and undirected graphs.

Directed graphs are useful for expressing causal relationships between random variables, whereas undirected graphs are better suited at expressing soft constraints between random variables.

For the purposes of solving inference problems, it is often convenient to convert both directed and undirected graphs into a different representation called a factor graph

See http://www.cedar.buffalo.edu/~srihari/CSE674/Chap4/4.1-UndirGr.pdf.

Deep Boltzman Machine

A new learning algorithm for Boltzmann machines that contain many layers of hidden variables. Data-dependent expectations are estimated using a variational approximation that tends to focus on a single mode, and data- independent expectations are approximated using persistent Markov chains. The use of two quite different techniques for estimating the two types of expectation that enter into the gradient of the log-likelihood makes it practical to learn Boltzmann machines with multiple hidden layers and millions of parameters. The learning can be made more efficient by using a layer-by-layer “pre-training” phase that allows variational inference to be initialized with a single bottom- up pass.

See http://www.utstat.toronto.edu/~rsalakhu/papers/dbm.pdf.

Markov Random Fields

In the domain of physics and probability, a Markov random field(often abbreviated as MRF), Markov network or undirected graphical model is a set of random variables having a Markov property described by an undirected graph.

Markov property refers to the memoryless property of a stochastic process – it has the Markov property if the conditional probability distribution of future states of the process (conditional on both past and present states) depends only upon the present state, not on the sequence of events that preceded it. A process with this property is called a Markov process.

A Markov network or MRF is similar to a Bayesian network in its representation of dependencies; the differences being that Bayesian networks are directed and acyclic, whereas Markov networks are undirected and may be cyclic. Thus, a Markov network can represent certain dependencies that a Bayesian network cannot (such as cyclic dependencies); on the other hand, it can’t represent certain dependencies that a Bayesian network can (such as induced dependencies). The underlying graph of a Markov random field may be finite or infinite.

Partition Functions

In physics, a partition function describes the statistical properties of a system in thermodynamic equilibrium

Partition functions are functions of the thermodynamic state variables, such as the temperature and volume. Most of the aggregate thermodynamic variables of the system, such as the total energyfree energyentropy, and pressure, can be expressed in terms of the partition function or its derivatives.

Each partition function is constructed to represent a particular statistical ensemble (which, in turn, corresponds to a particular free energy). The most common statistical ensembles have named partition functions.

The canonical partition functionapplies to a canonical ensemble, in which the system is allowed to exchange heat with the environment at fixed temperature, volume, and number of particles. The grand canonical partition function applies to a grand canonical ensemble, in which the system can exchange both heat and particles with the environment, at fixed temperature, volume, and chemical potential. Other types of partition functions can be defined for different circumstances; see partition function (mathematics) for generalizations. The partition function has many physical meanings, as discussed in Meaning and significance.

Restricted Boltzman Machines

restricted Boltzmann machine (RBM) is a generativestochastic artificial neural network that can learn a probability distribution over its set of inputs. Can be trained in either supervised or unsupervised ways, depending on the task.

As their name implies, RBMs are a variant of Boltzmann machines, with the restriction that their neurons must form a bipartite graph: a pair of nodes from each of the two groups of units (commonly referred to as the “visible” and “hidden” units respectively) may have a symmetric connection between them; and there are no connections between nodes within a group. By contrast, “unrestricted” Boltzmann machines may have connections between hidden units. This restriction allows for more efficient training algorithms than are available for the general class of Boltzmann machines, in particular the gradient-based contrastive divergence algorithm.[7]

Can also be used in deep learningnetworks – deep belief networks can be formed by “stacking” RBMs and optionally fine-tuning the resulting deep network with gradient descent and backpropagation.

See https://en.wikipedia.org/wiki/Restricted_Boltzmann_machine

MRF model selection (Markov random field model selection)

Hidden Markov random fields appear naturally in problems such as image segmentation, where an unknown class assignment has to be estimated from the observations at each pixel. Choosing the probabilistic model that best accounts for the observations is an important first step for the quality of the subsequent estimation and analysis.

A commonly used selection criterion is the Bayesian Information Criterion (BIC) of Schwarz (1978), but for hidden Markov random fields, its exact computation is not tractable due to the dependence structure induced by the Markov model.

Forbes and Peyrard proposed approximations of BIC based on the mean field principle of statistical physics. The mean field theory provides approximations of Markov random fields by systems of independent variables leading to tractable computations. Using this principle, they derived a class of criteria by approximating the Markov distribution in the usual BIC expression as a penalized likelihood, then rewrite BIC in terms of normalizing constants, also called partition functions, instead of Markov distributions.

 

See http://ieeexplore.ieee.org/document/1227985/?reload=true.

Generative Model

In probability and statistics, a generative model is a model for randomly generating observable data values, typically given some hidden parameters. It specifies a joint probability distribution over observation and label sequences. Generative models are used in machine learning for either modeling data directly (i.e., modeling observations drawn from a probability density function), or as an intermediate step to forming a conditional probability density function.

A conditional distribution can be formed from a generative model through Bayes’ rule.

Generative models contrast with discriminative models, in that a generative model is a full probabilistic model of all variables, whereas a discriminative model provides a model only for the target variable(s) conditional on the observed variables. Thus a generative model can be used, for example, to simulate (i.e. generate) values of any variable in the model, whereas a discriminative model allows only sampling of the target variables conditional on the observed quantities. Despite the fact that discriminative models do not need to model the distribution of the observed variables, they cannot generally express more complex relationships between the observed and target variables. They don’t necessarily perform better than generative models at classification and regression tasks. In modern applications the two classes are seen as complementary or as different views of the same procedure.

Simple Importance Sampling

One of the advantages of Bayesian statistics is flexibility to design analysis to fit the problem, instead of trying to force the problem into an approach you have avaiable.

Searching the parameter space to fit a model is more taxing in the Bayesian paradigm, because the whole posterior is of interest, while in contrast most classical analyses split the task into two (fairly) simple steps: finding the maximum of the likelihood function (say, using Newton–Raphson) and then plugging the estimate in to the Fisher information to derive an asymptotic confidence interval.

Several approaches that are widely used to search the parameter space – evaluating the posterior along the way —grid search and Monte Carlo.

The problem with Monte Carlo sampling in its most basic form is that to use it you need to (i) know the specific distribution of the posterior of the parameters and (ii) be able to simulate this distribution. More usually, however, you know the distribution only in the sense that you can calculate it for any given values of the parameters (perhaps up to a constant of proportionality) but cannot simulate from it, so rather than knowing that pv |Xv , Nv ∼ Be(52, 8147), say, all you know is that p(p |X , N ) ∝ p51(1 − p )8146. This lack of knowledge means you do not know how to select values for the parameters, as you could for basic Monte Carlo sampling.

The idea behind importance sampling is that instead of sampling from the posterior itself, whose form you don’t know, you simulate from a different distribution that you hope, is close to the actual posterior, and “correct” it to account for the fact that it is not the actual posterior. The correction is done by weighting the sampled points by the ratio of the posterior to the proposal density, both of which you should be able to evaluate numerically (if you cannot evaluate the proposal density, choose one you can).

See http://blog.nus.edu.sg/alexcook/files/2012/12/chapter2-2bv9gf5.pdf.

Monte Carlo Approximation

Using statistical methods one may face integrals of a certain form, which sometimes cannot be evaluated analytically.  This may be the case ith Bayesian methods.  One can use numeric integration methods in that case, though they may lose tractibility in practical cases with many dimensions.

Monte Carlo approximation allows calculation of an estimate for the value of  by transforming the  integration problem into a procedure of sampling values from a tractable probability distribution and calculating the average of those samples.

See https://theclevermachine.wordpress.com/2012/09/22/monte-carlo-approximations/

Gibbs Sampling

Gibbs sampling  is a Markov chain Monte Carlo (MCMC) algorithm for obtaining a sequence of observations which are approximated from a specified multivariate probability distribution, when direct sampling is difficult.

This sequence can be used to approximate the joint distribution (e.g., to generate a histogram of the distribution); to approximate the marginal distribution of one of the variables, or some subset of the variables (for example, the unknown parameters or latent variables); or to compute an integral (such as the expected value of one of the variables). Typically, some of the variables correspond to observations whose values are known, and hence do not need to be sampled.

As with other MCMC algorithms, Gibbs sampling generates a Markov chain of samples, each of which is correlated with nearby samples. As a result, care must be taken if independent samples are desired (typically by thinning the resulting chain of samples by only taking every nth value, e.g. every 100th value). In addition, samples from the beginning of the chain (the burn-in period) may not accurately represent the desired distribution.

See https://en.wikipedia.org/wiki/Gibbs_sampling

Forward or Reverse Markov Chains

Consider Markov random processes in which there is a list of possible states.

NOTE: the excerpt below is from See http://www.math.nyu.edu/faculty/goodman/teaching/StochCalc2012/notes/Week2.pdf…

We introduce three mathematical ideas: a σ−algebra to represent a state of partial information, measurability of a function with respect to a dis- crete σ−algebra, and a filtration that represents gaining information over time. Filtrations are a convenient way to describe the Markov property and to give the general definition of a martingale, the latter a few weeks from now.

Associated with Markov chains are the backward and forward equations that describe the evolution of probabilities and expectation values over time. Forward and backward equations are one of the main “calculate things” methods of stochastic calculus.

A stochastic process in discrete time is a sequence (X1, X2, . . .), where Xn is the state of the system at time n. The path up to time T is X[1:T] = (X1,X2,…,XT). The state Xn must be in the state space, S. Last week S was Rd, for a linear Gaussian process. This week, S is either a finite set S = {x1,x2,…xm}, or an infinite countable set of the form S = {x1,x2,…}. A set such as S is discrete if it is finite or countable. The set of all real numbers, like the Gaussian state space Rd, is not discrete because the real numbers are not countable (a famous theorem of Georg Cantor). Spaces that are not discrete may be called continuous. If we define Xt for all times t, then Xt is a continu- ous time process. If Xn is defined only for integers n (or another discrete set of times), then we have a discrete time process. This week is about discrete time discrete state space stochastic processes.

We will be interested in discrete Markov chains partly for their own sake and partly because they are a setting where the general definitions are easy to give without much mathematical subtlety. The concepts of measurability and filtration for continuous time or continuous state space are more technical and subtle than we have time for in this class. The same is true of backward and forward equations. They are rigorous this week, but heuristic when we go to continuous time and state space. That is the “∆X → 0 and ∆t → 0” aspect of stochastic calculus.

The main examples of Markov chain will be random walk and mean reverting random walk.  There are discrete versions of Brownian motion and the Ornstein Uhlenbeck process respectively.

See http://www.sciencedirect.com/science/article/pii/S030441490600189X

RBM Sampling

From Yam Peleg, answering a Quora question at https://www.quora.com/Why-does-Gibbs-sampling-in-an-RBM-in-contrastive-divergence-allow-us-to-approximate-the-log-of-the-partition-function

Direct sampling of a stochastic process is sometimes difficult and even impossible – due to lack of knowledge of the underlying mechanics that create that stochastic process over time.

Gibbs sampling is a Markov chain Monte Carlo (MCMC) algorithm (That for itself is a very important algorithm) for obtaining a sequence of observations when direct sampling is difficult.

Contrastive divergence is a way for training some machine learning algorithms that relay on the use of Markov chains for data sampling.
CD works like that: we initialize the Markov chain with a training example (from a distribution that is expected to be close to desired).
the trick is that does not wait for the chain to converge. samples are obtained after only k-steps of Gibbs sampling.

This is why gibbs sampling in RBM works.

Omniglot Data Set

A dataset collected called “Omniglot” – while similar in spirit to MNIST, rather than having 10 characters with 6000 examples each, it has over 1600 character with 20 examples each – making it more like the “transpose” of MNIST. These characters were selected from 50 different alphabets on http://www.omniglot.com, which includes scripts from natural languages (e.g., Hebrew, Korean, Greek) and artificial scripts (e.g., Futurama and ULOG) invented for purposes like TV shows or video games. Since it was produced on Amazon’s Mechanical Turk, each image is paired with a movie ([x,y,time] coordinates) showing how that drawing was produced.

Data Manifold

High-dimensional data, meaning data that requires more than two or three dimensions to represent, can be difficult to interpret. One approach to simplification is to assume that the data of interest lie on an embedded non-linear manifold within the higher-dimensional space. If the manifold is of low enough dimension, the data can be visualised in the low-dimensional space.

Many of these non-linear dimensionality reduction methods are related to linear methods. Non-linear methods can be broadly classified into two groups: those that provide a mapping (either from the high-dimensional space to the low-dimensional embedding or vice versa), and those that just give a visualisation. In the context of machine learning, mapping methods may be viewed as a preliminary feature extraction step, after which pattern recognition algorithms are applied. Typically those that just give a visualisation are based on proximity data – that is, distance measurements.

Generative Process

In probability and statistics, a generative model is a model for randomly generating observable data values, typically given some hidden parameters. It specifies a joint probability distribution over observation and label sequences. Generative models are used in machine learning for either modeling data directly (i.e., modeling observations drawn from a probability density function), or as an intermediate step to forming a conditional probability density function. A conditional distribution can be formed from a generative model through Bayes’ rule.

Generative models contrast with discriminative models, in that a generative model is a full probabilistic model of all variables, whereas a discriminative model provides a model only for the target variable(s) conditional on the observed variables. Thus a generative model can be used, for example, to simulate (i.e. generate) values of any variable in the model, whereas a discriminative model allows only sampling of the target variables conditional on the observed quantities. Despite the fact that discriminative models do not need to model the distribution of the observed variables, they cannot generally express more complex relationships between the observed and target variables. They don’t necessarily perform better than generative models at classificationand regression tasks. In modern applications the two classes are seen as complementary or as different views of the same procedure.[1]

Kernel Density Estimator

In statisticskernel density estimation (KDE) is a non-parametric way to estimate the probability density function of a random variable. Kernel density estimation is a fundamental data smoothing problem where inferences about the population are made, based on a finite data sample. In some fields such as signal processing and econometrics it is also termed the Parzen–Rosenblatt window method, after Emanuel Parzen and Murray Rosenblatt, who are usually credited with independently creating it in its current form.[1][2]

Leave a comment