Summary of AlphaGo Approach, Pulling Directly from the Paper

If you can handle it, read the paper on AlphaGo here: https://gogameguru.com/i/2016/03/deepmind-mastering-go.pdf.  I tried to pull out a summary of technique and results from the paper into less than three pages so I can reread the highlights over time to see if sinks in…

What is Hard About Go

Enormous search space, difficult to assess board positions.

Approach

Introduction of Value Networks and Policy Networks – deep neural networks

Novel combination of supervised learning and reinforcement learning

This thing plays itself to improve

New search algorithm that combines Monte-Carlo simulation with the value & policy networks

Results

99.8% winning rate against other programs, defeated European Go champ 5 to nothing.

Note – this was thought to be a decade away

Perfect Information and Outcomes

In games of perfect information, there is an optimal value function, determining the outcome of the game from every board position, under perfect play.

Recursive Computation of Optimal Value Function

You can solve the game by recursively computing the optimal value function in a search tree containing possible sequences of moves, depending on the game’s breadth (number of legal moves per position) and its depth (game length).

In Go, exhaustive search is not feasible.

Reducing the Search Space

1> position evaluation: truncating the search tree at state s and replacing the subtree below s by an approximate value function that predicts the outcome from state s.

2> sampling actions from a policy that is a probability distribution over possible moves in position s.

Monte-Carlo Tree Search (MCTS)

Uses Monte-Carlo rollouts to estimate the value of each state in a search tree.

As more simulations are executed, the search tree grows larger and the relevant values become more accurate.

The policy used to select actions during search is also improved over time, by selecting children with higher values.

These policies are used to narrow the search to a beam of high probability actions, and to sample actions during rollouts.

Deep Convolutional Neural Networks

Deep convolutional neural networks use many layers of neurons, each arranged in overlapping tiles, to construct increasingly abstract, localised representations of an image.

Pass in the board position as a 19 × 19 image and use convolutional layers

to construct a representation of the position, then use these neural networks to reduce the effective depth and breadth of the search tree: evaluating positions using a value network, and sampling actions using a policy network.

Training with Pipeline

Train the neural networks using a pipeline consisting of several stages of machine learning.

First by training a supervised learning (SL) policy network directly from expert human moves. This provides fast, efficient learning updates with immediate feedback and high quality gradients.

Also train a fast policy that can rapidly sample actions during rollouts.

Next, train a reinforcement learning (RL) policy network that improves the SL policy network by optimising the final outcome of games of self-play.

This adjusts policy towards correct goal of winning games, rather than predictive accuracy. Finally, train a value network that predicts the winner of games played by the RL policy network against itself.

AlphaGo efficiently combines the policy and value networks with MCTS.

Supervised Learning of Policy Networks

The SL policy network pσ(a|s) alternates between convolutional layers with weights σ, and rectifier non-linearities. A final softmax layer outputs a probability distribution over all legal moves a. The input s to the policy network is a simple representation of the board state.

The policy network is trained on randomly sampled state-action pairs (s, a), using stochastic gradient ascent to maximize the likelihood of the human move a selected in state s.

Reinforcement Learning of Policy Networks

This stage aims at improving the policy network by policy gradient reinforcement learning. The RL policy network pρ is identical in structure to the SL policy network, and its weights ρ are initialised to the same values, ρ = σ.

They play games between the current policy network pρ and a randomly selected previous iteration of the policy network. Randomising from a pool of opponents stabilises training by preventing overfitting to the current policy.

They use a reward function r(s) that is zero for all non-terminal time-steps t < T. The outcome zt = ±r(sT ) is the terminal reward at the end of the game from the perspective of the current player at time-step t: +1 for winning and −1 for losing. Weights are then updated at each time-step t by stochastic gradient ascent in the direction that maximizes expected outcome.

Reinforcement Learning of Value Networks

Last stage of training pipeline focuses on position evaluation, estimating a value function that predicts the outcome from position s of games played by using policy p for both players.

Ideally, would like to know the optimal value function under perfect play, in practice, instead estimate the value function for our strongest policy, using the RL policy network pρ.

approximate the value function using a value network vθ(s) with weights θ, vθ(s) ≈ vpρ (s) ≈ v∗(s) – This neural network has a similar architecture to the policy network, but outputs a single prediction instead of a probability distribution.

Train the weights of the value network by regression on state-outcome pairs (s, z), using stochastic gradient descent to minimize the mean squared error (MSE) between the predicted value vθ(s), and the corresponding outcome z.

Overfitting by Predicting Outcomes from Complete Games

The naive approach of predicting game outcomes from data consisting of complete games leads to overfitting.

The problem is that successive positions are strongly correlated, differing by just one stone, but the regression target is shared for the entire game.

When trained on the KGS dataset in this way, the value network memorised the game outcomes rather than generalising to new positions.

To mitigate this problem, generated a new self-play data-set consisting of 30 million distinct positions, each sampled from a separate game. Each game was played between the RL policy network and itself until the game terminated.

Searching with Policy and Value Networks

AlphaGo combines the policy and value networks in an MCTS algorithm that selects actions by lookahead search.

Each edge (s, a) of the search tree stores an action value Q(s, a), visit count N (s, a), and prior probability P (s, a).

The tree is traversed by simulation starting from the root state. At each time-step t of each simulation, an action at is selected from state st, so as to maximize action value plus a bonus u(s, a) ∝ P (s,a) that is proportional to the prior 1+N (s,a) probability but decays with repeated visits to encourage exploration.

When the traversal reaches a leaf node at step L, the leaf node may be expanded. The leaf position is processed just once by the SL policy network. The output probabilities are stored as prior probabilities P for each legal action.

The leaf node is evaluated in two very different ways: first, by the value network and second, by the outcome of a random rollout played out until terminal step T using the fast rollout policy pπ; these evaluations are combined, using a mixing parameter λ, into a leaf evaluation V (sL),

At the end of simulation n, the action values and visit counts of all traversed edges are updated. Each edge accumulates the visit count and mean evaluation of all simulations passing through that edge

Once the search is complete, the algorithm chooses the most visited move from the root position.

Computational Resources

To efficiently combine MCTS with deep neural networks, AlphaGo uses an asynchronous multi-threaded search that executes simulations on CPUs, and computes policy and value networks in parallel on GPUs. The final version of AlphaGo used 40 search threads, 48 CPUs, and 8 GPUs.

 

Background Research in Attempts to Follow a Talk on “Quantitative Evaluation of Deep Generative Models” by Salakhutdinov

This are my notes from quickly researching some topics (sort of an ad-hoc glossary of some machine learning terms) in an attempt to follow a talk by Russ Salakhutdinov at CMU – the talk slides can be found here: http://www.cs.cmu.edu/~rsalakhu/talk_Eval.pdf. I found him while doing some background research on Apple.

Unsupervised Learning

Infer a function to describe a hidden structure from “unlabeled” data – a classification not included in the observation.  No objective way to evaluate the accuracy of the structure that is outbut by the algorithm, which is not the case with supervised learning.

Sparse coding

Class of unsupervised methods for learning sets of over-complete bases to represent data efficiently.  You find a set of basis vectors, so you can represent an input vector as a linear combo of the basis vectors.  Over-complete basis can be better with patterns in input data.

Sparsity means you have few non-zero components.  Sparsity works with things like sensory data, natural images.  They can be described wtih small number of atomic elements, like surfaces or edges.

Over-Complete

Linear algebra concept.  You’re complete if you can represent a vector as a linear combo of a set of basis vectors.  You’re overcomplete if you can remove one of the basis vectors and still be complete.  Over-completeness can yield a more stable or compact representation sometimes.

Autoencoder

An autoencoder is an articifical neural net, used in unsupervised learning.  Use to learn a representation of a data set, so you can dimensionally reduce it.  This reduction is reducing the number of random vars being considered, to obtain a set of principle variables.  This involves feature selection and feature extraction.

Simplest form of autoencoder is multilayer perceptron.

Fully Observed System

In a fully observable environment, the state of the system is known at all times.   With a chess board, all information required to make an optimal decision is available.   Past state isn’t relevant.

Fully Observed Belief Nets

A probabilistic graphical model.  Represents a set of random variables and their conditional dependencies via a directed acyclic graph (DAG).

Ex: Bayesian network could represent the probabilistic relationships between diseases and symptoms. Given symptoms, the network can be used to compute the probabilities of the presence of various diseases.

Nodes represent random vars, edges represent conditional dependencies, unconnected nodes are independent.  Each node has probabability function that takes as input a set of values, and outputs the probability of the variable represented by the node.

An example of a belief network might be oberving a patient’s symptoms of coughing, wheezing, fever, and smoking habits.  You might want to predict the causes of the symptoms, is it emphysema?

NADE (Neural Autoregressive Density Estimator)

A neural network architecture applied to the problem of unsupervised distribution and density estimation. They leverage the probability product rule and a weight sharing scheme inspired from restricted Boltzmann machines, to yield an estimator that is both tractable and has good generalization performance.

PIXELRNN (Pixel Recurrent Neural Networks)

Used in modeling distribution of natural images with unsupervised learning.  Needs to be expressive, tractable, and scalable.  Involves sequentially predicting pixels in an image, modeling the discrete probability of the raw pixel values and encodes complete set of dependencies in the image.

Can be used in image compression, to forms of reconstruction and deblurring.

Tractable Models

Computational tractability means that one can solve a problem in polynomial time, concerns the relationship between the size of inputs and the time it takes to solve as a function of that.

Stochastic recurrent neural network

A RNN is a type of artificial neural net – where connection between units form a directed cycle (path of edges and vertices wherein a vertex is reachable from itself).  Exhibits dynamic temporal behaviour, can use internal memory to process arbitrary sequences of inputs.  Used in handwriting recognition or speech recognition.

 

Stochastic RNN is built by introducing random variations into the network, possibly giving the neurons stochastic transfer functions, or by giving them stochastic weights.  Used in optimization problems as they can avoid local minima somewhat.

Stochastic neural networks are a type of artificial neural networks built by introducing random variations into the network, either by giving the network’s neurons stochastic transfer functions, or by giving them stochastic weights. This makes them useful tools for optimization problems, since the random fluctuations help it escape from local minima.  Used in risk management, for example.

Boltzman Machines

A type of stochastic recurrent neural network and Markov random field.  Able to represent and (given sufficient time) solve difficult combinatoric problems.

They are named after the Boltzmann distribution in statistical mechanics, which is used in their sampling function.  A Boltzmann machine, like a Hopfield network, is a network of units with an “energy” defined for the network. It also has binary units, but unlike Hopfield nets, Boltzmann machine units are stochastic.

The network is run by repeatedly choosing a unit and setting its state according to a formula. After running for long enough at a certain temperature, the probability of a global state of the network will depend only upon that global state’s energy, according to a Boltzmann distribution, and not on the initial state from which the process was started. This means that log-probabilities of global states become linear in their energies. This relationship is true when the machine is “at thermal equilibrium“, meaning that the probability distribution of global states has converged. If we start running the network from a high temperature, and gradually decrease it until we reach a thermal equilibrium at a low temperature, we may converge to a distribution where the energy level fluctuates around the global minimum. This process is called simulated annealing.

If we want to train the network so that the chance it will converge to a global state is according to an external distribution that we have over these states, we need to set the weights so that the global states with the highest probabilities will get the lowest energies. This is done with a training procedure.

The Boltzmann machine would theoretically be a rather general computational medium. For instance, if trained on photographs, the machine would theoretically model the distribution of photographs, and could use that model to, for example, complete a partial photograph.

Unfortunately, there is a serious practical problem with the Boltzmann machine, namely that it seems to stop learning correctly when the machine is scaled up to anything larger than a trivial machine.

Variational Autoencoders

Found in unsupervised machine learning, involves probabilistic Bayesian inference and deep learning.  Autoencoders sequentially deconstruct input data into hidden representations, then use these representations to sequentially reconstruct outputs that resemble the original.  This is representation learning, works on unlabeled data – without things like “this person has emphysema, or this person has bronchitus”.

See http://blog.fastforwardlabs.com/2016/08/12/introducing-variational-autoencoders-in-prose-and.html.

You must provide some constraints to make the hidden representation useful, or you will just copy the image.

Variational Autoencoders (VAEs) incorporate regularization by explicitly learning the joint distribution over data and a set of latent variables that is most compatible with observed datapoints and some designated prior distribution over latent space.

VAEs are an excellent tool for manifold learning—recovering the “true” manifold in lower-dimensional space along which the observed data lives with high probability mass—and generative modeling of complex datasets like images, text, and audio—conjuring up brand new examples, consistent with the observed training set, that do not exist in nature.

An end-to-end autoencoder (input to reconstructed input) can be split into two complementary networks: an encoder and a decoder. The encoder maps input x to a latent representation, or so-called hidden code, z. The decoder maps the hidden code to a reconstructed input value x̃ .
Helmholtz Machines

Another type of artificial NN.  Can account for the hidden structure of a set of data by being trained to create a generative model of the original set of data.  The hope is that by learning economical representations of the data, the underlying structure of the generative model should reasonably approximate the hidden structure of the data set.

A Helmholtz machine contains two networks, a bottom-up recognition network that takes the data as input and produces a distribution over hidden variables, and a top-down “generative” network that generates values of the hidden variables and the data itself.

Helmholtz machines are usually trained using an unsupervised learning algorithm, but can be used in supervised learning to do things like character recognition.

Generative Adversarial Networks

Generative adversarial networks are a branch of unsupervised machine learning, implemented by a system of two neural networks competing against each other in a zero-sum game framework.

Can be used to make images that a human might find convincing.  You could create a picture of a duck in video game, for example.

One network is generative and one is discriminative.[1] Typically, the generative network is taught to map from a latent space to a particular data distribution of interest, and the discriminative network is simultaneously taught to discriminate between instances from the true data distribution and synthesized instances produced by the generator. The generative network’s training objective is to increase the error rate of the discriminative network (fooling it).

In practice, a particular dataset serves as the training data for the discriminator. Training the discriminator involves presenting the discriminator with samples from the dataset and samples synthesized by the generator, and backpropagating from a binary classification loss. In order to produce a sample, typically the generator is seeded with a randomized input that is sampled from a predefined latent space (e.g., a multivariate normal distribution). Training the generator involves back-propagating the negation of the binary classification loss of the discriminator. The generator adjusts its parameters so that the training data and generated data cannot be distinguished by the discriminator model. The goal is to find a setting of parameters that makes generated data look like the training data to the discriminator network.[4]

In practice, the generator is typically a deconvolutional neural network, and the discriminator is a convolutional neural network.

Moment Matching Networks

Addressing learning deep generative models from data. Generates an independent sample via a single feedforward pass through a multilayer perceptron.

Training a generative adversarial network requires careful optimization of a difficult minimax program. Instead, utilizes a technique from statistical hypothesis testing known as maximum mean discrepancy (MMD), which leads to a simple objective that can be interpreted as matching all orders of statistics between a dataset and samples from the model, and can be trained by backpropagation.

To further boost the performance of this approach – involves combining generative network with an auto-encoder network, using MMD to learn to generate codes that can then be decoded to produce samples.  When measured using the Toronto Face Database, the researchers found excellent generative models compared to baseline approaches.

See http://jmlr.org/proceedings/papers/v37/li15.pdf.

Directed or Undirected Models

Bayensian networks are directed.  Undirected graphical models are useful when natural directionality doesnt exist between variables, this is the case for Markov Networks or MRFs.

One can combine directed and undirected graphs.

Directed graphs are useful for expressing causal relationships between random variables, whereas undirected graphs are better suited at expressing soft constraints between random variables.

For the purposes of solving inference problems, it is often convenient to convert both directed and undirected graphs into a different representation called a factor graph

See http://www.cedar.buffalo.edu/~srihari/CSE674/Chap4/4.1-UndirGr.pdf.

Deep Boltzman Machine

A new learning algorithm for Boltzmann machines that contain many layers of hidden variables. Data-dependent expectations are estimated using a variational approximation that tends to focus on a single mode, and data- independent expectations are approximated using persistent Markov chains. The use of two quite different techniques for estimating the two types of expectation that enter into the gradient of the log-likelihood makes it practical to learn Boltzmann machines with multiple hidden layers and millions of parameters. The learning can be made more efficient by using a layer-by-layer “pre-training” phase that allows variational inference to be initialized with a single bottom- up pass.

See http://www.utstat.toronto.edu/~rsalakhu/papers/dbm.pdf.

Markov Random Fields

In the domain of physics and probability, a Markov random field(often abbreviated as MRF), Markov network or undirected graphical model is a set of random variables having a Markov property described by an undirected graph.

Markov property refers to the memoryless property of a stochastic process – it has the Markov property if the conditional probability distribution of future states of the process (conditional on both past and present states) depends only upon the present state, not on the sequence of events that preceded it. A process with this property is called a Markov process.

A Markov network or MRF is similar to a Bayesian network in its representation of dependencies; the differences being that Bayesian networks are directed and acyclic, whereas Markov networks are undirected and may be cyclic. Thus, a Markov network can represent certain dependencies that a Bayesian network cannot (such as cyclic dependencies); on the other hand, it can’t represent certain dependencies that a Bayesian network can (such as induced dependencies). The underlying graph of a Markov random field may be finite or infinite.

Partition Functions

In physics, a partition function describes the statistical properties of a system in thermodynamic equilibrium

Partition functions are functions of the thermodynamic state variables, such as the temperature and volume. Most of the aggregate thermodynamic variables of the system, such as the total energyfree energyentropy, and pressure, can be expressed in terms of the partition function or its derivatives.

Each partition function is constructed to represent a particular statistical ensemble (which, in turn, corresponds to a particular free energy). The most common statistical ensembles have named partition functions.

The canonical partition functionapplies to a canonical ensemble, in which the system is allowed to exchange heat with the environment at fixed temperature, volume, and number of particles. The grand canonical partition function applies to a grand canonical ensemble, in which the system can exchange both heat and particles with the environment, at fixed temperature, volume, and chemical potential. Other types of partition functions can be defined for different circumstances; see partition function (mathematics) for generalizations. The partition function has many physical meanings, as discussed in Meaning and significance.

Restricted Boltzman Machines

restricted Boltzmann machine (RBM) is a generativestochastic artificial neural network that can learn a probability distribution over its set of inputs. Can be trained in either supervised or unsupervised ways, depending on the task.

As their name implies, RBMs are a variant of Boltzmann machines, with the restriction that their neurons must form a bipartite graph: a pair of nodes from each of the two groups of units (commonly referred to as the “visible” and “hidden” units respectively) may have a symmetric connection between them; and there are no connections between nodes within a group. By contrast, “unrestricted” Boltzmann machines may have connections between hidden units. This restriction allows for more efficient training algorithms than are available for the general class of Boltzmann machines, in particular the gradient-based contrastive divergence algorithm.[7]

Can also be used in deep learningnetworks – deep belief networks can be formed by “stacking” RBMs and optionally fine-tuning the resulting deep network with gradient descent and backpropagation.

See https://en.wikipedia.org/wiki/Restricted_Boltzmann_machine

MRF model selection (Markov random field model selection)

Hidden Markov random fields appear naturally in problems such as image segmentation, where an unknown class assignment has to be estimated from the observations at each pixel. Choosing the probabilistic model that best accounts for the observations is an important first step for the quality of the subsequent estimation and analysis.

A commonly used selection criterion is the Bayesian Information Criterion (BIC) of Schwarz (1978), but for hidden Markov random fields, its exact computation is not tractable due to the dependence structure induced by the Markov model.

Forbes and Peyrard proposed approximations of BIC based on the mean field principle of statistical physics. The mean field theory provides approximations of Markov random fields by systems of independent variables leading to tractable computations. Using this principle, they derived a class of criteria by approximating the Markov distribution in the usual BIC expression as a penalized likelihood, then rewrite BIC in terms of normalizing constants, also called partition functions, instead of Markov distributions.

 

See http://ieeexplore.ieee.org/document/1227985/?reload=true.

Generative Model

In probability and statistics, a generative model is a model for randomly generating observable data values, typically given some hidden parameters. It specifies a joint probability distribution over observation and label sequences. Generative models are used in machine learning for either modeling data directly (i.e., modeling observations drawn from a probability density function), or as an intermediate step to forming a conditional probability density function.

A conditional distribution can be formed from a generative model through Bayes’ rule.

Generative models contrast with discriminative models, in that a generative model is a full probabilistic model of all variables, whereas a discriminative model provides a model only for the target variable(s) conditional on the observed variables. Thus a generative model can be used, for example, to simulate (i.e. generate) values of any variable in the model, whereas a discriminative model allows only sampling of the target variables conditional on the observed quantities. Despite the fact that discriminative models do not need to model the distribution of the observed variables, they cannot generally express more complex relationships between the observed and target variables. They don’t necessarily perform better than generative models at classification and regression tasks. In modern applications the two classes are seen as complementary or as different views of the same procedure.

Simple Importance Sampling

One of the advantages of Bayesian statistics is flexibility to design analysis to fit the problem, instead of trying to force the problem into an approach you have avaiable.

Searching the parameter space to fit a model is more taxing in the Bayesian paradigm, because the whole posterior is of interest, while in contrast most classical analyses split the task into two (fairly) simple steps: finding the maximum of the likelihood function (say, using Newton–Raphson) and then plugging the estimate in to the Fisher information to derive an asymptotic confidence interval.

Several approaches that are widely used to search the parameter space – evaluating the posterior along the way —grid search and Monte Carlo.

The problem with Monte Carlo sampling in its most basic form is that to use it you need to (i) know the specific distribution of the posterior of the parameters and (ii) be able to simulate this distribution. More usually, however, you know the distribution only in the sense that you can calculate it for any given values of the parameters (perhaps up to a constant of proportionality) but cannot simulate from it, so rather than knowing that pv |Xv , Nv ∼ Be(52, 8147), say, all you know is that p(p |X , N ) ∝ p51(1 − p )8146. This lack of knowledge means you do not know how to select values for the parameters, as you could for basic Monte Carlo sampling.

The idea behind importance sampling is that instead of sampling from the posterior itself, whose form you don’t know, you simulate from a different distribution that you hope, is close to the actual posterior, and “correct” it to account for the fact that it is not the actual posterior. The correction is done by weighting the sampled points by the ratio of the posterior to the proposal density, both of which you should be able to evaluate numerically (if you cannot evaluate the proposal density, choose one you can).

See http://blog.nus.edu.sg/alexcook/files/2012/12/chapter2-2bv9gf5.pdf.

Monte Carlo Approximation

Using statistical methods one may face integrals of a certain form, which sometimes cannot be evaluated analytically.  This may be the case ith Bayesian methods.  One can use numeric integration methods in that case, though they may lose tractibility in practical cases with many dimensions.

Monte Carlo approximation allows calculation of an estimate for the value of  by transforming the  integration problem into a procedure of sampling values from a tractable probability distribution and calculating the average of those samples.

See https://theclevermachine.wordpress.com/2012/09/22/monte-carlo-approximations/

Gibbs Sampling

Gibbs sampling  is a Markov chain Monte Carlo (MCMC) algorithm for obtaining a sequence of observations which are approximated from a specified multivariate probability distribution, when direct sampling is difficult.

This sequence can be used to approximate the joint distribution (e.g., to generate a histogram of the distribution); to approximate the marginal distribution of one of the variables, or some subset of the variables (for example, the unknown parameters or latent variables); or to compute an integral (such as the expected value of one of the variables). Typically, some of the variables correspond to observations whose values are known, and hence do not need to be sampled.

As with other MCMC algorithms, Gibbs sampling generates a Markov chain of samples, each of which is correlated with nearby samples. As a result, care must be taken if independent samples are desired (typically by thinning the resulting chain of samples by only taking every nth value, e.g. every 100th value). In addition, samples from the beginning of the chain (the burn-in period) may not accurately represent the desired distribution.

See https://en.wikipedia.org/wiki/Gibbs_sampling

Forward or Reverse Markov Chains

Consider Markov random processes in which there is a list of possible states.

NOTE: the excerpt below is from See http://www.math.nyu.edu/faculty/goodman/teaching/StochCalc2012/notes/Week2.pdf&#8230;

We introduce three mathematical ideas: a σ−algebra to represent a state of partial information, measurability of a function with respect to a dis- crete σ−algebra, and a filtration that represents gaining information over time. Filtrations are a convenient way to describe the Markov property and to give the general definition of a martingale, the latter a few weeks from now.

Associated with Markov chains are the backward and forward equations that describe the evolution of probabilities and expectation values over time. Forward and backward equations are one of the main “calculate things” methods of stochastic calculus.

A stochastic process in discrete time is a sequence (X1, X2, . . .), where Xn is the state of the system at time n. The path up to time T is X[1:T] = (X1,X2,…,XT). The state Xn must be in the state space, S. Last week S was Rd, for a linear Gaussian process. This week, S is either a finite set S = {x1,x2,…xm}, or an infinite countable set of the form S = {x1,x2,…}. A set such as S is discrete if it is finite or countable. The set of all real numbers, like the Gaussian state space Rd, is not discrete because the real numbers are not countable (a famous theorem of Georg Cantor). Spaces that are not discrete may be called continuous. If we define Xt for all times t, then Xt is a continu- ous time process. If Xn is defined only for integers n (or another discrete set of times), then we have a discrete time process. This week is about discrete time discrete state space stochastic processes.

We will be interested in discrete Markov chains partly for their own sake and partly because they are a setting where the general definitions are easy to give without much mathematical subtlety. The concepts of measurability and filtration for continuous time or continuous state space are more technical and subtle than we have time for in this class. The same is true of backward and forward equations. They are rigorous this week, but heuristic when we go to continuous time and state space. That is the “∆X → 0 and ∆t → 0” aspect of stochastic calculus.

The main examples of Markov chain will be random walk and mean reverting random walk.  There are discrete versions of Brownian motion and the Ornstein Uhlenbeck process respectively.

See http://www.sciencedirect.com/science/article/pii/S030441490600189X

RBM Sampling

From Yam Peleg, answering a Quora question at https://www.quora.com/Why-does-Gibbs-sampling-in-an-RBM-in-contrastive-divergence-allow-us-to-approximate-the-log-of-the-partition-function

Direct sampling of a stochastic process is sometimes difficult and even impossible – due to lack of knowledge of the underlying mechanics that create that stochastic process over time.

Gibbs sampling is a Markov chain Monte Carlo (MCMC) algorithm (That for itself is a very important algorithm) for obtaining a sequence of observations when direct sampling is difficult.

Contrastive divergence is a way for training some machine learning algorithms that relay on the use of Markov chains for data sampling.
CD works like that: we initialize the Markov chain with a training example (from a distribution that is expected to be close to desired).
the trick is that does not wait for the chain to converge. samples are obtained after only k-steps of Gibbs sampling.

This is why gibbs sampling in RBM works.

Omniglot Data Set

A dataset collected called “Omniglot” – while similar in spirit to MNIST, rather than having 10 characters with 6000 examples each, it has over 1600 character with 20 examples each – making it more like the “transpose” of MNIST. These characters were selected from 50 different alphabets on http://www.omniglot.com, which includes scripts from natural languages (e.g., Hebrew, Korean, Greek) and artificial scripts (e.g., Futurama and ULOG) invented for purposes like TV shows or video games. Since it was produced on Amazon’s Mechanical Turk, each image is paired with a movie ([x,y,time] coordinates) showing how that drawing was produced.

Data Manifold

High-dimensional data, meaning data that requires more than two or three dimensions to represent, can be difficult to interpret. One approach to simplification is to assume that the data of interest lie on an embedded non-linear manifold within the higher-dimensional space. If the manifold is of low enough dimension, the data can be visualised in the low-dimensional space.

Many of these non-linear dimensionality reduction methods are related to linear methods. Non-linear methods can be broadly classified into two groups: those that provide a mapping (either from the high-dimensional space to the low-dimensional embedding or vice versa), and those that just give a visualisation. In the context of machine learning, mapping methods may be viewed as a preliminary feature extraction step, after which pattern recognition algorithms are applied. Typically those that just give a visualisation are based on proximity data – that is, distance measurements.

Generative Process

In probability and statistics, a generative model is a model for randomly generating observable data values, typically given some hidden parameters. It specifies a joint probability distribution over observation and label sequences. Generative models are used in machine learning for either modeling data directly (i.e., modeling observations drawn from a probability density function), or as an intermediate step to forming a conditional probability density function. A conditional distribution can be formed from a generative model through Bayes’ rule.

Generative models contrast with discriminative models, in that a generative model is a full probabilistic model of all variables, whereas a discriminative model provides a model only for the target variable(s) conditional on the observed variables. Thus a generative model can be used, for example, to simulate (i.e. generate) values of any variable in the model, whereas a discriminative model allows only sampling of the target variables conditional on the observed quantities. Despite the fact that discriminative models do not need to model the distribution of the observed variables, they cannot generally express more complex relationships between the observed and target variables. They don’t necessarily perform better than generative models at classificationand regression tasks. In modern applications the two classes are seen as complementary or as different views of the same procedure.[1]

Kernel Density Estimator

In statisticskernel density estimation (KDE) is a non-parametric way to estimate the probability density function of a random variable. Kernel density estimation is a fundamental data smoothing problem where inferences about the population are made, based on a finite data sample. In some fields such as signal processing and econometrics it is also termed the Parzen–Rosenblatt window method, after Emanuel Parzen and Murray Rosenblatt, who are usually credited with independently creating it in its current form.[1][2]

What Do You Do After AlphaGo

Friend and I talked about AlphaGo, and what problem would be next for that technology.  He thought it would be no limit texas hold’em.   http://www.theverge.com/2017/1/10/14220578/ai-deepstack-beats-poker-pros-no-limit-texas-hold-em.   Possibly people are close to doing this.

I am sure that would be progress.  Though I would like to see an AI produce an empathic sentiment in a conversation with a human.  Or – show an AI a video somebody takes walking through a garden, then have the AI paint a picture that would be a simulation of what Van Gogh would have produced if he had taken that walk.

 

Google Translate, Functionality Emerging During Use, Evolution, Containment, Fear

Google Translate uses deep learning (http://machinelearningmastery.com/what-is-deep-learning/) – which is sort of using very large neural networks that are trained by huge amounts of data, scaling up gracefully to train towards increasing performance.

Translate started with a few languages and then scaled to 100, translating 100Billion+ words per day (https://www.weforum.org/agenda/2017/02/googles-ai-translation-tool-seems-to-have-invented-its-own-language/).  They started translating to and from English.  Then Translate became able to translate Japanese to Korean WITHOUT using English as a bridge.

The Translate algorithms may have developed an abstract representation of meaning, independent of any human language, which can then be reasonably easily converted into a human language, like English or Italian.

Something noteworthy about this is the EMERGENCE of functionality in a software system, without it being DESIGNED by a human product manager.  This could become a common characteristic of AI systems, that they effectively design their new features as they operate without being told to do so.  The system in a sense evolves, as the “features” that work will remain as neural networks are trained.

Is evolution just the minimization of a bunch of mathematical functions?  Reminds me of the “principle of least action” in physics, where you see the minimization of a Hamiltonian.

This process instinctively gives me fear, as if systems can develop features that achieve an end without human direction – what happens if humans can’t even tell that those features are in play, if they just see the result.   What if they realize it later and it may be too late to reverse?