Mathematics of Hierarchical Temporal Memory

  • Nov 28 / 2014
  • 0
Clortex (HTM in Clojure), Cortical Learning Algorithm, NuPIC

Mathematics of Hierarchical Temporal Memory

This article describes some of the mathematics underlying the theory and implementations of Jeff Hawkins’ Hierarchical Temporal Memory (HTM), which seeks to explain how the neocortex processes information and forms models of the world.

Note: Part II: Transition Memory is now available.

The HTM Model Neuron – Pattern Memory (aka Spatial Pooling)

We’ll illustrate the mathematics of HTM by describing the simplest operation in HTM’s Cortical Learning Algorithm: Pattern Memory, also known as Spatial Pooling, forms a Sparse Distributed Representation from a binary input vector. We begin with a layer (a 1- or 2-dimensional array) of single neurons, which will form a pattern of activity aimed at efficiently representing the input vectors.

Feedforward Processing on Proximal Dendrites

The HTM model neuron has a single proximal dendrite, which is used to process and recognise feedforward or afferent inputs to the neuron. We model the entire feedforward input to a cortical layer as a bit vector \({\mathbf x}_{\textrm{ff}}\in\lbrace{0,1}\rbrace^{n_{\textrm{ff}}}\), where \(n_{\textrm{ff}}\) is the width of the input.

The dendrite is composed of \(n_s\) synapses which each act as a binary gate for a single bit in the input vector.  Each synapse has a permanence \(p_i\in{[0,1]}\) which represents the size and efficiency of the dendritic spine and synaptic junction. The synapse will transmit a 1-bit (or on-bit) if the permanence exceeds a threshold \(\theta_i\) (often a global constant \(\theta_i = \theta = 0.2\)). When this is true, we say the synapse is connected.

Each neuron samples \(n_s\) bits from the \(n_{\textrm{ff}}\) feedforward inputs, and so there are \({n_{\textrm{ff}}}\choose{n_{s}}\) possible choices of input for a single neuron. A single proximal dendrite represents a projection \(\pi_j:\lbrace{0,1}\rbrace^{n_{\textrm{ff}}}\rightarrow\lbrace{0,1}\rbrace^{n_s}\), so a population of neurons corresponds to a set of subspaces of the sensory space. Each dendrite has an input vector \({\mathbf x}_j=\pi_j({\mathbf x}_{\textrm{ff}})\) which is the projection of the entire input into this neuron’s subspace.

A synapse is connected if its permanence \(p_i\) exceeds its threshold \(\theta_i\). If we subtract \({\mathbf p}-{\vec\theta}\), take the elementwise sign of the result, and map to \(\lbrace{0,1}\rbrace\), we derive the binary connection vector \({\mathbf c}_j\) for the dendrite. Thus:

$$c_i=(1 + sgn(p_i-\theta_i))/2$$

The dot product \(o_j({\mathbf x})={\mathbf c}_j\cdot{\mathbf x}_j\) now represents the feedforward overlap of the neuron with the input, ie the number of connected synapses which have an incoming activation potential. Later, we’ll see how this number is used in the neuron’s processing.

The elementwise product \({\mathbf o}_j={\mathbf c}_j\odot{\mathbf x}_j\) is the vector in the neuron’s subspace which represents the input vector \({\mathbf x}_{\textrm{ff}}\) as “seen” by this neuron. This is known as the overlap vector. The length \(o_j = \lVert{\mathbf o}_j\rVert_{\ell_1}\) of this vector corresponds to the extent to which the neuron recognises the input, and the direction (in the neuron’s subspace) is that vector which has on-bits shared by both the connection vector and the input.

If we project this vector back into the input space, the result \(\mathbf{\hat{x}}_j =\pi^{-1}({\mathbf o}_j)\) is this neuron’s approximation of the part of the input vector which this neuron matches. If we add a set of such vectors, we will form an increasingly close approximation to the original input vector as we choose more and more neurons to collectively represent it.

Sparse Distributed Representations (SDRs)

We now show how a layer of neurons transforms an input vector into a sparse representation. From the above description, every neuron is producing an estimate \(\mathbf{\hat{x}}_j \) of the input \({\mathbf x}_{\textrm{ff}}\), with length \(o_j\ll n_{\textrm{ff}}\) reflecting how well the neuron represents or recognises the input. We form a sparse representation of the input by choosing a set \(Y_{\textrm{SDR}}\) of the top \(n_{\textrm{SDR}}=sN\) neurons, where \(N\) is the number of neurons in the layer, and \(s\) is the chosen sparsity we wish to impose (typically \(s=0.02=2\%\)).

The algorithm for choosing the top \(n_{\textrm{SDR}}\) neurons may vary. In neocortex, this is achieved using a mechanism involving cascading inhibition: a cell firing quickly (because it depolarises quickly due to its input) activates nearby inhibitory cells, which shut down neighbouring excitatory cells, and also nearby inhibitory cells, which spread the inhibition outwards. This type of local inhibition can also be used in software simulations, but it is expensive and is only used where the design involves spatial topology (ie where the semantics of the data is to be reflected in the position of the neurons). A more efficient global inhibition algorithm – simply choosing the top \(n_{\textrm{SDR}}\) neurons by their depolarisation values – is often used in practise.

If we form a bit vector \({\mathbf y}_{\textrm{SDR}}\in\lbrace{0,1}\rbrace^N\textrm{ where } y_j = 1 \Leftrightarrow j \in Y_{\textrm{SDR}}\), we have a function which maps an input \({\mathbf x}_{\textrm{ff}}\in\lbrace{0,1}\rbrace^{n_{\textrm{ff}}}\) to a sparse output \({\mathbf y}_{\textrm{SDR}}\in\lbrace{0,1}\rbrace^N\), where the length of each output vector is \(\lVert{\mathbf y}_{\textrm{SDR}}\rVert_{\ell_1}=sN \ll N\).

The reverse mapping or estimate of the input vector by the set \(Y_{\textrm{SDR}}\) of neurons in the SDR is given by the sum:

$$\mathbf{\hat{x}} = \sum\limits_{j \in Y_{\textrm{SDR}}}{{\mathbf{\hat{x}}}_j} = \sum\limits_{j \in Y_{\textrm{SDR}}}{\pi_j^{-1}({\mathbf o}_j)} = \sum\limits_{j \in Y_{\textrm{SDR}}}{\pi_j^{-1}({\mathbf c}_j\odot{\mathbf x}_j)}= \sum\limits_{j \in Y_{\textrm{SDR}}}{\pi_j^{-1}({\mathbf c}_j \odot \pi_j({\mathbf x}_{\textrm{ff}}))}= \sum\limits_{j \in Y_{\textrm{SDR}}}{\pi_j^{-1}({\mathbf c}_j) \odot {\mathbf x}_{\textrm{ff}}} $$

Matrix Form

The above can be represented straightforwardly in matrix form. The projection \(\pi_j:\lbrace{0,1}\rbrace^{n_{\textrm{ff}}} \rightarrow\lbrace{0,1}\rbrace^{n_s} \) can be represented as a matrix \(\Pi_j \in \lbrace{0,1}\rbrace^{{n_s} \times\ n_{\textrm{ff}}} \).

Alternatively, we can stay in the input space \(\mathbb{B}^{n_{\textrm{ff}}}\), and model \(\pi_j\) as a vector \(\vec\pi_j =\pi_j^{-1}(\mathbf 1_{n_s})\), ie where \(\pi_{j,i} = 1 \Leftrightarrow (\pi_j^{-1}(\mathbf 1_{n_s}))_i = 1\).

The elementwise product \(\vec{x_j} =\pi_j^{-1}(\mathbf x_{j}) = \vec{\pi_j} \odot {\mathbf x_{\textrm{ff}}}\) represents the neuron’s view of the input vector \(x_{\textrm{ff}}\).

We can similarly project the connection vector for the dendrite by elementwise multiplication: \(\vec{c_j} =\pi_j^{-1}(\mathbf c_{j}) \), and thus \(\vec{o_j}(\mathbf x_{\textrm{ff}}) = \vec{c_j} \odot \mathbf{x}_{\textrm{ff}}\) is the overlap vector projected back into \(\mathbb{B}^{n_{\textrm{ff}}}\), and the dot product \(o_j(\mathbf x_{\textrm{ff}}) = \vec{c_j} \cdot \mathbf{x}_{\textrm{ff}}\) gives the same overlap score for the neuron given \(\mathbf x_{\textrm{ff}}\) as input. Note that \(\vec{o_j}(\mathbf x_{\textrm{ff}}) =\mathbf{\hat{x}}_j \), the partial estimate of the input produced by neuron \(j\).

We can reconstruct the estimate of the input by an SDR of neurons \(Y_{\textrm{SDR}}\):

$$\mathbf{\hat{x}}_{\textrm{SDR}} = \sum\limits_{j \in Y_{\textrm{SDR}}}{{\mathbf{\hat{x}}}_j} = \sum\limits_{j \in Y_{\textrm{SDR}}}{\vec o}_j = \sum\limits_{j \in Y_{\textrm{SDR}}}{{\vec c}_j\odot{\mathbf x_{\textrm{ff}}}} = {\mathbf C}_{\textrm{SDR}}{\mathbf x_{\textrm{ff}}}$$

where \({\mathbf C}_{\textrm{SDR}}\) is a matrix formed from the \({\vec c}_j\) for \(j \in Y_{\textrm{SDR}}\).

Optimisation Problem

We can now measure the distance between the input vector \(\mathbf x_{\textrm{ff}}\) and the reconstructed estimate \(\mathbf{\hat{x}}_{\textrm{SDR}}\) by taking a norm of the difference. Using this, we can frame learning in HTM as an optimisation problem. We wish to minimise the estimation error over all inputs to the layer. Given a set of (usually random) projection vectors \(\vec\pi_j\) for the N neurons, the parameters of the model are the permanence vectors \(\vec{p}_j\), which we adjust using a simple Hebbian update model.

The update model for the permanence of a synapse \(p_i\) on neuron \(j\) is:

$$ p_i^{(t+1)} =
(1+\delta_{inc})p_i^{(t)} & \text {if $j \in Y_{\textrm{SDR}}$, $(\mathbf x_j)_i=1$, and $p_i^{(t)} \ge \theta_i$} \\
(1-\delta_{dec})p_i^{(t)} & \text {if $j \in Y_{\textrm{SDR}}$, and ($(\mathbf x_j)_i=0$ or $p_i^{(t)} \lt \theta_i$)} \\
p_i^{(t)} & \text{otherwise} \\
\end{cases} $$

This update rule increases the permanence of active synapses, those that were connected to an active input when the cell became active, and decreases those which were either disconnected or received a zero when the cell fired. In addition to this rule, an external process gently boosts synapses on cells which either have a lower than target rate of activation, or a lower than target average overlap score.

I do not yet have the proof that this optimisation problem converges, or whether it can be represented as a convex optimisation problem. I am confident such a proof can be easily found. Perhaps a kind reader who is more familiar with a problem framed like this would be able to confirm this. I’ll update this post with more functions from HTM in coming weeks.

Note: Part II: Transition Memory is now available.



Leave a comment