## Mathematics of HTM Part II – Transition Memory

This article is part of a series describing the mathematics of Hierarchical Temporal Memory (HTM), a theory of cortical information processing developed by Jeff Hawkins. In Part One, we saw how a layer of neurons learns to form a Sparse Distributed Representation (SDR) of an input pattern. In this section, we’ll describe the process of learning temporal sequences.

We showed in part one that the HTM model neuron learns to recognise subpatterns of feedforward input on its proximal dendrites. This is somewhat similar to the manner by which a Restricted Boltzmann Machine can learn to represent its input in an unsupervised learning process. One distinguishing feature of HTM is that the evolution of the world over time is a critical aspect of what, and how, the system learns. The premise for this is that objects and processes in the world persist over time, and may only display a portion of their structure at any given moment. By learning to model this evolving revelation of structure, the neocortex can more efficiently recognise and remember objects and concepts in the world.

## Distal Dendrites and Prediction

In addition to its one proximal dendrite, a HTM model neuron has a collection of *distal* (far) neurons, which gather information from sources other than the feedforward inputs to the layer. In some layers of neocortex, these dendrites combine signals from neurons in the same layer as well as from other layers in the same region, and even receive indirect inputs from neurons in higher regions of cortex. We will describe the structure and function of each of these.

The simplest case involves distal dendrites which gather signals from neurons within the *same *layer.

In Part One, we showed that a layer of \(N\) neurons converted an input vector \(\mathbf x \in \mathbb{B}^{n_{\textrm{ff}}}\) into a SDR \(\mathbf{y}_{\textrm{SDR}} \in \mathbb{B}^{N}\), with length\(\lVert{\mathbf y}_{\textrm{SDR}}\rVert_{\ell_1}=sN \ll N\), where the sparsity \(s\) is usually of the order of 2% (\(N\) is typically 2048, so the SDR \(\mathbf{y}_{\textrm{SDR}}\) will have 40 active neurons).

The layer of HTM neurons can now be extended to treat its own activation pattern as a separate and complementary input for the next timestep. This is done using a collection of distal dendrite segments, which each receive as input the signals from other neurons in the layer itself. Unlike the proximal dendrite, which transmits signals directly to the neuron, each distal dendrite acts as an active *coincidence detector*, firing only when it receives enough signals to exceed its individual threshold.

We proceed with the analysis in a manner analogous to the earlier discussion. The input to the distal dendrite segment \(k\) at time \(t\) is a sample of the bit vector \(\mathbf{y}_{\textrm{SDR}}^{(t-1)}\). We have \(n_{ds}\) distal synapses per segment, a permanence vector \(\mathbf{p}_k \in [0,1]^{n_{ds}}\) and a synapse threshold vector \(\vec{\theta}_k \in [0,1]^{n_{ds}}\), where typically \(\theta_i = \theta = 0.2\) for all synapses.

Following the process for proximal dendrites, we get the distal segment’s connection vector \(\mathbf{c}_k\):

$$c_{k,i}=(1 + sgn(p_{k,i}-\theta_{k,i}))/2$$

The input for segment \(k\) is the vector \(\mathbf{y}_k^{(t-1)} = \phi_k(\mathbf{y}_{\textrm{SDR}}^{(t-1)})\) formed by the projection \(\phi_k:\lbrace{0,1}\rbrace^{N-1}\rightarrow\lbrace{0,1}\rbrace^{n_{ds}}\) from the SDR to the subspace of the segment. There are \({N-1}\choose{n_{ds}}\) such projections (there are no connections from a neuron to itself, so there are \(N-1\) to choose from).

The overlap of the segment for a given \(\mathbf{y}_{\textrm{SDR}}^{(t-1)}\) is the dot product \(o_k^t = \mathbf{c}_k\cdot\mathbf{y}_k^{(t-1)}\). If this overlap exceeds the *threshold* \(\lambda_k\) of the segment, the segment is *active* and sends a *dendritic spike* of size \(s_k\) to the neuron’s cell body.

This process takes place *before* the processing of the feedforward input, which allows the layer to combine contextual knowledge of recent activity with recognition of the incoming feedforward signals. In order to facilitate this, we will change the algorithm for Pattern Memory as follows.

Each neuron begins a timestep \(t\) by performing the above processing on its \({n_{\textrm{dd}}}\) distal dendrites. This results in some number \(0\ldots{n_{\textrm{dd}}}\) of segments becoming active and sending spikes to the neuron. The total *predictive activation potential* is given by:

$$o_{\textrm{pred}}=\sum\limits_{o_k^{t} \ge \lambda_k}{s_k}$$

The predictive potential is combined with the overlap score from the feedforward overlap coming from the proximal dendrite to give the *total activation potential*:

$$a_j^t=\alpha_j o_{\textrm{ff},j} + \beta_j o_{\textrm{pred},j}$$

and these \(a_j\) potentials are used to choose the top neurons, forming the SDR \(Y_{\textrm{SDR}}\) at time \(t\). The mixing factors \(\alpha_k\) and \(\beta_k\) are design parameters of the simulation.

## Learning Predictions

We use a very similar learning rule for distal dendrite segments as we did for the feedforward inputs:

$$ p_i^{(t+1)} =

\begin{cases}

(1+\sigma_{inc})p_i^{(t)} & \text {if cell $j$ active, segment $k$ active, synapse $i$ active} \\

(1-\sigma_{dec})p_i^{(t)} & \text {if cell $j$ active, segment $k$ active, synapse $i$ not active} \\

p_i^{(t)} & \text{otherwise} \\

\end{cases} $$

Again, this reinforces synapses which contribute to activity of the cell, and decreases the contribution of synapses which don’t. A boosting rule, similar to that for proximal synapses, allows poorly performing distal connections to improve until they are good enough to use the main rule.

## Interpretation

We can now view the layer of neurons as forming a number of representations at each timestep. The field of predictive potentials \(o_{\textrm{pred},j}\) can be viewed as a map of the layer’s confidence in its prediction of the next input. The field of feedforward potentials can be viewed as a map of the layer’s recognition of current reality. Combined, these maps allow for *prediction-assisted recognition*, which, in the presence of temporal correlations between sensory inputs, will improve the recognition and representation significantly.

We can quantify the properties of the predictions formed by such a layer in terms of the *mutual information* between the SDRs at time \(t\) and \(t+1\). I intend to provide this analysis as soon as possible, and I’d appreciate the kind reader’s assistance if she could point me to papers which might be of help.

A layer of neurons connected as described here is a *Transition Memory*, and is a kind of *first-order memory* of temporally correlated transitions between sensory patterns. This kind of memory may only learn one-step transitions, because the SDR is formed only by combining potentials one timestep in the past with current inputs.

Since the neocortex clearly learns to identify and model much longer sequences, we need to modify our layer significantly in order to construct a system which can learn *high-order sequences*. This is the subject of the next part of this series.

**Note:** For brevity, I’ve omitted the matrix treatment of the above. See Part One for how this is done for Pattern Memory; the extension to Transition Memory is simple but somewhat arduous.