## Self-Stabilisation in Hierarchical Temporal Memory

This post was written in response to Jeff Hawkins’ comments on last week’s article on a new Multilayer Model of Neocortex in Hierarchical Temporal Memory (HTM). Jeff expressed concerns about the clarity or correctness of my claim that sublayers in a cortical region act to self-stabilise in the face of unpredicted changes in the world (including changes in top-down feedback from higher regions). This discussion is a companion to an earlier description of the Efficiency of Predicted Sparseness, but goes into much more detail when describing how a non-sparse output from one sublayer is absorbed and processed by downstream sublayers.

In the earlier posts, we described how each sublayer in a region combines context inputs with feedforward inputs to form a sparse, predicted representation of the world in context. When this succeeds perfectly, each column in the sublayer has only a single active cell, and that cell represents the best combination of prediction from context and recognition of the feedforward input. The single-cell-per-column representation occurs when the single cell is sufficiently depolarised by distal (predictive/context) inputs to beat its columnar inhibitory sheath and fire first. If this does not happen, then the sheath fires first, allowing some number of contained pyramidal cells to fire before vertical inhibition reduces the column’s activity to just the one, best-predicted cell.

In order to understand the stabilising effect, we need to zoom in temporally and watch how the potentials evolve in extreme “slow-motion” in which the time steps correspond to individual synaptic events. At this framerate, we can observe the individual neurons’ potentials rising towards firing and the effect of inhibition both vertically and horizontally on the patterns of activation. This level of granularity also allows us to characterise the opportunities for synapses to adapt, which turns out to be crucial for understanding the model.

Synapses grow when there is a temporal correlation between their pre-synaptic inputs and the action potentials of the post-synaptic cell. The more often the cell fires within a short (c. 10ms) window of time after the synapse receives an action potential, the bigger and more receptive the synapse grows. In HTM, we model this with a scalar value we call *permanence*, which varies between 0.0 and 1.0, and we say that the synapse is *connected* when its permanence is above a threshold (usually 0.2), otherwise it is *disconnected*.

The current “official” Cortical Learning Algorithm (or CLA, the detailed computational model in HTM) separates feedforward and predictive stages of processing. A modification of this model (which I call *prediction-assisted recognition* or paCLA) combines these into a single step involving competition between highly predictive pyramidal cells and their surrounding columnar inhibitory sheaths. Though this has been described in summary form before, I’ll go through it in detail here.

Neural network models generally model a neuron as somehow “combining” a set of inputs to produce an output. This is based on the idea that input signals cause ion currents to flow into the neuron’s cell body, which raises its voltage (depolarises), until it reaches a threshold level and fires (outputs a signal). paCLA also models this idea, with the added complication that there are two separate pathways (proximal and distal) for input signals to be converted into effects on the voltage of the cell. In addition, paCLA treats the effect of the inputs as a *rate of change* of potential, rather than as a final potential level as found in standard CLA.

## Slow-motion Timeline of paCLA

[Note: this section relates to Mathematics of HTM Part I and Part II – see those posts for a full treatment].

Consider a single column of pyramidal cells in a layer of cortex. Along with the set of pyramidal cells \(\{P_1,P_2 .. P_n\}\), we also model a *columnar sheath of inhibitory cells* as a single cell \(I\). All the \(P_i\) and \(I\) are provided with the same feedforward input vector \(\mathbf{x}_t\), and they also have similar (but not necessarily identical) synaptic connection vectors \(\mathbf{c}_{P_i}\) and \(\mathbf{c}_{I}\) to those inputs (the bits of \(\mathbf{x}_t\) are the incoming sensory activation potentials, while bit \(j\) of a connection vector \(\mathbf{c}\) is 1 if synapse \(j\) is connected). The *feedforward overlap* \(o^{\textrm{ff}}_{P_i}(\mathbf{x}_t) = \mathbf{x}_t \cdot \mathbf{c}_{P_i}\) is the output of the proximal dendrite of cell \({P_i}\) (and similarly for cell \(I\)).

In addition, each pyramidal cell (but not the inhibitory sheath) receives signals on its distal dendrites. Each dendrite segment acts separately on its own inputs \(\mathbf{y}_k^{t-1}\), which come from other neurons in the same layer as well as other sublayers in the region (and from other regions in some cases). When a dendrite segment \(k\) has a sufficient *distal overlap*, exceeding a threshold \(\lambda_k\), the segment emits a dendritic spike of size \(s_k\). The output of the distal dendrites is then given by:

$$o^{\textrm{pred}}=\sum\limits_{o_k^{t} \ge \lambda_k}{s_k}$$

The predictive potential is combined with the overlap score from the feedforward overlap coming from the proximal dendrite to give the *total depolarisation rate*:

$$d_j = \frac{\partial V_j}{\partial t} = \alpha_j o^{\textrm{ff}}_{P_j} + \beta_j o^{\textrm{pred}}_{P_j}$$

where \(\alpha_j\) and \(\beta_j\) are parameters which transform the proximal and distal contributions into a rate of change of potential (and also control the relative effects of feedforward and predictive inputs). For the inhibitory sheath \(I\), there is only the feedforward component \(\alpha_I o^{\textrm{ff}}_I\), but we assume this is larger than any of the feedforward contributions \(\alpha_j o^{\textrm{ff}}_{P_j}\) for the pyramidal cells [cite evidence].

Now, the time a neuron takes to reach firing threshold is inversely proportional to its depolarisation rate. This imposes an ordering of the set \(\{P_1..P_n,I\}\) according to their (prospective) firing times \(\tau_{P_j} = \gamma_P \frac{1}{d_j}\) (and \(\tau_I = \gamma_I \frac{1}{d_I}\)).

## Formation of the Sparse Distributed Representation (SDR)

Zooming out from the single column to a neighbourhood (or sublayer) \(L_1\) of columns \(C_m\), we see that there is a local sequence \(\mathbb{S}\) in which all the pyramidal cells (and the inhibitory sheaths) would fire if inhibition didn’t take place. The actual sequence of cells which do fire can now be established by taking into account the effects of inhibition.

Let’s partition the sequence as follows:

$$\mathbb{S} = \mathbb{P}^{\textrm{pred}} \parallel \mathbb{I}^{\textrm{pred}} \parallel \mathbb{I}^{\textrm{ff}} \parallel \mathbb{P}^{\textrm{burst}} \parallel \mathbb{I}^{\textrm{spread}}$$

where:

- \(\mathbb{P}^{\textrm{pred}}\) is the (possibly empty) sequence of pyramidal cells in a highly predictive state, which fire before their inhibitory sheaths (ie \(\mathbb{P}^{\textrm{pred}} = \{P~|~\tau_P < \tau_{I_m}, P \in C_m\}\));
- \(\mathbb{I}^{\textrm{pred}}\) is the sequence of inhibitory sheaths which fire due to triggering by their contained predictively firing neurons in \(\mathbb{P}^{\textrm{pred}}\) – these cells fire in advance of their feedforward times due to inputs from \(\mathbb{P}^{\textrm{pred}}\);
- \(\mathbb{I}^{\textrm{ff}}\) is the sequence of inhibitory sheaths which fire as a result of feedforward input alone;
- \(\mathbb{P}^{\textrm{burst}}\) is the sequence of cells in columns where the inhibitory sheaths have just fired but their vertical inhibition has not had a chance to reach these cells (this is known as
*bursting*) – ie \(\mathbb{P}^{\textrm{burst}} =\{P~|~\tau_P < \tau_{I_m} + \Delta\tau_{\textrm{vert}}, P \in C_m\}\); - Finally, \(\mathbb{I}^{\textrm{spread}}\) is the sequence of all the other inhibitory sheaths which are triggered by earlier-firing neighbours, which spreads a wave of inhibition imposing sparsity in the neighbourhood.

Note that there may be some overlap in these sequences, depending on the exact sequence of firing and the distances between active columns.

The output of a sublayer is the SDR composed of the pyramidal cells from \(\mathbb{P}^{\textrm{pred}} \parallel \mathbb{P}^{\textrm{burst}}\) in that order. We say that the sublayer has *predicted perfectly* if \(\mathbb{P}^{\textrm{burst}} = \emptyset\) and that the sublayer is *bursting* otherwise.

The cardinality of the SDR is minimal under perfect prediction, with some columns having a sequence of extra, bursting cells otherwise. The bursting columns represent feedforward inputs which were well recognised (causing their inhibitory sheaths to fire quickly) but less well predicted (no cell was predictive enough to beat the sheath), and the number of cells firing indicates the uncertainty of which prediction corresponds to reality. The actual cells which get to burst are representative of the most plausible contexts for the unexpected input.

## Transmission and Reception of SDRs

A sublayer \(L_2\) which receives this \(L_1\) SDR as input will first see the minimal SDR \(\mathbb{P}^{\textrm{pred}}\) representing the perfect match of input and prediction, followed by the bursting SDR elements \(\mathbb{P}^{\textrm{burst}}\) in decreasing order of prediction-reality match.

This favours cells in \(L_2\) which have learned to respond to this SDR, and even more so for the subset which are also predictive due to their own contextual inputs (this biasing happens regardless of whether the receiving cells are proximally or distally enervated). The more sparse (well-predicted) the incoming SDR, the more sparse the activation of \(L_2\).

When there is a bursting component in the SDR, this will tend to add significant (or overwhelming) extra signal to the minimal SDR, leading to high probability of a change in the SDR formed by \(L_2\), because several cells in \(L_2\) will have a stronger feedforward response to the extra inputs than those which respond to the small number of signals in the minimal SDR.

For example, in software we typically use layers containing 2,048 columns of 32 pyramidal neurons (64K cells), with a minimal column SDR of 40 columns (c. 2%). At perfect prediction, the SDR has 40 cells (0.06%), while total bursting would create an SDR of 1280 cells. In between, the effect is quite uneven, since each bursting column produces several signals, while all non-bursting columns stay at one. Assuming some locality of the mapping between \(L_1\) and \(L_2\), this will have dramatic local effects where there is bursting.

The response in \(L_2\) to bursting in its input will not only be a change in the columnar representation, but may also cause bursting in \(L_2\) itself if the new state was not well predicted using \(L_2\)’s context. This will cause bursting to propagate downstream, from sublayer to sublayer (including cycles in feedback loops), until some sublayer can stop the cascade either by predicting its input or by causing a change in its external world which indirectly restores predictability.

Since we typically do not see reverberating, self-reinforcing cycles of bursting in neocortex, we must assume that the brain has learned to halt these cascades using some combination of eventual predictive resolution and remediating output from regions. Note that each sublayer has its own version of “output” in this sense – it’s not just the obvious motor output of L5 which can “change the world”. For example, L6 can output a new SDR which it transmits down to lower regions, changing the high-level context imposed on those regions and thus the environment in which they are trying (and failing somewhat) to predict their own inputs. L6 can also respond by altering its influence over thalamic connections, thus mediating or eliminating the source of disturbance. L2/3 and L5 both send SDRs up to higher regions, which may be able to better handle their deviations from predictability. And of course L5 can cause real changes in the world by acting on motor circuits.

## How is Self-Stabilisation Learned?

When time is slowed down to the extent we’ve seen in this discussion, it is relatively easy to see how neurons can learn to contribute to self-stabilisation of sparse activation patterns in cortex. Recall the general principle of Hebbian learning in synapses – the more often a synapse receives an input within a short time before its cell fires, the more it grows to respond to that input.

Consider again the sequence of firing neurons in a sublayer:

$$\mathbb{S} = \mathbb{P}^{\textrm{pred}} \parallel \mathbb{I}^{\textrm{pred}} \parallel \mathbb{I}^{\textrm{ff}} \parallel \mathbb{P}^{\textrm{burst}} \parallel \mathbb{I}^{\textrm{spread}}$$

This sequence does not include the very many cells in a sublayer which do not fire at all, because they are contained either in columns which become active, but are not fast enough to burst, or more commonly they are in columns inhibited by a spreading wave from active columns. Let’s call this set \(\mathbb{P}^{\textrm{inactive}}\).

A particular neuron will, at any moment, be a member of one of these sets. How often the cell fires depends on the average amount of time it spends in each set, and how often a cell fires characteristically for each set. Clearly, the highly predictive cells in \(\mathbb{P}^{\textrm{pred}}\) will have a higher typical firing frequency than those in \(\mathbb{P}^{\textrm{burst}}\), while those in \(\mathbb{P}^{\textrm{inactive}}\) have zero frequency when in that set.

Note that the numbers used earlier (65536 cells, 40 cells active in perfect prediction, 1280 in total bursting) mean that the percentage of the time cells are firing on average is massively increased if they are in the predictive population. Bursting cells only fire once following a failure of prediction, with the most predictive of them effectively “winning” and firing if the same input persists.

Some cells will simply be “lucky enough” to find themselves in the most predictive set and will strengthen the synapses which will keep them there. Because of their much higher frequency of firing, these cells will be increasingly hard to dislodge and demote from the predictive state.

Some cells will spend much of their time only bursting. This unstable status will cause a bifurcation among this population. A portion of these cells will simply strengthen the right connections and join the ranks of the sparsely predictive cells (which will eliminate their column from bursting on the current inputs). Others will weaken the optimal connections in favour of some other combination of context and inputs (which will drop them from bursting to inactive on current inputs). The remainder, lacking the ability to improve to predictive and the attraction of an alternative set of inputs, will continue to form part of the short-lived bursting behaviour. In order to compete with inactive cells in the same column, these “metastable” cells will have to have an output which tends to feed back into the same state which led to them bursting in the first place.

Cells which get to fire (either predictively or by bursting) have a further advantage – they can specialise their sensitivity to feedforward inputs given the contexts which caused them to fire, and this will give them an ever-improving chance of beating the inhibitory sheath (which has no context to help it learn). This is another mechanism which will allow cells to graduate from bursting to predictive on a given set of inputs (and context).

Since only active cells have any effect in neocortex, we see that there is an emergent “drive” towards stability and sparsity in a sublayer. Cells, given the opportunity, will graduate up the ladder from inactive to bursting to predictive when presented with the right inputs. Cells which fail to improve will be overtaken by their neighbours in the same column, and demoted back down towards inactive. A cell which has recently started to burst (having been inactive on the same inputs) will be reinforced in that status if its firing gives rise to a transient change in the world which causes its inputs to recur. With enough repetition, a cell will graduate to predictive on its favoured inputs, and will participate in a sparse, stable predictive pattern of activity in the sublayer and its region. The effect of its output will correspondingly change from a transient “restorative” effect to a self-sustaining, self-reinforcing effect.