Better Living through Thoughtful Technology

## Blog

• May 21 / 2015
Cortical Learning Algorithm

## The Brain is a Universal Dynamical Systems Computer – Hierarchical Temporal Memory

[Note: This post is a sketch of a paper in progress, due to be completed in May-June 2015.]
I believe we have now discovered the key function of neocortex: it is a machine which uses sensorimotor information from complex systems in the world to build and utilise running simulacra of those systems. The Cortical Learning Algorithm in HTM provides a self-organising structure which can automatically emulate a very large class of real-world phenomena. The design of the neocortex is specifically suited to the task of maintaining a model of the world in the face of nonstationarity in the complex system.

### Nonlinear Dynamics – an Introduction

OK, that’s a lot of jargon, so I’ll illustrate this with an everyday example. Riding a real bicycle on a real road is an extraordinarily difficult task for a classical computer program. If you try to do this the 1950’s way, you’d begin by identifying a big system of partial differential equations, and then find a way to solve them numerically in order to control the robot. This turns out to be near impossible in practice, and results in a system which is very brittle and inflexible. There is another approach, however. One very popular method used in robotics and control systems today is PID (proportional/integral/differential), which involves combining mixed feedback loops between sensation and action.

Here’s a cute video of such a system:

What’s happening here is simple. The robot is using its sensors to detect how things are going, and just reacting to the changing sensory data in order to maintain stability.

The robot-controller-bicycle-floor system is an example of a nonlinear dynamical system. The real world we live in is full of such systems, but the past several centuries of physics has tended to avoid them in favour of pretending the world is linear. Much of the physics and applied math we learned in school and college approximates reality with much simpler linear systems. Only in the last century or so (and increasingly since the advent of computer simulations) have we begun to examine nonlinear dynamical systems in any detail.

The most famous recent result from Dynamical Systems Science was the discovery of Chaos, which involves the evolution of apparently unpredictable behaviour in simple, nonlinear, deterministic systems. Apart from vaguely being aware of the idea of chaos, most well-educated people have no real knowledge of how nonlinear systems work, what can be known about them, and how different systems are related. In fact, this has become perhaps the primary field of study in applied mathematics over the past 40 years, and some very clever people have made big progress in understanding these complex, non-intuitive phenomena. We’ll get back to this shortly.

### Dynamical Systems and the Brain

Of course, one of the most interesting systems of this type is to be found in our brains. Often described as “the most complex thing in the known universe,” the brain is indeed a daunting thing to study. Many people have examined neural structures as dynamical systems, and proposed that nonlinear dynamics are key to working out how the brain works. Indeed, a number of researchers have demonstrated that simplified model neural networks can exhibit some of the same kinds of computational properties found in the brain (for example, see Hoerzer et al).

In fact, it appears that the brain looks like a whole bunch of interacting dynamical systems, everywhere you look, and at all scales. Surely this is only going to make things harder to understand? Well, yes and no. Yes, we’re going to have to leave the comfort of our training in seeing everything as linear, and venture into a world of oddness and unpredictability. And no, we actually can – once we take the leap – understand how nonlinear dynamics reveals the true nature of animal intelligence.

### Dynamical Systems and Information

Nonlinear dynamical systems are weird. They can be entirely deterministic (rather than random), but practically unpredictable. They are often critically sensitive to initial (or measured) conditions, so in practise they might never repeat exactly the same sequences again. They may contain huge numbers of “internal variables” (billions, trillions or more), leaving us with no hope of using analytic methods in order to model them.

Yet incredibly, many dynamical systems have a miracle property. They “export” information which we can collect, and this information is often sufficient for us to build a model with the same kinds of dynamics as the original. This discovery was made in the 1970’s, the “golden decade” of dynamical systems, and it has been applied again and again in a hugely diverse range of areas.

Here’s a (very old, so murky and scratchy) video by Steve Strogatz and Kevin Cuomo:

So, what’s going on here? Well, the sending circuit is an analog dynamical system which is executing one of the most famous sets of equations in Dynamical Systems – the Lorenz Equations. The details are not important (for this discussion), but essentially the system has three “internal variables” which are coupled together with quite simple differential equations. Here’s an animation of a Lorenz system:

It’s quite beautiful. You can see how there is an elegant kind of structure to the trajectories traced out by the point, and a strange kind of symmetry in the spiralling and twisting of the butterfly-like space it lives in. In fact, this system is infinitely complex and has become the “Hello World” of dynamic systems science.

OK, so the sending system is behaving like a Lorenz System, with certain voltages in the circuit acting like the $$x$$, $$y$$ and $$z$$ coordinates in the animation. The receiving circuit is also a Lorenz emulator, with almost exactly the same setup as the sender (they’re real electronic devices, so they can’t be identical). Now, the trick is to take just one output from the sending circuit (say $$x$$), and use it as the $$x^\prime$$ voltage in the receiving circuit. As Strogatz says in his book, Sync, it’s as if the $$x^\prime$$ has been “taken over” by the signal from the sender. Normally, $$x^\prime$$, $$y^\prime$$ and $$z^\prime$$ work together to produce the elegant trajectory we see in the animation, but now $$x^\prime$$ is simply ignoring its dance partners, who appear to have no choice but to synchronise themselves with the interloper from afar.

This eerie effect is much more general than you might think. It turns out that, just using a single stream of measurements, you can reconstruct the dynamics of a huge range of systems, without needing any knowledge of the “internal variables” or their equations. This result is based on Takens’ Theorem, which proves this for certain well-behaved systems (such as Lorenz’).

Here’s a video (with three parts) which explains how this works:

Part One introduces Lorenz’ system. Part Two illustrates Takens’ Theorem, and the final part shows how it can be applied to test for causal connections between time series.

### The Brain as a Universal Dynamical Computer

This phenomenon is the key to what the neocortex is doing. It’s exploiting the information in time series sensory data to build replicas of the dynamics of the world, use them for identification, forecasting, modelling, communication, and behaviour. Well, that’s nice to know, but it doesn’t explain how it does that. So, let’s do that.

I referred earlier to the work of Gregor Hoerzer, which uses recurrent neural networks (RNNs) to model a few kinds of chaotic computation. RNNs are similar to other kinds of Deep Learning artificial neural networks, which use extremely simple “point neurons”. They differ in that their outputs may end up (after a few hops) as part of their own inputs. This gives RNNs a lot more power than other ANNs, which explains why they’re currently such a hot topic in Machine Learning.

I believe they are so successful right now because they use the tricks we’ve seen and self-organise to represent a simulated dynamics and thus allow for some amount of modelling, prediction and generation. RNNs are powerful, but they lack structure, and they’re very hard for us to understand. Perhaps a more structured type of network would have even more power and (fingers crossed) might be easier to understand and reason about.

### Hierarchical Temporal Memory and the Cortical Learning Algorithm

In Jeff Hawkins’ HTM theory, the point neurons are replaced by far more realistic model neurons, which are much more complex and have significant computational power just on their own. Neurons are packed into columns, and the columns are arranged in layers. This structure is based on detailed study of real neocortex, and is a reasonable, first-order approximation of what you’d see in a real brain.

The key to HTM is that the layers are combined and connected just like in the brain. Each layer in a region (a small area of cortex) has different inputs and performs its own particular role in the computation. I’ve written in some depth about this before, so I’ll just briefly summarise this in the context of dynamical systems.

This rather intimidating diagram is a minimal sketch of the primary computational connections in my multilayer model. It shows the key information flows in a region of neocortex. The “primary” inputs to the region are the red and blue arrows coming in from the bottom and going to Layer 4 (and L6 as well). Here, subpopulations of cells in L4 learn to reconstruct the dynamics of the sensorimotor inputs, and forecast transitions in short timesteps. While L4 is able to predict the upcoming evolution, its representation is being pooled over time by cells in L2 and L3. These cells represent the current dynamical “regime” of the evolving dynamics in L4, which characterises the sensed system at a longer timescale than the fast-changing input to the region.

The output from L2/3 goes up the hierarchy to higher regions, which treat that as a dynamically evolving sensory input, and repeat the same process. In addition, this output goes to L5, which combines it with other inputs (from L1 and L6) and produces behaviour which has been learned to interact with the world in order to preserve or recover prediction in the entire region (see here for the mechanisms of self-stabilisation in this system).

The key thing here is that subpopulations of neurons are capable of learning to model the dynamics of the world at many timescales, and that changes of the characteristics of the real-world system cause changes in the choice of subpopulation, which is then picked up in downstream layers, leading to a new representation of the world by the region and also a motor or behavioural reaction to the new dynamics.

The other pathways in the diagram are crucial to both the learning of dynamical modelling and perception itself. The higher-level regions provide even slower-changing inputs to both L2/3 and L5, representing the more stable “state” they are working with, and assisting these cells to maintain a consistent picture of the world in the face of uncertainty and noise.

### References (to be completed)

Gregor M. Hoerzer, Robert Legenstein, and Wolfgang Maass. Emergence of Complex Computational Structures From Chaotic Neural Networks Through Reward-Modulated Hebbian Learning. In Cereb. Cortex (2014) 24 (3): 677-690 first published online November 11, 2012 doi:10.1093/cercor/bhs348 Free Full Text.

• Jan 02 / 2015

## Self-Stabilisation in Hierarchical Temporal Memory

This post was written in response to Jeff Hawkins’ comments on last week’s article on a new Multilayer Model of Neocortex in Hierarchical Temporal Memory (HTM). Jeff expressed concerns about the clarity or correctness of my claim that sublayers in a cortical region act to self-stabilise in the face of unpredicted changes in the world (including changes in top-down feedback from higher regions). This discussion is a companion to an earlier description of the Efficiency of Predicted Sparseness, but goes into much more detail when describing how a non-sparse output from one sublayer is absorbed and processed by downstream sublayers.

In the earlier posts, we described how each sublayer in a region combines context inputs with feedforward inputs to form a sparse, predicted representation of the world in context. When this succeeds perfectly, each column in the sublayer has only a single active cell, and that cell represents the best combination of prediction from context and recognition of the feedforward input. The single-cell-per-column representation occurs when the single cell is sufficiently depolarised by distal (predictive/context) inputs to beat its columnar inhibitory sheath and fire first. If this does not happen, then the sheath fires first, allowing some number of contained pyramidal cells to fire before vertical inhibition reduces the column’s activity to just the one, best-predicted cell.

In order to understand the stabilising effect, we need to zoom in temporally and watch how the potentials evolve in extreme “slow-motion” in which the time steps correspond to individual synaptic events. At this framerate, we can observe the individual neurons’ potentials rising towards firing and the effect of inhibition both vertically and horizontally on the patterns of activation. This level of granularity also allows us to characterise the opportunities for synapses to adapt, which turns out to be crucial for understanding the model.

Synapses grow when there is a temporal correlation between their pre-synaptic inputs and the action potentials of the post-synaptic cell. The more often the cell fires within a short (c. 10ms) window of time after the synapse receives an action potential, the bigger and more receptive the synapse grows. In HTM, we model this with a scalar value we call permanence, which varies between 0.0 and 1.0, and we say that the synapse is connected when its permanence is above a threshold (usually 0.2), otherwise it is disconnected.

The current “official” Cortical Learning Algorithm (or CLA, the detailed computational model in HTM) separates feedforward and predictive stages of processing. A modification of this model (which I call prediction-assisted recognition or paCLA) combines these into a single step involving competition between highly predictive pyramidal cells and their surrounding columnar inhibitory sheaths. Though this has been described in summary form before, I’ll go through it in detail here.

Neural network models generally model a neuron as somehow “combining” a set of inputs to produce an output. This is based on the idea that input signals cause ion currents to flow into the neuron’s cell body, which raises its voltage (depolarises), until it reaches a threshold level and fires (outputs a signal). paCLA also models this idea, with the added complication that there are two separate pathways (proximal and distal) for input signals to be converted into effects on the voltage of the cell. In addition, paCLA treats the effect of the inputs as a rate of change of potential, rather than as a final potential level as found in standard CLA.

## Slow-motion Timeline of paCLA

[Note: this section relates to Mathematics of HTM Part I  and Part II – see those posts for a full treatment].

Consider a single column of pyramidal cells in a layer of cortex. Along with the set of pyramidal cells $$\{P_1,P_2 .. P_n\}$$, we also model a columnar sheath of inhibitory cells as a single cell $$I$$. All the $$P_i$$ and $$I$$ are provided with the same feedforward input vector $$\mathbf{x}_t$$, and they also have similar (but not necessarily identical) synaptic connection vectors $$\mathbf{c}_{P_i}$$ and $$\mathbf{c}_{I}$$ to those inputs (the bits of $$\mathbf{x}_t$$ are the incoming sensory activation potentials, while bit $$j$$ of a connection vector $$\mathbf{c}$$ is 1 if synapse $$j$$ is connected). The feedforward overlap $$o^{\textrm{ff}}_{P_i}(\mathbf{x}_t) = \mathbf{x}_t \cdot \mathbf{c}_{P_i}$$ is the output of the proximal dendrite of cell $${P_i}$$ (and similarly for cell $$I$$).

In addition, each pyramidal cell (but not the inhibitory sheath) receives signals on its distal dendrites. Each dendrite segment acts separately on its own inputs $$\mathbf{y}_k^{t-1}$$, which come from other neurons in the same layer as well as other sublayers in the region (and from other regions in some cases). When a dendrite segment $$k$$ has a sufficient distal overlap, exceeding a threshold $$\lambda_k$$, the segment emits a dendritic spike of size $$s_k$$. The output of the distal dendrites is then given by:

$$o^{\textrm{pred}}=\sum\limits_{o_k^{t} \ge \lambda_k}{s_k}$$

The predictive potential is combined with the overlap score from the feedforward overlap coming from the proximal dendrite to give the total depolarisation rate:

$$d_j = \frac{\partial V_j}{\partial t} = \alpha_j o^{\textrm{ff}}_{P_j} + \beta_j o^{\textrm{pred}}_{P_j}$$

where $$\alpha_j$$ and $$\beta_j$$ are parameters which transform the proximal and distal contributions into a rate of change of potential (and also control the relative effects of feedforward and predictive inputs). For the inhibitory sheath $$I$$, there is only the feedforward component $$\alpha_I o^{\textrm{ff}}_I$$, but we assume this is larger than any of the feedforward contributions $$\alpha_j o^{\textrm{ff}}_{P_j}$$ for the pyramidal cells [cite evidence].

Now, the time a neuron takes to reach firing threshold is inversely proportional to its depolarisation rate. This imposes an ordering of the set $$\{P_1..P_n,I\}$$ according to their (prospective) firing times $$\tau_{P_j} = \gamma_P \frac{1}{d_j}$$ (and $$\tau_I = \gamma_I \frac{1}{d_I}$$).

## Formation of the Sparse Distributed Representation (SDR)

Zooming out from the single column to a neighbourhood (or sublayer) $$L_1$$ of columns $$C_m$$, we see that there is a local sequence $$\mathbb{S}$$ in which all the pyramidal cells (and the inhibitory sheaths) would fire if inhibition didn’t take place. The actual sequence of cells which do fire can now be established by taking into account the effects of inhibition.

Let’s partition the sequence as follows:

$$\mathbb{S} = \mathbb{P}^{\textrm{pred}} \parallel \mathbb{I}^{\textrm{pred}} \parallel \mathbb{I}^{\textrm{ff}} \parallel \mathbb{P}^{\textrm{burst}} \parallel \mathbb{I}^{\textrm{spread}}$$

where:

1. $$\mathbb{P}^{\textrm{pred}}$$ is the (possibly empty) sequence of pyramidal cells in a highly predictive state, which fire before their inhibitory sheaths (ie $$\mathbb{P}^{\textrm{pred}} = \{P~|~\tau_P < \tau_{I_m}, P \in C_m\}$$);
2. $$\mathbb{I}^{\textrm{pred}}$$ is the sequence of inhibitory sheaths which fire due to triggering by their contained predictively firing neurons in $$\mathbb{P}^{\textrm{pred}}$$ – these cells fire in advance of their feedforward times due to inputs from $$\mathbb{P}^{\textrm{pred}}$$;
3. $$\mathbb{I}^{\textrm{ff}}$$ is the sequence of inhibitory sheaths which fire as a result of feedforward input alone;
4. $$\mathbb{P}^{\textrm{burst}}$$ is the sequence of cells in columns where the inhibitory sheaths have just fired but their vertical inhibition has not had a chance to reach these cells (this is known as bursting) – ie $$\mathbb{P}^{\textrm{burst}} =\{P~|~\tau_P < \tau_{I_m} + \Delta\tau_{\textrm{vert}}, P \in C_m\}$$;
5. Finally, $$\mathbb{I}^{\textrm{spread}}$$ is the sequence of all the other inhibitory sheaths which are triggered by earlier-firing neighbours, which spreads a wave of inhibition imposing sparsity in the neighbourhood.

Note that there may be some overlap in these sequences, depending on the exact sequence of firing and the distances between active columns.

The output of a sublayer is the SDR composed of the pyramidal cells from $$\mathbb{P}^{\textrm{pred}} \parallel \mathbb{P}^{\textrm{burst}}$$ in that order. We say that the sublayer has predicted perfectly if $$\mathbb{P}^{\textrm{burst}} = \emptyset$$ and that the sublayer is bursting otherwise.

The cardinality of the SDR is minimal under perfect prediction, with some columns having a sequence of extra, bursting cells otherwise. The bursting columns represent feedforward inputs which were well recognised (causing their inhibitory sheaths to fire quickly) but less well predicted (no cell was predictive enough to beat the sheath), and the number of cells firing indicates the uncertainty of which prediction corresponds to reality. The actual cells which get to burst are representative of the most plausible contexts for the unexpected input.

## Transmission and Reception of SDRs

A sublayer $$L_2$$ which receives this $$L_1$$ SDR as input will first see the minimal SDR $$\mathbb{P}^{\textrm{pred}}$$ representing the perfect match of input and prediction, followed by the bursting SDR elements $$\mathbb{P}^{\textrm{burst}}$$ in decreasing order of prediction-reality match.

This favours cells in $$L_2$$ which have learned to respond to this SDR, and even more so for the subset which are also predictive due to their own contextual inputs (this biasing happens regardless of whether the receiving cells are proximally or distally enervated). The more sparse (well-predicted) the incoming SDR, the more sparse the activation of $$L_2$$.

When there is a bursting component in the SDR, this will tend to add significant (or overwhelming) extra signal to the minimal SDR, leading to high probability of a change in the SDR formed by $$L_2$$, because several cells in $$L_2$$ will have a stronger feedforward response to the extra inputs than those which respond to the small number of signals in the minimal SDR.

For example, in software we typically use layers containing 2,048 columns of 32 pyramidal neurons (64K cells), with a minimal column SDR of 40 columns (c. 2%). At perfect prediction, the SDR has 40 cells (0.06%), while total bursting would create an SDR of 1280 cells. In between, the effect is quite uneven, since each bursting column produces several signals, while all non-bursting columns stay at one. Assuming some locality of the mapping between $$L_1$$ and $$L_2$$, this will have dramatic local effects where there is bursting.

The response in $$L_2$$ to bursting in its input will not only be a change in the columnar representation, but may also cause bursting in $$L_2$$ itself if the new state was not well predicted using $$L_2$$’s context. This will cause bursting to propagate downstream, from sublayer to sublayer (including cycles in feedback loops), until some sublayer can stop the cascade either by predicting its input or by causing a change in its external world which indirectly restores predictability.

Since we typically do not see reverberating, self-reinforcing cycles of bursting in neocortex, we must assume that the brain has learned to halt these cascades using some combination of eventual predictive resolution and remediating output from regions. Note that each sublayer has its own version of “output” in this sense – it’s not just the obvious motor output of L5 which can “change the world”. For example, L6 can output a new SDR which it transmits down to lower regions, changing the high-level context imposed on those regions and thus the environment in which they are trying (and failing somewhat) to predict their own inputs. L6 can also respond by altering its influence over thalamic connections, thus mediating or eliminating the source of disturbance. L2/3 and L5 both send SDRs up to higher regions, which may be able to better handle their deviations from predictability. And of course L5 can cause real changes in the world by acting on motor circuits.

## How is Self-Stabilisation Learned?

When time is slowed down to the extent we’ve seen in this discussion, it is relatively easy to see how neurons can learn to contribute to self-stabilisation of sparse activation patterns in cortex. Recall the general principle of Hebbian learning in synapses – the more often a synapse receives an input within a short time before its cell fires, the more it grows to respond to that input.

Consider again the sequence of firing neurons in a sublayer:

$$\mathbb{S} = \mathbb{P}^{\textrm{pred}} \parallel \mathbb{I}^{\textrm{pred}} \parallel \mathbb{I}^{\textrm{ff}} \parallel \mathbb{P}^{\textrm{burst}} \parallel \mathbb{I}^{\textrm{spread}}$$

This sequence does not include the very many cells in a sublayer which do not fire at all, because they are contained either in columns which become active, but are not fast enough to burst, or more commonly they are in columns inhibited by a spreading wave from active columns. Let’s call this set $$\mathbb{P}^{\textrm{inactive}}$$.

A particular neuron will, at any moment, be a member of one of these sets. How often the cell fires depends on the average amount of time it spends in each set, and how often a cell fires characteristically for each set. Clearly, the highly predictive cells in $$\mathbb{P}^{\textrm{pred}}$$ will have a higher typical firing frequency than those in $$\mathbb{P}^{\textrm{burst}}$$, while those in $$\mathbb{P}^{\textrm{inactive}}$$ have zero frequency when in that set.

Note that the numbers used earlier (65536 cells, 40 cells active in perfect prediction, 1280 in total bursting) mean that the percentage of the time cells are firing on average is massively increased if they are in the predictive population. Bursting cells only fire once following a failure of prediction, with the most predictive of them effectively “winning” and firing if the same input persists.

Some cells will simply be “lucky enough” to find themselves in the most predictive set and will strengthen the synapses which will keep them there. Because of their much higher frequency of firing, these cells will be increasingly hard to dislodge and demote from the predictive state.

Some cells will spend much of their time only bursting. This unstable status will cause a bifurcation among this population. A portion of these cells will simply strengthen the right connections and join the ranks of the sparsely predictive cells (which will eliminate their column from bursting on the current inputs). Others will weaken the optimal connections in favour of some other combination of context and inputs (which will drop them from bursting to inactive on current inputs). The remainder, lacking the ability to improve to predictive and the attraction of an alternative set of inputs, will continue to form part of the short-lived bursting behaviour. In order to compete with inactive cells in the same column, these “metastable” cells will have to have an output which tends to feed back into the same state which led to them bursting in the first place.

Cells which get to fire (either predictively or by bursting) have a further advantage – they can specialise their sensitivity to feedforward inputs given the contexts which caused them to fire, and this will give them an ever-improving chance of beating the inhibitory sheath (which has no context to help it learn). This is another mechanism which will allow cells to graduate from bursting to predictive on a given set of inputs (and context).

Since only active cells have any effect in neocortex, we see that there is an emergent “drive” towards stability and sparsity in a sublayer. Cells, given the opportunity, will graduate up the ladder from inactive to bursting to predictive when presented with the right inputs. Cells which fail to improve will be overtaken by their neighbours in the same column, and demoted back down towards inactive. A cell which has recently started to burst (having been inactive on the same inputs) will be reinforced in that status if its firing gives rise to a transient change in the world which causes its inputs to recur. With enough repetition, a cell will graduate to predictive on its favoured inputs, and will participate in a sparse, stable predictive pattern of activity in the sublayer and its region. The effect of its output will correspondingly change from a transient “restorative” effect to a self-sustaining, self-reinforcing effect.

• Dec 17 / 2014
Cortical Learning Algorithm

## Multilayer Model for Hierarchical Temporal Memory

This post sketches a simple model for multilayer processing in Hierarchical Temporal Memory (HTM). It is based on a combination of Jeff Hawkins’ and Numenta’s current work on sensorimotor extensions to HTM, my previous ideas on efficiency of predicted sparseness as well as evidence from neuroscience.

HTM has entered a new phase of development in the past year. Hawkins and his colleagues are currently extending HTM from a single-layer sensory model (assumed to represent high-order memory in Layer 2/3 of cortex) to a sensorimotor model which involves Transition Memory of combined sensory and motor inputs in L4, which is Temporally Pooled in L2/3. Once this is successfully modelled, the plan is to examine the role of L5 and L6 in motor behaviour and feedback.

Recent research in neuroscience has significantly improved our understanding of the various pathways in cortical circuits. [Douglas & Martin, 2004] proposed a so-called canonical pathway in which thalamic inputs arrive in L4, which projects to L2/3 (which sends its output to higher regions), then to L5 (which outputs motor signals) and from there to L6 (which outputs feedback to lower layers and thalamus). Teams led by Randy Bruno [deKock et al, 2007], [Constantinople & Bruno, 2013] have found that there is also a parallel circuit thalamus-L5-[L6 and L4] as well as an L3-L4 feedback pathway.

Figure 1, which is from [deKock et al, 2007], shows the calculated temporal pattern of activity in a piece of rat barrel cortex (called D2) consisting of about 9000 neurons. Barrel cortex is so named because the neurons responsive to a single Primary Whisker (PW) form a barrel-like columnar structure in this part of rat cortex. The paper estimates the layer populations in this “column” to be 3200 L2/3, 2050 L4, 1100 L5A, 1050 L5B and 1200 L6 excitatory cells.

Figure 1. Evolution of Action Potential (AP) rates in rat barrel cortex when experimenters stimulate the associated whisker. VPM is the thalamic region which projects to this part of cortex. From [deKock et al, 2007].

We’ll examine this data from the point of view of HTM. Firstly, we see that the spontaneous activity in all layers is very sparse (0.3% in L2/3, 0.6% in L4, 1.1% in L5A, 3% in L5B and 0.5% in L6), and that activity rises and falls dramatically and differently in each layer over the 150ms following stimulation.

Looking at the first 10ms and only in L4-L2/3, we see the expected sparse activations in L4 and L3, which is followed by a dramatic increase (x17 in L4, x10 in L2/3) representing bursting in both layers, likely because the input was unpredicted. Over the next 20ms, activity in L2/3 drops sharply back to 2x the baseline, but that in L4, after 10ms of dropping, rises again to practically match the original activation. This is matched in the next 10ms by a rise in L2/3 activation, after which both levels drop gradually towards the baseline over more than 100ms. We see another, somewhat different “wavelike” response pattern in the L5/6 complex.

So, can we build a model using HTM principles which explains this data (and, even better, predicts other unseen data)? I believe there must be such a model, because we see this kind of processing everywhere we look in cortex.

Before we get to that, let’s identify some important principles which arise from our current understanding of cortical function.

### I: A Race to Represent

The first principle is that a population of neurons which share a common set of inputs is driven to “best represent” its inputs using a competitive inhibition process. Each neuron is accumulating depolarising input current from a unique set of contextual and immediate sources, and the first to fire will inhibit its neighbours and form part of the representation.

Each neuron can thus be seen as analogous to a “microtheory” of its world, and it will accumulate evidence from past context, current sensory inputs, and behaviour to compete in a race for its theory to be “most true”.

### II: Different Sources of Evidence

The purpose of the layered structure of neocortex is to allow each population to combine its own individual evidence sources and learn to represent the “theory” of that evidence. The various populations (or sublayers) form a cyclic graph structure of evidence flow, and they cooperate to form a stable, predictable, sensorimotor model of the current world.

### III: Efficiency of Predictive Sparseness

Each neuron combines contextual or predictive inputs (on distal synapses) with evidence from immediate sources (on proximal synapses). In addition, the columnar inhibitory sheath is also racing to recognise its inputs, which come largely from the same feedforward sources as its contained pyramidal cells. The sheath has an advantage as it is a better responder [cite] to the feedforward evidence alone than any of its contained cells, so there is also a race between predictive assisted recognition and simple spatial recognition of reality.

The result of the race depends on which wins – if a single pyramidal cell wins due to high predictive depolarisation (lots of contextual evidence), then it alone will fire. Otherwise, there is a short window of time which allows some number of the most predictive cells in the column to fire in turn, before they are inhibited by a vertical process. This “bursting” encodes the difference between the reality (as signalled by this column’s inhibitory sheath firing) and the population’s prediction (as would have been signalled by a highly predictive cell in some losing nearby column).

### IV: Self-stabilisation through Sparse Patterns

If we consider a cortical region in its “steady state”, we see highly sparse (non-bursting) representations everywhere, and the behavioural output (from Layer 5) will be a sequence of highly sparse patterns which result in very fine motor adjustments (or none at all). This corresponds to the region perfectly modelling the sensorimotor world it experiences and making optimal predictions with minimal corrective behaviour.

A deviation from this state (failure of prediction) leads to a partial change in representation (because reality differs from prediction) and some amount of redundant predictive representation (when several cells burst in new columns). This departure from maximal sparseness is transmitted to the downstream sublayers, causing their “view of the world” and thus their own state to change. Depending on how well each sublayer can predict these changes, the cascade may halt, or instead continue to roll around the cyclical graph of sublayers, causing behavioural side-effects as it goes.

### V: A Team of Rivals – “Explaining Change” by Witnessing or Acting

Within each sublayer, some cells will have inputs which correspond to “observing” the world as it evolves on its own (by predicting from context), while others will respond better when the organism is taking certain actions, and will have learned to associate certain changes with those behaviours. The representation in each sublayer will be some mixture of these, and, in the case of motor output cells in L5, the “decisions” of the region will be those which restore the predictability of things.

The reason is simple. While the activity in the region is sparse, all the active cells are predicting their activity, and the outputs of the region reflect the happy condition. These include motor output, which by definition is acting to prolong the current status of the region (if it was acting to depart from the status, these motor cells would not be still firing).

When something changes, and a set of new neurons becomes active, new neurons become temporarily active throughout the various sublayers, but they will all be cells which have learned to respond better to the new state of the world than the previously active cells. These cells will have learned to associate their own activity with the new situation, by being more right about predicting their own activity in this new context. And this, in turn, will be true only if they are the long-term winners in the establishment of a new, stable cycle of sparse activity, or alternatively if they have regularly participated in the transition to a new stable state. Either way, the system is self-stabilising, acting to right itself and improve the prediction.

## A Multilayer Cortical Model

I claim that the above principles are enough to construct a simple model of how the sublayers in a region of cortex interact and co-operate.

I use the word “sublayers” because each layer (L1-6) may contain more than one population or class of neurons. We’ll pretend these are each in their own sublayer, but recognising that there are local connections between cells in sublayers which are important to how things work.

So as not to confuse, I’ll not use the common notation for sublayers found in the literature (eg L5A), instead I’ll use labels such as L5.1, L5.2 and so on. The “minor number” will usually indicate sublayers successively “far away” from the sensorimotor inputs, both in terms of time and the number of neurons in the path to reach them. I’ll also use the deKock diagram above to anchor the place and time of each part of the response to a large sensory stimulus.

I’ll also assume the idea that when a neuron projects an axon, it does so in order to connect proximally with its target. Thus, L4 projections to L2/3 are proximal on L2/3 cells, likewise with L6 to L4, while the L2/3->L4 feedback pathway uses distal dendrites.

### Layer 4.1 – Sensorimotor Transition Prediction (0-20ms)

Layer 4 is said [cite] to receive inputs from L6 (65%), elsewhere in L4 (25%), and directly from thalamus (5%). In addition, some cells in L4 have distal dendrites in L2/3. We’ll split L4 into two sublayers, depending on whether they receive inputs from L2/3 (L4.1 no, L4.2 yes). Some researchers [cite] divide L4 into two populations – stellar cells and pyramidal cells, and it may be that the split is along these lines.

My hypothesis is that L4.1 cells are making predictions of sensorimotor transitions, using thalamic sensorimotor input as (primarily) feedforward, and a combination of local predictive context (L4) and information about the region’s current sensorimotor output (from L6). I say “primarily” because a single feedforward axon could synapse with a cell both on its proximal and distal dendrites, and this would be even more important for the stellar dendritic branches of L4.1 cells.

Note that the L4 inputs to L4.1 includes evidence of the output of L2/3 (a more stable “sensory” representation) via L4.2. The L6-sourced inputs also include evidence of the stable feedback pattern being sent to lower regions, which are themselves indirectly influenced by L5’s use of L2/3 (see later).

So, L4.1 is receiving fast-changing sensorimotor inputs, along with slower-changing context from within L4, and both sensory and motor outputs of the region. It uses whatever best evidence it has to predict any transitions in the thalamic input.

Successful prediction in L4.1 results in it outputting a highly sparse pattern on each transition. Failures in prediction are encoded as a union of “nearly predicted” cell activations in the columns best recognising the unpredicted thalamic input.

This might not seem sensible when thalamic inputs are only 5% of what L4.1 is receiving, but remember that the other inputs are usually highly sparse (1-2%) and change much more slowly, so thalamic feedforward input to L4.1 acts as a tiebreaker among predictions. This pattern is repeated throughout cortex because bursting cells cause a similar disruptive, temporary tiebreaking signal in downstream sublayers.

### Layers 3.1 and 2.1 – Temporal Pooling (10-20ms)

Layers 2 and 3 are usually treated as one. Both receive most of their feedforward input from L4 and have distal inputs both from within L2/3 and from L1 (which gets feedback input from L6 in higher regions).

I’ll split the two by saying that L2 gets more input from L1 than L3 does. In other words, L2 is more primed or biased by higher-level context, while L3 is less likely to be dominated by feedback. There is evidence [cite] of this differentiation, so let’s assume it’s useful.

Now, L2.1/L3.1 are receiving feedforward inputs from L4.1. If those inputs are sparse, then only those cells in L2/3 which have many active inputs will be part of the SDR in this layer (it’s one layer in a column sense, just the L2 “end” has a higher L1 input mix). In addition, they’ll need good intralayer and/or top-down predictive input to maintain stable activity.

The stability in L2.1/3.1 comes from the combination of stable predictive inputs from within the layer and from above. This prebiases predictive cells to recognise the successive sparse inputs from L4.1 and continue to remain active. The active cells in L2/3 have learned to use a combination of sequence memory (intralayer) and top-down feedback to associate with each fast-changing SDR in L4.1. This mechanism is reinforced by the fast L4.1-L2/3.1-L4.2-L4.1 feedback loop, along with the much longer feedback loops.

This is where the L2/3 difference is important. The more superficial cells in L2/3 are more strongly biased by top-down feedback from L1. We have evidence [cite] that L2 projects more strongly to the deep part of L5, while L3 projects more to superficial L5. Thus, the choices of active cells in L2/3 encode how much sequence memory and how much top-down are involved in the representation.

### L6.1 – Comparing Reality with Expectations from Behaviour (0-10ms)

[Constantinople and Bruno], among others, show that direct thalamic inputs arrive simultaneously at L4 and L5/L6, suggesting that L5/6 and L4/L2/3 are performing parallel operations on sensorimotor inputs. While the L4-L2/3 system is relatively simple (at least at first order approximation), the L5/6 system is much more complex, involving a larger number of functional populations with diverse purposes. I’ll describe a minimum of these for now.

Layer 6.1 cells are the first in L5/6 to respond to thalamic inputs, suggesting a role analogous to L4.1. Unlike L4 cells, however, these cells have immediate access to both the recent L6 output to lower regions (representing the current steady state of the region) and the current motor output of the region (from L5). This much richer set of evidence sources allows L6.1 to make finer-grained predictions of the expected thalamic inputs, and its response when prediction fails is the primary driver for changes in L5 motor output and signals to higher regions.

### L5.1 – Responding to Change by Acting (0-20ms)

I speculate that the thick-tufted L5B cells correspond to L5.1 in my model. These cells also receive direct thalamic inputs, as well as inputs from L6, L2/3 (primarily the L2 “end”) and top-down feedback via L1. L5.1’s purpose is to act quickly if necessary, in response to a significant change in its world. Any dramatic change in either sensorimotor patterns or context will cause L5 to output a large, non-sparse signal which it has learned is appropriate to that change.

In the steady state, with all inputs sparse, L5.1 generates a minimal, sparse signal which corresponds to energetically efficient, smooth behaviour in the organism. Sudden (unpredictable) changes in either sensorimotor inputs (thalamic), correspondence between behaviour and outcomes (L6), sequence memory predictions (L2/3) or top-down “instructions” (L1) will cause a dramatic rise in output (from 3% to over 10% active cells) which results in new corrective motor behaviour as well as an alarm signal to higher layers.

### L6.2 – Co-ordination of Responses (10-30ms)

In Layer 6, a second population of cells is responsible for integrating any rising activity in L5.1 with context, signalling L4 of the new situation, and affecting the L6 feedback output. The better L6.2 can predict/recognise the output of L5, the sparser its signal to L4 and the smaller the effect on L6 feedback output. Thus, L6.2 acts either to help L4 make good predictions of transitions (by sending sparse signals), or to disrupt steady-state prediction in L4 (and later L2/3) into a new sensorimotor regime.

### L4.2 and L2/3: Stabilising Prediction (30-50ms)

After 30ms or so, pyramidal cells in L4 are sampling the “sensory” response of L2/3 along with signals from L6 about the motor response. L4.2 can now generate a signal for L2/3 which is more sparse than the initial L4.1 response, but still well above baseline. Over the next 20-50ms, L4.2 and L2/3 use this feedback loop (along with the L5/6 motor loop) to reduce their activity and settle into a steady predictive state.

I propose that it is these L4.2 cells which participate in the steady-state activity of L4, along with the L5.2 cells (next section). L4.1 and L5.1 are representative of large transitions between steady, predictive sparse states.

### L5.2 and L6 – Stabilising Behaviour (40-50ms)

L5.2, which corresponds to thick-tufted cells in L5A (in deKock’s diagram). This sublayer combines the context inputs (from L6, L1 and L5) with the lagging, stabilising output from L2/3 (which is being stabilised by the L4.2 feedback loop) and produces a second motor response (and a second signal to higher layers). With more information about how L2/3 responded to the initial signal, L5.2 can learn to produce a more nuanced behaviour than the “knee-jerk” response of L5.1, or perhaps counteract it to resume stability.

L6 is again used to provide feedback of behaviour to L4 and aid its prediction.

Figure 2: Schematic showing main connections in the multilayer model. Each “neuron” represents a large number of neurons in each sublayer.

pp

Figure 3: Schematic showing main axonal (arrows) and dendritic (tufts) links in the multilayer model.

## Summary

We can see how this model allows a region of cortex to go from a highly sparse, quiescent steady state, absorb a large sensory stimulus, and respond, initially with dramatic changes in activity, then with decreasing waves of disturbance and motor response, in order to restore a new steady state which is self-sustaining.

The fast-responding L4.1 and L5.1 cells react first to a drastic change, causing representations in L2/3 and L6 to update, and then the second population, using L4.2 to stabilise perception and L5.2 to stabilise behaviour, takes over and settles into a new steady state.

## Examples

Apart from the rat barrel cortex example used here, we can see how this model can be applied in other well-studied cortical systems.

### Microsaccades Stabilise Vision in V1

In V1, the primary thalamic input is from retinal ganglion cells which detect on-centre or off-centre patterns in the retinal image. L4 cells are understood [cite] mostly to contain so-called “simple cells” which respond to short oriented “bars” formed by a small number of neighbouring ganglion cells. L2/3, by the same token, contains many more “complex” cells which respond to overlapping or moving bars corresponding to longer edges or a sequence of edge movements. L4 also contains a smaller number of cells with these response properties.

I propose that the simple cells are L4.1, while the L2/3 complex cells are temporally poling over these cells, and the second population of L4 complex cells are actually L4.2, responding to the activity in L2/3. L5 in steady state is causing the eye to microsaccade in order to stabilise the “image” formed in L2/3 of the edges in the scene as tiny movements of organism and objects cause the exact patterns in L4.1 to change predictably.

Deviations beyond the microsaccade scale will cause bursting in L4.1, and the SDR shown by L2/3 will change to a new one representing the new sensory input. If L2/3 can use L1 and its own predictive input to correctly expect this new state, it will remain sparse and cause minimal reaction in L5 (in the second phase). If not, L2/3 will burst, L5 will generate a large signal, and thus V1 will pass the buck up to a region which can deal with changes of scene.

This process will be repeated at higher levels, at higher temporal and spatial scales.

### Speech Generation

In speech generation, the sensory input is from the ears, and the motor output is to the vocal system. The region responsible for generating speech is controlled (via L1) by higher regions expressing a high-level representation of sounds to be produced. Layer 2/3 uses this input to bias itself to represent all sequences of sounds which match the L1 signal. Layer 5 receives both these signals and is thus highly predictive of representing the motor actions for these sequences. Since all the sublayers are at non-zero sparseness, activity will propagate and be amplified at each stage by the predictive states until a “most probable” starting sound is generated. The region will continue to generate the correct motor activity, using prediction to correct for differences between the expected and perceived sounds.

## Citations (to be completed)

Constantinople, Christine M. and Bruno, Randy M.: Deep Cortical Layers Are Activated Directly by Thalamus. Science 28 June 2013: Vol. 340 no. 6140 pp. 1591-1594 DOI: 10.1126/science.1236425 [Abstract Free]

Douglas, Rodney J. and Martin, Kevan A.C.: Neuronal Circuits of the Neocortex, Annu. Rev. Neurosci. 2004. 27:419–51 doi:10.1146/annurev.neuro.27.070203.144152 [Google Scholar]

[Abstract/Full Text]

• Dec 08 / 2014
Cortical Learning Algorithm

## Response to Yann LeCun’s Questions on the Brain

Yann LeCun recently posed some questions on Facebook about the brain. I’d like to address these really great questions in the context of Hierarchical Temporal Memory (HTM). I’ll intersperse the questions and answers in order.

A list of challenges related to how neuroscience can help computer science:

– The brains appears to be a kind of prediction engine. How do we translate the principle of prediction into a practical learning paradigm?

HTM is based on seeing the brain as a prediction system. The Cortical Learning Algorithm uses intra-layer connections to distal dendrites to learn transitions between feedforward sensory inputs. Individual neurons use inputs from neighbouring, recently active neurons to learn to predict their own activity in context. The layer as a whole chooses as sparse a set of best predictor-recognisers to represent the current situation.

– Good ML paradigms are built around the minimization of an objective function. Does the brain minimize an objective function? What is this function?

The answer is different at each level of the system, but the common theme is efficiency of activity. Synapses/dendritic spines form, grow and shrink in response to incoming signals, in order to maximise the correlation between an incoming signal and the neuron’s activity. Neurons adjust their internal thresholds and other parameters in order to maximise their probability of firing given a combined feedforward/context input pattern. Columns (represented using a simplified sheath of inhibitory neurons) again adjust their synapses in order to maximise their contained cells’ probability of becoming active given the inputs. The objective metric of a layer of neurons is the sparsity of representation, with errors in prediction-recognition being measured as lower sparsity (bursting in columns). A region of cortex produces motor output which minimises deviations from stable predicted representations of the combined sensory, motor, contextual and top-down inputs.

– Good ML systems estimate the gradient of their objective function in order to minimize it. Assuming the brain minimizes an objective function, does it estimate its gradient? How does it do it?

Each component in HTM uses only local information to adapt and learn. The optimisation emerges from each components’ responses as it learns, and from competition between columns and neurons to represent the inputs.

– Assuming that the brain computes some sort of gradient, how does it use it to optimize the objective?

There is no evidence of a mechanism in the brain which operates in this way. HTM does without such a mechanism.

– What are the principles behind unsupervised learning? Much of learning in the brain is unsupervised (or predictive). We have lots of unsupervised/predictive learning paradigms, but none of them seems as efficient as what the brain uses. How do we find one that is as efficient and general as biological learning?

CLA is a highly efficient and completely general unsupervised learning mechanism, which automatically learns the combined spatial and temporal structure of the inputs.

– Short term memory: the cortex seems to have a very short term memory with a span of about 20 seconds. Remembering things for more than 20 seconds seems to require the hippocampus. And learning new skills seems to take place in the cortex with help from the hippocampus. How do we build learning machines with short-term memory? There have been proposals to augment recurrent neural nets with a separate associative short-term memory module (e.g LSTM, Facebook’s “Memory Networks”, Deep Mind’s “Neural Turing Machine”). This is a model by which the “processor” (e.g. a recurrent net) is separate from the “RAM” (e.g. a hippocampus-like assoicative memory). Could we get inspiration from neuroscience about how to do this?

Hierarchy in HTM provides short-term memory, with higher-level regions seeking to form a stable representation of the current situation in terms of sequence-sets of lower-level representations of the state of the world. Each region uses prediction-assisted recognition to represent its input, predict future inputs, and execute behaviours which maintain the predicted future.

– Resource allocation in short-term memory: if we have a separate module for short-term memory, how are resources allocated within it? When we enter a room, our position in the room, the geometry of the room, and the landmarks and obstacles in it are stored in our hippocampus. Presumably, the neural circuits used for this are recycled and reused for future tasks. How?

There’s no evidence of a separate short-term memory module in the brain. The entire neocortex is the memory, with the ephemeral activity in each region representing the current content. Active hierarchical communication between regions lead to the evolution of perception, decisions and behaviour. At the “top” of the hierarchy, the hippocampus is used to store and recycle longer-term memories.

– How does the brain perform planning, language production, motor control sequences, and long chains of reasoning? Planning complex tasks (which includes communicating with people, writing programs, and solving math problems) seems like an important part of AI system.

Because of the multiple feedforward and feedback pathways in neocortex, the entire system is constantly acting as a cyclic graph of information flow. In each region, memories of sequences are used in recognition, prediction, visualisation, execution of behaviour, imagination and so on. Depending on the task, the representations can be sensory, sensorimotor, pseudosensory (diagrammatic) or linguistic.

– resource allocation in the cortex: how does the brain “recruit” pieces of cortex when it learns a new task. In monkeys that have lost a finger, the corresponding sensory area gets recruited by other fingers when the monkey is trained to perform a task that involves touch.

There is always a horizontal “leakage” level of connections in any area of neocortex. When an area is deprived of input, neurons at the boundary respond to activity in nearby regions by increasing their response to that activity. This is enhanced by the “housekeeping” glial cells embedded in cortex, which actively bring axons and dendrites together to knit new connections.

– The brain uses spikes. Do spikes play a fundamental role in AI and learning, or are they just made necessary by biological hardware?

Spikes are very important in the real brain, but they are not directly needed for the core processing of information, so HTM doesn’t model them per se. We do use an analogue to Spike Timing Dependent Plasticity in the core Hebbian learning of predictive connections, but this is simplified to a timestep-based model rather than individual spikes.

We have elements of answers and avenues for research for many of these points, but no definite/perfect solutions.

HTM’s solutions are also neither perfect nor definitive, but they are our best attempt to address your questions in a simple, coherent and effective system, which directly depends on data from neuroscience.

Thanks to Yann for asking such pertinent questions about how the brain might work. It’s a recognition that the brain has a lot to teach us about intelligence and learning.

• Nov 29 / 2014

## Mathematics of HTM Part II – Transition Memory

This article is part of a series describing the mathematics of Hierarchical Temporal Memory (HTM), a theory of cortical information processing developed by Jeff Hawkins. In Part One, we saw how a layer of neurons learns to form a Sparse Distributed Representation (SDR) of an input pattern. In this section, we’ll describe the process of learning temporal sequences.

We showed in part one that the HTM model neuron learns to recognise subpatterns of feedforward input on its proximal dendrites. This is somewhat similar to the manner by which a Restricted Boltzmann Machine can learn to represent its input in an unsupervised learning process. One distinguishing feature of HTM is that the evolution of the world over time is a critical aspect of what, and how, the system learns. The premise for this is that objects and processes in the world persist over time, and may only display a portion of their structure at any given moment. By learning to model this evolving revelation of structure, the neocortex can more efficiently recognise and remember objects and concepts in the world.

## Distal Dendrites and Prediction

In addition to its one proximal dendrite, a HTM model neuron has a collection of distal (far) neurons, which gather information from sources other than the feedforward inputs to the layer. In some layers of neocortex, these dendrites combine signals from neurons in the same layer as well as from other layers in the same region, and even receive indirect inputs from neurons in higher regions of cortex. We will describe the structure and function of each of these.

The simplest case involves distal dendrites which gather signals from neurons within the same layer.

In Part One, we showed that a layer of $$N$$ neurons converted an input vector $$\mathbf x \in \mathbb{B}^{n_{\textrm{ff}}}$$ into a SDR $$\mathbf{y}_{\textrm{SDR}} \in \mathbb{B}^{N}$$, with length$$\lVert{\mathbf y}_{\textrm{SDR}}\rVert_{\ell_1}=sN \ll N$$, where the sparsity $$s$$ is usually of the order of 2% ($$N$$ is typically 2048, so the SDR $$\mathbf{y}_{\textrm{SDR}}$$ will have 40 active neurons).

The layer of HTM neurons can now be extended to treat its own activation pattern as a separate and complementary input for the next timestep. This is done using a collection of distal dendrite segments, which each receive as input the signals from other neurons in the layer itself. Unlike the proximal dendrite, which transmits signals directly to the neuron, each distal dendrite acts as an active coincidence detector, firing only when it receives enough signals to exceed its individual threshold.

We proceed with the analysis in a manner analogous to the earlier discussion. The input to the distal dendrite segment $$k$$ at time $$t$$ is a sample of the bit vector $$\mathbf{y}_{\textrm{SDR}}^{(t-1)}$$. We have $$n_{ds}$$ distal synapses per segment, a permanence vector $$\mathbf{p}_k \in [0,1]^{n_{ds}}$$ and a synapse threshold vector $$\vec{\theta}_k \in [0,1]^{n_{ds}}$$, where typically $$\theta_i = \theta = 0.2$$ for all synapses.

Following the process for proximal dendrites, we get the distal segment’s connection vector $$\mathbf{c}_k$$:

$$c_{k,i}=(1 + sgn(p_{k,i}-\theta_{k,i}))/2$$

The input for segment $$k$$ is the vector $$\mathbf{y}_k^{(t-1)} = \phi_k(\mathbf{y}_{\textrm{SDR}}^{(t-1)})$$ formed by the projection $$\phi_k:\lbrace{0,1}\rbrace^{N-1}\rightarrow\lbrace{0,1}\rbrace^{n_{ds}}$$ from the SDR to the subspace of the segment. There are $${N-1}\choose{n_{ds}}$$ such projections (there are no connections from a neuron to itself, so there are $$N-1$$ to choose from).

The overlap of the segment for a given $$\mathbf{y}_{\textrm{SDR}}^{(t-1)}$$ is the dot product $$o_k^t = \mathbf{c}_k\cdot\mathbf{y}_k^{(t-1)}$$. If this overlap exceeds the threshold $$\lambda_k$$ of the segment, the segment is active and sends a dendritic spike of size $$s_k$$ to the neuron’s cell body.

This process takes place before the processing of the feedforward input, which allows the layer to combine contextual knowledge of recent activity with recognition of the incoming feedforward signals. In order to facilitate this, we will change the algorithm for Pattern Memory as follows.

Each neuron begins a timestep $$t$$ by performing the above processing on its $${n_{\textrm{dd}}}$$ distal dendrites. This results in some number $$0\ldots{n_{\textrm{dd}}}$$ of segments becoming active and sending spikes to the neuron. The total predictive activation potential is given by:

$$o_{\textrm{pred}}=\sum\limits_{o_k^{t} \ge \lambda_k}{s_k}$$

The predictive potential is combined with the overlap score from the feedforward overlap coming from the proximal dendrite to give the total activation potential:

$$a_j^t=\alpha_j o_{\textrm{ff},j} + \beta_j o_{\textrm{pred},j}$$

and these $$a_j$$ potentials are used to choose the top neurons, forming the SDR $$Y_{\textrm{SDR}}$$ at time $$t$$. The mixing factors $$\alpha_k$$ and $$\beta_k$$ are design parameters of the simulation.

## Learning Predictions

We use a very similar learning rule for distal dendrite segments as we did for the feedforward inputs:

$$p_i^{(t+1)} = \begin{cases} (1+\sigma_{inc})p_i^{(t)} & \text {if cell j active, segment k active, synapse i active} \\ (1-\sigma_{dec})p_i^{(t)} & \text {if cell j active, segment k active, synapse i not active} \\ p_i^{(t)} & \text{otherwise} \\ \end{cases}$$

Again, this reinforces synapses which contribute to activity of the cell, and decreases the contribution of synapses which don’t. A boosting rule, similar to that for proximal synapses, allows poorly performing distal connections to improve until they are good enough to use the main rule.

## Interpretation

We can now view the layer of neurons as forming a number of representations at each timestep. The field of predictive potentials $$o_{\textrm{pred},j}$$ can be viewed as a map of the layer’s confidence in its prediction of the next input. The field of feedforward potentials can be viewed as a map of the layer’s recognition of current reality. Combined, these maps allow for prediction-assisted recognition, which, in the presence of temporal correlations between sensory inputs, will improve the recognition and representation significantly.

We can quantify the properties of the predictions formed by such a layer in terms of the mutual information between the SDRs at time $$t$$ and $$t+1$$. I intend to provide this analysis as soon as possible, and I’d appreciate the kind reader’s assistance if she could point me to papers which might be of help.

A layer of neurons connected as described here is a Transition Memory, and is a kind of first-order memory of temporally correlated transitions between sensory patterns. This kind of memory may only learn one-step transitions, because the SDR is formed only by combining potentials one timestep in the past with current inputs.

Since the neocortex clearly learns to identify and model much longer sequences, we need to modify our layer significantly in order to construct a system which can learn high-order sequences. This is the subject of the next part of this series.

Note: For brevity, I’ve omitted the matrix treatment of the above. See Part One for how this is done for Pattern Memory; the extension to Transition Memory is simple but somewhat arduous.

• Nov 28 / 2014

## Mathematics of Hierarchical Temporal Memory

This article describes some of the mathematics underlying the theory and implementations of Jeff Hawkins’ Hierarchical Temporal Memory (HTM), which seeks to explain how the neocortex processes information and forms models of the world.

Note: Part II: Transition Memory is now available.

## The HTM Model Neuron – Pattern Memory (aka Spatial Pooling)

We’ll illustrate the mathematics of HTM by describing the simplest operation in HTM’s Cortical Learning Algorithm: Pattern Memory, also known as Spatial Pooling, forms a Sparse Distributed Representation from a binary input vector. We begin with a layer (a 1- or 2-dimensional array) of single neurons, which will form a pattern of activity aimed at efficiently representing the input vectors.

### Feedforward Processing on Proximal Dendrites

The HTM model neuron has a single proximal dendrite, which is used to process and recognise feedforward or afferent inputs to the neuron. We model the entire feedforward input to a cortical layer as a bit vector $${\mathbf x}_{\textrm{ff}}\in\lbrace{0,1}\rbrace^{n_{\textrm{ff}}}$$, where $$n_{\textrm{ff}}$$ is the width of the input.

The dendrite is composed of $$n_s$$ synapses which each act as a binary gate for a single bit in the input vector.  Each synapse has a permanence $$p_i\in{[0,1]}$$ which represents the size and efficiency of the dendritic spine and synaptic junction. The synapse will transmit a 1-bit (or on-bit) if the permanence exceeds a threshold $$\theta_i$$ (often a global constant $$\theta_i = \theta = 0.2$$). When this is true, we say the synapse is connected.

Each neuron samples $$n_s$$ bits from the $$n_{\textrm{ff}}$$ feedforward inputs, and so there are $${n_{\textrm{ff}}}\choose{n_{s}}$$ possible choices of input for a single neuron. A single proximal dendrite represents a projection $$\pi_j:\lbrace{0,1}\rbrace^{n_{\textrm{ff}}}\rightarrow\lbrace{0,1}\rbrace^{n_s}$$, so a population of neurons corresponds to a set of subspaces of the sensory space. Each dendrite has an input vector $${\mathbf x}_j=\pi_j({\mathbf x}_{\textrm{ff}})$$ which is the projection of the entire input into this neuron’s subspace.

A synapse is connected if its permanence $$p_i$$ exceeds its threshold $$\theta_i$$. If we subtract $${\mathbf p}-{\vec\theta}$$, take the elementwise sign of the result, and map to $$\lbrace{0,1}\rbrace$$, we derive the binary connection vector $${\mathbf c}_j$$ for the dendrite. Thus:

$$c_i=(1 + sgn(p_i-\theta_i))/2$$

The dot product $$o_j({\mathbf x})={\mathbf c}_j\cdot{\mathbf x}_j$$ now represents the feedforward overlap of the neuron with the input, ie the number of connected synapses which have an incoming activation potential. Later, we’ll see how this number is used in the neuron’s processing.

The elementwise product $${\mathbf o}_j={\mathbf c}_j\odot{\mathbf x}_j$$ is the vector in the neuron’s subspace which represents the input vector $${\mathbf x}_{\textrm{ff}}$$ as “seen” by this neuron. This is known as the overlap vector. The length $$o_j = \lVert{\mathbf o}_j\rVert_{\ell_1}$$ of this vector corresponds to the extent to which the neuron recognises the input, and the direction (in the neuron’s subspace) is that vector which has on-bits shared by both the connection vector and the input.

If we project this vector back into the input space, the result $$\mathbf{\hat{x}}_j =\pi^{-1}({\mathbf o}_j)$$ is this neuron’s approximation of the part of the input vector which this neuron matches. If we add a set of such vectors, we will form an increasingly close approximation to the original input vector as we choose more and more neurons to collectively represent it.

## Sparse Distributed Representations (SDRs)

We now show how a layer of neurons transforms an input vector into a sparse representation. From the above description, every neuron is producing an estimate $$\mathbf{\hat{x}}_j$$ of the input $${\mathbf x}_{\textrm{ff}}$$, with length $$o_j\ll n_{\textrm{ff}}$$ reflecting how well the neuron represents or recognises the input. We form a sparse representation of the input by choosing a set $$Y_{\textrm{SDR}}$$ of the top $$n_{\textrm{SDR}}=sN$$ neurons, where $$N$$ is the number of neurons in the layer, and $$s$$ is the chosen sparsity we wish to impose (typically $$s=0.02=2\%$$).

The algorithm for choosing the top $$n_{\textrm{SDR}}$$ neurons may vary. In neocortex, this is achieved using a mechanism involving cascading inhibition: a cell firing quickly (because it depolarises quickly due to its input) activates nearby inhibitory cells, which shut down neighbouring excitatory cells, and also nearby inhibitory cells, which spread the inhibition outwards. This type of local inhibition can also be used in software simulations, but it is expensive and is only used where the design involves spatial topology (ie where the semantics of the data is to be reflected in the position of the neurons). A more efficient global inhibition algorithm – simply choosing the top $$n_{\textrm{SDR}}$$ neurons by their depolarisation values – is often used in practise.

If we form a bit vector $${\mathbf y}_{\textrm{SDR}}\in\lbrace{0,1}\rbrace^N\textrm{ where } y_j = 1 \Leftrightarrow j \in Y_{\textrm{SDR}}$$, we have a function which maps an input $${\mathbf x}_{\textrm{ff}}\in\lbrace{0,1}\rbrace^{n_{\textrm{ff}}}$$ to a sparse output $${\mathbf y}_{\textrm{SDR}}\in\lbrace{0,1}\rbrace^N$$, where the length of each output vector is $$\lVert{\mathbf y}_{\textrm{SDR}}\rVert_{\ell_1}=sN \ll N$$.

The reverse mapping or estimate of the input vector by the set $$Y_{\textrm{SDR}}$$ of neurons in the SDR is given by the sum:

$$\mathbf{\hat{x}} = \sum\limits_{j \in Y_{\textrm{SDR}}}{{\mathbf{\hat{x}}}_j} = \sum\limits_{j \in Y_{\textrm{SDR}}}{\pi_j^{-1}({\mathbf o}_j)} = \sum\limits_{j \in Y_{\textrm{SDR}}}{\pi_j^{-1}({\mathbf c}_j\odot{\mathbf x}_j)}= \sum\limits_{j \in Y_{\textrm{SDR}}}{\pi_j^{-1}({\mathbf c}_j \odot \pi_j({\mathbf x}_{\textrm{ff}}))}= \sum\limits_{j \in Y_{\textrm{SDR}}}{\pi_j^{-1}({\mathbf c}_j) \odot {\mathbf x}_{\textrm{ff}}}$$

## Matrix Form

The above can be represented straightforwardly in matrix form. The projection $$\pi_j:\lbrace{0,1}\rbrace^{n_{\textrm{ff}}} \rightarrow\lbrace{0,1}\rbrace^{n_s}$$ can be represented as a matrix $$\Pi_j \in \lbrace{0,1}\rbrace^{{n_s} \times\ n_{\textrm{ff}}}$$.

Alternatively, we can stay in the input space $$\mathbb{B}^{n_{\textrm{ff}}}$$, and model $$\pi_j$$ as a vector $$\vec\pi_j =\pi_j^{-1}(\mathbf 1_{n_s})$$, ie where $$\pi_{j,i} = 1 \Leftrightarrow (\pi_j^{-1}(\mathbf 1_{n_s}))_i = 1$$.

The elementwise product $$\vec{x_j} =\pi_j^{-1}(\mathbf x_{j}) = \vec{\pi_j} \odot {\mathbf x_{\textrm{ff}}}$$ represents the neuron’s view of the input vector $$x_{\textrm{ff}}$$.

We can similarly project the connection vector for the dendrite by elementwise multiplication: $$\vec{c_j} =\pi_j^{-1}(\mathbf c_{j})$$, and thus $$\vec{o_j}(\mathbf x_{\textrm{ff}}) = \vec{c_j} \odot \mathbf{x}_{\textrm{ff}}$$ is the overlap vector projected back into $$\mathbb{B}^{n_{\textrm{ff}}}$$, and the dot product $$o_j(\mathbf x_{\textrm{ff}}) = \vec{c_j} \cdot \mathbf{x}_{\textrm{ff}}$$ gives the same overlap score for the neuron given $$\mathbf x_{\textrm{ff}}$$ as input. Note that $$\vec{o_j}(\mathbf x_{\textrm{ff}}) =\mathbf{\hat{x}}_j$$, the partial estimate of the input produced by neuron $$j$$.

We can reconstruct the estimate of the input by an SDR of neurons $$Y_{\textrm{SDR}}$$:

$$\mathbf{\hat{x}}_{\textrm{SDR}} = \sum\limits_{j \in Y_{\textrm{SDR}}}{{\mathbf{\hat{x}}}_j} = \sum\limits_{j \in Y_{\textrm{SDR}}}{\vec o}_j = \sum\limits_{j \in Y_{\textrm{SDR}}}{{\vec c}_j\odot{\mathbf x_{\textrm{ff}}}} = {\mathbf C}_{\textrm{SDR}}{\mathbf x_{\textrm{ff}}}$$

where $${\mathbf C}_{\textrm{SDR}}$$ is a matrix formed from the $${\vec c}_j$$ for $$j \in Y_{\textrm{SDR}}$$.

## Optimisation Problem

We can now measure the distance between the input vector $$\mathbf x_{\textrm{ff}}$$ and the reconstructed estimate $$\mathbf{\hat{x}}_{\textrm{SDR}}$$ by taking a norm of the difference. Using this, we can frame learning in HTM as an optimisation problem. We wish to minimise the estimation error over all inputs to the layer. Given a set of (usually random) projection vectors $$\vec\pi_j$$ for the N neurons, the parameters of the model are the permanence vectors $$\vec{p}_j$$, which we adjust using a simple Hebbian update model.

The update model for the permanence of a synapse $$p_i$$ on neuron $$j$$ is:

$$p_i^{(t+1)} = \begin{cases} (1+\delta_{inc})p_i^{(t)} & \text {if j \in Y_{\textrm{SDR}}, (\mathbf x_j)_i=1, and p_i^{(t)} \ge \theta_i} \\ (1-\delta_{dec})p_i^{(t)} & \text {if j \in Y_{\textrm{SDR}}, and ((\mathbf x_j)_i=0 or p_i^{(t)} \lt \theta_i)} \\ p_i^{(t)} & \text{otherwise} \\ \end{cases}$$

This update rule increases the permanence of active synapses, those that were connected to an active input when the cell became active, and decreases those which were either disconnected or received a zero when the cell fired. In addition to this rule, an external process gently boosts synapses on cells which either have a lower than target rate of activation, or a lower than target average overlap score.

I do not yet have the proof that this optimisation problem converges, or whether it can be represented as a convex optimisation problem. I am confident such a proof can be easily found. Perhaps a kind reader who is more familiar with a problem framed like this would be able to confirm this. I’ll update this post with more functions from HTM in coming weeks.

Note: Part II: Transition Memory is now available.

• Nov 13 / 2014

## Part 1 – Introduction and Description.

In any attempt to create a theoretical scientific framework, breakthroughs are often made when a single key “law” is found to underly what previously appeared to be a number of observed lesser laws. An example from Physics is the key principle of Relativity: that the speed of light is a constant in all inertial frames of reference, which quickly leads to all sorts of unintuitive phenomena like time dilation, length contraction, and so on. This discussion aims to do this for HTM by proposing that its key underlying principle is the efficiency of predicted sparseness at all levels. I’ll attempt to show how this single principle not only explains several key features of HTM identified so far, but also explains in detail how to model any required structural component of the neocortex.

The neocortex is a tremendously expensive organ in mammals, and particularly in humans, so it seems certain that the benefits it provides are proportionately valuable to the genes of an animal. We can use this relationship between cost and benefit, with sparseness and prediction as mediating metrics, to derive detailed design rules for the neocortex at every level, down to individual synapses and their protein machinery.

If you take one thing away from this talk, it should be that Sparse Distributed Representations are the key to Intelligence. Jeff Hawkins

Note: The next post in this series describes the Mathematics of Hierarchical Temporal Memory.

Sparse Distributed Representations are a key concept in HTM theory. In any functional piece of cortex, only a small fraction of a large population of neurons will be active at a given time; each active neuron encodes some component of the semantics of the representation; and small changes in the exact SDR correspond with small differences in the detailed object or concept being represented. Ahmad 2014 describes many important properties of SDRs.

SDRs are one efficient solution to the problem of representing something with sufficient accuracy at optimal cost in resources, and in the face of ambiguity and noise. My thesis is that in forming SDRs, neocortex is striving to optimise a lossy compression process by representing only those elements of the input which are structural and ignoring everything else.

Shannon proposed that any message has a concrete amount of information, measured in bits, which reflects the amount of surprise (i.e. something you couldn’t compute from the message so far, or by other means) contained in the message.

The most efficient message has zero length – it’s the message you don’t need to send. The next most efficient message contains only the information the receiver lacks to reconstruct everything the sender wishes her to know. Thus, by using memory and the right encoding to connect with it, a clever receiver (or memory system) can become very efficient indeed.

We will see that neocortex implements this idea literally, at all levels, as it attempts to represent, remember and predict events in the world as usefully as possible and at minimal cost.

The organising principle in cortical design is that components (from the whole organism down to a synapse) can do little about the amount of signal they receive, but they can – and do – adapt and learn to make best use of that signal to control what they do, only acting – sending a signal – when it’s the predicted optimal choice. This gives rise to sparseness in space and time everywhere, which directly reflects the degree of successful prediction present in any part of the system.

The success metric for a component in neocortex is the ratio of input data rate to output information rate, where the component has either a fixed minimum, or (for neurons and synapses) a fixed maximum, output level.

Deviations from the target indicate some failure to predict activity. This failure is either an opportunity to learn (and predict better next time), or, failing that, something which needs to be acted upon in some other way, by taking a different action or by passing new information up the hierarchy.

Note inputs in this context are any kind of signal coming in to the component under study. In the case of regions, layers and neurons, these include top-down feedback and lateral inputs as well as feedforward.

### Hierarchy

Neocortex is a hierarchy because it has finite space to store its model of the world, and a hierarchy is an optimal strategy when the world itself has hierarchical structure. Each region in the hierarchy is subjected (by design) to a necessarily overwhelming rate of input, it will run at capacity to absorb its data stream, reallocating its finite resources to contain an optimal model of the world it perceives.

### Regions

The memory inside a region of cortex is driven towards an “ideal” state in which it always predicts its inputs and thus produces a “perfect”, minimal message – containing its learned SDR of its world’s current state – as output. Any failure to predict is indicated by a larger output, the deviation from “ideal” representing the exact surprise of the region to its current perception of the world.

A region has several output layers, each of which has a different (and usually more than one) purpose.

For each region, two layers send (different) signals up the hierarchy, therefore signalling both the current state of its world and the encoding of its unpredictability. The higher region now gets details of something it should hopefully have the capacity to handle – predict – or else it passes the problem up the chain.

Two layers send (again different) signals down to lower layers and (in the case of motor) to subcortical systems. The content of these outputs will relate to the content as well as the stability and confidence of the region’s model, and also actions which are appropriate in terms of that content and confidence level.

### Layers

A cortical layer which has fully predicted its inputs has a maximally sparse output pattern. A fully failing prediction pattern in a layer causes it to output a maximally bursting and minimally sparse pattern, at least for a short time. At any failure level in between, the exact evolution of firing in the bursting neurons encodes the precise pattern of prediction failure of the layer, and this is the information passed to other layers in the region, to other regions in cortex, or to targets outside the cortex.

The output of a cortical layer is thus a minimal message – it “starts” with the best match of its prediction and reality, followed (in a short period of time) by encodings of reality in the context of increasingly weak prediction.

### Columns

A layer’s output, in turn, is formed from the combination of its neurons, which are themselves arranged in columns. The columnar arrangement of cells in cortical columns is the key design leading to all the behaviour described previously.

Pyramidal cells, which represent both the SDR activity pattern and the “memory” in a layer, are all contained in columns. The sparse pattern of activity across a layer is dictated by how all the cells compete within this columnar array.

Columns are composed of pyramidal cells, which act independently, and a complex of inhibitory cells which act together to define how the column operates. All cells share a very similar feedforward receptive field, due to the fact that feedforward axons physically run up through the narrow column and abut the pyramidal bodies as they squeeze past.

#### Columnar Inhibition

The inhibitory cells have a broader and faster feedforward response compared with the pyramidal cells Reference so, in the absence of strong predictive inputs to any pyramidal cells, the entire assemblage of inhibitory neurons will be first to fire in a column. When this happens, these inhibitory cells excite those in adjacent columns, and a wave of inhibition spreads out from a successfully firing column.

The wave continues until it arrives at a column which has already been inhibited by a wave coming from elsewhere in the layer (from some recently active column). This gives rise to a pattern of inactivity around columns which are currently active.

#### Predictive Activation

Each cell in a column has its own set of feedforward and predictive inputs, so every cell has a different rate of depolarising as it is driven towards firing threshold.

Some cells may have received sufficient depolarising input from predictive lateral or top-down dendrites to reach firing threshold before the column’s sheath of inhibitory cells. In this case the pyramidal cell will fire first, trigger the column’s inhibitory sheath, and cause the wave of inhibition to spread out laterally in the layer.

#### Vertical Inhibition in Columns

When the inhibitory sheath fires, it also sends a wave of inhibitory signals vertically in the column. This wave will shut down any pyramidal cells which have not yet reached threshold, giving rise to a sparse activity pattern in the column.

The exact number of cells which get to fire before the sheath shuts them down depends mainly on how predictive each cell was and whether the sheath was triggered by a “winning cell” (previous section), by the sheath being first to fire, or as a result of neighbouring columns sending out signals.

If there is a wave of inhibition reaching a column, all cells are shut down and none (or no more) fire.

If there was a cell so predictive that it fired before the sheath, all other cells are very likely shut down and only one cell fires.

Finally, if the sheath was first to fire due to its feedforward input, the pyramidal cells are shut down quite quickly, but the most predictive may get the chance to fire just before being shut down.

This last process is called bursting, and gives rise to a short-lived pattern which encodes exactly how well the column as an ensemble has matched its predictions. Basically, the more cells which fire, the more “confused” the match between prediction and reality. This is because the inhibition happens quickly, so the gap between the first and last cell to burst must be small, reflecting similar levels of predictivity.

The bursting process may also be ended by an incoming wave of inhibition. The further away a competing column is, the longer that will take, allowing more cells to fire and extending the burst. Thus the amount of bursting also reflects the local area’s ability to respond to the inputs.

### Neurons

Neurons are machines which use patterns of input signals to produce a temporal pattern of output signal. The neuron wastes most resources if its potential rises but just fails to fire, so the processes of adaption of the neuron are driven to a) maximise the response to inputs within a particular set, and b) minimise the response to inputs outside that set.

The set of excitatory inputs to one neuron are of two main types – feedforward and predictive; the number of each type of input varies from 10’s to 10’s of thousand; and the inputs arrive stochastically in combinations which contain mixtures of true structure and noise, so the “partitioning problem” a neuron faces is intractable. It simply learns to do the best it can.

Note that neurons are the biggest components in HTM which actually do anything! In fact, the regions, layers and columns are just organisational constructs, ways of looking at the sets of interacting neurons.

The neuron is the level in the system at which genetic control is exercised. The neuron’s shape, size, position in the neocortex, receptor selections, and many more things are decided per-neuron.

Importantly, many neurons have a genetically expressed “firing program” which broadly sets a target for the firing pattern, frequency and dependency setup.

Again, this gives the neuron an optimal pattern of output, and its job is to arrange its adaptations and learn to match that output.

### Dendrites

Distal dendrites have a similar but simpler and smaller scale problem of combining inputs and deciding whether to spike.

I don’t believe dendrites do much more than passively respond to global factors such as modulators and act as conduits for signals, both electrical and chemical, originating in synapses.

### Synapses

Synapses are now understood to be highly active processing components, capable of growing both in size and efficiency in a few seconds, actively managing their response to multiple inputs – presynaptic, modulatory and intracellular, and self-optimising to best correlate a stream of incoming signals with the activity of the entire neuron.

Part Two takes this idea further and details how a multilayer region uses the efficiency of predicted sparseness to learn a sensorimotor model and generate behaviour.

The next post in this series describes the Mathematics of Hierarchical Temporal Memory. This diversion is useful before proceeding with the main thread.

Blättler F, Hahnloser RHR. An Efficient Coding Hypothesis Links Sparsity and Selectivity of Neural Responses. Kiebel SJ, ed. PLoS ONE 2011;6(10):e25506. doi:10.1371/journal.pone.0025506. [Full Text]

• Sep 14 / 2014

## A Unifying View of Deep Networks and Hierarchical Temporal Memory

There’s been a somewhat less than convivial history between two of the theories of neurally-inspired computation systems over the last few years. When a leading protagonist of one school is asked a question about the other, the answer often varies from a kind of empty semi-praise to downright dismissal and the occasional snide remark. The objections of one side to the others’ approach are usually valid, and mostly admitted, but the whole thing leaves one with a feeling that it is not a very scientific way to proceed or behave. This post describes an idea which might go some way to resolving this slightly unpleasant impasse and suggests that the discrepancies may simply be as a result of two groups using the same name for two quite different things.

In HTM, Jeff Hawkins’ plan is to identify the mechanisms which actually perform computation in real neocortex, abstracting them only far enough that the details of the brain’s bioengineering are simplified out, and hopefully leaving only the pure computational systems in a form which allows us to implement them in software and reason about them. On the other hand, Hinton and LeCun’s neural networks are each built “computation-first,” drawing some inspiration from and resembling the analogous (but in detail very different) computations in neocortex.

The results (ie the models produced), inevitably, are as different at all levels as their inventors’ approaches and goals. For example, one criterion for the Deep Network developer is that her model is susceptible to a set of mathematical tools and techniques, which allow other researchers to frame questions, examine and compare models, and so on, all in a similar mathematical framework. HTM, on the other hand, uses neuroscience as a standard test, and will not admit to a model any element which is known to be contradicted by observation of natural neocortex. The Deep Network people complain that the models of HTM cannot be analysed like theirs can (indeed it seems they cannot), while the HTM people complain that the neurons and network topologies in Deep Networks bear no relationship with any known brain structures, and are several simplifications too far.

Yann LeCun said recently on Reddit (with a great summary):

Jeff Hawkins has the right intuition and the right philosophy. Some of us have had similar ideas for several decades. Certainly, we all agree that AI systems of the future will be hierarchical (it’s the very idea of deep learning) and will use temporal prediction.

But the difficulty is to instantiate these concepts and reduce them to practice. Another difficulty is grounding them on sound mathematical principles (is this algorithm minimizing an objective function?).

I think Jeff Hawkins, Dileep George and others greatly underestimated the difficulty of reducing these conceptual ideas to practice.

As far as I can tell, HTM has not been demonstrated to get anywhere close to state of the art on any serious task.

The topic of HTM and Jeff Hawkins was second out of all the major themes in the Q&A session, reflecting the fact that people in the field view this as an important issue, and (it seems to me) wish that the impressive progress made by Deep Learning researchers could be reconciled with the deeper explanatory power of HTM in describing how the neocortex works.

Of course, HTM people seldom refuse to play their own role in this spat, saying that a Deep Network sacrifices authenticity in favour of mathematical tractability and getting high scores on artificial “benchmarks”. We explain or excuse the fact that our models are several steps smaller in hierarchy and power, making the valid claim that there are shortcuts and simplifications we are not prepared to make,  and speculating that we will – like the tortoise – emerge alone at the finish with the prize of AGI in our hands.

The problem is, however, a little deeper and more important than an aesthetic argument (as it sometimes appears). This gap in acknowledging the valid accomplishments of the two models, coupled with a certain defensiveness, causes a “chilling effect” when an idea threatens to cross over into the other realm. This means that findings in one regime are very slow to be noticed or incorporated in the other. I’ve heard quite senior HTM people actually say things like “I don’t know anything about Deep Learning, just that it’s wrong” – and vice versa. This is really bad science.

From reading their comments, I’m pretty sure that no really senior Deep Learning proponent has any knowledge of the current HTM beyond what he’s read in the popular science press, and the reverse is nearly as true.

I consider a very good working knowledge of Deep Learning to be a critical part of any area of computational neuroscience or machine learning. Obviously I feel at least the same way about HTM, but recognise that the communication of our progress (or even the reporting of results) in HTM has not made it easy for “outsiders” to achieve the levels of understanding they feel they need to take part. There are historical reasons for much of this, but it’s never too late to start fixing a problem like this, and I see this post (and one of my roles) as a step in the right direction.

## The Neuron as the Unit of Computation

In both models, we have identified the neuron as the atomic unit of computation, and the connections between neurons as the location of the memory or functional adjustment which gives the network its computational power. This sounds fine, and clearly the brain uses neurons and connections in some way like this, but this is exactly where the two schools mistakenly diverge.

Jeff Hawkins rejects the NN integrate-and-fire model and builds a neuron with vastly higher complexity. Geoff Hinton admits that, while impossible to reason about mathematically, HTM’s neuron is far more realistic if your goal is to mimic neocortex. Deep Learning, using neurons like Lego bricks, can build vast hierarchies and huge networks, find cats in Youtube videos, and win prizes in competitions. HTM, on the other hand, struggles for years to fit together its “super-neurons” and builds a tiny, single-layer model which can find features and anomalies in low-dimensional streaming data.

Looking at this, you’d swear these people were talking about entirely different things. They’ve just been using the same names for them. And, it’s just dawned on me, therein lies both the problem and its solution. The answer’s been there all the time:

Each and every neuron in HTM is actually a Deep Network.

In a HTM neuron, there are two types of dendrite. One is the proximal dendrite, which contains synapses receiving inputs from the feedforward (mainly sensory) pathway. The other is a set of coincidence-detecting, largely independent, distal dendrite segments, which receive lateral and top-down predictive inputs from the same layer or higher layers and regions in neocortex.

My thesis here is that a single neuron can be seen as composed of many elements which have direct analogues in various types of Deep Learning networks, and that there are enough of these, with a sufficient structural complexity, that it’s best to view the neuron as a network of simple, Deep Learning-sized nodes, connected in a particular way. I’ll describe this network in some detail now, and hopefully it’ll become clear how this approach removes much of the dichotomy between the models.

Firstly, a synapse in HTM is very much like a single-input NN node, where HTM’s permanence value is akin to the bias in a NN node, and the weight on the input connection is fixed at 1.0. If the input is active, and the permanence exceeds the threshold, the synapse produces a 1. In HTM we call such a synapse connected, in that the gate is open and the signal is passed through.

The dendrite or dendrite segment is like the next layer of nodes in NN, in that it combines its inputs and passes the result up. The proximal dendrite effectively acts as a semi-rectifier, summing inputs and generating a scalar depolarisation value to the cell body. The distal segments, on the other hand, act like thresholded coincidence detectors and produce a depolarising spike only if the sum of the inputs exceeds a threshold.

These depolarising inputs (feedforward and recurrent) are combined in the cell body to produce an activation potential. This only potentially generates the output of the entire neuron, because a higher-level inhibition system is used to identify those neurons with highest potential, allow those to fire (producing a binary 1), and suppress the others to zero (a winner-takes-all step with multiple local winners in the layer).

So, a HTM layer is a network of networks, a hierarchy in which neuron-networks communicate with connections between their sub-parts. At the HTM layer level, each neuron has two types of input and one output, and we wire them together at such, but each neuron is really hiding an internal, network-like structure of its own.

• Aug 23 / 2014

## Suggested Naming in HTM Theory and White Paper

“There are only two hard things in Computer Science: cache invalidation and naming things.” – Phil Karlton

In the case of HTM, we also have the much bigger problem of explaining how neocortex may work, and how a non-obvious CLA operates to use cortical principles. Extra confusion caused by poor naming multiplies the difficulties.

A key component of the art of naming consists in identifying the scope of each name. We need to have names which are just specific enough to capture the underlying concept, but not so specific that they entangle non-essential details. Names also need to be memorable and comfortable, while not being too easy to misconstrue, because they resemble or contain words which have other meanings.

I’d like to begin a reasoned discussion about key names in HTM and CLA. The goal of the discussion is to arrive at a set of names which everyone strongly believes captures the concepts for both theory and implementation.

As a famous Supreme Court judge once said of pornography, “we cannot define it but we know it when we see it.” We are looking for this kind of name, with the added advantage that HTM can actually precisely define the concept behind each name.

Until we arrive at a good name for something (ie one which magically gets everyone’s support), we should identify the key flaws in each candidate and agree that they invalidate that candidate. This is a healthy process which should not be regarded as a criticism of any proposer.

Please treat that as an open invitation to tell me how poor my proposed names are, but only for reasons you’d accept as rational if they were directed at yours!

I’m currently re-reading the 2011 White Paper with a view to updating and improving it. This document is a very rich source of information pertinent to this discussion, and in fact appears to answer a couple of the thorniest ones! I’d very strongly recommend re-reading it as preparation for taking part in this discussion.

I’d like to go through the main named concepts one by one, discuss the strengths and weaknesses of the current names, and propose a new name for each concept with some supporting motivations and argument. I don’t expect that my proposals will stick, but they should get us a noticeable step in the right direction, or at least throw light on the relevant issues.

### Sparse Distributed Representation.

I start with this one because, in my experience of learning, reasoning about, writing about, talking about, and explaining HTM, the term SDR is as close to perfect as I can imagine. It has the property of monotonically improving understanding the more you find out about each of the three concepts named.

It is also an easily testable name. We all remember when Francisco showed us the CEPT Retina SDRs, in fact they were so SDRish, some of us thought they were too good to be true!

### Spatial Pooling.

There are several problems with this term. We understand that “spatial” was chosen to indicate that each presentation of the data has some properties and structure in the sensory domain (such as a shape, size or colour), and it’s called “spatial” as opposed to “temporal”.

A difficulty arises for newcomers who read too much into this use of the word. There is a strong temptation to rely on our commonsense ideas of space when Jeff is really talking about mathematical, vector spaces and the abstract “spaces” of SDRs.

HTM does not require the kind of retinotopic mapping found in V1. The only reason we have literal spatial layouts in just a few primary areas of sensory cortex is because it is a simpler evolutionary and developmental design, not because it is needed for the algorithm. The RDSE, the Geospatial Encoder and the CEPT retina are all superb examples of how “pseudorandom” representations are better than more pictorially understandable spatial representation regimes.

Lastly, we’ve already tripped over this when we started talking about the new sensorimotor theory. L4 cells are now dealing with motor inputs as well as “spatial”, and L3 cells are now expected to “see” a set of L4 outputs whose members are substituted over time. So the word “spatial” really needs to go.

The word “Pooling” has, for many, either no meaning at all (most cases), or worse, the wrong meanings in this context. If you are trying to capture the notion of a noise-tolerant, largely stable representation of closely related sensory input, “pooling” isn’t going to do that for most people.

I’m not sure there is a good word for this, so my suggestion drops this aspect. As mentioned several times in the 2011 White Paper, the concept of pooling (noise-tolerance, high-overlap) is already embedded as a property of the product of SP – the SDR.

I propose the term Pattern Memory for what we currently call Spatial Pooling. This captures the fact that patterns in the data are recognised-learned and that the CLA is developing a memory of patterns it has seen. By not being too specific about which patterns we mean, it also allows us to say that the CLA learns to recognise and remember patterns of input data, stores patterns of synaptic connections, and forms patterns of activation (SDRs) to represent its inputs.

This name is also robust to adopting the new theory. L4 cells can learn sensorimotor patterns, and L3 cells can learn to recognise patterns of membership in a sequence-set.
We can run this in the top-down direction too, talking about patterns appearing in L1, motor patterns, patterns of depolarisation, and so on.

### (old) Temporal Pooling.

The problems with using this term in its old context have been well-rehearsed, and it’s now used for the much more appropriate concept of representing a stable(r) sequence-identifying SDR in Layer 3 when sensorimotor transitions from that sequence are occurring in Layer 4. Temporal Pooling, in that sense, is another great name.

I had previously offered the term “Transition Prediction” for the component of CLA involving lateral connections and predictive states. Jeff and Numenta are currently using “Temporal Memory”. I believe both are flawed.

My suggestion accurately captured the limited, 1-timestep scope of this component, and also the fact that prediction is the key to temporal learning. However, it sounds like we need to add words to the name, to reflect “something missing” from the two word name.

Temporal Memory, on the other hand, is too high-ranking and valuable a name for this relatively basic component. It carries the risk that people will think HTM is just a hierarchy of TMs. Also, “temporal” is too general – the same word is currently used for single-timestep (old TP/TM) all the way up to entire sequences (new TP).

I propose Transition Memory for this second core component of CLA. This captures most literally what the algorithm is doing – learning single transitions. It is also the temporal equivalent of Pattern Memory, using distal dendrites to link to past SDRs just as PM uses proximal dendrites to link to feedforward patterns.

Importantly, the term Transition Memory is not trying to work too hard. We can explain that learned transitions are used to put cells into predictive states, and that these predictive patterns are used both in sensory (variable order) and sensorimotor (first order) temporal learning. They are used to match predicted and actual inputs, detect anomalies and create patterns which indicate continuing successful prediction or trigger a pattern of bursting columns. It seems impossible to me to have one name capture all these aspects, so I propose we stop trying and give the name a break!

In a variation on Pattern Memory (SP), depolarisation due to Transition Memory is combined with feedforward inputs to assist recognition and increase noise-tolerance. In Jeff’s new sensorimotor theory, combining distal with proximal inputs is likely to be key to the function.

### Old and New Versions of HTM/CLA Theory.

In previous posts, I used “old and new” or “2013 and 2014″ to distinguish these two generations of the theory. In reworking the White Paper, I’ve recognised that these two theories are akin to the Newtonian versus Relativistic or Quantum views of mechanics. You need to quite deeply understand the simpler theory before you can begin to deal with the far more complex and realistic one. And for many purposes, the simpler theory is perfectly sufficient both for understanding how the neocortex works, and for useful application in software.

I thus propose that the older, simpler theory and model be called the “Sensory Cortical Learning Algorithm” or “Sensory CLA”, the newer being called the “Sensorimotor CLA”.

SCLA (or just CLA) and SMCLA are simple, distinguishable acronyms.

This also allows us to talk about HTM systems with SCLA single-layer regions (as NuPIC can/does), which just do feedforward, sensory hierarchy, or else fuller HTMs which incorporate behaviour, stable sequences, temporal pooling, and true bidirectional hierarchy using SMCLA in each region.

• Aug 14 / 2014

## Implications of the NuPIC Geospatial Encoder

Numenta’s Chetan Surpur recently demoed and explained the details of a new encoder for NuPIC which creates Sparse Distributed Representations (SDRs) from GPS data. Apart altogether from the direct applications which this development immediately suggests, I believe that Chetan’s invention has a number of much more profound implications for NuPIC and even HTM in general. This post will explore a few of the most important of these. Chetans’ demo and a tutorial by Matt Taylor are available on Youtube. First, here is Chetan presenting to, and discussing it with, Numenta people: And here’s Matt with another excellent hands-on tutorial:

### Mechanism

I’ll begin by describing the encoder itself. The Geospatial Encoder takes as input a triple [Lat, Long, Speed] and returns a Sparse Distributed Representation (SDR) which uniquely identifies that position for the given speed. The speed is important because we want the “resolution” of the encoding to vary depending on how quickly the position is changing, and Chetan’s method does this very elegantly. The algorithm is quite simple. First, a 2D space (Lat, Long) is divided up (virtually) into squares of a given scale (a parameter provided for each encoder), so each square has an x and y integer co-ordinate (the Lat-Long pair is projected using a given projection scheme for convenient display on mapping software). This co-ordinate pair can then be used as a seed for a pseudorandom number generator (Python and numpy use the cross-platform Mersenne Twister MT19937), which is used to produce a real-valued order between 0 and 1, and a bit position chosen from the n bits in the encoding. These can be generated on demand for each square in the grid, always yielding the same results. To create the SDR for a given position and speed, the algorithm first converts the speed to a radius and forms a box of squares surrounding the position and calculates the pair [orderbit] for each square in the box. The top w squares (with the highest order) are chosen, and their bit values are used to choose the w active bits in the SDR.

### Initial Interpretation

The first thing to say is that this encoder is an exemplar of transforming real-world data (location in the context of movement) into a very “SDR-like” SDR. It has the key properties we seek in an SDR encoder, in that semantically similar inputs will yield highly overlapping representations. It is robust to noise and measurement error in both space and time, and the representation is both unique (given a set scale parameter) and reproducible (given a choice of cross-platform random number generator), independently of the order of presentation of the data. The reason for this “SDR-style” character is that the entire space of squares forms an infinite field of “virtual neurons”, each of which has some activation value (its order) and position in the input bit vector (its bit). The algorithm first sparsifies this representation by restricting its sampling subspace to a box of squares around the position, and then enforces the exact sparseness by picking the w squares using a competitive analogue of local inhibition.

### Random Spatial Neuron Field (Spatial Retina)

This idea can be generalised to produce a “spatial retina” in n-dimensional space which provides a (statistically) unique SDR fingerprint for every point in the space. The SDRs specialise (or zoom in) when you reduce the radius factor, and generalise (or zoom out) when radius is increased. This provides a distance metric between two points which involves the interplay of spatial zoom and the fuzziness of overlap. Any two points will have identical SDRs (w bits of overlap) if you increase the radius sufficiently, and entirely disparate SDRs (0 bits overlap) if you zoom in sufficiently (down to the order of w*scale). Since the Coordinate Encoder operates in a world of integer-indexed squares, we first need to transform each dimension using its own scale parameter (the Geospatial Encoder uses the same scale for each direction, but this is not necessary). We thus have a single, efficient, simple mechanism which allows HTM to navigate in any kind of spatial environment. This is, I believe a really significant invention which has implications well beyond HTM and NuPIC. As Jeff and others mentioned during Chetan’s talk, this may be the mechanism underlying some animals’ ability to navigate using the Earth’s magnetic field. It is possible to envisage a (finite, obviously) field of real neurons which each have a unique response to position in the magnetic field. Humans have a similar ability to navigate, using sensory input to provide an activation pattern which varies over space and identifies locations. We combine whichever modalities work best (blind people use sound and memories of movement to compensate for impaired vision), and as long as the pipeline produces SDRs of an appropriate character, we can now see how this just works.

### Comparison with Random Distributed Scalar Encoder (RDSE)

The Geospatial Encoder uses the more general Coordinate Encoder, which takes a n-dimensional integer vector and a radius, and produces the corresponding SDR. It is easy to see how a 1D spatial encoder with a fixed speed would produce an SDR for arbitrary scalars, given an initial scale which would decide the maximum resolution of the encoder.  This encoder would be an improved replacement for the RDSE, with the following advantages:

• When encoding a value, the RDSE needs to encode all the values between existing encodings and the new value (so that the overlap guarantees are honoured). A 1D-Geo encoder can compute each value independently, saving significantly in time and memory footprint.
• In order to produce identical values for all inputs regardless of the order of presentation, the RDSE needs to “precompute” even more values in batches around a fixed “centre” (eg to compute f(23) starting at 0, we might have to compute [f(-30),…,f(30)]). Again, 1D-Geo scalar encoding computes each value uniquely and independently.
• Assuming scale (which decides the max resolution) is fixed, the 1D-Geo scalar encoding can compute encodings of variable resolution with semantic degradation by varying speed. The SDR for a value is exactly unique for the same speed, but changes gradually as speed is increased or decreased. The RDSE has no such property.

This would strongly suggest that we can replace the RDSE with a 1D coordinate spatial encoder in NuPIC, and get all the above benefits without any compromise.

### Combination with Spatially-varying Data

It is clear how you could combine this encoding scheme with data which varies by location, to create a richer idea of “order” in feeding the SDR generation algorithm. For example, you could combine random “order” with altitude or temperature data to choose the top w squares. Alternatively, the pure spatial bit signature of a location may be combined in parallel with the encoded values of scalar quantities found at the current location, so that a HTM system associatively learns the spatial structure of the given scalar field.

### Spatially Addressed Memory

The Geospatial Encoder computes a symbolic SDR address for a spatial location, effectively a “name” or “word” for each place. The elements or alphabet of this encoding are simply random order activation values of nearby squares, so any more “real” semantic SDR-like activation pattern will do an even better job in computing spatial addresses. We use memories of spatial cues (literally, landmarks), emotional memories, maps, memories of moving within the space, textual directions, and so on to encode and reinforce these representations. This model explains why memory experts often use Memory Palaces (aka the Method of Loci) to remember long sequences of data items. They associate each item (or an imagined, memorable visual proxy) occupying a location in a very familiar spatial environment. It also explains the existence of “place neurons” in rodent hippocampi – these neurons are each participating in generating a spatial encoding similar in character to the Geospatial Encoder.

### Zooming, Panning and Attention

This is a wonderful model for how we “zoom in” or “zoom out” and perceive a continuously but smoothly varying model of the world. It also models how we can perceive gracefully degrading levels of detail depending on how much time or attention we pay for a perception. In this case, the “encoder” detailed here would be a subcortical structure or a thalamus-gated (attention controlled) input or relay between regions. If we could find a mechanism in the brain which controls the size and position of a “window” of signals (akin to our variable box of squares), we would have a candidate for our ability to use attention to control spatial resolution and centre of focus. Such a mechanism may automatically arise from preferentially gating neurons at the edges of a “patch”, by virtue of the inhibition mechanism’s ability to smoothly alter the representation as inputs are added or removed. This mechanism would also explain boundary extension error, in which we “fill out” areas surrounding the physical boundaries of objects and images. As explained in detail in her talk at the Royal Institute, Eleanor Maguire believes that the hippocampus is crucial for both this phenomenon and our ability to navigate in real space. As one of the brain components at the “top” of the hierarchies, the hippocampus may be the place where we can perform the crucial “zooming and panning” operations and where we manipulate spatial SDRs as suggested by the current discovery.

### Implementation Details

The coordinate encoder has a deterministic, O(1), order-independent algorithm for computing both “order” and bit choice. One important issue is that the pseudorandom number is Python-specific, and so a Java encoder (which uses a different pseudorandom number generator) will produce completely different answers. The solution is to use the Python (and numpy) RNG, which is the Mersenne Twister MT19937, also used by default in numerous other languages. I believe it would be worth exploring using Perlin noise to generate the order and bit choice values. This would give you a) identical encodings across platforms, b) pseudorandom, uncorrelated values when the noise samples are far enough apart (eg when the inputs are integers as in this case), and c) smoothly changing values if you use very small step sizes. Just one point about changing radius and its effect on the encoding. I’m very confident that the SDR is very robust to changes in radius, due to the sparsity of the SDRs. In other words, the overlap in an SDR at radius r with that at radius r’ (at the same GPS position) will be high, because you are only adding or removing an annulus around the same position (this will be similar to adding or removing a strip of squares when a small position change occurs).

### Links to the Demo and Encoder Code

Chetan’s demo code (which is really comprehensive) is at https://github.com/numenta/nupic.geospatial. The Geospatial Encoder code is at https://github.com/numenta/nupic/blob/master/nupic/encoders/geospatial_coordinate.py and the Coordinate Encoder is at https://github.com/numenta/nupic/blob/master/nupic/encoders/coordinate.py.

Pages:12