• Dec 17 / 2014
  • 0
Cortical Learning Algorithm

Multilayer Model for Hierarchical Temporal Memory

This post sketches a simple model for multilayer processing in Hierarchical Temporal Memory (HTM). It is based on a combination of Jeff Hawkins’ and Numenta’s current work on sensorimotor extensions to HTM, my previous ideas on efficiency of predicted sparseness as well as evidence from neuroscience.

HTM has entered a new phase of development in the past year. Hawkins and his colleagues are currently extending HTM from a single-layer sensory model (assumed to represent high-order memory in Layer 2/3 of cortex) to a sensorimotor model which involves Transition Memory of combined sensory and motor inputs in L4, which is Temporally Pooled in L2/3. Once this is successfully modelled, the plan is to examine the role of L5 and L6 in motor behaviour and feedback.

Recent research in neuroscience has significantly improved our understanding of the various pathways in cortical circuits. [Douglas & Martin, 2004] proposed a so-called canonical pathway in which thalamic inputs arrive in L4, which projects to L2/3 (which sends its output to higher regions), then to L5 (which outputs motor signals) and from there to L6 (which outputs feedback to lower layers and thalamus). Teams led by Randy Bruno [deKock et al, 2007], [Constantinople & Bruno, 2013] have found that there is also a parallel circuit thalamus-L5-[L6 and L4] as well as an L3-L4 feedback pathway.

Figure 1, which is from [deKock et al, 2007], shows the calculated temporal pattern of activity in a piece of rat barrel cortex (called D2) consisting of about 9000 neurons. Barrel cortex is so named because the neurons responsive to a single Primary Whisker (PW) form a barrel-like columnar structure in this part of rat cortex. The paper estimates the layer populations in this “column” to be 3200 L2/3, 2050 L4, 1100 L5A, 1050 L5B and 1200 L6 excitatory cells.

Figure 1. Evolution of Action Potential (AP) rates in rat barrel cortex when experimenters stimulate the associated whisker. VPM is the thalamic region which projects to this part of cortex. From [deKock et al, 2007].

We’ll examine this data from the point of view of HTM. Firstly, we see that the spontaneous activity in all layers is very sparse (0.3% in L2/3, 0.6% in L4, 1.1% in L5A, 3% in L5B and 0.5% in L6), and that activity rises and falls dramatically and differently in each layer over the 150ms following stimulation.

Looking at the first 10ms and only in L4-L2/3, we see the expected sparse activations in L4 and L3, which is followed by a dramatic increase (x17 in L4, x10 in L2/3) representing bursting in both layers, likely because the input was unpredicted. Over the next 20ms, activity in L2/3 drops sharply back to 2x the baseline, but that in L4, after 10ms of dropping, rises again to practically match the original activation. This is matched in the next 10ms by a rise in L2/3 activation, after which both levels drop gradually towards the baseline over more than 100ms. We see another, somewhat different “wavelike” response pattern in the L5/6 complex.

So, can we build a model using HTM principles which explains this data (and, even better, predicts other unseen data)? I believe there must be such a model, because we see this kind of processing everywhere we look in cortex.

Before we get to that, let’s identify some important principles which arise from our current understanding of cortical function.

I: A Race to Represent

The first principle is that a population of neurons which share a common set of inputs is driven to “best represent” its inputs using a competitive inhibition process. Each neuron is accumulating depolarising input current from a unique set of contextual and immediate sources, and the first to fire will inhibit its neighbours and form part of the representation.

Each neuron can thus be seen as analogous to a “microtheory” of its world, and it will accumulate evidence from past context, current sensory inputs, and behaviour to compete in a race for its theory to be “most true”.

II: Different Sources of Evidence

The purpose of the layered structure of neocortex is to allow each population to combine its own individual evidence sources and learn to represent the “theory” of that evidence. The various populations (or sublayers) form a cyclic graph structure of evidence flow, and they cooperate to form a stable, predictable, sensorimotor model of the current world.

III: Efficiency of Predictive Sparseness

Each neuron combines contextual or predictive inputs (on distal synapses) with evidence from immediate sources (on proximal synapses). In addition, the columnar inhibitory sheath is also racing to recognise its inputs, which come largely from the same feedforward sources as its contained pyramidal cells. The sheath has an advantage as it is a better responder [cite] to the feedforward evidence alone than any of its contained cells, so there is also a race between predictive assisted recognition and simple spatial recognition of reality.

The result of the race depends on which wins – if a single pyramidal cell wins due to high predictive depolarisation (lots of contextual evidence), then it alone will fire. Otherwise, there is a short window of time which allows some number of the most predictive cells in the column to fire in turn, before they are inhibited by a vertical process. This “bursting” encodes the difference between the reality (as signalled by this column’s inhibitory sheath firing) and the population’s prediction (as would have been signalled by a highly predictive cell in some losing nearby column).

IV: Self-stabilisation through Sparse Patterns

If we consider a cortical region in its “steady state”, we see highly sparse (non-bursting) representations everywhere, and the behavioural output (from Layer 5) will be a sequence of highly sparse patterns which result in very fine motor adjustments (or none at all). This corresponds to the region perfectly modelling the sensorimotor world it experiences and making optimal predictions with minimal corrective behaviour.

A deviation from this state (failure of prediction) leads to a partial change in representation (because reality differs from prediction) and some amount of redundant predictive representation (when several cells burst in new columns). This departure from maximal sparseness is transmitted to the downstream sublayers, causing their “view of the world” and thus their own state to change. Depending on how well each sublayer can predict these changes, the cascade may halt, or instead continue to roll around the cyclical graph of sublayers, causing behavioural side-effects as it goes.

V: A Team of Rivals – “Explaining Change” by Witnessing or Acting

Within each sublayer, some cells will have inputs which correspond to “observing” the world as it evolves on its own (by predicting from context), while others will respond better when the organism is taking certain actions, and will have learned to associate certain changes with those behaviours. The representation in each sublayer will be some mixture of these, and, in the case of motor output cells in L5, the “decisions” of the region will be those which restore the predictability of things.

The reason is simple. While the activity in the region is sparse, all the active cells are predicting their activity, and the outputs of the region reflect the happy condition. These include motor output, which by definition is acting to prolong the current status of the region (if it was acting to depart from the status, these motor cells would not be still firing).

When something changes, and a set of new neurons becomes active, new neurons become temporarily active throughout the various sublayers, but they will all be cells which have learned to respond better to the new state of the world than the previously active cells. These cells will have learned to associate their own activity with the new situation, by being more right about predicting their own activity in this new context. And this, in turn, will be true only if they are the long-term winners in the establishment of a new, stable cycle of sparse activity, or alternatively if they have regularly participated in the transition to a new stable state. Either way, the system is self-stabilising, acting to right itself and improve the prediction.

A Multilayer Cortical Model

I claim that the above principles are enough to construct a simple model of how the sublayers in a region of cortex interact and co-operate.

I use the word “sublayers” because each layer (L1-6) may contain more than one population or class of neurons. We’ll pretend these are each in their own sublayer, but recognising that there are local connections between cells in sublayers which are important to how things work.

So as not to confuse, I’ll not use the common notation for sublayers found in the literature (eg L5A), instead I’ll use labels such as L5.1, L5.2 and so on. The “minor number” will usually indicate sublayers successively “far away” from the sensorimotor inputs, both in terms of time and the number of neurons in the path to reach them. I’ll also use the deKock diagram above to anchor the place and time of each part of the response to a large sensory stimulus.

I’ll also assume the idea that when a neuron projects an axon, it does so in order to connect proximally with its target. Thus, L4 projections to L2/3 are proximal on L2/3 cells, likewise with L6 to L4, while the L2/3->L4 feedback pathway uses distal dendrites.

Layer 4.1 – Sensorimotor Transition Prediction (0-20ms)

Layer 4 is said [cite] to receive inputs from L6 (65%), elsewhere in L4 (25%), and directly from thalamus (5%). In addition, some cells in L4 have distal dendrites in L2/3. We’ll split L4 into two sublayers, depending on whether they receive inputs from L2/3 (L4.1 no, L4.2 yes). Some researchers [cite] divide L4 into two populations – stellar cells and pyramidal cells, and it may be that the split is along these lines.

My hypothesis is that L4.1 cells are making predictions of sensorimotor transitions, using thalamic sensorimotor input as (primarily) feedforward, and a combination of local predictive context (L4) and information about the region’s current sensorimotor output (from L6). I say “primarily” because a single feedforward axon could synapse with a cell both on its proximal and distal dendrites, and this would be even more important for the stellar dendritic branches of L4.1 cells.

Note that the L4 inputs to L4.1 includes evidence of the output of L2/3 (a more stable “sensory” representation) via L4.2. The L6-sourced inputs also include evidence of the stable feedback pattern being sent to lower regions, which are themselves indirectly influenced by L5’s use of L2/3 (see later).

So, L4.1 is receiving fast-changing sensorimotor inputs, along with slower-changing context from within L4, and both sensory and motor outputs of the region. It uses whatever best evidence it has to predict any transitions in the thalamic input.

Successful prediction in L4.1 results in it outputting a highly sparse pattern on each transition. Failures in prediction are encoded as a union of “nearly predicted” cell activations in the columns best recognising the unpredicted thalamic input.

This might not seem sensible when thalamic inputs are only 5% of what L4.1 is receiving, but remember that the other inputs are usually highly sparse (1-2%) and change much more slowly, so thalamic feedforward input to L4.1 acts as a tiebreaker among predictions. This pattern is repeated throughout cortex because bursting cells cause a similar disruptive, temporary tiebreaking signal in downstream sublayers.

Layers 3.1 and 2.1 – Temporal Pooling (10-20ms)

Layers 2 and 3 are usually treated as one. Both receive most of their feedforward input from L4 and have distal inputs both from within L2/3 and from L1 (which gets feedback input from L6 in higher regions).

I’ll split the two by saying that L2 gets more input from L1 than L3 does. In other words, L2 is more primed or biased by higher-level context, while L3 is less likely to be dominated by feedback. There is evidence [cite] of this differentiation, so let’s assume it’s useful.

Now, L2.1/L3.1 are receiving feedforward inputs from L4.1. If those inputs are sparse, then only those cells in L2/3 which have many active inputs will be part of the SDR in this layer (it’s one layer in a column sense, just the L2 “end” has a higher L1 input mix). In addition, they’ll need good intralayer and/or top-down predictive input to maintain stable activity.

The stability in L2.1/3.1 comes from the combination of stable predictive inputs from within the layer and from above. This prebiases predictive cells to recognise the successive sparse inputs from L4.1 and continue to remain active. The active cells in L2/3 have learned to use a combination of sequence memory (intralayer) and top-down feedback to associate with each fast-changing SDR in L4.1. This mechanism is reinforced by the fast L4.1-L2/3.1-L4.2-L4.1 feedback loop, along with the much longer feedback loops.

This is where the L2/3 difference is important. The more superficial cells in L2/3 are more strongly biased by top-down feedback from L1. We have evidence [cite] that L2 projects more strongly to the deep part of L5, while L3 projects more to superficial L5. Thus, the choices of active cells in L2/3 encode how much sequence memory and how much top-down are involved in the representation.

L6.1 – Comparing Reality with Expectations from Behaviour (0-10ms)

[Constantinople and Bruno], among others, show that direct thalamic inputs arrive simultaneously at L4 and L5/L6, suggesting that L5/6 and L4/L2/3 are performing parallel operations on sensorimotor inputs. While the L4-L2/3 system is relatively simple (at least at first order approximation), the L5/6 system is much more complex, involving a larger number of functional populations with diverse purposes. I’ll describe a minimum of these for now.

Layer 6.1 cells are the first in L5/6 to respond to thalamic inputs, suggesting a role analogous to L4.1. Unlike L4 cells, however, these cells have immediate access to both the recent L6 output to lower regions (representing the current steady state of the region) and the current motor output of the region (from L5). This much richer set of evidence sources allows L6.1 to make finer-grained predictions of the expected thalamic inputs, and its response when prediction fails is the primary driver for changes in L5 motor output and signals to higher regions.

L5.1 – Responding to Change by Acting (0-20ms)

I speculate that the thick-tufted L5B cells correspond to L5.1 in my model. These cells also receive direct thalamic inputs, as well as inputs from L6, L2/3 (primarily the L2 “end”) and top-down feedback via L1. L5.1’s purpose is to act quickly if necessary, in response to a significant change in its world. Any dramatic change in either sensorimotor patterns or context will cause L5 to output a large, non-sparse signal which it has learned is appropriate to that change.

In the steady state, with all inputs sparse, L5.1 generates a minimal, sparse signal which corresponds to energetically efficient, smooth behaviour in the organism. Sudden (unpredictable) changes in either sensorimotor inputs (thalamic), correspondence between behaviour and outcomes (L6), sequence memory predictions (L2/3) or top-down “instructions” (L1) will cause a dramatic rise in output (from 3% to over 10% active cells) which results in new corrective motor behaviour as well as an alarm signal to higher layers.

L6.2 – Co-ordination of Responses (10-30ms)

In Layer 6, a second population of cells is responsible for integrating any rising activity in L5.1 with context, signalling L4 of the new situation, and affecting the L6 feedback output. The better L6.2 can predict/recognise the output of L5, the sparser its signal to L4 and the smaller the effect on L6 feedback output. Thus, L6.2 acts either to help L4 make good predictions of transitions (by sending sparse signals), or to disrupt steady-state prediction in L4 (and later L2/3) into a new sensorimotor regime.

L4.2 and L2/3: Stabilising Prediction (30-50ms)

After 30ms or so, pyramidal cells in L4 are sampling the “sensory” response of L2/3 along with signals from L6 about the motor response. L4.2 can now generate a signal for L2/3 which is more sparse than the initial L4.1 response, but still well above baseline. Over the next 20-50ms, L4.2 and L2/3 use this feedback loop (along with the L5/6 motor loop) to reduce their activity and settle into a steady predictive state.

I propose that it is these L4.2 cells which participate in the steady-state activity of L4, along with the L5.2 cells (next section). L4.1 and L5.1 are representative of large transitions between steady, predictive sparse states.

L5.2 and L6 – Stabilising Behaviour (40-50ms)

L5.2, which corresponds to thick-tufted cells in L5A (in deKock’s diagram). This sublayer combines the context inputs (from L6, L1 and L5) with the lagging, stabilising output from L2/3 (which is being stabilised by the L4.2 feedback loop) and produces a second motor response (and a second signal to higher layers). With more information about how L2/3 responded to the initial signal, L5.2 can learn to produce a more nuanced behaviour than the “knee-jerk” response of L5.1, or perhaps counteract it to resume stability.

L6 is again used to provide feedback of behaviour to L4 and aid its prediction.

Multilayer CLA

Figure 2: Schematic showing main connections in the multilayer model. Each “neuron” represents a large number of neurons in each sublayer.

ppMultilayer Flow Diagram

Figure 3: Schematic showing main axonal (arrows) and dendritic (tufts) links in the multilayer model.


We can see how this model allows a region of cortex to go from a highly sparse, quiescent steady state, absorb a large sensory stimulus, and respond, initially with dramatic changes in activity, then with decreasing waves of disturbance and motor response, in order to restore a new steady state which is self-sustaining.

The fast-responding L4.1 and L5.1 cells react first to a drastic change, causing representations in L2/3 and L6 to update, and then the second population, using L4.2 to stabilise perception and L5.2 to stabilise behaviour, takes over and settles into a new steady state.


Apart from the rat barrel cortex example used here, we can see how this model can be applied in other well-studied cortical systems.

Microsaccades Stabilise Vision in V1

In V1, the primary thalamic input is from retinal ganglion cells which detect on-centre or off-centre patterns in the retinal image. L4 cells are understood [cite] mostly to contain so-called “simple cells” which respond to short oriented “bars” formed by a small number of neighbouring ganglion cells. L2/3, by the same token, contains many more “complex” cells which respond to overlapping or moving bars corresponding to longer edges or a sequence of edge movements. L4 also contains a smaller number of cells with these response properties.

I propose that the simple cells are L4.1, while the L2/3 complex cells are temporally poling over these cells, and the second population of L4 complex cells are actually L4.2, responding to the activity in L2/3. L5 in steady state is causing the eye to microsaccade in order to stabilise the “image” formed in L2/3 of the edges in the scene as tiny movements of organism and objects cause the exact patterns in L4.1 to change predictably.

Deviations beyond the microsaccade scale will cause bursting in L4.1, and the SDR shown by L2/3 will change to a new one representing the new sensory input. If L2/3 can use L1 and its own predictive input to correctly expect this new state, it will remain sparse and cause minimal reaction in L5 (in the second phase). If not, L2/3 will burst, L5 will generate a large signal, and thus V1 will pass the buck up to a region which can deal with changes of scene.

This process will be repeated at higher levels, at higher temporal and spatial scales.

Speech Generation

In speech generation, the sensory input is from the ears, and the motor output is to the vocal system. The region responsible for generating speech is controlled (via L1) by higher regions expressing a high-level representation of sounds to be produced. Layer 2/3 uses this input to bias itself to represent all sequences of sounds which match the L1 signal. Layer 5 receives both these signals and is thus highly predictive of representing the motor actions for these sequences. Since all the sublayers are at non-zero sparseness, activity will propagate and be amplified at each stage by the predictive states until a “most probable” starting sound is generated. The region will continue to generate the correct motor activity, using prediction to correct for differences between the expected and perceived sounds.

Citations (to be completed)

Constantinople, Christine M. and Bruno, Randy M.: Deep Cortical Layers Are Activated Directly by Thalamus. Science 28 June 2013: Vol. 340 no. 6140 pp. 1591-1594 DOI: 10.1126/science.1236425 [Abstract Free]

Douglas, Rodney J. and Martin, Kevan A.C.: Neuronal Circuits of the Neocortex, Annu. Rev. Neurosci. 2004. 27:419–51 doi:10.1146/annurev.neuro.27.070203.144152 [Google Scholar]

[Abstract/Full Text]

  • Dec 08 / 2014
  • 0
Cortical Learning Algorithm

Response to Yann LeCun’s Questions on the Brain

Yann LeCun recently posed some questions on Facebook about the brain. I’d like to address these really great questions in the context of Hierarchical Temporal Memory (HTM). I’ll intersperse the questions and answers in order.

A list of challenges related to how neuroscience can help computer science:

– The brains appears to be a kind of prediction engine. How do we translate the principle of prediction into a practical learning paradigm?

HTM is based on seeing the brain as a prediction system. The Cortical Learning Algorithm uses intra-layer connections to distal dendrites to learn transitions between feedforward sensory inputs. Individual neurons use inputs from neighbouring, recently active neurons to learn to predict their own activity in context. The layer as a whole chooses as sparse a set of best predictor-recognisers to represent the current situation.

- Good ML paradigms are built around the minimization of an objective function. Does the brain minimize an objective function? What is this function?

The answer is different at each level of the system, but the common theme is efficiency of activity. Synapses/dendritic spines form, grow and shrink in response to incoming signals, in order to maximise the correlation between an incoming signal and the neuron’s activity. Neurons adjust their internal thresholds and other parameters in order to maximise their probability of firing given a combined feedforward/context input pattern. Columns (represented using a simplified sheath of inhibitory neurons) again adjust their synapses in order to maximise their contained cells’ probability of becoming active given the inputs. The objective metric of a layer of neurons is the sparsity of representation, with errors in prediction-recognition being measured as lower sparsity (bursting in columns). A region of cortex produces motor output which minimises deviations from stable predicted representations of the combined sensory, motor, contextual and top-down inputs.

- Good ML systems estimate the gradient of their objective function in order to minimize it. Assuming the brain minimizes an objective function, does it estimate its gradient? How does it do it?

Each component in HTM uses only local information to adapt and learn. The optimisation emerges from each components’ responses as it learns, and from competition between columns and neurons to represent the inputs.

- Assuming that the brain computes some sort of gradient, how does it use it to optimize the objective?

There is no evidence of a mechanism in the brain which operates in this way. HTM does without such a mechanism.

- What are the principles behind unsupervised learning? Much of learning in the brain is unsupervised (or predictive). We have lots of unsupervised/predictive learning paradigms, but none of them seems as efficient as what the brain uses. How do we find one that is as efficient and general as biological learning?

CLA is a highly efficient and completely general unsupervised learning mechanism, which automatically learns the combined spatial and temporal structure of the inputs.

- Short term memory: the cortex seems to have a very short term memory with a span of about 20 seconds. Remembering things for more than 20 seconds seems to require the hippocampus. And learning new skills seems to take place in the cortex with help from the hippocampus. How do we build learning machines with short-term memory? There have been proposals to augment recurrent neural nets with a separate associative short-term memory module (e.g LSTM, Facebook’s “Memory Networks”, Deep Mind’s “Neural Turing Machine”). This is a model by which the “processor” (e.g. a recurrent net) is separate from the “RAM” (e.g. a hippocampus-like assoicative memory). Could we get inspiration from neuroscience about how to do this?

Hierarchy in HTM provides short-term memory, with higher-level regions seeking to form a stable representation of the current situation in terms of sequence-sets of lower-level representations of the state of the world. Each region uses prediction-assisted recognition to represent its input, predict future inputs, and execute behaviours which maintain the predicted future.

- Resource allocation in short-term memory: if we have a separate module for short-term memory, how are resources allocated within it? When we enter a room, our position in the room, the geometry of the room, and the landmarks and obstacles in it are stored in our hippocampus. Presumably, the neural circuits used for this are recycled and reused for future tasks. How?

There’s no evidence of a separate short-term memory module in the brain. The entire neocortex is the memory, with the ephemeral activity in each region representing the current content. Active hierarchical communication between regions lead to the evolution of perception, decisions and behaviour. At the “top” of the hierarchy, the hippocampus is used to store and recycle longer-term memories.

- How does the brain perform planning, language production, motor control sequences, and long chains of reasoning? Planning complex tasks (which includes communicating with people, writing programs, and solving math problems) seems like an important part of AI system.

Because of the multiple feedforward and feedback pathways in neocortex, the entire system is constantly acting as a cyclic graph of information flow. In each region, memories of sequences are used in recognition, prediction, visualisation, execution of behaviour, imagination and so on. Depending on the task, the representations can be sensory, sensorimotor, pseudosensory (diagrammatic) or linguistic.

- resource allocation in the cortex: how does the brain “recruit” pieces of cortex when it learns a new task. In monkeys that have lost a finger, the corresponding sensory area gets recruited by other fingers when the monkey is trained to perform a task that involves touch.

There is always a horizontal “leakage” level of connections in any area of neocortex. When an area is deprived of input, neurons at the boundary respond to activity in nearby regions by increasing their response to that activity. This is enhanced by the “housekeeping” glial cells embedded in cortex, which actively bring axons and dendrites together to knit new connections.

- The brain uses spikes. Do spikes play a fundamental role in AI and learning, or are they just made necessary by biological hardware?

Spikes are very important in the real brain, but they are not directly needed for the core processing of information, so HTM doesn’t model them per se. We do use an analogue to Spike Timing Dependent Plasticity in the core Hebbian learning of predictive connections, but this is simplified to a timestep-based model rather than individual spikes.

We have elements of answers and avenues for research for many of these points, but no definite/perfect solutions.

HTM’s solutions are also neither perfect nor definitive, but they are our best attempt to address your questions in a simple, coherent and effective system, which directly depends on data from neuroscience.

Thanks to Yann for asking such pertinent questions about how the brain might work. It’s a recognition that the brain has a lot to teach us about intelligence and learning.

  • Nov 29 / 2014
  • 0
Clortex (HTM in Clojure), Cortical Learning Algorithm, NuPIC

Mathematics of HTM Part II – Transition Memory

This article is part of a series describing the mathematics of Hierarchical Temporal Memory (HTM), a theory of cortical information processing developed by Jeff Hawkins. In Part One, we saw how a layer of neurons learns to form a Sparse Distributed Representation (SDR) of an input pattern. In this section, we’ll describe the process of learning temporal sequences.

We showed in part one that the HTM model neuron learns to recognise subpatterns of feedforward input on its proximal dendrites. This is somewhat similar to the manner by which a Restricted Boltzmann Machine can learn to represent its input in an unsupervised learning process. One distinguishing feature of HTM is that the evolution of the world over time is a critical aspect of what, and how, the system learns. The premise for this is that objects and processes in the world persist over time, and may only display a portion of their structure at any given moment. By learning to model this evolving revelation of structure, the neocortex can more efficiently recognise and remember objects and concepts in the world.

Distal Dendrites and Prediction

In addition to its one proximal dendrite, a HTM model neuron has a collection of distal (far) neurons, which gather information from sources other than the feedforward inputs to the layer. In some layers of neocortex, these dendrites combine signals from neurons in the same layer as well as from other layers in the same region, and even receive indirect inputs from neurons in higher regions of cortex. We will describe the structure and function of each of these.

The simplest case involves distal dendrites which gather signals from neurons within the same layer.

In Part One, we showed that a layer of \(N\) neurons converted an input vector \(\mathbf x \in \mathbb{B}^{n_{\textrm{ff}}}\) into a SDR \(\mathbf{y}_{\textrm{SDR}} \in \mathbb{B}^{N}\), with length\(\lVert{\mathbf y}_{\textrm{SDR}}\rVert_{\ell_1}=sN \ll N\), where the sparsity \(s\) is usually of the order of 2% (\(N\) is typically 2048, so the SDR \(\mathbf{y}_{\textrm{SDR}}\) will have 40 active neurons).

The layer of HTM neurons can now be extended to treat its own activation pattern as a separate and complementary input for the next timestep. This is done using a collection of distal dendrite segments, which each receive as input the signals from other neurons in the layer itself. Unlike the proximal dendrite, which transmits signals directly to the neuron, each distal dendrite acts as an active coincidence detector, firing only when it receives enough signals to exceed its individual threshold.

We proceed with the analysis in a manner analogous to the earlier discussion. The input to the distal dendrite segment \(k\) at time \(t\) is a sample of the bit vector \(\mathbf{y}_{\textrm{SDR}}^{(t-1)}\). We have \(n_{ds}\) distal synapses per segment, a permanence vector \(\mathbf{p}_k \in [0,1]^{n_{ds}}\) and a synapse threshold vector \(\vec{\theta}_k \in [0,1]^{n_{ds}}\), where typically \(\theta_i = \theta = 0.2\) for all synapses.

Following the process for proximal dendrites, we get the distal segment’s connection vector \(\mathbf{c}_k\):

$$c_{k,i}=(1 + sgn(p_{k,i}-\theta_{k,i}))/2$$

The input for segment \(k\) is the vector \(\mathbf{y}_k^{(t-1)} = \phi_k(\mathbf{y}_{\textrm{SDR}}^{(t-1)})\) formed by the projection \(\phi_k:\lbrace{0,1}\rbrace^{N-1}\rightarrow\lbrace{0,1}\rbrace^{n_{ds}}\) from the SDR to the subspace of the segment. There are \({N-1}\choose{n_{ds}}\) such projections (there are no connections from a neuron to itself, so there are \(N-1\) to choose from).

The overlap of the segment for a given \(\mathbf{y}_{\textrm{SDR}}^{(t-1)}\) is the dot product \(o_k^t = \mathbf{c}_k\cdot\mathbf{y}_k^{(t-1)}\). If this overlap exceeds the threshold \(\lambda_k\) of the segment, the segment is active and sends a dendritic spike of size \(s_k\) to the neuron’s cell body.

This process takes place before the processing of the feedforward input, which allows the layer to combine contextual knowledge of recent activity with recognition of the incoming feedforward signals. In order to facilitate this, we will change the algorithm for Pattern Memory as follows.

Each neuron begins a timestep \(t\) by performing the above processing on its \({n_{\textrm{dd}}}\) distal dendrites. This results in some number \(0\ldots{n_{\textrm{dd}}}\) of segments becoming active and sending spikes to the neuron. The total predictive activation potential is given by:

$$o_{\textrm{pred}}=\sum\limits_{o_k^{t} \ge \lambda_k}{s_k}$$

The predictive potential is combined with the overlap score from the feedforward overlap coming from the proximal dendrite to give the total activation potential:

$$a_j^t=\alpha_j o_{\textrm{ff},j} + \beta_j o_{\textrm{pred},j}$$

and these \(a_j\) potentials are used to choose the top neurons, forming the SDR \(Y_{\textrm{SDR}}\) at time \(t\). The mixing factors \(\alpha_k\) and \(\beta_k\) are design parameters of the simulation.

Learning Predictions

We use a very similar learning rule for distal dendrite segments as we did for the feedforward inputs:

$$ p_i^{(t+1)} =
(1+\sigma_{inc})p_i^{(t)} & \text {if cell $j$ active, segment $k$ active, synapse $i$ active} \\
(1-\sigma_{dec})p_i^{(t)} & \text {if cell $j$ active, segment $k$ active, synapse $i$ not active} \\
p_i^{(t)} & \text{otherwise} \\
\end{cases} $$

Again, this reinforces synapses which contribute to activity of the cell, and decreases the contribution of synapses which don’t. A boosting rule, similar to that for proximal synapses, allows poorly performing distal connections to improve until they are good enough to use the main rule.


We can now view the layer of neurons as forming a number of representations at each timestep. The field of predictive potentials \(o_{\textrm{pred},j}\) can be viewed as a map of the layer’s confidence in its prediction of the next input. The field of feedforward potentials can be viewed as a map of the layer’s recognition of current reality. Combined, these maps allow for prediction-assisted recognition, which, in the presence of temporal correlations between sensory inputs, will improve the recognition and representation significantly.

We can quantify the properties of the predictions formed by such a layer in terms of the mutual information between the SDRs at time \(t\) and \(t+1\). I intend to provide this analysis as soon as possible, and I’d appreciate the kind reader’s assistance if she could point me to papers which might be of help.

A layer of neurons connected as described here is a Transition Memory, and is a kind of first-order memory of temporally correlated transitions between sensory patterns. This kind of memory may only learn one-step transitions, because the SDR is formed only by combining potentials one timestep in the past with current inputs.

Since the neocortex clearly learns to identify and model much longer sequences, we need to modify our layer significantly in order to construct a system which can learn high-order sequences. This is the subject of the next part of this series.

Note: For brevity, I’ve omitted the matrix treatment of the above. See Part One for how this is done for Pattern Memory; the extension to Transition Memory is simple but somewhat arduous.

  • Nov 28 / 2014
  • 0
Clortex (HTM in Clojure), Cortical Learning Algorithm, NuPIC

Mathematics of Hierarchical Temporal Memory

This article describes some of the mathematics underlying the theory and implementations of Jeff Hawkins’ Hierarchical Temporal Memory (HTM), which seeks to explain how the neocortex processes information and forms models of the world.

Note: Part II: Transition Memory is now available.

The HTM Model Neuron – Pattern Memory (aka Spatial Pooling)

We’ll illustrate the mathematics of HTM by describing the simplest operation in HTM’s Cortical Learning Algorithm: Pattern Memory, also known as Spatial Pooling, forms a Sparse Distributed Representation from a binary input vector. We begin with a layer (a 1- or 2-dimensional array) of single neurons, which will form a pattern of activity aimed at efficiently representing the input vectors.

Feedforward Processing on Proximal Dendrites

The HTM model neuron has a single proximal dendrite, which is used to process and recognise feedforward or afferent inputs to the neuron. We model the entire feedforward input to a cortical layer as a bit vector \({\mathbf x}_{\textrm{ff}}\in\lbrace{0,1}\rbrace^{n_{\textrm{ff}}}\), where \(n_{\textrm{ff}}\) is the width of the input.

The dendrite is composed of \(n_s\) synapses which each act as a binary gate for a single bit in the input vector.  Each synapse has a permanence \(p_i\in{[0,1]}\) which represents the size and efficiency of the dendritic spine and synaptic junction. The synapse will transmit a 1-bit (or on-bit) if the permanence exceeds a threshold \(\theta_i\) (often a global constant \(\theta_i = \theta = 0.2\)). When this is true, we say the synapse is connected.

Each neuron samples \(n_s\) bits from the \(n_{\textrm{ff}}\) feedforward inputs, and so there are \({n_{\textrm{ff}}}\choose{n_{s}}\) possible choices of input for a single neuron. A single proximal dendrite represents a projection \(\pi_j:\lbrace{0,1}\rbrace^{n_{\textrm{ff}}}\rightarrow\lbrace{0,1}\rbrace^{n_s}\), so a population of neurons corresponds to a set of subspaces of the sensory space. Each dendrite has an input vector \({\mathbf x}_j=\pi_j({\mathbf x}_{\textrm{ff}})\) which is the projection of the entire input into this neuron’s subspace.

A synapse is connected if its permanence \(p_i\) exceeds its threshold \(\theta_i\). If we subtract \({\mathbf p}-{\vec\theta}\), take the elementwise sign of the result, and map to \(\lbrace{0,1}\rbrace\), we derive the binary connection vector \({\mathbf c}_j\) for the dendrite. Thus:

$$c_i=(1 + sgn(p_i-\theta_i))/2$$

The dot product \(o_j({\mathbf x})={\mathbf c}_j\cdot{\mathbf x}_j\) now represents the feedforward overlap of the neuron with the input, ie the number of connected synapses which have an incoming activation potential. Later, we’ll see how this number is used in the neuron’s processing.

The elementwise product \({\mathbf o}_j={\mathbf c}_j\odot{\mathbf x}_j\) is the vector in the neuron’s subspace which represents the input vector \({\mathbf x}_{\textrm{ff}}\) as “seen” by this neuron. This is known as the overlap vector. The length \(o_j = \lVert{\mathbf o}_j\rVert_{\ell_1}\) of this vector corresponds to the extent to which the neuron recognises the input, and the direction (in the neuron’s subspace) is that vector which has on-bits shared by both the connection vector and the input.

If we project this vector back into the input space, the result \(\mathbf{\hat{x}}_j =\pi^{-1}({\mathbf o}_j)\) is this neuron’s approximation of the part of the input vector which this neuron matches. If we add a set of such vectors, we will form an increasingly close approximation to the original input vector as we choose more and more neurons to collectively represent it.

Sparse Distributed Representations (SDRs)

We now show how a layer of neurons transforms an input vector into a sparse representation. From the above description, every neuron is producing an estimate \(\mathbf{\hat{x}}_j \) of the input \({\mathbf x}_{\textrm{ff}}\), with length \(o_j\ll n_{\textrm{ff}}\) reflecting how well the neuron represents or recognises the input. We form a sparse representation of the input by choosing a set \(Y_{\textrm{SDR}}\) of the top \(n_{\textrm{SDR}}=sN\) neurons, where \(N\) is the number of neurons in the layer, and \(s\) is the chosen sparsity we wish to impose (typically \(s=0.02=2\%\)).

The algorithm for choosing the top \(n_{\textrm{SDR}}\) neurons may vary. In neocortex, this is achieved using a mechanism involving cascading inhibition: a cell firing quickly (because it depolarises quickly due to its input) activates nearby inhibitory cells, which shut down neighbouring excitatory cells, and also nearby inhibitory cells, which spread the inhibition outwards. This type of local inhibition can also be used in software simulations, but it is expensive and is only used where the design involves spatial topology (ie where the semantics of the data is to be reflected in the position of the neurons). A more efficient global inhibition algorithm – simply choosing the top \(n_{\textrm{SDR}}\) neurons by their depolarisation values – is often used in practise.

If we form a bit vector \({\mathbf y}_{\textrm{SDR}}\in\lbrace{0,1}\rbrace^N\textrm{ where } y_j = 1 \Leftrightarrow j \in Y_{\textrm{SDR}}\), we have a function which maps an input \({\mathbf x}_{\textrm{ff}}\in\lbrace{0,1}\rbrace^{n_{\textrm{ff}}}\) to a sparse output \({\mathbf y}_{\textrm{SDR}}\in\lbrace{0,1}\rbrace^N\), where the length of each output vector is \(\lVert{\mathbf y}_{\textrm{SDR}}\rVert_{\ell_1}=sN \ll N\).

The reverse mapping or estimate of the input vector by the set \(Y_{\textrm{SDR}}\) of neurons in the SDR is given by the sum:

$$\mathbf{\hat{x}} = \sum\limits_{j \in Y_{\textrm{SDR}}}{{\mathbf{\hat{x}}}_j} = \sum\limits_{j \in Y_{\textrm{SDR}}}{\pi_j^{-1}({\mathbf o}_j)} = \sum\limits_{j \in Y_{\textrm{SDR}}}{\pi_j^{-1}({\mathbf c}_j\odot{\mathbf x}_j)}= \sum\limits_{j \in Y_{\textrm{SDR}}}{\pi_j^{-1}({\mathbf c}_j \odot \pi_j({\mathbf x}_{\textrm{ff}}))}= \sum\limits_{j \in Y_{\textrm{SDR}}}{\pi_j^{-1}({\mathbf c}_j) \odot {\mathbf x}_{\textrm{ff}}} $$

Matrix Form

The above can be represented straightforwardly in matrix form. The projection \(\pi_j:\lbrace{0,1}\rbrace^{n_{\textrm{ff}}} \rightarrow\lbrace{0,1}\rbrace^{n_s} \) can be represented as a matrix \(\Pi_j \in \lbrace{0,1}\rbrace^{{n_s} \times\ n_{\textrm{ff}}} \).

Alternatively, we can stay in the input space \(\mathbb{B}^{n_{\textrm{ff}}}\), and model \(\pi_j\) as a vector \(\vec\pi_j =\pi_j^{-1}(\mathbf 1_{n_s})\), ie where \(\pi_{j,i} = 1 \Leftrightarrow (\pi_j^{-1}(\mathbf 1_{n_s}))_i = 1\).

The elementwise product \(\vec{x_j} =\pi_j^{-1}(\mathbf x_{j}) = \vec{\pi_j} \odot {\mathbf x_{\textrm{ff}}}\) represents the neuron’s view of the input vector \(x_{\textrm{ff}}\).

We can similarly project the connection vector for the dendrite by elementwise multiplication: \(\vec{c_j} =\pi_j^{-1}(\mathbf c_{j}) \), and thus \(\vec{o_j}(\mathbf x_{\textrm{ff}}) = \vec{c_j} \odot \mathbf{x}_{\textrm{ff}}\) is the overlap vector projected back into \(\mathbb{B}^{n_{\textrm{ff}}}\), and the dot product \(o_j(\mathbf x_{\textrm{ff}}) = \vec{c_j} \cdot \mathbf{x}_{\textrm{ff}}\) gives the same overlap score for the neuron given \(\mathbf x_{\textrm{ff}}\) as input. Note that \(\vec{o_j}(\mathbf x_{\textrm{ff}}) =\mathbf{\hat{x}}_j \), the partial estimate of the input produced by neuron \(j\).

We can reconstruct the estimate of the input by an SDR of neurons \(Y_{\textrm{SDR}}\):

$$\mathbf{\hat{x}}_{\textrm{SDR}} = \sum\limits_{j \in Y_{\textrm{SDR}}}{{\mathbf{\hat{x}}}_j} = \sum\limits_{j \in Y_{\textrm{SDR}}}{\vec o}_j = \sum\limits_{j \in Y_{\textrm{SDR}}}{{\vec c}_j\odot{\mathbf x_{\textrm{ff}}}} = {\mathbf C}_{\textrm{SDR}}{\mathbf x_{\textrm{ff}}}$$

where \({\mathbf C}_{\textrm{SDR}}\) is a matrix formed from the \({\vec c}_j\) for \(j \in Y_{\textrm{SDR}}\).

Optimisation Problem

We can now measure the distance between the input vector \(\mathbf x_{\textrm{ff}}\) and the reconstructed estimate \(\mathbf{\hat{x}}_{\textrm{SDR}}\) by taking a norm of the difference. Using this, we can frame learning in HTM as an optimisation problem. We wish to minimise the estimation error over all inputs to the layer. Given a set of (usually random) projection vectors \(\vec\pi_j\) for the N neurons, the parameters of the model are the permanence vectors \(\vec{p}_j\), which we adjust using a simple Hebbian update model.

The update model for the permanence of a synapse \(p_i\) on neuron \(j\) is:

$$ p_i^{(t+1)} =
(1+\delta_{inc})p_i^{(t)} & \text {if $j \in Y_{\textrm{SDR}}$, $(\mathbf x_j)_i=1$, and $p_i^{(t)} \ge \theta_i$} \\
(1-\delta_{dec})p_i^{(t)} & \text {if $j \in Y_{\textrm{SDR}}$, and ($(\mathbf x_j)_i=0$ or $p_i^{(t)} \lt \theta_i$)} \\
p_i^{(t)} & \text{otherwise} \\
\end{cases} $$

This update rule increases the permanence of active synapses, those that were connected to an active input when the cell became active, and decreases those which were either disconnected or received a zero when the cell fired. In addition to this rule, an external process gently boosts synapses on cells which either have a lower than target rate of activation, or a lower than target average overlap score.

I do not yet have the proof that this optimisation problem converges, or whether it can be represented as a convex optimisation problem. I am confident such a proof can be easily found. Perhaps a kind reader who is more familiar with a problem framed like this would be able to confirm this. I’ll update this post with more functions from HTM in coming weeks.

Note: Part II: Transition Memory is now available.

  • Nov 13 / 2014
  • 0
Cortical Learning Algorithm, NuPIC

Efficiency of Predicted Sparseness as a Motivating Model for Hierarchical Temporal Memory

Part 1 – Introduction and Description.

In any attempt to create a theoretical scientific framework, breakthroughs are often made when a single key “law” is found to underly what previously appeared to be a number of observed lesser laws. An example from Physics is the key principle of Relativity: that the speed of light is a constant in all inertial frames of reference, which quickly leads to all sorts of unintuitive phenomena like time dilation, length contraction, and so on. This discussion aims to do this for HTM by proposing that its key underlying principle is the efficiency of predicted sparseness at all levels. I’ll attempt to show how this single principle not only explains several key features of HTM identified so far, but also explains in detail how to model any required structural component of the neocortex.

The neocortex is a tremendously expensive organ in mammals, and particularly in humans, so it seems certain that the benefits it provides are proportionately valuable to the genes of an animal. We can use this relationship between cost and benefit, with sparseness and prediction as mediating metrics, to derive detailed design rules for the neocortex at every level, down to individual synapses and their protein machinery.

If you take one thing away from this talk, it should be that Sparse Distributed Representations are the key to Intelligence. Jeff Hawkins

Note: The next post in this series describes the Mathematics of Hierarchical Temporal Memory.

Sparse Distributed Representations are a key concept in HTM theory. In any functional piece of cortex, only a small fraction of a large population of neurons will be active at a given time; each active neuron encodes some component of the semantics of the representation; and small changes in the exact SDR correspond with small differences in the detailed object or concept being represented. Ahmad 2014 describes many important properties of SDRs.

SDRs are one efficient solution to the problem of representing something with sufficient accuracy at optimal cost in resources, and in the face of ambiguity and noise. My thesis is that in forming SDRs, neocortex is striving to optimise a lossy compression process by representing only those elements of the input which are structural and ignoring everything else.

Shannon proposed that any message has a concrete amount of information, measured in bits, which reflects the amount of surprise (i.e. something you couldn’t compute from the message so far, or by other means) contained in the message.

The most efficient message has zero length – it’s the message you don’t need to send. The next most efficient message contains only the information the receiver lacks to reconstruct everything the sender wishes her to know. Thus, by using memory and the right encoding to connect with it, a clever receiver (or memory system) can become very efficient indeed.

We will see that neocortex implements this idea literally, at all levels, as it attempts to represent, remember and predict events in the world as usefully as possible and at minimal cost.

The organising principle in cortical design is that components (from the whole organism down to a synapse) can do little about the amount of signal they receive, but they can – and do – adapt and learn to make best use of that signal to control what they do, only acting – sending a signal – when it’s the predicted optimal choice. This gives rise to sparseness in space and time everywhere, which directly reflects the degree of successful prediction present in any part of the system.

The success metric for a component in neocortex is the ratio of input data rate to output information rate, where the component has either a fixed minimum, or (for neurons and synapses) a fixed maximum, output level.

Deviations from the target indicate some failure to predict activity. This failure is either an opportunity to learn (and predict better next time), or, failing that, something which needs to be acted upon in some other way, by taking a different action or by passing new information up the hierarchy.

Note inputs in this context are any kind of signal coming in to the component under study. In the case of regions, layers and neurons, these include top-down feedback and lateral inputs as well as feedforward.


Neocortex is a hierarchy because it has finite space to store its model of the world, and a hierarchy is an optimal strategy when the world itself has hierarchical structure. Each region in the hierarchy is subjected (by design) to a necessarily overwhelming rate of input, it will run at capacity to absorb its data stream, reallocating its finite resources to contain an optimal model of the world it perceives.


The memory inside a region of cortex is driven towards an “ideal” state in which it always predicts its inputs and thus produces a “perfect”, minimal message – containing its learned SDR of its world’s current state – as output. Any failure to predict is indicated by a larger output, the deviation from “ideal” representing the exact surprise of the region to its current perception of the world.

A region has several output layers, each of which has a different (and usually more than one) purpose.

For each region, two layers send (different) signals up the hierarchy, therefore signalling both the current state of its world and the encoding of its unpredictability. The higher region now gets details of something it should hopefully have the capacity to handle – predict – or else it passes the problem up the chain.

Two layers send (again different) signals down to lower layers and (in the case of motor) to subcortical systems. The content of these outputs will relate to the content as well as the stability and confidence of the region’s model, and also actions which are appropriate in terms of that content and confidence level.


A cortical layer which has fully predicted its inputs has a maximally sparse output pattern. A fully failing prediction pattern in a layer causes it to output a maximally bursting and minimally sparse pattern, at least for a short time. At any failure level in between, the exact evolution of firing in the bursting neurons encodes the precise pattern of prediction failure of the layer, and this is the information passed to other layers in the region, to other regions in cortex, or to targets outside the cortex.

The output of a cortical layer is thus a minimal message – it “starts” with the best match of its prediction and reality, followed (in a short period of time) by encodings of reality in the context of increasingly weak prediction.


A layer’s output, in turn, is formed from the combination of its neurons, which are themselves arranged in columns. The columnar arrangement of cells in cortical columns is the key design leading to all the behaviour described previously.

Pyramidal cells, which represent both the SDR activity pattern and the “memory” in a layer, are all contained in columns. The sparse pattern of activity across a layer is dictated by how all the cells compete within this columnar array.

Columns are composed of pyramidal cells, which act independently, and a complex of inhibitory cells which act together to define how the column operates. All cells share a very similar feedforward receptive field, due to the fact that feedforward axons physically run up through the narrow column and abut the pyramidal bodies as they squeeze past.

Columnar Inhibition

The inhibitory cells have a broader and faster feedforward response compared with the pyramidal cells Reference so, in the absence of strong predictive inputs to any pyramidal cells, the entire assemblage of inhibitory neurons will be first to fire in a column. When this happens, these inhibitory cells excite those in adjacent columns, and a wave of inhibition spreads out from a successfully firing column.

The wave continues until it arrives at a column which has already been inhibited by a wave coming from elsewhere in the layer (from some recently active column). This gives rise to a pattern of inactivity around columns which are currently active.

Predictive Activation

Each cell in a column has its own set of feedforward and predictive inputs, so every cell has a different rate of depolarising as it is driven towards firing threshold.

Some cells may have received sufficient depolarising input from predictive lateral or top-down dendrites to reach firing threshold before the column’s sheath of inhibitory cells. In this case the pyramidal cell will fire first, trigger the column’s inhibitory sheath, and cause the wave of inhibition to spread out laterally in the layer.

Vertical Inhibition in Columns

When the inhibitory sheath fires, it also sends a wave of inhibitory signals vertically in the column. This wave will shut down any pyramidal cells which have not yet reached threshold, giving rise to a sparse activity pattern in the column.

The exact number of cells which get to fire before the sheath shuts them down depends mainly on how predictive each cell was and whether the sheath was triggered by a “winning cell” (previous section), by the sheath being first to fire, or as a result of neighbouring columns sending out signals.

If there is a wave of inhibition reaching a column, all cells are shut down and none (or no more) fire.

If there was a cell so predictive that it fired before the sheath, all other cells are very likely shut down and only one cell fires.

Finally, if the sheath was first to fire due to its feedforward input, the pyramidal cells are shut down quite quickly, but the most predictive may get the chance to fire just before being shut down.

This last process is called bursting, and gives rise to a short-lived pattern which encodes exactly how well the column as an ensemble has matched its predictions. Basically, the more cells which fire, the more “confused” the match between prediction and reality. This is because the inhibition happens quickly, so the gap between the first and last cell to burst must be small, reflecting similar levels of predictivity.

The bursting process may also be ended by an incoming wave of inhibition. The further away a competing column is, the longer that will take, allowing more cells to fire and extending the burst. Thus the amount of bursting also reflects the local area’s ability to respond to the inputs.


Neurons are machines which use patterns of input signals to produce a temporal pattern of output signal. The neuron wastes most resources if its potential rises but just fails to fire, so the processes of adaption of the neuron are driven to a) maximise the response to inputs within a particular set, and b) minimise the response to inputs outside that set.

The set of excitatory inputs to one neuron are of two main types – feedforward and predictive; the number of each type of input varies from 10’s to 10’s of thousand; and the inputs arrive stochastically in combinations which contain mixtures of true structure and noise, so the “partitioning problem” a neuron faces is intractable. It simply learns to do the best it can.

Note that neurons are the biggest components in HTM which actually do anything! In fact, the regions, layers and columns are just organisational constructs, ways of looking at the sets of interacting neurons.

The neuron is the level in the system at which genetic control is exercised. The neuron’s shape, size, position in the neocortex, receptor selections, and many more things are decided per-neuron.

Importantly, many neurons have a genetically expressed “firing program” which broadly sets a target for the firing pattern, frequency and dependency setup.

Again, this gives the neuron an optimal pattern of output, and its job is to arrange its adaptations and learn to match that output.


Distal dendrites have a similar but simpler and smaller scale problem of combining inputs and deciding whether to spike.

I don’t believe dendrites do much more than passively respond to global factors such as modulators and act as conduits for signals, both electrical and chemical, originating in synapses.


Synapses are now understood to be highly active processing components, capable of growing both in size and efficiency in a few seconds, actively managing their response to multiple inputs – presynaptic, modulatory and intracellular, and self-optimising to best correlate a stream of incoming signals with the activity of the entire neuron.

Part Two takes this idea further and details how a multilayer region uses the efficiency of predicted sparseness to learn a sensorimotor model and generate behaviour.

The next post in this series describes the Mathematics of Hierarchical Temporal Memory. This diversion is useful before proceeding with the main thread.

  • Sep 14 / 2014
  • 4
Cortical Learning Algorithm, NuPIC

A Unifying View of Deep Networks and Hierarchical Temporal Memory

There’s been a somewhat less than convivial history between two of the theories of neurally-inspired computation systems over the last few years. When a leading protagonist of one school is asked a question about the other, the answer often varies from a kind of empty semi-praise to downright dismissal and the occasional snide remark. The objections of one side to the others’ approach are usually valid, and mostly admitted, but the whole thing leaves one with a feeling that it is not a very scientific way to proceed or behave. This post describes an idea which might go some way to resolving this slightly unpleasant impasse and suggests that the discrepancies may simply be as a result of two groups using the same name for two quite different things.

In HTM, Jeff Hawkins’ plan is to identify the mechanisms which actually perform computation in real neocortex, abstracting them only far enough that the details of the brain’s bioengineering are simplified out, and hopefully leaving only the pure computational systems in a form which allows us to implement them in software and reason about them. On the other hand, Hinton and LeCun’s neural networks are each built “computation-first,” drawing some inspiration from and resembling the analogous (but in detail very different) computations in neocortex.

The results (ie the models produced), inevitably, are as different at all levels as their inventors’ approaches and goals. For example, one criterion for the Deep Network developer is that her model is susceptible to a set of mathematical tools and techniques, which allow other researchers to frame questions, examine and compare models, and so on, all in a similar mathematical framework. HTM, on the other hand, uses neuroscience as a standard test, and will not admit to a model any element which is known to be contradicted by observation of natural neocortex. The Deep Network people complain that the models of HTM cannot be analysed like theirs can (indeed it seems they cannot), while the HTM people complain that the neurons and network topologies in Deep Networks bear no relationship with any known brain structures, and are several simplifications too far.

Yann LeCun said recently on Reddit (with a great summary):

Jeff Hawkins has the right intuition and the right philosophy. Some of us have had similar ideas for several decades. Certainly, we all agree that AI systems of the future will be hierarchical (it’s the very idea of deep learning) and will use temporal prediction.

But the difficulty is to instantiate these concepts and reduce them to practice. Another difficulty is grounding them on sound mathematical principles (is this algorithm minimizing an objective function?).

I think Jeff Hawkins, Dileep George and others greatly underestimated the difficulty of reducing these conceptual ideas to practice.

As far as I can tell, HTM has not been demonstrated to get anywhere close to state of the art on any serious task.

The topic of HTM and Jeff Hawkins was second out of all the major themes in the Q&A session, reflecting the fact that people in the field view this as an important issue, and (it seems to me) wish that the impressive progress made by Deep Learning researchers could be reconciled with the deeper explanatory power of HTM in describing how the neocortex works.

Of course, HTM people seldom refuse to play their own role in this spat, saying that a Deep Network sacrifices authenticity in favour of mathematical tractability and getting high scores on artificial “benchmarks”. We explain or excuse the fact that our models are several steps smaller in hierarchy and power, making the valid claim that there are shortcuts and simplifications we are not prepared to make,  and speculating that we will – like the tortoise – emerge alone at the finish with the prize of AGI in our hands.

The problem is, however, a little deeper and more important than an aesthetic argument (as it sometimes appears). This gap in acknowledging the valid accomplishments of the two models, coupled with a certain defensiveness, causes a “chilling effect” when an idea threatens to cross over into the other realm. This means that findings in one regime are very slow to be noticed or incorporated in the other. I’ve heard quite senior HTM people actually say things like “I don’t know anything about Deep Learning, just that it’s wrong” – and vice versa. This is really bad science.

From reading their comments, I’m pretty sure that no really senior Deep Learning proponent has any knowledge of the current HTM beyond what he’s read in the popular science press, and the reverse is nearly as true.

I consider a very good working knowledge of Deep Learning to be a critical part of any area of computational neuroscience or machine learning. Obviously I feel at least the same way about HTM, but recognise that the communication of our progress (or even the reporting of results) in HTM has not made it easy for “outsiders” to achieve the levels of understanding they feel they need to take part. There are historical reasons for much of this, but it’s never too late to start fixing a problem like this, and I see this post (and one of my roles) as a step in the right direction.

The Neuron as the Unit of Computation

In both models, we have identified the neuron as the atomic unit of computation, and the connections between neurons as the location of the memory or functional adjustment which gives the network its computational power. This sounds fine, and clearly the brain uses neurons and connections in some way like this, but this is exactly where the two schools mistakenly diverge.

Jeff Hawkins rejects the NN integrate-and-fire model and builds a neuron with vastly higher complexity. Geoff Hinton admits that, while impossible to reason about mathematically, HTM’s neuron is far more realistic if your goal is to mimic neocortex. Deep Learning, using neurons like Lego bricks, can build vast hierarchies and huge networks, find cats in Youtube videos, and win prizes in competitions. HTM, on the other hand, struggles for years to fit together its “super-neurons” and builds a tiny, single-layer model which can find features and anomalies in low-dimensional streaming data.

Looking at this, you’d swear these people were talking about entirely different things. They’ve just been using the same names for them. And, it’s just dawned on me, therein lies both the problem and its solution. The answer’s been there all the time:

Each and every neuron in HTM is actually a Deep Network.

In a HTM neuron, there are two types of dendrite. One is the proximal dendrite, which contains synapses receiving inputs from the feedforward (mainly sensory) pathway. The other is a set of coincidence-detecting, largely independent, distal dendrite segments, which receive lateral and top-down predictive inputs from the same layer or higher layers and regions in neocortex.

My thesis here is that a single neuron can be seen as composed of many elements which have direct analogues in various types of Deep Learning networks, and that there are enough of these, with a sufficient structural complexity, that it’s best to view the neuron as a network of simple, Deep Learning-sized nodes, connected in a particular way. I’ll describe this network in some detail now, and hopefully it’ll become clear how this approach removes much of the dichotomy between the models.

Firstly, a synapse in HTM is very much like a single-input NN node, where HTM’s permanence value is akin to the bias in a NN node, and the weight on the input connection is fixed at 1.0. If the input is active, and the permanence exceeds the threshold, the synapse produces a 1. In HTM we call such a synapse connected, in that the gate is open and the signal is passed through.

The dendrite or dendrite segment is like the next layer of nodes in NN, in that it combines its inputs and passes the result up. The proximal dendrite effectively acts as a semi-rectifier, summing inputs and generating a scalar depolarisation value to the cell body. The distal segments, on the other hand, act like thresholded coincidence detectors and produce a depolarising spike only if the sum of the inputs exceeds a threshold.

These depolarising inputs (feedforward and recurrent) are combined in the cell body to produce an activation potential. This only potentially generates the output of the entire neuron, because a higher-level inhibition system is used to identify those neurons with highest potential, allow those to fire (producing a binary 1), and suppress the others to zero (a winner-takes-all step with multiple local winners in the layer).

So, a HTM layer is a network of networks, a hierarchy in which neuron-networks communicate with connections between their sub-parts. At the HTM layer level, each neuron has two types of input and one output, and we wire them together at such, but each neuron is really hiding an internal, network-like structure of its own.




  • Aug 23 / 2014
  • 0
Cortical Learning Algorithm, NuPIC

Suggested Naming in HTM Theory and White Paper

“There are only two hard things in Computer Science: cache invalidation and naming things.” – Phil Karlton

In the case of HTM, we also have the much bigger problem of explaining how neocortex may work, and how a non-obvious CLA operates to use cortical principles. Extra confusion caused by poor naming multiplies the difficulties.

A key component of the art of naming consists in identifying the scope of each name. We need to have names which are just specific enough to capture the underlying concept, but not so specific that they entangle non-essential details. Names also need to be memorable and comfortable, while not being too easy to misconstrue, because they resemble or contain words which have other meanings.

I’d like to begin a reasoned discussion about key names in HTM and CLA. The goal of the discussion is to arrive at a set of names which everyone strongly believes captures the concepts for both theory and implementation.

As a famous Supreme Court judge once said of pornography, “we cannot define it but we know it when we see it.” We are looking for this kind of name, with the added advantage that HTM can actually precisely define the concept behind each name.

Until we arrive at a good name for something (ie one which magically gets everyone’s support), we should identify the key flaws in each candidate and agree that they invalidate that candidate. This is a healthy process which should not be regarded as a criticism of any proposer.

Please treat that as an open invitation to tell me how poor my proposed names are, but only for reasons you’d accept as rational if they were directed at yours!

I’m currently re-reading the 2011 White Paper with a view to updating and improving it. This document is a very rich source of information pertinent to this discussion, and in fact appears to answer a couple of the thorniest ones! I’d very strongly recommend re-reading it as preparation for taking part in this discussion.

I’d like to go through the main named concepts one by one, discuss the strengths and weaknesses of the current names, and propose a new name for each concept with some supporting motivations and argument. I don’t expect that my proposals will stick, but they should get us a noticeable step in the right direction, or at least throw light on the relevant issues.

Sparse Distributed Representation.

I start with this one because, in my experience of learning, reasoning about, writing about, talking about, and explaining HTM, the term SDR is as close to perfect as I can imagine. It has the property of monotonically improving understanding the more you find out about each of the three concepts named.

It is also an easily testable name. We all remember when Francisco showed us the CEPT Retina SDRs, in fact they were so SDRish, some of us thought they were too good to be true!

Spatial Pooling.

There are several problems with this term. We understand that “spatial” was chosen to indicate that each presentation of the data has some properties and structure in the sensory domain (such as a shape, size or colour), and it’s called “spatial” as opposed to “temporal”.

A difficulty arises for newcomers who read too much into this use of the word. There is a strong temptation to rely on our commonsense ideas of space when Jeff is really talking about mathematical, vector spaces and the abstract “spaces” of SDRs.

HTM does not require the kind of retinotopic mapping found in V1. The only reason we have literal spatial layouts in just a few primary areas of sensory cortex is because it is a simpler evolutionary and developmental design, not because it is needed for the algorithm. The RDSE, the Geospatial Encoder and the CEPT retina are all superb examples of how “pseudorandom” representations are better than more pictorially understandable spatial representation regimes.

Lastly, we’ve already tripped over this when we started talking about the new sensorimotor theory. L4 cells are now dealing with motor inputs as well as “spatial”, and L3 cells are now expected to “see” a set of L4 outputs whose members are substituted over time. So the word “spatial” really needs to go.

The word “Pooling” has, for many, either no meaning at all (most cases), or worse, the wrong meanings in this context. If you are trying to capture the notion of a noise-tolerant, largely stable representation of closely related sensory input, “pooling” isn’t going to do that for most people.

I’m not sure there is a good word for this, so my suggestion drops this aspect. As mentioned several times in the 2011 White Paper, the concept of pooling (noise-tolerance, high-overlap) is already embedded as a property of the product of SP – the SDR.

I propose the term Pattern Memory for what we currently call Spatial Pooling. This captures the fact that patterns in the data are recognised-learned and that the CLA is developing a memory of patterns it has seen. By not being too specific about which patterns we mean, it also allows us to say that the CLA learns to recognise and remember patterns of input data, stores patterns of synaptic connections, and forms patterns of activation (SDRs) to represent its inputs.

This name is also robust to adopting the new theory. L4 cells can learn sensorimotor patterns, and L3 cells can learn to recognise patterns of membership in a sequence-set.
We can run this in the top-down direction too, talking about patterns appearing in L1, motor patterns, patterns of depolarisation, and so on.

(old) Temporal Pooling.

The problems with using this term in its old context have been well-rehearsed, and it’s now used for the much more appropriate concept of representing a stable(r) sequence-identifying SDR in Layer 3 when sensorimotor transitions from that sequence are occurring in Layer 4. Temporal Pooling, in that sense, is another great name.

I had previously offered the term “Transition Prediction” for the component of CLA involving lateral connections and predictive states. Jeff and Numenta are currently using “Temporal Memory”. I believe both are flawed.

My suggestion accurately captured the limited, 1-timestep scope of this component, and also the fact that prediction is the key to temporal learning. However, it sounds like we need to add words to the name, to reflect “something missing” from the two word name.

Temporal Memory, on the other hand, is too high-ranking and valuable a name for this relatively basic component. It carries the risk that people will think HTM is just a hierarchy of TMs. Also, “temporal” is too general – the same word is currently used for single-timestep (old TP/TM) all the way up to entire sequences (new TP).

I propose Transition Memory for this second core component of CLA. This captures most literally what the algorithm is doing – learning single transitions. It is also the temporal equivalent of Pattern Memory, using distal dendrites to link to past SDRs just as PM uses proximal dendrites to link to feedforward patterns.

Importantly, the term Transition Memory is not trying to work too hard. We can explain that learned transitions are used to put cells into predictive states, and that these predictive patterns are used both in sensory (variable order) and sensorimotor (first order) temporal learning. They are used to match predicted and actual inputs, detect anomalies and create patterns which indicate continuing successful prediction or trigger a pattern of bursting columns. It seems impossible to me to have one name capture all these aspects, so I propose we stop trying and give the name a break!

In a variation on Pattern Memory (SP), depolarisation due to Transition Memory is combined with feedforward inputs to assist recognition and increase noise-tolerance. In Jeff’s new sensorimotor theory, combining distal with proximal inputs is likely to be key to the function.

Old and New Versions of HTM/CLA Theory.

In previous posts, I used “old and new” or “2013 and 2014″ to distinguish these two generations of the theory. In reworking the White Paper, I’ve recognised that these two theories are akin to the Newtonian versus Relativistic or Quantum views of mechanics. You need to quite deeply understand the simpler theory before you can begin to deal with the far more complex and realistic one. And for many purposes, the simpler theory is perfectly sufficient both for understanding how the neocortex works, and for useful application in software.

I thus propose that the older, simpler theory and model be called the “Sensory Cortical Learning Algorithm” or “Sensory CLA”, the newer being called the “Sensorimotor CLA”.

SCLA (or just CLA) and SMCLA are simple, distinguishable acronyms.

This also allows us to talk about HTM systems with SCLA single-layer regions (as NuPIC can/does), which just do feedforward, sensory hierarchy, or else fuller HTMs which incorporate behaviour, stable sequences, temporal pooling, and true bidirectional hierarchy using SMCLA in each region.

  • Aug 14 / 2014
  • 0
Cortical Learning Algorithm, NuPIC

Implications of the NuPIC Geospatial Encoder

Numenta’s Chetan Surpur recently demoed and explained the details of a new encoder for NuPIC which creates Sparse Distributed Representations (SDRs) from GPS data. Apart altogether from the direct applications which this development immediately suggests, I believe that Chetan’s invention has a number of much more profound implications for NuPIC and even HTM in general. This post will explore a few of the most important of these. Chetans’ demo and a tutorial by Matt Taylor are available on Youtube. First, here is Chetan presenting to, and discussing it with, Numenta people: And here’s Matt with another excellent hands-on tutorial:


I’ll begin by describing the encoder itself. The Geospatial Encoder takes as input a triple [Lat, Long, Speed] and returns a Sparse Distributed Representation (SDR) which uniquely identifies that position for the given speed. The speed is important because we want the “resolution” of the encoding to vary depending on how quickly the position is changing, and Chetan’s method does this very elegantly. The algorithm is quite simple. First, a 2D space (Lat, Long) is divided up (virtually) into squares of a given scale (a parameter provided for each encoder), so each square has an x and y integer co-ordinate (the Lat-Long pair is projected using a given projection scheme for convenient display on mapping software). This co-ordinate pair can then be used as a seed for a pseudorandom number generator (Python and numpy use the cross-platform Mersenne Twister MT19937), which is used to produce a real-valued order between 0 and 1, and a bit position chosen from the n bits in the encoding. These can be generated on demand for each square in the grid, always yielding the same results. To create the SDR for a given position and speed, the algorithm first converts the speed to a radius and forms a box of squares surrounding the position and calculates the pair [orderbit] for each square in the box. The top w squares (with the highest order) are chosen, and their bit values are used to choose the w active bits in the SDR.

Initial Interpretation

The first thing to say is that this encoder is an exemplar of transforming real-world data (location in the context of movement) into a very “SDR-like” SDR. It has the key properties we seek in an SDR encoder, in that semantically similar inputs will yield highly overlapping representations. It is robust to noise and measurement error in both space and time, and the representation is both unique (given a set scale parameter) and reproducible (given a choice of cross-platform random number generator), independently of the order of presentation of the data. The reason for this “SDR-style” character is that the entire space of squares forms an infinite field of “virtual neurons”, each of which has some activation value (its order) and position in the input bit vector (its bit). The algorithm first sparsifies this representation by restricting its sampling subspace to a box of squares around the position, and then enforces the exact sparseness by picking the w squares using a competitive analogue of local inhibition.

Random Spatial Neuron Field (Spatial Retina)

This idea can be generalised to produce a “spatial retina” in n-dimensional space which provides a (statistically) unique SDR fingerprint for every point in the space. The SDRs specialise (or zoom in) when you reduce the radius factor, and generalise (or zoom out) when radius is increased. This provides a distance metric between two points which involves the interplay of spatial zoom and the fuzziness of overlap. Any two points will have identical SDRs (w bits of overlap) if you increase the radius sufficiently, and entirely disparate SDRs (0 bits overlap) if you zoom in sufficiently (down to the order of w*scale). Since the Coordinate Encoder operates in a world of integer-indexed squares, we first need to transform each dimension using its own scale parameter (the Geospatial Encoder uses the same scale for each direction, but this is not necessary). We thus have a single, efficient, simple mechanism which allows HTM to navigate in any kind of spatial environment. This is, I believe a really significant invention which has implications well beyond HTM and NuPIC. As Jeff and others mentioned during Chetan’s talk, this may be the mechanism underlying some animals’ ability to navigate using the Earth’s magnetic field. It is possible to envisage a (finite, obviously) field of real neurons which each have a unique response to position in the magnetic field. Humans have a similar ability to navigate, using sensory input to provide an activation pattern which varies over space and identifies locations. We combine whichever modalities work best (blind people use sound and memories of movement to compensate for impaired vision), and as long as the pipeline produces SDRs of an appropriate character, we can now see how this just works.

Comparison with Random Distributed Scalar Encoder (RDSE)

The Geospatial Encoder uses the more general Coordinate Encoder, which takes a n-dimensional integer vector and a radius, and produces the corresponding SDR. It is easy to see how a 1D spatial encoder with a fixed speed would produce an SDR for arbitrary scalars, given an initial scale which would decide the maximum resolution of the encoder.  This encoder would be an improved replacement for the RDSE, with the following advantages:

  • When encoding a value, the RDSE needs to encode all the values between existing encodings and the new value (so that the overlap guarantees are honoured). A 1D-Geo encoder can compute each value independently, saving significantly in time and memory footprint.
  • In order to produce identical values for all inputs regardless of the order of presentation, the RDSE needs to “precompute” even more values in batches around a fixed “centre” (eg to compute f(23) starting at 0, we might have to compute [f(-30),…,f(30)]). Again, 1D-Geo scalar encoding computes each value uniquely and independently.
  • Assuming scale (which decides the max resolution) is fixed, the 1D-Geo scalar encoding can compute encodings of variable resolution with semantic degradation by varying speed. The SDR for a value is exactly unique for the same speed, but changes gradually as speed is increased or decreased. The RDSE has no such property.

This would strongly suggest that we can replace the RDSE with a 1D coordinate spatial encoder in NuPIC, and get all the above benefits without any compromise.

Combination with Spatially-varying Data

It is clear how you could combine this encoding scheme with data which varies by location, to create a richer idea of “order” in feeding the SDR generation algorithm. For example, you could combine random “order” with altitude or temperature data to choose the top w squares. Alternatively, the pure spatial bit signature of a location may be combined in parallel with the encoded values of scalar quantities found at the current location, so that a HTM system associatively learns the spatial structure of the given scalar field.

Spatially Addressed Memory

The Geospatial Encoder computes a symbolic SDR address for a spatial location, effectively a “name” or “word” for each place. The elements or alphabet of this encoding are simply random order activation values of nearby squares, so any more “real” semantic SDR-like activation pattern will do an even better job in computing spatial addresses. We use memories of spatial cues (literally, landmarks), emotional memories, maps, memories of moving within the space, textual directions, and so on to encode and reinforce these representations. This model explains why memory experts often use Memory Palaces (aka the Method of Loci) to remember long sequences of data items. They associate each item (or an imagined, memorable visual proxy) occupying a location in a very familiar spatial environment. It also explains the existence of “place neurons” in rodent hippocampi – these neurons are each participating in generating a spatial encoding similar in character to the Geospatial Encoder.

Zooming, Panning and Attention

This is a wonderful model for how we “zoom in” or “zoom out” and perceive a continuously but smoothly varying model of the world. It also models how we can perceive gracefully degrading levels of detail depending on how much time or attention we pay for a perception. In this case, the “encoder” detailed here would be a subcortical structure or a thalamus-gated (attention controlled) input or relay between regions. If we could find a mechanism in the brain which controls the size and position of a “window” of signals (akin to our variable box of squares), we would have a candidate for our ability to use attention to control spatial resolution and centre of focus. Such a mechanism may automatically arise from preferentially gating neurons at the edges of a “patch”, by virtue of the inhibition mechanism’s ability to smoothly alter the representation as inputs are added or removed. This mechanism would also explain boundary extension error, in which we “fill out” areas surrounding the physical boundaries of objects and images. As explained in detail in her talk at the Royal Institute, Eleanor Maguire believes that the hippocampus is crucial for both this phenomenon and our ability to navigate in real space. As one of the brain components at the “top” of the hierarchies, the hippocampus may be the place where we can perform the crucial “zooming and panning” operations and where we manipulate spatial SDRs as suggested by the current discovery.

Implementation Details

The coordinate encoder has a deterministic, O(1), order-independent algorithm for computing both “order” and bit choice. One important issue is that the pseudorandom number is Python-specific, and so a Java encoder (which uses a different pseudorandom number generator) will produce completely different answers. The solution is to use the Python (and numpy) RNG, which is the Mersenne Twister MT19937, also used by default in numerous other languages. I believe it would be worth exploring using Perlin noise to generate the order and bit choice values. This would give you a) identical encodings across platforms, b) pseudorandom, uncorrelated values when the noise samples are far enough apart (eg when the inputs are integers as in this case), and c) smoothly changing values if you use very small step sizes. Just one point about changing radius and its effect on the encoding. I’m very confident that the SDR is very robust to changes in radius, due to the sparsity of the SDRs. In other words, the overlap in an SDR at radius r with that at radius r’ (at the same GPS position) will be high, because you are only adding or removing an annulus around the same position (this will be similar to adding or removing a strip of squares when a small position change occurs).

Links to the Demo and Encoder Code

Chetan’s demo code (which is really comprehensive) is at The Geospatial Encoder code is at and the Coordinate Encoder is at

  • Jun 07 / 2014
  • 0
Real Life!

Ham Boiled and Roasted in Guinness, with Belfast Champ

This has to be one of my favourite dishes. It’s absolutely Irish, but has some twists which come from learning how to cook with Italians, Spaniards and Mexicans. It’s a bit of work (some in prep, mostly in the care you take checking up on your food as it cooks), but it results in a sublime thing which your dinner guests will never forget you introduced them to. Just remember to say you got the recipe from me (I got the inspiration for this from the Guinness site, which has a very decent recipe for the boiled version of this one).


Serves 6

For boiling:

  • 1.5Kg (3lbs) Prime Ham Fillet (get it from your butcher, ask for it on the bone if possible), skin on.
  • Some pork or bacon ribs or other pork bones (butcher will often give you these for free from their bin)
  • Half-dozen cloves
  • 3 Spanish (or any large) Onions, finely chopped
  • 1 Red Onion (optional)
  • 1 Red Bell Pepper, finely chopped
  • 3 medium sized carrots
  • 2 Spring Onions (Scallions)
  • 1 or 2 Bay (or Laurel) leaves
  • A sprig of fresh (or a 1tsp dried) Parsley
  • 1 can of Draught Guinness
  • 3 peeled tomatoed, chopped, or 1 can of chopped tomatoes
  • Grapeseed, Sunflower or Olive Oil
  • Salt and Pepper
  • A pinch of Spanish Pimenton (Paprika) optional

Equipment: 1 Large Saucepan or Pressure Cooker, 1 Sharp Chef’s Knife, Veg cutting board or food processor.

For Roasting:

  • 2-3 tsps of the best honey you can find (or Demerera or brown sugar)
  • More carrots, parsnips, root veg to your taste, roughly chopped.
  • (Optional, I don’t) another can of Guinness

Preheat Oven to 180C (400F).
Equipment: 1 Deep Baking Dish

To Serve with Belfast Champ:

  • 12 Medium “Floury” Potatoes ( a small bag)
  • A bunch of Spring Onions (Scallions)
  • 1/2 a bunch of Thyme (or 3 tsp dried)
  • 50g (2oz) Irish Butter
  • 125ml (1/4 pint) fresh whole milk
  • 1 organic free range egg

Equipment: Potato Ricer (makes the spuds delicious) or masher.

(Optional, not the Irish way!) a leaf salad made with lemon, olive oil and Rocket leaves.


In a large saucepan, sweat the onions, scallions, carrot and pepper for 5-10 mins over a gentle heat until the onion is translucent. Add a few pinches of salt and pepper. Remove from heat and place in a bowl.

Wash all the meat under cold running water before use. Score the ham skin and stick the cloves through the slits into the meat. Put a little oil in the saucepan and place the ham in the centre. Slip the pork bones between the ham and the sides of the saucepan. Add the can of Guinness, pouring over the meat. Add back the onion-carrot mix, and the tomatoes, parsley and pimenton. Slide a couple of bay leaves down the sides of the pan and cover. When it boils, lower the heat and gently simmer for as long as possible, but at least 90 minutes. Every 10-15 mins, return to the pan to ensure it’s gently simmering and taste the evolving flavours.

30 minutes before the ham is done, put on the oven at 190C/375F. Place the baking dish in the oven to heat up as well.

You’ve now cooked the ham perfectly well, so if you wish to give up now, work away and you could serve it as boiled ham and rule (prepare the Belfast Champ as below). But trust me, you really want to roast it very slowly to bring out the great flavours we’ve just infused into the meat. So, take the baking dish out of the oven and smear some oil over it before placing your ham in the centre. Pour the sauce from the saucepan over the meat and put the pork bones/ribs around the ham in the dish. Add any more carrots, parsnips, or other root veg as you like, scattered around the ham. At this point some people would pour another pint of Guinness over the meat, but I’d prefer to drink it! Now, take a couple of teaspoons of the best honey you can find (or brown sugar if you can’t) and smear the top of the ham with it. Sprinkle with freshly picked thyme and place in the oven at the high heat to crisp the fat for 15 minutes. Reduce the oven temperature to 150C/300F and allow to slow-roast for at least another 45 mins. Every 15-30 mins, take it out of the oven and baste the juices over the ham.

15 minutes before the ham is ready, have your potatoes peeled and chunk them into 2cm or 3/4in cubes. Boil gently in a large saucepan for 12-15 minutes (you can tell they’re done when you can stick a fork straight into them without resistance). Drain the spuds in a colander and place in a bowl. In the potato saucepan, heat up the milk and butter, add the spring onions and thyme, and gently bring to the boil. Reduce the heat, simmer for a couple of minutes to poach the onions, and add the butter and the raw egg. Whisk until smooth, remove from heat and add the potato using a potato ricer (or else gently mash the potatoes in the bowl using a potato masher or fork). Fold everything together until it has a glossy appearance. If you have too much potato (if it is too dry) add a little more milk at a time until it’s smooth. Keep warm but off the hob (it’ll burn).

When the ham is done. remove from the oven and allow to rest for 10-15 minutes (this distributes the juices in the ham throughout the meat and is essential for a good result). By all means baste the ham one last time using the juices from the dish. If you like thick gravy, you should remove the veg and bones from the dish and place it over a gentle heat on the hob and scrape it to release the burned-in flavours while condensing it. I personally prefer a lighter gravy so I don’t reduce the sauce.

Once rested, carve the ham and plate up a few slices of ham, lots of the champ, the roast veg, and salad if that’s your thing. Pour liberal quantities of gravy over both meat and spuds. Enjoy!


  • May 09 / 2014
  • 0
Cortical Learning Algorithm, General Interest

Proposed Mechanism for Layer 4 Sensorimotor Prediction

Jeff Hawkins has recently talked about a sensorimotor extension for his Cortical Learning Algorithm (CLA). This extension involves Layer 4 cells learning to predict near-future sensorimotor inputs based on the current sensory input and a copy of a related motor instruction. This article briefly describes an idea which can explain both the mechanism, and several useful properties, of this phenomenon. It is both a philosophical and a neuroscientific idea, which serves to explain our experience of cognition, and simultaneously explains an aspect of the functioning of the cortex.

In essence, Jeff’s new idea is based on the observation that Layer 4 cells in a region receive information about a part of the current sensory (afferent, feedforward) inputs to the region, along with a copy of related motor command activity. The idea is that Layer 4 combines these to form a prediction of the next set of sensory inputs, having previously learned the temporal coincidence of the sensory transition and the effect of executing the motor command.

One easily visualised example is that a face recognising region, currently perceiving a right eye, can learn to predict seeing a left eye when a saccade to the right is the motor command, and/or a nose when a saccade to the lower right is made, etc. Jeff proposes that this is used to form a stable representation of the face in Layer 3, which is receiving the output of these Layer 4 cells.

The current article claims that the “motor command” represents either a real motor command to be executed, which will cause the predicted change in sensory input, or else the analogous “change in the world” which would have the same transitional sensory effect. The latter would represent, in the above example, the person whose face is seen, moving her own head in the opposite direction, and presenting an eye or nose to the observer while the observer is passive.

In the case of speech recognition, the listener uses her memory of how to make the next sound to predict which sounds the speaker is likely to make next. At the same time, the speaker is using his memory of the sound he expects to make to perform fine control over his motor behaviour.

Another example is the experience of sitting on a stationary train when another train begins to move out of the station. The stationary observer often gets the feeling that she is in fact moving and that the other train is not (and a person in the other train may have the opposite perception – that he is stationary and the first person’s train is the one which is moving).

The colloquial term for this idea is the notion of a “mirror cell”. This article claims that so-called “mirror cells” are pervasive at all levels of cortex and serve to explain exactly why every region of cortex produces “motor commands” in the processing of what is usually considered pure sensory information.

In this way, the cortex is creating a truly integrated sensorimotor model, which not only contains and explains the temporal structure of the world, but also stores and provides the “means of construction” of that temporal structure in terms of how it can be generated (either by the action of the observer interacting with the world, or by the passive observation of the external action of some cause in the world).

This idea also provides an explanation for the learning power of the cortex. In learning to perceive the world, we need to provide – literally – a “motivation” for every observed event in the world, as either the result of our action or by the occurrence of a precisely mirrored action caused externally. At a higher cognitive level, this explains why the best way to learn anything is to “do it yourself” – whether it’s learning a language or proving a theorem. Only when we have constructed both an active and a passive sensorimotor model of something do we possess true understanding of it.

Finally, this idea explains why some notions are hard to “get” at times – this model requires a listener or learner not just to imagine the sensory perception or cognitive “snapshot” of an idea, but the events or actions which are involved in its construction or establishment in the world.