Monthly Archives / December 2014

  • Dec 17 / 2014
  • 0
Cortical Learning Algorithm

Multilayer Model for Hierarchical Temporal Memory

This post sketches a simple model for multilayer processing in Hierarchical Temporal Memory (HTM). It is based on a combination of Jeff Hawkins’ and Numenta’s current work on sensorimotor extensions to HTM, my previous ideas on efficiency of predicted sparseness as well as evidence from neuroscience.

HTM has entered a new phase of development in the past year. Hawkins and his colleagues are currently extending HTM from a single-layer sensory model (assumed to represent high-order memory in Layer 2/3 of cortex) to a sensorimotor model which involves Transition Memory of combined sensory and motor inputs in L4, which is Temporally Pooled in L2/3. Once this is successfully modelled, the plan is to examine the role of L5 and L6 in motor behaviour and feedback.

Recent research in neuroscience has significantly improved our understanding of the various pathways in cortical circuits. [Douglas & Martin, 2004] proposed a so-called canonical pathway in which thalamic inputs arrive in L4, which projects to L2/3 (which sends its output to higher regions), then to L5 (which outputs motor signals) and from there to L6 (which outputs feedback to lower layers and thalamus). Teams led by Randy Bruno [deKock et al, 2007], [Constantinople & Bruno, 2013] have found that there is also a parallel circuit thalamus-L5-[L6 and L4] as well as an L3-L4 feedback pathway.

Figure 1, which is from [deKock et al, 2007], shows the calculated temporal pattern of activity in a piece of rat barrel cortex (called D2) consisting of about 9000 neurons. Barrel cortex is so named because the neurons responsive to a single Primary Whisker (PW) form a barrel-like columnar structure in this part of rat cortex. The paper estimates the layer populations in this “column” to be 3200 L2/3, 2050 L4, 1100 L5A, 1050 L5B and 1200 L6 excitatory cells.


Figure 1. Evolution of Action Potential (AP) rates in rat barrel cortex when experimenters stimulate the associated whisker. VPM is the thalamic region which projects to this part of cortex. From [deKock et al, 2007].

We’ll examine this data from the point of view of HTM. Firstly, we see that the spontaneous activity in all layers is very sparse (0.3% in L2/3, 0.6% in L4, 1.1% in L5A, 3% in L5B and 0.5% in L6), and that activity rises and falls dramatically and differently in each layer over the 150ms following stimulation.

Looking at the first 10ms and only in L4-L2/3, we see the expected sparse activations in L4 and L3, which is followed by a dramatic increase (x17 in L4, x10 in L2/3) representing bursting in both layers, likely because the input was unpredicted. Over the next 20ms, activity in L2/3 drops sharply back to 2x the baseline, but that in L4, after 10ms of dropping, rises again to practically match the original activation. This is matched in the next 10ms by a rise in L2/3 activation, after which both levels drop gradually towards the baseline over more than 100ms. We see another, somewhat different “wavelike” response pattern in the L5/6 complex.

So, can we build a model using HTM principles which explains this data (and, even better, predicts other unseen data)? I believe there must be such a model, because we see this kind of processing everywhere we look in cortex.

Before we get to that, let’s identify some important principles which arise from our current understanding of cortical function.

I: A Race to Represent

The first principle is that a population of neurons which share a common set of inputs is driven to “best represent” its inputs using a competitive inhibition process. Each neuron is accumulating depolarising input current from a unique set of contextual and immediate sources, and the first to fire will inhibit its neighbours and form part of the representation.

Each neuron can thus be seen as analogous to a “microtheory” of its world, and it will accumulate evidence from past context, current sensory inputs, and behaviour to compete in a race for its theory to be “most true”.

II: Different Sources of Evidence

The purpose of the layered structure of neocortex is to allow each population to combine its own individual evidence sources and learn to represent the “theory” of that evidence. The various populations (or sublayers) form a cyclic graph structure of evidence flow, and they cooperate to form a stable, predictable, sensorimotor model of the current world.

III: Efficiency of Predictive Sparseness

Each neuron combines contextual or predictive inputs (on distal synapses) with evidence from immediate sources (on proximal synapses). In addition, the columnar inhibitory sheath is also racing to recognise its inputs, which come largely from the same feedforward sources as its contained pyramidal cells. The sheath has an advantage as it is a better responder [cite] to the feedforward evidence alone than any of its contained cells, so there is also a race between predictive assisted recognition and simple spatial recognition of reality.

The result of the race depends on which wins – if a single pyramidal cell wins due to high predictive depolarisation (lots of contextual evidence), then it alone will fire. Otherwise, there is a short window of time which allows some number of the most predictive cells in the column to fire in turn, before they are inhibited by a vertical process. This “bursting” encodes the difference between the reality (as signalled by this column’s inhibitory sheath firing) and the population’s prediction (as would have been signalled by a highly predictive cell in some losing nearby column).

IV: Self-stabilisation through Sparse Patterns

If we consider a cortical region in its “steady state”, we see highly sparse (non-bursting) representations everywhere, and the behavioural output (from Layer 5) will be a sequence of highly sparse patterns which result in very fine motor adjustments (or none at all). This corresponds to the region perfectly modelling the sensorimotor world it experiences and making optimal predictions with minimal corrective behaviour.

A deviation from this state (failure of prediction) leads to a partial change in representation (because reality differs from prediction) and some amount of redundant predictive representation (when several cells burst in new columns). This departure from maximal sparseness is transmitted to the downstream sublayers, causing their “view of the world” and thus their own state to change. Depending on how well each sublayer can predict these changes, the cascade may halt, or instead continue to roll around the cyclical graph of sublayers, causing behavioural side-effects as it goes.

V: A Team of Rivals – “Explaining Change” by Witnessing or Acting

Within each sublayer, some cells will have inputs which correspond to “observing” the world as it evolves on its own (by predicting from context), while others will respond better when the organism is taking certain actions, and will have learned to associate certain changes with those behaviours. The representation in each sublayer will be some mixture of these, and, in the case of motor output cells in L5, the “decisions” of the region will be those which restore the predictability of things.

The reason is simple. While the activity in the region is sparse, all the active cells are predicting their activity, and the outputs of the region reflect the happy condition. These include motor output, which by definition is acting to prolong the current status of the region (if it was acting to depart from the status, these motor cells would not be still firing).

When something changes, and a set of new neurons becomes active, new neurons become temporarily active throughout the various sublayers, but they will all be cells which have learned to respond better to the new state of the world than the previously active cells. These cells will have learned to associate their own activity with the new situation, by being more right about predicting their own activity in this new context. And this, in turn, will be true only if they are the long-term winners in the establishment of a new, stable cycle of sparse activity, or alternatively if they have regularly participated in the transition to a new stable state. Either way, the system is self-stabilising, acting to right itself and improve the prediction.

A Multilayer Cortical Model

I claim that the above principles are enough to construct a simple model of how the sublayers in a region of cortex interact and co-operate.

I use the word “sublayers” because each layer (L1-6) may contain more than one population or class of neurons. We’ll pretend these are each in their own sublayer, but recognising that there are local connections between cells in sublayers which are important to how things work.

So as not to confuse, I’ll not use the common notation for sublayers found in the literature (eg L5A), instead I’ll use labels such as L5.1, L5.2 and so on. The “minor number” will usually indicate sublayers successively “far away” from the sensorimotor inputs, both in terms of time and the number of neurons in the path to reach them. I’ll also use the deKock diagram above to anchor the place and time of each part of the response to a large sensory stimulus.

I’ll also assume the idea that when a neuron projects an axon, it does so in order to connect proximally with its target. Thus, L4 projections to L2/3 are proximal on L2/3 cells, likewise with L6 to L4, while the L2/3->L4 feedback pathway uses distal dendrites.

Layer 4.1 – Sensorimotor Transition Prediction (0-20ms)

Layer 4 is said [cite] to receive inputs from L6 (65%), elsewhere in L4 (25%), and directly from thalamus (5%). In addition, some cells in L4 have distal dendrites in L2/3. We’ll split L4 into two sublayers, depending on whether they receive inputs from L2/3 (L4.1 no, L4.2 yes). Some researchers [cite] divide L4 into two populations – stellar cells and pyramidal cells, and it may be that the split is along these lines.

My hypothesis is that L4.1 cells are making predictions of sensorimotor transitions, using thalamic sensorimotor input as (primarily) feedforward, and a combination of local predictive context (L4) and information about the region’s current sensorimotor output (from L6). I say “primarily” because a single feedforward axon could synapse with a cell both on its proximal and distal dendrites, and this would be even more important for the stellar dendritic branches of L4.1 cells.

Note that the L4 inputs to L4.1 includes evidence of the output of L2/3 (a more stable “sensory” representation) via L4.2. The L6-sourced inputs also include evidence of the stable feedback pattern being sent to lower regions, which are themselves indirectly influenced by L5’s use of L2/3 (see later).

So, L4.1 is receiving fast-changing sensorimotor inputs, along with slower-changing context from within L4, and both sensory and motor outputs of the region. It uses whatever best evidence it has to predict any transitions in the thalamic input.

Successful prediction in L4.1 results in it outputting a highly sparse pattern on each transition. Failures in prediction are encoded as a union of “nearly predicted” cell activations in the columns best recognising the unpredicted thalamic input.

This might not seem sensible when thalamic inputs are only 5% of what L4.1 is receiving, but remember that the other inputs are usually highly sparse (1-2%) and change much more slowly, so thalamic feedforward input to L4.1 acts as a tiebreaker among predictions. This pattern is repeated throughout cortex because bursting cells cause a similar disruptive, temporary tiebreaking signal in downstream sublayers.

Layers 3.1 and 2.1 – Temporal Pooling (10-20ms)

Layers 2 and 3 are usually treated as one. Both receive most of their feedforward input from L4 and have distal inputs both from within L2/3 and from L1 (which gets feedback input from L6 in higher regions).

I’ll split the two by saying that L2 gets more input from L1 than L3 does. In other words, L2 is more primed or biased by higher-level context, while L3 is less likely to be dominated by feedback. There is evidence [cite] of this differentiation, so let’s assume it’s useful.

Now, L2.1/L3.1 are receiving feedforward inputs from L4.1. If those inputs are sparse, then only those cells in L2/3 which have many active inputs will be part of the SDR in this layer (it’s one layer in a column sense, just the L2 “end” has a higher L1 input mix). In addition, they’ll need good intralayer and/or top-down predictive input to maintain stable activity.

The stability in L2.1/3.1 comes from the combination of stable predictive inputs from within the layer and from above. This prebiases predictive cells to recognise the successive sparse inputs from L4.1 and continue to remain active. The active cells in L2/3 have learned to use a combination of sequence memory (intralayer) and top-down feedback to associate with each fast-changing SDR in L4.1. This mechanism is reinforced by the fast L4.1-L2/3.1-L4.2-L4.1 feedback loop, along with the much longer feedback loops.

This is where the L2/3 difference is important. The more superficial cells in L2/3 are more strongly biased by top-down feedback from L1. We have evidence [cite] that L2 projects more strongly to the deep part of L5, while L3 projects more to superficial L5. Thus, the choices of active cells in L2/3 encode how much sequence memory and how much top-down are involved in the representation.

L6.1 – Comparing Reality with Expectations from Behaviour (0-10ms)

[Constantinople and Bruno], among others, show that direct thalamic inputs arrive simultaneously at L4 and L5/L6, suggesting that L5/6 and L4/L2/3 are performing parallel operations on sensorimotor inputs. While the L4-L2/3 system is relatively simple (at least at first order approximation), the L5/6 system is much more complex, involving a larger number of functional populations with diverse purposes. I’ll describe a minimum of these for now.

Layer 6.1 cells are the first in L5/6 to respond to thalamic inputs, suggesting a role analogous to L4.1. Unlike L4 cells, however, these cells have immediate access to both the recent L6 output to lower regions (representing the current steady state of the region) and the current motor output of the region (from L5). This much richer set of evidence sources allows L6.1 to make finer-grained predictions of the expected thalamic inputs, and its response when prediction fails is the primary driver for changes in L5 motor output and signals to higher regions.

L5.1 – Responding to Change by Acting (0-20ms)

I speculate that the thick-tufted L5B cells correspond to L5.1 in my model. These cells also receive direct thalamic inputs, as well as inputs from L6, L2/3 (primarily the L2 “end”) and top-down feedback via L1. L5.1’s purpose is to act quickly if necessary, in response to a significant change in its world. Any dramatic change in either sensorimotor patterns or context will cause L5 to output a large, non-sparse signal which it has learned is appropriate to that change.

In the steady state, with all inputs sparse, L5.1 generates a minimal, sparse signal which corresponds to energetically efficient, smooth behaviour in the organism. Sudden (unpredictable) changes in either sensorimotor inputs (thalamic), correspondence between behaviour and outcomes (L6), sequence memory predictions (L2/3) or top-down “instructions” (L1) will cause a dramatic rise in output (from 3% to over 10% active cells) which results in new corrective motor behaviour as well as an alarm signal to higher layers.

L6.2 – Co-ordination of Responses (10-30ms)

In Layer 6, a second population of cells is responsible for integrating any rising activity in L5.1 with context, signalling L4 of the new situation, and affecting the L6 feedback output. The better L6.2 can predict/recognise the output of L5, the sparser its signal to L4 and the smaller the effect on L6 feedback output. Thus, L6.2 acts either to help L4 make good predictions of transitions (by sending sparse signals), or to disrupt steady-state prediction in L4 (and later L2/3) into a new sensorimotor regime.

L4.2 and L2/3: Stabilising Prediction (30-50ms)

After 30ms or so, pyramidal cells in L4 are sampling the “sensory” response of L2/3 along with signals from L6 about the motor response. L4.2 can now generate a signal for L2/3 which is more sparse than the initial L4.1 response, but still well above baseline. Over the next 20-50ms, L4.2 and L2/3 use this feedback loop (along with the L5/6 motor loop) to reduce their activity and settle into a steady predictive state.

I propose that it is these L4.2 cells which participate in the steady-state activity of L4, along with the L5.2 cells (next section). L4.1 and L5.1 are representative of large transitions between steady, predictive sparse states.

L5.2 and L6 – Stabilising Behaviour (40-50ms)

L5.2, which corresponds to thick-tufted cells in L5A (in deKock’s diagram). This sublayer combines the context inputs (from L6, L1 and L5) with the lagging, stabilising output from L2/3 (which is being stabilised by the L4.2 feedback loop) and produces a second motor response (and a second signal to higher layers). With more information about how L2/3 responded to the initial signal, L5.2 can learn to produce a more nuanced behaviour than the “knee-jerk” response of L5.1, or perhaps counteract it to resume stability.

L6 is again used to provide feedback of behaviour to L4 and aid its prediction.

Multilayer CLA

Figure 2: Schematic showing main connections in the multilayer model. Each “neuron” represents a large number of neurons in each sublayer.

ppMultilayer Flow Diagram

Figure 3: Schematic showing main axonal (arrows) and dendritic (tufts) links in the multilayer model.

Summary

We can see how this model allows a region of cortex to go from a highly sparse, quiescent steady state, absorb a large sensory stimulus, and respond, initially with dramatic changes in activity, then with decreasing waves of disturbance and motor response, in order to restore a new steady state which is self-sustaining.

The fast-responding L4.1 and L5.1 cells react first to a drastic change, causing representations in L2/3 and L6 to update, and then the second population, using L4.2 to stabilise perception and L5.2 to stabilise behaviour, takes over and settles into a new steady state.

Examples

Apart from the rat barrel cortex example used here, we can see how this model can be applied in other well-studied cortical systems.

Microsaccades Stabilise Vision in V1

In V1, the primary thalamic input is from retinal ganglion cells which detect on-centre or off-centre patterns in the retinal image. L4 cells are understood [cite] mostly to contain so-called “simple cells” which respond to short oriented “bars” formed by a small number of neighbouring ganglion cells. L2/3, by the same token, contains many more “complex” cells which respond to overlapping or moving bars corresponding to longer edges or a sequence of edge movements. L4 also contains a smaller number of cells with these response properties.

I propose that the simple cells are L4.1, while the L2/3 complex cells are temporally poling over these cells, and the second population of L4 complex cells are actually L4.2, responding to the activity in L2/3. L5 in steady state is causing the eye to microsaccade in order to stabilise the “image” formed in L2/3 of the edges in the scene as tiny movements of organism and objects cause the exact patterns in L4.1 to change predictably.

Deviations beyond the microsaccade scale will cause bursting in L4.1, and the SDR shown by L2/3 will change to a new one representing the new sensory input. If L2/3 can use L1 and its own predictive input to correctly expect this new state, it will remain sparse and cause minimal reaction in L5 (in the second phase). If not, L2/3 will burst, L5 will generate a large signal, and thus V1 will pass the buck up to a region which can deal with changes of scene.

This process will be repeated at higher levels, at higher temporal and spatial scales.

Speech Generation

In speech generation, the sensory input is from the ears, and the motor output is to the vocal system. The region responsible for generating speech is controlled (via L1) by higher regions expressing a high-level representation of sounds to be produced. Layer 2/3 uses this input to bias itself to represent all sequences of sounds which match the L1 signal. Layer 5 receives both these signals and is thus highly predictive of representing the motor actions for these sequences. Since all the sublayers are at non-zero sparseness, activity will propagate and be amplified at each stage by the predictive states until a “most probable” starting sound is generated. The region will continue to generate the correct motor activity, using prediction to correct for differences between the expected and perceived sounds.

Citations (to be completed)

Constantinople, Christine M. and Bruno, Randy M.: Deep Cortical Layers Are Activated Directly by Thalamus. Science 28 June 2013: Vol. 340 no. 6140 pp. 1591-1594 DOI: 10.1126/science.1236425 [Abstract Free]

Douglas, Rodney J. and Martin, Kevan A.C.: Neuronal Circuits of the Neocortex, Annu. Rev. Neurosci. 2004. 27:419–51 doi:10.1146/annurev.neuro.27.070203.144152 [Google Scholar]

[Abstract/Full Text]

  • Dec 08 / 2014
  • 0
Cortical Learning Algorithm

Response to Yann LeCun’s Questions on the Brain

Yann LeCun recently posed some questions on Facebook about the brain. I’d like to address these really great questions in the context of Hierarchical Temporal Memory (HTM). I’ll intersperse the questions and answers in order.

A list of challenges related to how neuroscience can help computer science:

– The brains appears to be a kind of prediction engine. How do we translate the principle of prediction into a practical learning paradigm?

HTM is based on seeing the brain as a prediction system. The Cortical Learning Algorithm uses intra-layer connections to distal dendrites to learn transitions between feedforward sensory inputs. Individual neurons use inputs from neighbouring, recently active neurons to learn to predict their own activity in context. The layer as a whole chooses as sparse a set of best predictor-recognisers to represent the current situation.

– Good ML paradigms are built around the minimization of an objective function. Does the brain minimize an objective function? What is this function?

The answer is different at each level of the system, but the common theme is efficiency of activity. Synapses/dendritic spines form, grow and shrink in response to incoming signals, in order to maximise the correlation between an incoming signal and the neuron’s activity. Neurons adjust their internal thresholds and other parameters in order to maximise their probability of firing given a combined feedforward/context input pattern. Columns (represented using a simplified sheath of inhibitory neurons) again adjust their synapses in order to maximise their contained cells’ probability of becoming active given the inputs. The objective metric of a layer of neurons is the sparsity of representation, with errors in prediction-recognition being measured as lower sparsity (bursting in columns). A region of cortex produces motor output which minimises deviations from stable predicted representations of the combined sensory, motor, contextual and top-down inputs.

– Good ML systems estimate the gradient of their objective function in order to minimize it. Assuming the brain minimizes an objective function, does it estimate its gradient? How does it do it?

Each component in HTM uses only local information to adapt and learn. The optimisation emerges from each components’ responses as it learns, and from competition between columns and neurons to represent the inputs.

– Assuming that the brain computes some sort of gradient, how does it use it to optimize the objective?

There is no evidence of a mechanism in the brain which operates in this way. HTM does without such a mechanism.

– What are the principles behind unsupervised learning? Much of learning in the brain is unsupervised (or predictive). We have lots of unsupervised/predictive learning paradigms, but none of them seems as efficient as what the brain uses. How do we find one that is as efficient and general as biological learning?

CLA is a highly efficient and completely general unsupervised learning mechanism, which automatically learns the combined spatial and temporal structure of the inputs.

– Short term memory: the cortex seems to have a very short term memory with a span of about 20 seconds. Remembering things for more than 20 seconds seems to require the hippocampus. And learning new skills seems to take place in the cortex with help from the hippocampus. How do we build learning machines with short-term memory? There have been proposals to augment recurrent neural nets with a separate associative short-term memory module (e.g LSTM, Facebook’s “Memory Networks”, Deep Mind’s “Neural Turing Machine”). This is a model by which the “processor” (e.g. a recurrent net) is separate from the “RAM” (e.g. a hippocampus-like assoicative memory). Could we get inspiration from neuroscience about how to do this?

Hierarchy in HTM provides short-term memory, with higher-level regions seeking to form a stable representation of the current situation in terms of sequence-sets of lower-level representations of the state of the world. Each region uses prediction-assisted recognition to represent its input, predict future inputs, and execute behaviours which maintain the predicted future.

– Resource allocation in short-term memory: if we have a separate module for short-term memory, how are resources allocated within it? When we enter a room, our position in the room, the geometry of the room, and the landmarks and obstacles in it are stored in our hippocampus. Presumably, the neural circuits used for this are recycled and reused for future tasks. How?

There’s no evidence of a separate short-term memory module in the brain. The entire neocortex is the memory, with the ephemeral activity in each region representing the current content. Active hierarchical communication between regions lead to the evolution of perception, decisions and behaviour. At the “top” of the hierarchy, the hippocampus is used to store and recycle longer-term memories.

– How does the brain perform planning, language production, motor control sequences, and long chains of reasoning? Planning complex tasks (which includes communicating with people, writing programs, and solving math problems) seems like an important part of AI system.

Because of the multiple feedforward and feedback pathways in neocortex, the entire system is constantly acting as a cyclic graph of information flow. In each region, memories of sequences are used in recognition, prediction, visualisation, execution of behaviour, imagination and so on. Depending on the task, the representations can be sensory, sensorimotor, pseudosensory (diagrammatic) or linguistic.

– resource allocation in the cortex: how does the brain “recruit” pieces of cortex when it learns a new task. In monkeys that have lost a finger, the corresponding sensory area gets recruited by other fingers when the monkey is trained to perform a task that involves touch.

There is always a horizontal “leakage” level of connections in any area of neocortex. When an area is deprived of input, neurons at the boundary respond to activity in nearby regions by increasing their response to that activity. This is enhanced by the “housekeeping” glial cells embedded in cortex, which actively bring axons and dendrites together to knit new connections.

– The brain uses spikes. Do spikes play a fundamental role in AI and learning, or are they just made necessary by biological hardware?

Spikes are very important in the real brain, but they are not directly needed for the core processing of information, so HTM doesn’t model them per se. We do use an analogue to Spike Timing Dependent Plasticity in the core Hebbian learning of predictive connections, but this is simplified to a timestep-based model rather than individual spikes.

We have elements of answers and avenues for research for many of these points, but no definite/perfect solutions.

HTM’s solutions are also neither perfect nor definitive, but they are our best attempt to address your questions in a simple, coherent and effective system, which directly depends on data from neuroscience.

Thanks to Yann for asking such pertinent questions about how the brain might work. It’s a recognition that the brain has a lot to teach us about intelligence and learning.