Monthly Archives / February 2016

  • Feb 03 / 2016
  • 0
Cortical Learning Algorithm

Predictive Coding Cortical Learning Algorithms (PC-CLA)

Gideon and Dave Rawlinson posted on their excellent blog about this poster by Ryan McCall and Dan Franklin, detailing “Predictive Coding Cortical Learning Algorithms (PC-CLA)”. This is a really interesting idea, and is really well described, but the paper seems to have some problems when applied to HTM/CLA and/or cortical modelling.

Please note that I’m restricting my comments and analysis to the very specific use of HTM/CLA in this paper. I’m not qualified to comment on the other ideas in the paper, and in fact the basic scheme is very much along the lines of my own thinking on how HTM can fit in to a larger picture of modelling the cognitive networks of the brain. The problem is that HTM (and CLA in particular) is a detailed model only at the level of layers and regions of cortex, and this is the level addressed by this new combination of Predictive Coding and CLA.

I’ll start by summarising the basic concept – each region is like a standard CLA (performs Spatial Pooling/Temporal Memory on its inputs), but the input and output are (prediction) error vectors rather than the raw “sensory” vectors.

The key section of the paper is Figure 8 on page 161 (PDF page 13) with the accompanying set of 6 algorithmic steps.

Information flow in PC-CLA (see text for details). McCall and Franklin (2013).

Information flow in PC-CLA (see text for details). McCall and Franklin (2013).

Step 1. Compute the current bottom-up prediction error, \(\epsilon_\nu\), between the current bottom-up Boolean input, \(y\), and the previous cycle’s top-down prediction, \(y^{\textrm{TD}}\).

Step 2. Compute the active columns of the Cortical Region for cycle \(t\), \(\textrm{L1}\).

a) Perform process \(g\) taking the bottom-up prediction error, \(\epsilon_\nu\), and the columns’ proximal dendrites and associated proximal synapses, and outputting the columns’ overlap score.

b) Add each column’s (bottom-up) overlap score to its predicted column activation, a scalar measure of column activation from temporal predictions for the column for this cycle (computed in Step 5a of cycle \(t-1\)), to obtain the overall column activity.

c) For columns with overall column activity greater than a threshold, perform a local \(k\)-winners-take-all procedure to determine the active columns, \(\textrm{L1}\). The constraint, \(k\), limits the number of possible active columns within a given area ensuring that the active columns are distributed.

Step 3. Compute the active cells at cycle \(t\), \(\textrm{L2}\), the current cells predicted to be active at some future cycle, \(\textrm{PL2}_t\), and their union, \(\textrm{U}\).

a) Based on the active columns, \(\textrm{L1}\) and the currently predicted cells, \(\textrm{PL2}_{t-1}\) (computed in Step 3b of cycle t – 1), compute the current active cells, \(\textrm{L2}\).

b) Based on the active cells, \(\textrm{L2}\), and the region’s distal dendrites and synapses, perform process, \(f\), producing the region’s current predicted (for some future cycle) cells, \(\textrm{PL2}_t\). Based on only the cells predicted for the next cycle t + 1, determine the columns, predicted this cycle, to be active next cycle, \(\textrm{PL1}_t\) (used later in Step 5).

c) Compute the union, \(\textrm{U}\), of the active cells, \(\textrm{L2}\), and the current predicted cells, \(\textrm{PL2}_t\).

Step 4. Process the current received top-down prediction, \(\textrm{U}^\textrm{TD}\).

a) Compute the error between \(\textrm{U}\) and the current received top-down prediction, \(\textrm{U}^\textrm{TD}\), and send the error to the next hierarchical level.

b) Update \(\textrm{PL2}_t\), the current cells predicted to be active at some future cycle, by adding in those cells predicted in \(\textrm{U}^\textrm{TD}\).

Step 5. Based on the columns predicted to be active next cycle, \(\textrm{PL1}_t\), (found in Step 3b):

a) Compute each column’s predicted column activation (used in 2b of next cycle).

b) Perform process \(g^{-1}\) to generate the region’s current top-down prediction, .

Step 6. Perform the learning processes.

a) Perform spatial learning, updating the permanence of proximal synapses based on bottom-up prediction error. Also update each column’s boost attribute based on its activity history.

b) Perform temporal learning, updating the permanence of distal synapses, and possibly adding new distal synapses. Temporal learning is driven by both unpredicted columns and predicted columns that did not actually become active. We give more details of learning in the next section.

This is an intriguing design, but there are several really important points which don’t seem to me to work:

I can understand the idea of comparing a predicted SDR with a “sensed” SDR, but this system actually requires a sensory region to generate a prediction of its SDR from top-down and lateral (predictive) information, then decode that into the input space, then receive the input and compare, and then use only the error to feed into the CLA component.

It also requires a higher region to be able to convert its predicted SDR into a top-down SDR for the lower region, having only received as input the last error encoding the difference between the previous SDR it predicted and the actual SDR produced in the lower region.

This chain of dependency has no top. Where do the top-level regions get their predictions?

This design reverses the whole idea of hierarchy dependency. As you go up, you need more and more detailed knowledge of the semantics and dynamics of the representations you’re receiving from below. Each higher region has a column count a multiple of the size of the lower region!

The design depends on magical, reversible functions (\(g\) and its inverse) which bidirectionally and deterministically transform a region’s input into the space of columnar SDRs (as plain old SP does), and more crucially, in the opposite direction. There is no mention of how this inverse translation \(g^{-1}\) (aka reconstruction) is performed, perhaps because it’s provably intractable (current HTM systems estimate this mapping statistically).

Further, the design describes \(g^{-1}\) as the inverse of \(g\), but in the diagram (and in the steps) \(g\) is the mapping from error to columns, but  \(g^{-1}\) maps the columns to the predicted input which is a completely different animal. I’m not sure if this is a genuine error or a convenient avoidance of a difficult problem.

The “Testing” section looks impressive, but unfortunately hides a few crucial details deep in the text (and omits several others).

The first section is just a plain old single-layer SP – nothing to do with anything described in the paper about Prediction Coding. So that tells us nothing at all that we don’t know.

The second section, involving hierarchy, is actually also just using plain old SP/TM on direct inputs! Here’s what it says in the middle:

We controlled for the effects of the Cortical Regions processing prediction errors by having both regions process only their respective input. The effect of processing prediction errors is the subject of another test. [my emphasis]

The graphs are also (at best) confusing. They use F-Score with a \(\beta\) of 10. F-Scores (I just found out from Wikipedia!) normally use a \(\beta\) between 0.5 and 2, depending on which of recall or precision you weigh more important.

Using a \(\beta\) of 10 gives you a result which is effectively equal (c99%) to recall (true positive rate) and effectively ignores (c1%) precision, because the \(\beta\) is squared, unless precision is very low (see the graphs for \(\beta\) = 10 and compare with \(\beta\) of 0.5, 1.0 and 2.

F-Score with \(\beta\) set to 10, as used in the paper.

F-Score with \(\beta\) set to 10, as used in the paper.

F-Score with \(\beta\)  = 1 (balanced precision/recall.

F-Score with \(\beta\) = 1 (balanced precision/recall.

F-Score with a \(\beta\) of 0.5

F-Score with a \(\beta\) of 0.5

F-Score with \(\beta\) of 2.0

F-Score with \(\beta\) of 2.0

Why would you choose such a parameter which is skewed so strongly in favour of one aspect of accuracy? Since there are only a couple of numbers involved, it makes more sense just to report the values of true and false positives and negatives for the two tests (total 8 numbers).

The authors also do not explain what is meant by “top-down influence”, and their own results show that the effect of whatever it is is on the higher region, admitting it has no effect whatsoever on the lower region (where the influence would surely be felt).

Finally, the graphs seem to (it’s hard to know without raw numbers) show that the bottom region is doing some sort of reasonable job (0.75) at SP-recognising the raw input. Meanwhile the top region is getting as many SP columns wrong as the bottom is getting right! Then the “top-down influence” is added and the top region’s F-score goes to exactly 1, which is 100% perfect. This happens even in the temporal domain, when the bottom layer is only predicting with an F-score of .05 or so (ie getting about 1 column in 20 in the predicted SDR).

[2] http://www.cogsys.org/papers/2013poster7.pdf