• Nov 13 / 2014
  • 0
Cortical Learning Algorithm, NuPIC

Efficiency of Predicted Sparseness as a Motivating Model for Hierarchical Temporal Memory

Part 1 – Introduction and Description.

In any attempt to create a theoretical scientific framework, breakthroughs are often made when a single key “law” is found to underly what previously appeared to be a number of observed lesser laws. An example from Physics is the key principle of Relativity: that the speed of light is a constant in all inertial frames of reference, which quickly leads to all sorts of unintuitive phenomena like time dilation, length contraction, and so on. This discussion aims to do this for HTM by proposing that its key underlying principle is the efficiency of predicted sparseness at all levels. I’ll attempt to show how this single principle not only explains several key features of HTM identified so far, but also explains in detail how to model any required structural component of the neocortex.

The neocortex is a tremendously expensive organ in mammals, and particularly in humans, so it seems certain that the benefits it provides are proportionately valuable to the genes of an animal. We can use this relationship between cost and benefit, with sparseness and prediction as mediating metrics, to derive detailed design rules for the neocortex at every level, down to individual synapses and their protein machinery.

If you take one thing away from this talk, it should be that Sparse Distributed Representations are the key to Intelligence. Jeff Hawkins

Note: The next post in this series describes the Mathematics of Hierarchical Temporal Memory.

Sparse Distributed Representations are a key concept in HTM theory. In any functional piece of cortex, only a small fraction of a large population of neurons will be active at a given time; each active neuron encodes some component of the semantics of the representation; and small changes in the exact SDR correspond with small differences in the detailed object or concept being represented. Ahmad 2014 describes many important properties of SDRs.

SDRs are one efficient solution to the problem of representing something with sufficient accuracy at optimal cost in resources, and in the face of ambiguity and noise. My thesis is that in forming SDRs, neocortex is striving to optimise a lossy compression process by representing only those elements of the input which are structural and ignoring everything else.

Shannon proposed that any message has a concrete amount of information, measured in bits, which reflects the amount of surprise (i.e. something you couldn’t compute from the message so far, or by other means) contained in the message.

The most efficient message has zero length – it’s the message you don’t need to send. The next most efficient message contains only the information the receiver lacks to reconstruct everything the sender wishes her to know. Thus, by using memory and the right encoding to connect with it, a clever receiver (or memory system) can become very efficient indeed.

We will see that neocortex implements this idea literally, at all levels, as it attempts to represent, remember and predict events in the world as usefully as possible and at minimal cost.

The organising principle in cortical design is that components (from the whole organism down to a synapse) can do little about the amount of signal they receive, but they can – and do – adapt and learn to make best use of that signal to control what they do, only acting – sending a signal – when it’s the predicted optimal choice. This gives rise to sparseness in space and time everywhere, which directly reflects the degree of successful prediction present in any part of the system.

The success metric for a component in neocortex is the ratio of input data rate to output information rate, where the component has either a fixed minimum, or (for neurons and synapses) a fixed maximum, output level.

Deviations from the target indicate some failure to predict activity. This failure is either an opportunity to learn (and predict better next time), or, failing that, something which needs to be acted upon in some other way, by taking a different action or by passing new information up the hierarchy.

Note inputs in this context are any kind of signal coming in to the component under study. In the case of regions, layers and neurons, these include top-down feedback and lateral inputs as well as feedforward.


Neocortex is a hierarchy because it has finite space to store its model of the world, and a hierarchy is an optimal strategy when the world itself has hierarchical structure. Each region in the hierarchy is subjected (by design) to a necessarily overwhelming rate of input, it will run at capacity to absorb its data stream, reallocating its finite resources to contain an optimal model of the world it perceives.


The memory inside a region of cortex is driven towards an “ideal” state in which it always predicts its inputs and thus produces a “perfect”, minimal message – containing its learned SDR of its world’s current state – as output. Any failure to predict is indicated by a larger output, the deviation from “ideal” representing the exact surprise of the region to its current perception of the world.

A region has several output layers, each of which has a different (and usually more than one) purpose.

For each region, two layers send (different) signals up the hierarchy, therefore signalling both the current state of its world and the encoding of its unpredictability. The higher region now gets details of something it should hopefully have the capacity to handle – predict – or else it passes the problem up the chain.

Two layers send (again different) signals down to lower layers and (in the case of motor) to subcortical systems. The content of these outputs will relate to the content as well as the stability and confidence of the region’s model, and also actions which are appropriate in terms of that content and confidence level.


A cortical layer which has fully predicted its inputs has a maximally sparse output pattern. A fully failing prediction pattern in a layer causes it to output a maximally bursting and minimally sparse pattern, at least for a short time. At any failure level in between, the exact evolution of firing in the bursting neurons encodes the precise pattern of prediction failure of the layer, and this is the information passed to other layers in the region, to other regions in cortex, or to targets outside the cortex.

The output of a cortical layer is thus a minimal message – it “starts” with the best match of its prediction and reality, followed (in a short period of time) by encodings of reality in the context of increasingly weak prediction.


A layer’s output, in turn, is formed from the combination of its neurons, which are themselves arranged in columns. The columnar arrangement of cells in cortical columns is the key design leading to all the behaviour described previously.

Pyramidal cells, which represent both the SDR activity pattern and the “memory” in a layer, are all contained in columns. The sparse pattern of activity across a layer is dictated by how all the cells compete within this columnar array.

Columns are composed of pyramidal cells, which act independently, and a complex of inhibitory cells which act together to define how the column operates. All cells share a very similar feedforward receptive field, due to the fact that feedforward axons physically run up through the narrow column and abut the pyramidal bodies as they squeeze past.

Columnar Inhibition

The inhibitory cells have a broader and faster feedforward response compared with the pyramidal cells Reference so, in the absence of strong predictive inputs to any pyramidal cells, the entire assemblage of inhibitory neurons will be first to fire in a column. When this happens, these inhibitory cells excite those in adjacent columns, and a wave of inhibition spreads out from a successfully firing column.

The wave continues until it arrives at a column which has already been inhibited by a wave coming from elsewhere in the layer (from some recently active column). This gives rise to a pattern of inactivity around columns which are currently active.

Predictive Activation

Each cell in a column has its own set of feedforward and predictive inputs, so every cell has a different rate of depolarising as it is driven towards firing threshold.

Some cells may have received sufficient depolarising input from predictive lateral or top-down dendrites to reach firing threshold before the column’s sheath of inhibitory cells. In this case the pyramidal cell will fire first, trigger the column’s inhibitory sheath, and cause the wave of inhibition to spread out laterally in the layer.

Vertical Inhibition in Columns

When the inhibitory sheath fires, it also sends a wave of inhibitory signals vertically in the column. This wave will shut down any pyramidal cells which have not yet reached threshold, giving rise to a sparse activity pattern in the column.

The exact number of cells which get to fire before the sheath shuts them down depends mainly on how predictive each cell was and whether the sheath was triggered by a “winning cell” (previous section), by the sheath being first to fire, or as a result of neighbouring columns sending out signals.

If there is a wave of inhibition reaching a column, all cells are shut down and none (or no more) fire.

If there was a cell so predictive that it fired before the sheath, all other cells are very likely shut down and only one cell fires.

Finally, if the sheath was first to fire due to its feedforward input, the pyramidal cells are shut down quite quickly, but the most predictive may get the chance to fire just before being shut down.

This last process is called bursting, and gives rise to a short-lived pattern which encodes exactly how well the column as an ensemble has matched its predictions. Basically, the more cells which fire, the more “confused” the match between prediction and reality. This is because the inhibition happens quickly, so the gap between the first and last cell to burst must be small, reflecting similar levels of predictivity.

The bursting process may also be ended by an incoming wave of inhibition. The further away a competing column is, the longer that will take, allowing more cells to fire and extending the burst. Thus the amount of bursting also reflects the local area’s ability to respond to the inputs.


Neurons are machines which use patterns of input signals to produce a temporal pattern of output signal. The neuron wastes most resources if its potential rises but just fails to fire, so the processes of adaption of the neuron are driven to a) maximise the response to inputs within a particular set, and b) minimise the response to inputs outside that set.

The set of excitatory inputs to one neuron are of two main types – feedforward and predictive; the number of each type of input varies from 10’s to 10’s of thousand; and the inputs arrive stochastically in combinations which contain mixtures of true structure and noise, so the “partitioning problem” a neuron faces is intractable. It simply learns to do the best it can.

Note that neurons are the biggest components in HTM which actually do anything! In fact, the regions, layers and columns are just organisational constructs, ways of looking at the sets of interacting neurons.

The neuron is the level in the system at which genetic control is exercised. The neuron’s shape, size, position in the neocortex, receptor selections, and many more things are decided per-neuron.

Importantly, many neurons have a genetically expressed “firing program” which broadly sets a target for the firing pattern, frequency and dependency setup.

Again, this gives the neuron an optimal pattern of output, and its job is to arrange its adaptations and learn to match that output.


Distal dendrites have a similar but simpler and smaller scale problem of combining inputs and deciding whether to spike.

I don’t believe dendrites do much more than passively respond to global factors such as modulators and act as conduits for signals, both electrical and chemical, originating in synapses.


Synapses are now understood to be highly active processing components, capable of growing both in size and efficiency in a few seconds, actively managing their response to multiple inputs – presynaptic, modulatory and intracellular, and self-optimising to best correlate a stream of incoming signals with the activity of the entire neuron.

Part Two takes this idea further and details how a multilayer region uses the efficiency of predicted sparseness to learn a sensorimotor model and generate behaviour.

The next post in this series describes the Mathematics of Hierarchical Temporal Memory. This diversion is useful before proceeding with the main thread.

Blättler F, Hahnloser RHR. An Efficient Coding Hypothesis Links Sparsity and Selectivity of Neural Responses. Kiebel SJ, ed. PLoS ONE 2011;6(10):e25506. doi:10.1371/journal.pone.0025506. [Full Text]

  • Sep 14 / 2014
  • 4
Cortical Learning Algorithm, NuPIC

A Unifying View of Deep Networks and Hierarchical Temporal Memory

There’s been a somewhat less than convivial history between two of the theories of neurally-inspired computation systems over the last few years. When a leading protagonist of one school is asked a question about the other, the answer often varies from a kind of empty semi-praise to downright dismissal and the occasional snide remark. The objections of one side to the others’ approach are usually valid, and mostly admitted, but the whole thing leaves one with a feeling that it is not a very scientific way to proceed or behave. This post describes an idea which might go some way to resolving this slightly unpleasant impasse and suggests that the discrepancies may simply be as a result of two groups using the same name for two quite different things.

In HTM, Jeff Hawkins’ plan is to identify the mechanisms which actually perform computation in real neocortex, abstracting them only far enough that the details of the brain’s bioengineering are simplified out, and hopefully leaving only the pure computational systems in a form which allows us to implement them in software and reason about them. On the other hand, Hinton and LeCun’s neural networks are each built “computation-first,” drawing some inspiration from and resembling the analogous (but in detail very different) computations in neocortex.

The results (ie the models produced), inevitably, are as different at all levels as their inventors’ approaches and goals. For example, one criterion for the Deep Network developer is that her model is susceptible to a set of mathematical tools and techniques, which allow other researchers to frame questions, examine and compare models, and so on, all in a similar mathematical framework. HTM, on the other hand, uses neuroscience as a standard test, and will not admit to a model any element which is known to be contradicted by observation of natural neocortex. The Deep Network people complain that the models of HTM cannot be analysed like theirs can (indeed it seems they cannot), while the HTM people complain that the neurons and network topologies in Deep Networks bear no relationship with any known brain structures, and are several simplifications too far.

Yann LeCun said recently on Reddit (with a great summary):

Jeff Hawkins has the right intuition and the right philosophy. Some of us have had similar ideas for several decades. Certainly, we all agree that AI systems of the future will be hierarchical (it’s the very idea of deep learning) and will use temporal prediction.

But the difficulty is to instantiate these concepts and reduce them to practice. Another difficulty is grounding them on sound mathematical principles (is this algorithm minimizing an objective function?).

I think Jeff Hawkins, Dileep George and others greatly underestimated the difficulty of reducing these conceptual ideas to practice.

As far as I can tell, HTM has not been demonstrated to get anywhere close to state of the art on any serious task.

The topic of HTM and Jeff Hawkins was second out of all the major themes in the Q&A session, reflecting the fact that people in the field view this as an important issue, and (it seems to me) wish that the impressive progress made by Deep Learning researchers could be reconciled with the deeper explanatory power of HTM in describing how the neocortex works.

Of course, HTM people seldom refuse to play their own role in this spat, saying that a Deep Network sacrifices authenticity in favour of mathematical tractability and getting high scores on artificial “benchmarks”. We explain or excuse the fact that our models are several steps smaller in hierarchy and power, making the valid claim that there are shortcuts and simplifications we are not prepared to make,  and speculating that we will – like the tortoise – emerge alone at the finish with the prize of AGI in our hands.

The problem is, however, a little deeper and more important than an aesthetic argument (as it sometimes appears). This gap in acknowledging the valid accomplishments of the two models, coupled with a certain defensiveness, causes a “chilling effect” when an idea threatens to cross over into the other realm. This means that findings in one regime are very slow to be noticed or incorporated in the other. I’ve heard quite senior HTM people actually say things like “I don’t know anything about Deep Learning, just that it’s wrong” – and vice versa. This is really bad science.

From reading their comments, I’m pretty sure that no really senior Deep Learning proponent has any knowledge of the current HTM beyond what he’s read in the popular science press, and the reverse is nearly as true.

I consider a very good working knowledge of Deep Learning to be a critical part of any area of computational neuroscience or machine learning. Obviously I feel at least the same way about HTM, but recognise that the communication of our progress (or even the reporting of results) in HTM has not made it easy for “outsiders” to achieve the levels of understanding they feel they need to take part. There are historical reasons for much of this, but it’s never too late to start fixing a problem like this, and I see this post (and one of my roles) as a step in the right direction.

The Neuron as the Unit of Computation

In both models, we have identified the neuron as the atomic unit of computation, and the connections between neurons as the location of the memory or functional adjustment which gives the network its computational power. This sounds fine, and clearly the brain uses neurons and connections in some way like this, but this is exactly where the two schools mistakenly diverge.

Jeff Hawkins rejects the NN integrate-and-fire model and builds a neuron with vastly higher complexity. Geoff Hinton admits that, while impossible to reason about mathematically, HTM’s neuron is far more realistic if your goal is to mimic neocortex. Deep Learning, using neurons like Lego bricks, can build vast hierarchies and huge networks, find cats in Youtube videos, and win prizes in competitions. HTM, on the other hand, struggles for years to fit together its “super-neurons” and builds a tiny, single-layer model which can find features and anomalies in low-dimensional streaming data.

Looking at this, you’d swear these people were talking about entirely different things. They’ve just been using the same names for them. And, it’s just dawned on me, therein lies both the problem and its solution. The answer’s been there all the time:

Each and every neuron in HTM is actually a Deep Network.

In a HTM neuron, there are two types of dendrite. One is the proximal dendrite, which contains synapses receiving inputs from the feedforward (mainly sensory) pathway. The other is a set of coincidence-detecting, largely independent, distal dendrite segments, which receive lateral and top-down predictive inputs from the same layer or higher layers and regions in neocortex.

My thesis here is that a single neuron can be seen as composed of many elements which have direct analogues in various types of Deep Learning networks, and that there are enough of these, with a sufficient structural complexity, that it’s best to view the neuron as a network of simple, Deep Learning-sized nodes, connected in a particular way. I’ll describe this network in some detail now, and hopefully it’ll become clear how this approach removes much of the dichotomy between the models.

Firstly, a synapse in HTM is very much like a single-input NN node, where HTM’s permanence value is akin to the bias in a NN node, and the weight on the input connection is fixed at 1.0. If the input is active, and the permanence exceeds the threshold, the synapse produces a 1. In HTM we call such a synapse connected, in that the gate is open and the signal is passed through.

The dendrite or dendrite segment is like the next layer of nodes in NN, in that it combines its inputs and passes the result up. The proximal dendrite effectively acts as a semi-rectifier, summing inputs and generating a scalar depolarisation value to the cell body. The distal segments, on the other hand, act like thresholded coincidence detectors and produce a depolarising spike only if the sum of the inputs exceeds a threshold.

These depolarising inputs (feedforward and recurrent) are combined in the cell body to produce an activation potential. This only potentially generates the output of the entire neuron, because a higher-level inhibition system is used to identify those neurons with highest potential, allow those to fire (producing a binary 1), and suppress the others to zero (a winner-takes-all step with multiple local winners in the layer).

So, a HTM layer is a network of networks, a hierarchy in which neuron-networks communicate with connections between their sub-parts. At the HTM layer level, each neuron has two types of input and one output, and we wire them together at such, but each neuron is really hiding an internal, network-like structure of its own.




  • Aug 23 / 2014
  • 0
Cortical Learning Algorithm, NuPIC

Suggested Naming in HTM Theory and White Paper

“There are only two hard things in Computer Science: cache invalidation and naming things.” – Phil Karlton

In the case of HTM, we also have the much bigger problem of explaining how neocortex may work, and how a non-obvious CLA operates to use cortical principles. Extra confusion caused by poor naming multiplies the difficulties.

A key component of the art of naming consists in identifying the scope of each name. We need to have names which are just specific enough to capture the underlying concept, but not so specific that they entangle non-essential details. Names also need to be memorable and comfortable, while not being too easy to misconstrue, because they resemble or contain words which have other meanings.

I’d like to begin a reasoned discussion about key names in HTM and CLA. The goal of the discussion is to arrive at a set of names which everyone strongly believes captures the concepts for both theory and implementation.

As a famous Supreme Court judge once said of pornography, “we cannot define it but we know it when we see it.” We are looking for this kind of name, with the added advantage that HTM can actually precisely define the concept behind each name.

Until we arrive at a good name for something (ie one which magically gets everyone’s support), we should identify the key flaws in each candidate and agree that they invalidate that candidate. This is a healthy process which should not be regarded as a criticism of any proposer.

Please treat that as an open invitation to tell me how poor my proposed names are, but only for reasons you’d accept as rational if they were directed at yours!

I’m currently re-reading the 2011 White Paper with a view to updating and improving it. This document is a very rich source of information pertinent to this discussion, and in fact appears to answer a couple of the thorniest ones! I’d very strongly recommend re-reading it as preparation for taking part in this discussion.

I’d like to go through the main named concepts one by one, discuss the strengths and weaknesses of the current names, and propose a new name for each concept with some supporting motivations and argument. I don’t expect that my proposals will stick, but they should get us a noticeable step in the right direction, or at least throw light on the relevant issues.

Sparse Distributed Representation.

I start with this one because, in my experience of learning, reasoning about, writing about, talking about, and explaining HTM, the term SDR is as close to perfect as I can imagine. It has the property of monotonically improving understanding the more you find out about each of the three concepts named.

It is also an easily testable name. We all remember when Francisco showed us the CEPT Retina SDRs, in fact they were so SDRish, some of us thought they were too good to be true!

Spatial Pooling.

There are several problems with this term. We understand that “spatial” was chosen to indicate that each presentation of the data has some properties and structure in the sensory domain (such as a shape, size or colour), and it’s called “spatial” as opposed to “temporal”.

A difficulty arises for newcomers who read too much into this use of the word. There is a strong temptation to rely on our commonsense ideas of space when Jeff is really talking about mathematical, vector spaces and the abstract “spaces” of SDRs.

HTM does not require the kind of retinotopic mapping found in V1. The only reason we have literal spatial layouts in just a few primary areas of sensory cortex is because it is a simpler evolutionary and developmental design, not because it is needed for the algorithm. The RDSE, the Geospatial Encoder and the CEPT retina are all superb examples of how “pseudorandom” representations are better than more pictorially understandable spatial representation regimes.

Lastly, we’ve already tripped over this when we started talking about the new sensorimotor theory. L4 cells are now dealing with motor inputs as well as “spatial”, and L3 cells are now expected to “see” a set of L4 outputs whose members are substituted over time. So the word “spatial” really needs to go.

The word “Pooling” has, for many, either no meaning at all (most cases), or worse, the wrong meanings in this context. If you are trying to capture the notion of a noise-tolerant, largely stable representation of closely related sensory input, “pooling” isn’t going to do that for most people.

I’m not sure there is a good word for this, so my suggestion drops this aspect. As mentioned several times in the 2011 White Paper, the concept of pooling (noise-tolerance, high-overlap) is already embedded as a property of the product of SP – the SDR.

I propose the term Pattern Memory for what we currently call Spatial Pooling. This captures the fact that patterns in the data are recognised-learned and that the CLA is developing a memory of patterns it has seen. By not being too specific about which patterns we mean, it also allows us to say that the CLA learns to recognise and remember patterns of input data, stores patterns of synaptic connections, and forms patterns of activation (SDRs) to represent its inputs.

This name is also robust to adopting the new theory. L4 cells can learn sensorimotor patterns, and L3 cells can learn to recognise patterns of membership in a sequence-set.
We can run this in the top-down direction too, talking about patterns appearing in L1, motor patterns, patterns of depolarisation, and so on.

(old) Temporal Pooling.

The problems with using this term in its old context have been well-rehearsed, and it’s now used for the much more appropriate concept of representing a stable(r) sequence-identifying SDR in Layer 3 when sensorimotor transitions from that sequence are occurring in Layer 4. Temporal Pooling, in that sense, is another great name.

I had previously offered the term “Transition Prediction” for the component of CLA involving lateral connections and predictive states. Jeff and Numenta are currently using “Temporal Memory”. I believe both are flawed.

My suggestion accurately captured the limited, 1-timestep scope of this component, and also the fact that prediction is the key to temporal learning. However, it sounds like we need to add words to the name, to reflect “something missing” from the two word name.

Temporal Memory, on the other hand, is too high-ranking and valuable a name for this relatively basic component. It carries the risk that people will think HTM is just a hierarchy of TMs. Also, “temporal” is too general – the same word is currently used for single-timestep (old TP/TM) all the way up to entire sequences (new TP).

I propose Transition Memory for this second core component of CLA. This captures most literally what the algorithm is doing – learning single transitions. It is also the temporal equivalent of Pattern Memory, using distal dendrites to link to past SDRs just as PM uses proximal dendrites to link to feedforward patterns.

Importantly, the term Transition Memory is not trying to work too hard. We can explain that learned transitions are used to put cells into predictive states, and that these predictive patterns are used both in sensory (variable order) and sensorimotor (first order) temporal learning. They are used to match predicted and actual inputs, detect anomalies and create patterns which indicate continuing successful prediction or trigger a pattern of bursting columns. It seems impossible to me to have one name capture all these aspects, so I propose we stop trying and give the name a break!

In a variation on Pattern Memory (SP), depolarisation due to Transition Memory is combined with feedforward inputs to assist recognition and increase noise-tolerance. In Jeff’s new sensorimotor theory, combining distal with proximal inputs is likely to be key to the function.

Old and New Versions of HTM/CLA Theory.

In previous posts, I used “old and new” or “2013 and 2014” to distinguish these two generations of the theory. In reworking the White Paper, I’ve recognised that these two theories are akin to the Newtonian versus Relativistic or Quantum views of mechanics. You need to quite deeply understand the simpler theory before you can begin to deal with the far more complex and realistic one. And for many purposes, the simpler theory is perfectly sufficient both for understanding how the neocortex works, and for useful application in software.

I thus propose that the older, simpler theory and model be called the “Sensory Cortical Learning Algorithm” or “Sensory CLA”, the newer being called the “Sensorimotor CLA”.

SCLA (or just CLA) and SMCLA are simple, distinguishable acronyms.

This also allows us to talk about HTM systems with SCLA single-layer regions (as NuPIC can/does), which just do feedforward, sensory hierarchy, or else fuller HTMs which incorporate behaviour, stable sequences, temporal pooling, and true bidirectional hierarchy using SMCLA in each region.

  • Aug 14 / 2014
  • 0
Cortical Learning Algorithm, NuPIC

Implications of the NuPIC Geospatial Encoder

Numenta’s Chetan Surpur recently demoed and explained the details of a new encoder for NuPIC which creates Sparse Distributed Representations (SDRs) from GPS data. Apart altogether from the direct applications which this development immediately suggests, I believe that Chetan’s invention has a number of much more profound implications for NuPIC and even HTM in general. This post will explore a few of the most important of these. Chetans’ demo and a tutorial by Matt Taylor are available on Youtube. First, here is Chetan presenting to, and discussing it with, Numenta people: And here’s Matt with another excellent hands-on tutorial:


I’ll begin by describing the encoder itself. The Geospatial Encoder takes as input a triple [Lat, Long, Speed] and returns a Sparse Distributed Representation (SDR) which uniquely identifies that position for the given speed. The speed is important because we want the “resolution” of the encoding to vary depending on how quickly the position is changing, and Chetan’s method does this very elegantly. The algorithm is quite simple. First, a 2D space (Lat, Long) is divided up (virtually) into squares of a given scale (a parameter provided for each encoder), so each square has an x and y integer co-ordinate (the Lat-Long pair is projected using a given projection scheme for convenient display on mapping software). This co-ordinate pair can then be used as a seed for a pseudorandom number generator (Python and numpy use the cross-platform Mersenne Twister MT19937), which is used to produce a real-valued order between 0 and 1, and a bit position chosen from the n bits in the encoding. These can be generated on demand for each square in the grid, always yielding the same results. To create the SDR for a given position and speed, the algorithm first converts the speed to a radius and forms a box of squares surrounding the position and calculates the pair [orderbit] for each square in the box. The top w squares (with the highest order) are chosen, and their bit values are used to choose the w active bits in the SDR.

Initial Interpretation

The first thing to say is that this encoder is an exemplar of transforming real-world data (location in the context of movement) into a very “SDR-like” SDR. It has the key properties we seek in an SDR encoder, in that semantically similar inputs will yield highly overlapping representations. It is robust to noise and measurement error in both space and time, and the representation is both unique (given a set scale parameter) and reproducible (given a choice of cross-platform random number generator), independently of the order of presentation of the data. The reason for this “SDR-style” character is that the entire space of squares forms an infinite field of “virtual neurons”, each of which has some activation value (its order) and position in the input bit vector (its bit). The algorithm first sparsifies this representation by restricting its sampling subspace to a box of squares around the position, and then enforces the exact sparseness by picking the w squares using a competitive analogue of local inhibition.

Random Spatial Neuron Field (Spatial Retina)

This idea can be generalised to produce a “spatial retina” in n-dimensional space which provides a (statistically) unique SDR fingerprint for every point in the space. The SDRs specialise (or zoom in) when you reduce the radius factor, and generalise (or zoom out) when radius is increased. This provides a distance metric between two points which involves the interplay of spatial zoom and the fuzziness of overlap. Any two points will have identical SDRs (w bits of overlap) if you increase the radius sufficiently, and entirely disparate SDRs (0 bits overlap) if you zoom in sufficiently (down to the order of w*scale). Since the Coordinate Encoder operates in a world of integer-indexed squares, we first need to transform each dimension using its own scale parameter (the Geospatial Encoder uses the same scale for each direction, but this is not necessary). We thus have a single, efficient, simple mechanism which allows HTM to navigate in any kind of spatial environment. This is, I believe a really significant invention which has implications well beyond HTM and NuPIC. As Jeff and others mentioned during Chetan’s talk, this may be the mechanism underlying some animals’ ability to navigate using the Earth’s magnetic field. It is possible to envisage a (finite, obviously) field of real neurons which each have a unique response to position in the magnetic field. Humans have a similar ability to navigate, using sensory input to provide an activation pattern which varies over space and identifies locations. We combine whichever modalities work best (blind people use sound and memories of movement to compensate for impaired vision), and as long as the pipeline produces SDRs of an appropriate character, we can now see how this just works.

Comparison with Random Distributed Scalar Encoder (RDSE)

The Geospatial Encoder uses the more general Coordinate Encoder, which takes a n-dimensional integer vector and a radius, and produces the corresponding SDR. It is easy to see how a 1D spatial encoder with a fixed speed would produce an SDR for arbitrary scalars, given an initial scale which would decide the maximum resolution of the encoder.  This encoder would be an improved replacement for the RDSE, with the following advantages:

  • When encoding a value, the RDSE needs to encode all the values between existing encodings and the new value (so that the overlap guarantees are honoured). A 1D-Geo encoder can compute each value independently, saving significantly in time and memory footprint.
  • In order to produce identical values for all inputs regardless of the order of presentation, the RDSE needs to “precompute” even more values in batches around a fixed “centre” (eg to compute f(23) starting at 0, we might have to compute [f(-30),…,f(30)]). Again, 1D-Geo scalar encoding computes each value uniquely and independently.
  • Assuming scale (which decides the max resolution) is fixed, the 1D-Geo scalar encoding can compute encodings of variable resolution with semantic degradation by varying speed. The SDR for a value is exactly unique for the same speed, but changes gradually as speed is increased or decreased. The RDSE has no such property.

This would strongly suggest that we can replace the RDSE with a 1D coordinate spatial encoder in NuPIC, and get all the above benefits without any compromise.

Combination with Spatially-varying Data

It is clear how you could combine this encoding scheme with data which varies by location, to create a richer idea of “order” in feeding the SDR generation algorithm. For example, you could combine random “order” with altitude or temperature data to choose the top w squares. Alternatively, the pure spatial bit signature of a location may be combined in parallel with the encoded values of scalar quantities found at the current location, so that a HTM system associatively learns the spatial structure of the given scalar field.

Spatially Addressed Memory

The Geospatial Encoder computes a symbolic SDR address for a spatial location, effectively a “name” or “word” for each place. The elements or alphabet of this encoding are simply random order activation values of nearby squares, so any more “real” semantic SDR-like activation pattern will do an even better job in computing spatial addresses. We use memories of spatial cues (literally, landmarks), emotional memories, maps, memories of moving within the space, textual directions, and so on to encode and reinforce these representations. This model explains why memory experts often use Memory Palaces (aka the Method of Loci) to remember long sequences of data items. They associate each item (or an imagined, memorable visual proxy) occupying a location in a very familiar spatial environment. It also explains the existence of “place neurons” in rodent hippocampi – these neurons are each participating in generating a spatial encoding similar in character to the Geospatial Encoder.

Zooming, Panning and Attention

This is a wonderful model for how we “zoom in” or “zoom out” and perceive a continuously but smoothly varying model of the world. It also models how we can perceive gracefully degrading levels of detail depending on how much time or attention we pay for a perception. In this case, the “encoder” detailed here would be a subcortical structure or a thalamus-gated (attention controlled) input or relay between regions. If we could find a mechanism in the brain which controls the size and position of a “window” of signals (akin to our variable box of squares), we would have a candidate for our ability to use attention to control spatial resolution and centre of focus. Such a mechanism may automatically arise from preferentially gating neurons at the edges of a “patch”, by virtue of the inhibition mechanism’s ability to smoothly alter the representation as inputs are added or removed. This mechanism would also explain boundary extension error, in which we “fill out” areas surrounding the physical boundaries of objects and images. As explained in detail in her talk at the Royal Institute, Eleanor Maguire believes that the hippocampus is crucial for both this phenomenon and our ability to navigate in real space. As one of the brain components at the “top” of the hierarchies, the hippocampus may be the place where we can perform the crucial “zooming and panning” operations and where we manipulate spatial SDRs as suggested by the current discovery.

Implementation Details

The coordinate encoder has a deterministic, O(1), order-independent algorithm for computing both “order” and bit choice. One important issue is that the pseudorandom number is Python-specific, and so a Java encoder (which uses a different pseudorandom number generator) will produce completely different answers. The solution is to use the Python (and numpy) RNG, which is the Mersenne Twister MT19937, also used by default in numerous other languages. I believe it would be worth exploring using Perlin noise to generate the order and bit choice values. This would give you a) identical encodings across platforms, b) pseudorandom, uncorrelated values when the noise samples are far enough apart (eg when the inputs are integers as in this case), and c) smoothly changing values if you use very small step sizes. Just one point about changing radius and its effect on the encoding. I’m very confident that the SDR is very robust to changes in radius, due to the sparsity of the SDRs. In other words, the overlap in an SDR at radius r with that at radius r’ (at the same GPS position) will be high, because you are only adding or removing an annulus around the same position (this will be similar to adding or removing a strip of squares when a small position change occurs).

Links to the Demo and Encoder Code

Chetan’s demo code (which is really comprehensive) is at The Geospatial Encoder code is at and the Coordinate Encoder is at

  • Jun 07 / 2014
  • 0
Real Life!

Ham Boiled and Roasted in Guinness, with Belfast Champ

This has to be one of my favourite dishes. It’s absolutely Irish, but has some twists which come from learning how to cook with Italians, Spaniards and Mexicans. It’s a bit of work (some in prep, mostly in the care you take checking up on your food as it cooks), but it results in a sublime thing which your dinner guests will never forget you introduced them to. Just remember to say you got the recipe from me (I got the inspiration for this from the Guinness site, which has a very decent recipe for the boiled version of this one).


Serves 6

For boiling:

  • 1.5Kg (3lbs) Prime Ham Fillet (get it from your butcher, ask for it on the bone if possible), skin on.
  • Some pork or bacon ribs or other pork bones (butcher will often give you these for free from their bin)
  • Half-dozen cloves
  • 3 Spanish (or any large) Onions, finely chopped
  • 1 Red Onion (optional)
  • 1 Red Bell Pepper, finely chopped
  • 3 medium sized carrots
  • 2 Spring Onions (Scallions)
  • 1 or 2 Bay (or Laurel) leaves
  • A sprig of fresh (or a 1tsp dried) Parsley
  • 1 can of Draught Guinness
  • 3 peeled tomatoed, chopped, or 1 can of chopped tomatoes
  • Grapeseed, Sunflower or Olive Oil
  • Salt and Pepper
  • A pinch of Spanish Pimenton (Paprika) optional

Equipment: 1 Large Saucepan or Pressure Cooker, 1 Sharp Chef’s Knife, Veg cutting board or food processor.

For Roasting:

  • 2-3 tsps of the best honey you can find (or Demerera or brown sugar)
  • More carrots, parsnips, root veg to your taste, roughly chopped.
  • (Optional, I don’t) another can of Guinness

Preheat Oven to 180C (400F).
Equipment: 1 Deep Baking Dish

To Serve with Belfast Champ:

  • 12 Medium “Floury” Potatoes ( a small bag)
  • A bunch of Spring Onions (Scallions)
  • 1/2 a bunch of Thyme (or 3 tsp dried)
  • 50g (2oz) Irish Butter
  • 125ml (1/4 pint) fresh whole milk
  • 1 organic free range egg

Equipment: Potato Ricer (makes the spuds delicious) or masher.

(Optional, not the Irish way!) a leaf salad made with lemon, olive oil and Rocket leaves.


In a large saucepan, sweat the onions, scallions, carrot and pepper for 5-10 mins over a gentle heat until the onion is translucent. Add a few pinches of salt and pepper. Remove from heat and place in a bowl.

Wash all the meat under cold running water before use. Score the ham skin and stick the cloves through the slits into the meat. Put a little oil in the saucepan and place the ham in the centre. Slip the pork bones between the ham and the sides of the saucepan. Add the can of Guinness, pouring over the meat. Add back the onion-carrot mix, and the tomatoes, parsley and pimenton. Slide a couple of bay leaves down the sides of the pan and cover. When it boils, lower the heat and gently simmer for as long as possible, but at least 90 minutes. Every 10-15 mins, return to the pan to ensure it’s gently simmering and taste the evolving flavours.

30 minutes before the ham is done, put on the oven at 190C/375F. Place the baking dish in the oven to heat up as well.

You’ve now cooked the ham perfectly well, so if you wish to give up now, work away and you could serve it as boiled ham and rule (prepare the Belfast Champ as below). But trust me, you really want to roast it very slowly to bring out the great flavours we’ve just infused into the meat. So, take the baking dish out of the oven and smear some oil over it before placing your ham in the centre. Pour the sauce from the saucepan over the meat and put the pork bones/ribs around the ham in the dish. Add any more carrots, parsnips, or other root veg as you like, scattered around the ham. At this point some people would pour another pint of Guinness over the meat, but I’d prefer to drink it! Now, take a couple of teaspoons of the best honey you can find (or brown sugar if you can’t) and smear the top of the ham with it. Sprinkle with freshly picked thyme and place in the oven at the high heat to crisp the fat for 15 minutes. Reduce the oven temperature to 150C/300F and allow to slow-roast for at least another 45 mins. Every 15-30 mins, take it out of the oven and baste the juices over the ham.

15 minutes before the ham is ready, have your potatoes peeled and chunk them into 2cm or 3/4in cubes. Boil gently in a large saucepan for 12-15 minutes (you can tell they’re done when you can stick a fork straight into them without resistance). Drain the spuds in a colander and place in a bowl. In the potato saucepan, heat up the milk and butter, add the spring onions and thyme, and gently bring to the boil. Reduce the heat, simmer for a couple of minutes to poach the onions, and add the butter and the raw egg. Whisk until smooth, remove from heat and add the potato using a potato ricer (or else gently mash the potatoes in the bowl using a potato masher or fork). Fold everything together until it has a glossy appearance. If you have too much potato (if it is too dry) add a little more milk at a time until it’s smooth. Keep warm but off the hob (it’ll burn).

When the ham is done. remove from the oven and allow to rest for 10-15 minutes (this distributes the juices in the ham throughout the meat and is essential for a good result). By all means baste the ham one last time using the juices from the dish. If you like thick gravy, you should remove the veg and bones from the dish and place it over a gentle heat on the hob and scrape it to release the burned-in flavours while condensing it. I personally prefer a lighter gravy so I don’t reduce the sauce.

Once rested, carve the ham and plate up a few slices of ham, lots of the champ, the roast veg, and salad if that’s your thing. Pour liberal quantities of gravy over both meat and spuds. Enjoy!


  • May 09 / 2014
  • 0
Cortical Learning Algorithm, General Interest

Proposed Mechanism for Layer 4 Sensorimotor Prediction

Jeff Hawkins has recently talked about a sensorimotor extension for his Cortical Learning Algorithm (CLA). This extension involves Layer 4 cells learning to predict near-future sensorimotor inputs based on the current sensory input and a copy of a related motor instruction. This article briefly describes an idea which can explain both the mechanism, and several useful properties, of this phenomenon. It is both a philosophical and a neuroscientific idea, which serves to explain our experience of cognition, and simultaneously explains an aspect of the functioning of the cortex.

In essence, Jeff’s new idea is based on the observation that Layer 4 cells in a region receive information about a part of the current sensory (afferent, feedforward) inputs to the region, along with a copy of related motor command activity. The idea is that Layer 4 combines these to form a prediction of the next set of sensory inputs, having previously learned the temporal coincidence of the sensory transition and the effect of executing the motor command.

One easily visualised example is that a face recognising region, currently perceiving a right eye, can learn to predict seeing a left eye when a saccade to the right is the motor command, and/or a nose when a saccade to the lower right is made, etc. Jeff proposes that this is used to form a stable representation of the face in Layer 3, which is receiving the output of these Layer 4 cells.

The current article claims that the “motor command” represents either a real motor command to be executed, which will cause the predicted change in sensory input, or else the analogous “change in the world” which would have the same transitional sensory effect. The latter would represent, in the above example, the person whose face is seen, moving her own head in the opposite direction, and presenting an eye or nose to the observer while the observer is passive.

In the case of speech recognition, the listener uses her memory of how to make the next sound to predict which sounds the speaker is likely to make next. At the same time, the speaker is using his memory of the sound he expects to make to perform fine control over his motor behaviour.

Another example is the experience of sitting on a stationary train when another train begins to move out of the station. The stationary observer often gets the feeling that she is in fact moving and that the other train is not (and a person in the other train may have the opposite perception – that he is stationary and the first person’s train is the one which is moving).

The colloquial term for this idea is the notion of a “mirror cell”. This article claims that so-called “mirror cells” are pervasive at all levels of cortex and serve to explain exactly why every region of cortex produces “motor commands” in the processing of what is usually considered pure sensory information.

In this way, the cortex is creating a truly integrated sensorimotor model, which not only contains and explains the temporal structure of the world, but also stores and provides the “means of construction” of that temporal structure in terms of how it can be generated (either by the action of the observer interacting with the world, or by the passive observation of the external action of some cause in the world).

This idea also provides an explanation for the learning power of the cortex. In learning to perceive the world, we need to provide – literally – a “motivation” for every observed event in the world, as either the result of our action or by the occurrence of a precisely mirrored action caused externally. At a higher cognitive level, this explains why the best way to learn anything is to “do it yourself” – whether it’s learning a language or proving a theorem. Only when we have constructed both an active and a passive sensorimotor model of something do we possess true understanding of it.

Finally, this idea explains why some notions are hard to “get” at times – this model requires a listener or learner not just to imagine the sensory perception or cognitive “snapshot” of an idea, but the events or actions which are involved in its construction or establishment in the world.

  • May 01 / 2014
  • 0
CLA Layer
Clortex (HTM in Clojure)

Clortex Pre-Alpha Now Public

This is one of a series of posts on my experiences developing Clortex in Clojure, a new dialect of LISP which runs on the Java Virtual Machine. Clortex is a re-implementation of Numenta’s NuPIC, based on Jeff Hawkins’ theories of computational neuroscience. You can read my in-progress book by clicking on the links to the right.

Until today, I’ve been developing Clortex using a private repo on Github. While far from complete, I feel that Clortex is now at the stage where people can take a look at it, give feedback on the design, and help shape the completion of the first alpha release over the coming weeks.

I’ll be hacking on Clortex this weekend (May 3rd-4th) at the NuPIC Spring Hackathon in San José, please join us on the live feeds and stay in touch using the various Social Media tools.

WARNING: Clortex is not even at the alpha stage yet. I’ll post instructions over the next few days which will allow you to get some visualisations running.

You can find Clortex on Github at

A new kind of computing requires a new kind of software design.
Hierarchical Temporal Memory (HTM) and the Cortical Learning Algorithm (CLA) represent a new kind of computing, in which many, many millions of tiny, simple, unreliable components interact in a massively parallel, emergent choreography to produce what we would recognise as intelligence.
Jeff Hawkins and his company, Numenta, have built a system called NuPIC using the principles of the neocortex. Clortex is a reimagining of CLA, using modern software design ideas to unleash the potential of the theory.
Clortex’ design is all about turning constraints into synergies, using the expressive power and hygiene of Clojure and its immutable data structures, the unique characteristics of the Datomic database system, and the scaleability and portability characteristics of the Java Virtual Machine. Clortex will run on hosts as small as Raspberry Pi, a version will soon run in browsers and phones, yet it will scale layers and hierarchies across huge clusters to deliver real power and test the limits of HTM and CLA in production use.
How can you get involved?
Clortex is just part of a growing effort to realise the potential of Machine Intelligence based on the principles of the brain.
  • Visit the site for videos, white papers, details of the NuPIC mailing list, wikis, etc.
  • Have a look at (and optionally pre-purchase) my book: Real Machine Intelligence with Clortex and NuPIC.
  • Join the Clortex Google Group for discussion and updates.
  • We’ll be launching an Indiegogo campaign during May 2014 to fund completion of Clortex, please let us know if you’re interested in supporting us when we launch.
  • Apr 29 / 2014
  • 0
Clortex (HTM in Clojure), General Interest

Clortex: The Big Ideas

As you might know, I’ve been working away on “yet another implementation of Hierarchical Temporal Memory/Cortical Learning Algorithm” for the last few months. While that would have been nice as a hobby project, I see what I’m doing as something more than that. Working with NuPIC over the last year, I’ve gradually come to a realisation which seems kind of obvious, but remains uncomfortable:

A new kind of computing requires a new kind of software design.

HTM and CLA represent a new kind of computing, in which many, many millions of tiny, simple, unreliable components interact in a massively parallel, emergent choreography to produce what we would recognise as intelligence. This sounds like just another Neural Net theory, which is truly unfortunate, because that comparison has several consequences.

CLA – Not Another Neural Network

The first consequence relates to human comprehension. Most people who are interested in (or work in) Machine Learning and AI are familiar with the various flavours of Neural Nets which have been developed and studied over the decades. We now know that a big enough NN is the equivalent of a Turing Universal Computer, and we have a lot of what I would call “cultural knowledge” about NNs and how they work. Importantly, we have developed a set of mathematical and computational tools to work with NNs and build them inside traditional computing architectures.

This foreknowledge is perhaps the biggest obstacle facing HTM-CLA right now. It is very difficult to shake off the idea that what you’re looking at in CLA is some new twist on a familiar theme. There are neurons, with synapses connecting them; the synapses have settings which seem to resemble “weights” in NNs; the neurons combine their inputs and produce an output, and so on. The problem is not with CLA: it’s that the NN people got to use these names first and have hijacked (and overwritten) their original neuroscientific meanings.

CLA differs from NNs at every single level of granularity. And the differences are not subtle; they are fundamentally different operating concepts. Furthermore, the differences compound to create a whole idea which paradoxically continues to resemble a Neural Net but which becomes increasingly different in what it does as you scale out to layers, regions and hierarchies.

It’s best to start with a simplified idea of what a traditional NN neuron is. Essentially, it’s a pure function of its inputs. Each neuron has a local set of weights which are dot-producted with the inputs, run through some kind of “threshold filter” and produce a single scalar output. A NN is a graph of these neurons, usually organised into a layered structure with a notion of “feedforward” and “feedback” directions. Some NNs have horizontal and cyclical connections, giving rise to some “memory” or “temporal” features; such NNs are known as Recurrent Neural Networks.

In order to produce a useful NN, you must train it. This involves using a training set of input data and a learning algorithm which adjusts the weights so as to approach a situation where the collective effect of all the pure functions is the desired transformation from inputs to outputs.

In sharp contrast, CLA models each neuron as a state machines. The “output” of a neuron is not a pure function of its feedforward inputs, but incorporates the past history of its own and other neurons’ activity.  A neuron in CLA effectively combines a learned probability distribution of its feedforward inputs (as in a NN) with another learned probability distribution in the temporal domain (ie transitions between input states). Further, instead of being independent (as in NNs), the “outputs” of a set of neurons are dictated using an online competitive process. Finally, the output of a layer of CLA neurons is a Sparse Distributed Representation, which can only be interpreted as a whole, a collective, or a holographic encoding what the layer is representing.

These fundamental differences, along with the unfortunate duplicate use of names for the components, mean that your grandfather’s tools and techniques for reasoning about, working with, and mathematically modelling NNs do not apply at all to CLA.

In fact, Jeff’s work has been criticised because he’s allegedly “thrown away” the key ability to mathematically demonstrate some properties of NNs, which the consensus considers necessary if you want your theory to be admitted as valid (or your paper published). This view would have it that CLA sacrifices validity for a questionable (and in some opinions, vacuous) gain.

Jeff gave a talk a few weeks back to the London Deep Learning Meetup – it’s perhaps the best single overview of the current state of the art for CLA:

The second consequence relates to implementation in software. Numenta began work on what is now NuPIC in 2005, and most of the current theory – itself still in its early stages in 2014 – has only appeared in stages over the intervening years. Each innovation in the theory has had to be translated into and incorporated into a pre-existing system, with periodic redesigns of this or that component as understanding developed. It is unfair to expect the resulting software artefacts to magically transmute, Doctor Who-like, in a sequence of fully-formed designs, each perfectly appropriate for a new generation of the theory.

The fact that NuPIC is a highly functional, production-ready, and reasonably faithful implementation of much of CLA is a testament to the people at Numenta and their dedication to bring Jeff’s theories into an engineered reality. The details of how NuPIC works, and how it can be used, are a consequence of its history, co-evolving with the theory, and the software design and development techniques which were available over that history.

Everything involves tradeoffs, and NuPIC is no exception. I have huge respect for the decisions which have led to the NuPIC of 2014, and I would like to view Clortex as nothing other than NuPIC, metamorphosed for a new phase of Machine Intelligence based on HTM and CLA, with a different set of tradeoffs and the chance to stretch the boundaries yet again.

So, rather than harp on what might be limiting or difficult with NuPIC, I’ll now describe some of the key improvements which are possible when a “new kind of software” is created for HTM and CLA.

Architectural Simplicity and Antifragile Software – Russ Miles

It’s odd, but Clortex’ journey began when I followed a link to a talk Jeff gave last year [free registration required] at GOTO Aarhus 2013, and decided to watch one, then two, and finally all three talks given by Russ Miles at the same event. If you’re only able to watch one, the one to watch is Architectural Simplicity through Events. In that talk, Russ outlines his main Axioms for building adaptable software:

1. Your Software’s First Role is to be Useful

Clearly, NuPIC is already useful, but there is a huge opportunity for Clortex to be useful in several new ways:

a) As a Teaching Tool to help understand the CLA and its power. HTM and CLA are difficult to understand at a deep level, and they’re very different from traditional Neural Networks in every way. A new design is needed to transparently communicate an intuitive view of CLA to layman, machine learning expert, and neuroscientist alike. The resulting understanding should be as clear to an intelligent and interested viewer as it is to Jeff himself.

b) As a Research and Development platform for Machine Intelligence. Jeff has recently added – literally – a whole set of layers to his theory, involving a new kind of temporal pooling, sensorimotor modelling, multilayer regions, behaviour, subcortical connections, and hierarchy. This is all being done with thought experiments, whiteboards, pen and paper, and slides. We’ll see this in software sometime, no doubt, but that process has only begun. A new system which allows many of these ideas to be directly expressed in software and tested in real time will accelerate the development of the theory and allow many more people to work on it.

(Here’s Jeff talking about this in detail recently:)

c) As a Production Platform for new Use Cases. NuPIC is somewhat optimised for a certain class of use cases – producing predictions and detecting anomalies in streaming machine-generated numerical data. It’s also been able to demonstrate capabilities in other areas, but there is a huge opportunity for a new design to allow entirely new types of information to be handled by HTM and CLA techniques. These include vision, natural language, robotics and many other areas to which traditional AI and ML techniques have been applied with mixed results. A new design, which emphasises adaptability, flexibility, scaleability and composability, will allow CLA to be deployed at whatever scale (in terms of hierarchy, region size, input space etc as well as machine resources) is appropriate to the task.

2. The best software is that which is not needed at all

Well, we have our brains, and the whole point of this is to build software which uses the principles of the brain. On the other hand, we can minimise over-production by only building the components we need, once we understand how they work and how they contribute to the overall design. Clortex embraces this using a design centred around immutable data structures, surrounded by a growing set of transforming functions which work on that data.

3. Human Comprehension is King

This axiom is really important for every software project, but so much more so when the thing you’re modelling is so difficult to understand for many. The key with applying this axiom is to recognise that the machine is only the second most important audience for your code – the most important being other humans who will interact with your code as developers, researchers and users. Clortex has as its #1 requirement the need to directly map the domain – Jeff’s theory of the neocortex – and to maintain that mapping at all costs. This alone would justify building Clortex for me.

4. Machine Sympathy is Queen

This would seem to contradict Axiom 3, but the use of the word “Queen” is key. Any usable system must also address the machine environment in which it must run, and machine sympathy is how you do that. Clortex’ design is all about turning constraints into synergies, using the expressive power and hygiene of Clojure and its immutable data structures, the unique characteristics of the Datomic database system, and the scaleability and portability characteristics of the Java Virtual Machine. Clortex will run on Raspberry Pi, a version of will run in browsers and phones, yet it will scale layers and hierarchies across huge clusters to deliver real power and test the limits of HTM and CLA in production use.

5. Software is a Process of R&D

This is obviously the case when you’re building software based on an evolving theory of how the brain does it. Russ’ key point here is that our work always involves unknowns, and our software and processes must be designed in such a way as not to slow us down in our R&D work. Clortex is designed as a set of loosely coupled, interchangeable components around a group of core data structures, and communicating using simple, immutable data.

6. Software Development is an Extremely Challenging Intellectual Pursuit

Again, this is so true in this case, but the huge payoff you can derive if you can come up with a design which matches the potential of the CLA is hard to beat. I hope that Clortex can meet this most extreme of challenges.

Stay tuned for Pt II – The Clortex System Design..


  • Apr 11 / 2014
  • 0

Doc-driven Development Using lein-midje-doc

This is one of a series of posts on my experiences developing Clortex in Clojure, a new dialect of LISP which runs on the Java Virtual Machine. Clortex is a re-implementation of Numenta’s NuPIC, based on Jeff Hawkins’ theories of computational neuroscience. You can read my in-progress book by clicking on the links to the right. Clortex will become public this month.

One of the great things about Clojure is the vigour and quality of work being done to create the very best tools and libraries for developing in Clojure and harnessing its expressive power. Chris Zheng‘s lein-midje-doc is an excellent example. As its name suggests, it’s uses the comprehensive Midje testing library, but in a literate programming style which produces documentation or tutorials.

Doc-driven Development

Before we get to DDD, let’s review its antecedent, TDD.

Test-driven Development

Test-driven Development (TDD) has become practically a tradition, arising from the Agile development movement. In TDD, you develop your code based on creating small tests first (these specify what your code will have to do); the tests fail at first because you haven’t yet written code to make them pass. You then write some code which makes a test pass, and proceed until all tests pass. Keep writing new tests, failing and coding to pass, until the tests encompass the full spec for your new feature or functionality.

For example, to write some code which finds the sum of two numbers, you might first do the following:

(fact "adding two numbers returns their sum" ; "fact" is a Midje term for a property under test
    (my-sum 7 5) => 12 ; "=>" says the form on the left should return or match the form on the right

This will fail, firstly because there is no function my-sum. To fix this, write the following:

(defn my-sum [a b]

Note that this is the correct way to pass the test (it’s minimal). Midje will go green (all tests pass). Now we need to force the code to generalise:

(fact "adding two numbers returns their sum" 
    (my-sum 7 5) => 12 
    (my-sum 14 10) => 24

Which makes us actually write the code in place of 12:

(defn my-sum [a b]
  (+ a b)

The great advantage of TDD is that you don’t ever write code unless it is to pass a test you’ve created. As Rich Hickey says: “More code means… more bugs!” so you should strive to write the minimum code which solves your problem, as specified in your tests. The disadvantage of TDD is that it shifts the work into designing a series of tests which (you hope) defines your problem well. This is better than designing by coding, but another level seems to be required. Enter literate programming.

Literate Programming

This style of development was invented by the famous Donald Knuth back in the 70’s. Knuth’s argument is that software is a form of communication. The first is obvious: a human (the developer) is communicating instructions to the machine in the form of source code. The second is less obvious but perhaps more important: you are communicating your requirements, intentions and design decisions to other humans (including your future self). Knuth designed and built a system for literate programming, and this forms the basis for all similar systems today.

This post is an example of literate programming (although how ‘literate’ it is is left to the reader to decide), in that I am forming a narrative to explain a concrete software idea, using text, and interspersing it with code examples which could be executed by a system processing the document.

Doc-driven Development

DDD is essentially a combination of TDD and Literate Programming. Essentially, you write a document about some part of your software, a narrative describing what it should do and examples of how it should work. The examples are framed as “given this, the following should be expected to happen”, which you write as `facts` (a type of test which is easier for humans to read). The DDD system runs the examples and checks the return values to see if they match the expectations in your document.

This big advantage of this is that your documentation is a much higher-level product than a list of unit tests, in that your text provides the reader (including your future self returning to the code) with much more than a close inspection of test code would yield. In addition, your sample code and docs are guaranteed to stay in synch with your code, because they actually run your code every time it changes.


lein-midje-doc was developed by Chris Zheng as a plugin for the Clojure Leiningen project and build tool. It leverages Midje to convert documents written in the Literate Programming style into suites of tests which can verify the code described.

It’s simple to set up. You have to add dependencies to your project.clj file, and then you add an entry for each document you wish to include (instructions are in the, full docs on Chris’ site). then you use two shells to run things. In one, you run lein midje-doc, which repeatedly creates the readable documents as HTML files from your source files, and in the other you run midje.repl‘s (autotest) function to actually spin through your tests.

Here’s Chris demonstrating lein-midje-doc:

  • Mar 31 / 2014
  • 0
Clortex (HTM in Clojure), General Interest, NuPIC

Real Machine Intelligence now Available on

The first three chapters of my new book, Real Machine Intelligence with Clortex and NuPIC has just gone live on Lean Publishing is a new take on a very old idea – serial publishing – which goes back to Charles Dickens in the 19th century. The idea is to evolve the book based on reader feedback, and to give people a chance to read the whole book before committing to buy.

I’ve also set up a Google Group and Facebook Page. Prior to going live with Clortex, you can read a snapshot of the documentation on its page.

Read online or buy the in-progress book here.