Better Living through Thoughtful Technology

## Posts Categorized / NuPIC

• Jan 02 / 2015

## Self-Stabilisation in Hierarchical Temporal Memory

This post was written in response to Jeff Hawkins’ comments on last week’s article on a new Multilayer Model of Neocortex in Hierarchical Temporal Memory (HTM). Jeff expressed concerns about the clarity or correctness of my claim that sublayers in a cortical region act to self-stabilise in the face of unpredicted changes in the world (including changes in top-down feedback from higher regions). This discussion is a companion to an earlier description of the Efficiency of Predicted Sparseness, but goes into much more detail when describing how a non-sparse output from one sublayer is absorbed and processed by downstream sublayers.

In the earlier posts, we described how each sublayer in a region combines context inputs with feedforward inputs to form a sparse, predicted representation of the world in context. When this succeeds perfectly, each column in the sublayer has only a single active cell, and that cell represents the best combination of prediction from context and recognition of the feedforward input. The single-cell-per-column representation occurs when the single cell is sufficiently depolarised by distal (predictive/context) inputs to beat its columnar inhibitory sheath and fire first. If this does not happen, then the sheath fires first, allowing some number of contained pyramidal cells to fire before vertical inhibition reduces the column’s activity to just the one, best-predicted cell.

In order to understand the stabilising effect, we need to zoom in temporally and watch how the potentials evolve in extreme “slow-motion” in which the time steps correspond to individual synaptic events. At this framerate, we can observe the individual neurons’ potentials rising towards firing and the effect of inhibition both vertically and horizontally on the patterns of activation. This level of granularity also allows us to characterise the opportunities for synapses to adapt, which turns out to be crucial for understanding the model.

Synapses grow when there is a temporal correlation between their pre-synaptic inputs and the action potentials of the post-synaptic cell. The more often the cell fires within a short (c. 10ms) window of time after the synapse receives an action potential, the bigger and more receptive the synapse grows. In HTM, we model this with a scalar value we call permanence, which varies between 0.0 and 1.0, and we say that the synapse is connected when its permanence is above a threshold (usually 0.2), otherwise it is disconnected.

The current “official” Cortical Learning Algorithm (or CLA, the detailed computational model in HTM) separates feedforward and predictive stages of processing. A modification of this model (which I call prediction-assisted recognition or paCLA) combines these into a single step involving competition between highly predictive pyramidal cells and their surrounding columnar inhibitory sheaths. Though this has been described in summary form before, I’ll go through it in detail here.

Neural network models generally model a neuron as somehow “combining” a set of inputs to produce an output. This is based on the idea that input signals cause ion currents to flow into the neuron’s cell body, which raises its voltage (depolarises), until it reaches a threshold level and fires (outputs a signal). paCLA also models this idea, with the added complication that there are two separate pathways (proximal and distal) for input signals to be converted into effects on the voltage of the cell. In addition, paCLA treats the effect of the inputs as a rate of change of potential, rather than as a final potential level as found in standard CLA.

## Slow-motion Timeline of paCLA

[Note: this section relates to Mathematics of HTM Part I  and Part II – see those posts for a full treatment].

Consider a single column of pyramidal cells in a layer of cortex. Along with the set of pyramidal cells $$\{P_1,P_2 .. P_n\}$$, we also model a columnar sheath of inhibitory cells as a single cell $$I$$. All the $$P_i$$ and $$I$$ are provided with the same feedforward input vector $$\mathbf{x}_t$$, and they also have similar (but not necessarily identical) synaptic connection vectors $$\mathbf{c}_{P_i}$$ and $$\mathbf{c}_{I}$$ to those inputs (the bits of $$\mathbf{x}_t$$ are the incoming sensory activation potentials, while bit $$j$$ of a connection vector $$\mathbf{c}$$ is 1 if synapse $$j$$ is connected). The feedforward overlap $$o^{\textrm{ff}}_{P_i}(\mathbf{x}_t) = \mathbf{x}_t \cdot \mathbf{c}_{P_i}$$ is the output of the proximal dendrite of cell $${P_i}$$ (and similarly for cell $$I$$).

In addition, each pyramidal cell (but not the inhibitory sheath) receives signals on its distal dendrites. Each dendrite segment acts separately on its own inputs $$\mathbf{y}_k^{t-1}$$, which come from other neurons in the same layer as well as other sublayers in the region (and from other regions in some cases). When a dendrite segment $$k$$ has a sufficient distal overlap, exceeding a threshold $$\lambda_k$$, the segment emits a dendritic spike of size $$s_k$$. The output of the distal dendrites is then given by:

$$o^{\textrm{pred}}=\sum\limits_{o_k^{t} \ge \lambda_k}{s_k}$$

The predictive potential is combined with the overlap score from the feedforward overlap coming from the proximal dendrite to give the total depolarisation rate:

$$d_j = \frac{\partial V_j}{\partial t} = \alpha_j o^{\textrm{ff}}_{P_j} + \beta_j o^{\textrm{pred}}_{P_j}$$

where $$\alpha_j$$ and $$\beta_j$$ are parameters which transform the proximal and distal contributions into a rate of change of potential (and also control the relative effects of feedforward and predictive inputs). For the inhibitory sheath $$I$$, there is only the feedforward component $$\alpha_I o^{\textrm{ff}}_I$$, but we assume this is larger than any of the feedforward contributions $$\alpha_j o^{\textrm{ff}}_{P_j}$$ for the pyramidal cells [cite evidence].

Now, the time a neuron takes to reach firing threshold is inversely proportional to its depolarisation rate. This imposes an ordering of the set $$\{P_1..P_n,I\}$$ according to their (prospective) firing times $$\tau_{P_j} = \gamma_P \frac{1}{d_j}$$ (and $$\tau_I = \gamma_I \frac{1}{d_I}$$).

## Formation of the Sparse Distributed Representation (SDR)

Zooming out from the single column to a neighbourhood (or sublayer) $$L_1$$ of columns $$C_m$$, we see that there is a local sequence $$\mathbb{S}$$ in which all the pyramidal cells (and the inhibitory sheaths) would fire if inhibition didn’t take place. The actual sequence of cells which do fire can now be established by taking into account the effects of inhibition.

Let’s partition the sequence as follows:

$$\mathbb{S} = \mathbb{P}^{\textrm{pred}} \parallel \mathbb{I}^{\textrm{pred}} \parallel \mathbb{I}^{\textrm{ff}} \parallel \mathbb{P}^{\textrm{burst}} \parallel \mathbb{I}^{\textrm{spread}}$$

where:

1. $$\mathbb{P}^{\textrm{pred}}$$ is the (possibly empty) sequence of pyramidal cells in a highly predictive state, which fire before their inhibitory sheaths (ie $$\mathbb{P}^{\textrm{pred}} = \{P~|~\tau_P < \tau_{I_m}, P \in C_m\}$$);
2. $$\mathbb{I}^{\textrm{pred}}$$ is the sequence of inhibitory sheaths which fire due to triggering by their contained predictively firing neurons in $$\mathbb{P}^{\textrm{pred}}$$ – these cells fire in advance of their feedforward times due to inputs from $$\mathbb{P}^{\textrm{pred}}$$;
3. $$\mathbb{I}^{\textrm{ff}}$$ is the sequence of inhibitory sheaths which fire as a result of feedforward input alone;
4. $$\mathbb{P}^{\textrm{burst}}$$ is the sequence of cells in columns where the inhibitory sheaths have just fired but their vertical inhibition has not had a chance to reach these cells (this is known as bursting) – ie $$\mathbb{P}^{\textrm{burst}} =\{P~|~\tau_P < \tau_{I_m} + \Delta\tau_{\textrm{vert}}, P \in C_m\}$$;
5. Finally, $$\mathbb{I}^{\textrm{spread}}$$ is the sequence of all the other inhibitory sheaths which are triggered by earlier-firing neighbours, which spreads a wave of inhibition imposing sparsity in the neighbourhood.

Note that there may be some overlap in these sequences, depending on the exact sequence of firing and the distances between active columns.

The output of a sublayer is the SDR composed of the pyramidal cells from $$\mathbb{P}^{\textrm{pred}} \parallel \mathbb{P}^{\textrm{burst}}$$ in that order. We say that the sublayer has predicted perfectly if $$\mathbb{P}^{\textrm{burst}} = \emptyset$$ and that the sublayer is bursting otherwise.

The cardinality of the SDR is minimal under perfect prediction, with some columns having a sequence of extra, bursting cells otherwise. The bursting columns represent feedforward inputs which were well recognised (causing their inhibitory sheaths to fire quickly) but less well predicted (no cell was predictive enough to beat the sheath), and the number of cells firing indicates the uncertainty of which prediction corresponds to reality. The actual cells which get to burst are representative of the most plausible contexts for the unexpected input.

## Transmission and Reception of SDRs

A sublayer $$L_2$$ which receives this $$L_1$$ SDR as input will first see the minimal SDR $$\mathbb{P}^{\textrm{pred}}$$ representing the perfect match of input and prediction, followed by the bursting SDR elements $$\mathbb{P}^{\textrm{burst}}$$ in decreasing order of prediction-reality match.

This favours cells in $$L_2$$ which have learned to respond to this SDR, and even more so for the subset which are also predictive due to their own contextual inputs (this biasing happens regardless of whether the receiving cells are proximally or distally enervated). The more sparse (well-predicted) the incoming SDR, the more sparse the activation of $$L_2$$.

When there is a bursting component in the SDR, this will tend to add significant (or overwhelming) extra signal to the minimal SDR, leading to high probability of a change in the SDR formed by $$L_2$$, because several cells in $$L_2$$ will have a stronger feedforward response to the extra inputs than those which respond to the small number of signals in the minimal SDR.

For example, in software we typically use layers containing 2,048 columns of 32 pyramidal neurons (64K cells), with a minimal column SDR of 40 columns (c. 2%). At perfect prediction, the SDR has 40 cells (0.06%), while total bursting would create an SDR of 1280 cells. In between, the effect is quite uneven, since each bursting column produces several signals, while all non-bursting columns stay at one. Assuming some locality of the mapping between $$L_1$$ and $$L_2$$, this will have dramatic local effects where there is bursting.

The response in $$L_2$$ to bursting in its input will not only be a change in the columnar representation, but may also cause bursting in $$L_2$$ itself if the new state was not well predicted using $$L_2$$’s context. This will cause bursting to propagate downstream, from sublayer to sublayer (including cycles in feedback loops), until some sublayer can stop the cascade either by predicting its input or by causing a change in its external world which indirectly restores predictability.

Since we typically do not see reverberating, self-reinforcing cycles of bursting in neocortex, we must assume that the brain has learned to halt these cascades using some combination of eventual predictive resolution and remediating output from regions. Note that each sublayer has its own version of “output” in this sense – it’s not just the obvious motor output of L5 which can “change the world”. For example, L6 can output a new SDR which it transmits down to lower regions, changing the high-level context imposed on those regions and thus the environment in which they are trying (and failing somewhat) to predict their own inputs. L6 can also respond by altering its influence over thalamic connections, thus mediating or eliminating the source of disturbance. L2/3 and L5 both send SDRs up to higher regions, which may be able to better handle their deviations from predictability. And of course L5 can cause real changes in the world by acting on motor circuits.

## How is Self-Stabilisation Learned?

When time is slowed down to the extent we’ve seen in this discussion, it is relatively easy to see how neurons can learn to contribute to self-stabilisation of sparse activation patterns in cortex. Recall the general principle of Hebbian learning in synapses – the more often a synapse receives an input within a short time before its cell fires, the more it grows to respond to that input.

Consider again the sequence of firing neurons in a sublayer:

$$\mathbb{S} = \mathbb{P}^{\textrm{pred}} \parallel \mathbb{I}^{\textrm{pred}} \parallel \mathbb{I}^{\textrm{ff}} \parallel \mathbb{P}^{\textrm{burst}} \parallel \mathbb{I}^{\textrm{spread}}$$

This sequence does not include the very many cells in a sublayer which do not fire at all, because they are contained either in columns which become active, but are not fast enough to burst, or more commonly they are in columns inhibited by a spreading wave from active columns. Let’s call this set $$\mathbb{P}^{\textrm{inactive}}$$.

A particular neuron will, at any moment, be a member of one of these sets. How often the cell fires depends on the average amount of time it spends in each set, and how often a cell fires characteristically for each set. Clearly, the highly predictive cells in $$\mathbb{P}^{\textrm{pred}}$$ will have a higher typical firing frequency than those in $$\mathbb{P}^{\textrm{burst}}$$, while those in $$\mathbb{P}^{\textrm{inactive}}$$ have zero frequency when in that set.

Note that the numbers used earlier (65536 cells, 40 cells active in perfect prediction, 1280 in total bursting) mean that the percentage of the time cells are firing on average is massively increased if they are in the predictive population. Bursting cells only fire once following a failure of prediction, with the most predictive of them effectively “winning” and firing if the same input persists.

Some cells will simply be “lucky enough” to find themselves in the most predictive set and will strengthen the synapses which will keep them there. Because of their much higher frequency of firing, these cells will be increasingly hard to dislodge and demote from the predictive state.

Some cells will spend much of their time only bursting. This unstable status will cause a bifurcation among this population. A portion of these cells will simply strengthen the right connections and join the ranks of the sparsely predictive cells (which will eliminate their column from bursting on the current inputs). Others will weaken the optimal connections in favour of some other combination of context and inputs (which will drop them from bursting to inactive on current inputs). The remainder, lacking the ability to improve to predictive and the attraction of an alternative set of inputs, will continue to form part of the short-lived bursting behaviour. In order to compete with inactive cells in the same column, these “metastable” cells will have to have an output which tends to feed back into the same state which led to them bursting in the first place.

Cells which get to fire (either predictively or by bursting) have a further advantage – they can specialise their sensitivity to feedforward inputs given the contexts which caused them to fire, and this will give them an ever-improving chance of beating the inhibitory sheath (which has no context to help it learn). This is another mechanism which will allow cells to graduate from bursting to predictive on a given set of inputs (and context).

Since only active cells have any effect in neocortex, we see that there is an emergent “drive” towards stability and sparsity in a sublayer. Cells, given the opportunity, will graduate up the ladder from inactive to bursting to predictive when presented with the right inputs. Cells which fail to improve will be overtaken by their neighbours in the same column, and demoted back down towards inactive. A cell which has recently started to burst (having been inactive on the same inputs) will be reinforced in that status if its firing gives rise to a transient change in the world which causes its inputs to recur. With enough repetition, a cell will graduate to predictive on its favoured inputs, and will participate in a sparse, stable predictive pattern of activity in the sublayer and its region. The effect of its output will correspondingly change from a transient “restorative” effect to a self-sustaining, self-reinforcing effect.

• Nov 29 / 2014

## Mathematics of HTM Part II – Transition Memory

This article is part of a series describing the mathematics of Hierarchical Temporal Memory (HTM), a theory of cortical information processing developed by Jeff Hawkins. In Part One, we saw how a layer of neurons learns to form a Sparse Distributed Representation (SDR) of an input pattern. In this section, we’ll describe the process of learning temporal sequences.

We showed in part one that the HTM model neuron learns to recognise subpatterns of feedforward input on its proximal dendrites. This is somewhat similar to the manner by which a Restricted Boltzmann Machine can learn to represent its input in an unsupervised learning process. One distinguishing feature of HTM is that the evolution of the world over time is a critical aspect of what, and how, the system learns. The premise for this is that objects and processes in the world persist over time, and may only display a portion of their structure at any given moment. By learning to model this evolving revelation of structure, the neocortex can more efficiently recognise and remember objects and concepts in the world.

## Distal Dendrites and Prediction

In addition to its one proximal dendrite, a HTM model neuron has a collection of distal (far) neurons, which gather information from sources other than the feedforward inputs to the layer. In some layers of neocortex, these dendrites combine signals from neurons in the same layer as well as from other layers in the same region, and even receive indirect inputs from neurons in higher regions of cortex. We will describe the structure and function of each of these.

The simplest case involves distal dendrites which gather signals from neurons within the same layer.

In Part One, we showed that a layer of $$N$$ neurons converted an input vector $$\mathbf x \in \mathbb{B}^{n_{\textrm{ff}}}$$ into a SDR $$\mathbf{y}_{\textrm{SDR}} \in \mathbb{B}^{N}$$, with length$$\lVert{\mathbf y}_{\textrm{SDR}}\rVert_{\ell_1}=sN \ll N$$, where the sparsity $$s$$ is usually of the order of 2% ($$N$$ is typically 2048, so the SDR $$\mathbf{y}_{\textrm{SDR}}$$ will have 40 active neurons).

The layer of HTM neurons can now be extended to treat its own activation pattern as a separate and complementary input for the next timestep. This is done using a collection of distal dendrite segments, which each receive as input the signals from other neurons in the layer itself. Unlike the proximal dendrite, which transmits signals directly to the neuron, each distal dendrite acts as an active coincidence detector, firing only when it receives enough signals to exceed its individual threshold.

We proceed with the analysis in a manner analogous to the earlier discussion. The input to the distal dendrite segment $$k$$ at time $$t$$ is a sample of the bit vector $$\mathbf{y}_{\textrm{SDR}}^{(t-1)}$$. We have $$n_{ds}$$ distal synapses per segment, a permanence vector $$\mathbf{p}_k \in [0,1]^{n_{ds}}$$ and a synapse threshold vector $$\vec{\theta}_k \in [0,1]^{n_{ds}}$$, where typically $$\theta_i = \theta = 0.2$$ for all synapses.

Following the process for proximal dendrites, we get the distal segment’s connection vector $$\mathbf{c}_k$$:

$$c_{k,i}=(1 + sgn(p_{k,i}-\theta_{k,i}))/2$$

The input for segment $$k$$ is the vector $$\mathbf{y}_k^{(t-1)} = \phi_k(\mathbf{y}_{\textrm{SDR}}^{(t-1)})$$ formed by the projection $$\phi_k:\lbrace{0,1}\rbrace^{N-1}\rightarrow\lbrace{0,1}\rbrace^{n_{ds}}$$ from the SDR to the subspace of the segment. There are $${N-1}\choose{n_{ds}}$$ such projections (there are no connections from a neuron to itself, so there are $$N-1$$ to choose from).

The overlap of the segment for a given $$\mathbf{y}_{\textrm{SDR}}^{(t-1)}$$ is the dot product $$o_k^t = \mathbf{c}_k\cdot\mathbf{y}_k^{(t-1)}$$. If this overlap exceeds the threshold $$\lambda_k$$ of the segment, the segment is active and sends a dendritic spike of size $$s_k$$ to the neuron’s cell body.

This process takes place before the processing of the feedforward input, which allows the layer to combine contextual knowledge of recent activity with recognition of the incoming feedforward signals. In order to facilitate this, we will change the algorithm for Pattern Memory as follows.

Each neuron begins a timestep $$t$$ by performing the above processing on its $${n_{\textrm{dd}}}$$ distal dendrites. This results in some number $$0\ldots{n_{\textrm{dd}}}$$ of segments becoming active and sending spikes to the neuron. The total predictive activation potential is given by:

$$o_{\textrm{pred}}=\sum\limits_{o_k^{t} \ge \lambda_k}{s_k}$$

The predictive potential is combined with the overlap score from the feedforward overlap coming from the proximal dendrite to give the total activation potential:

$$a_j^t=\alpha_j o_{\textrm{ff},j} + \beta_j o_{\textrm{pred},j}$$

and these $$a_j$$ potentials are used to choose the top neurons, forming the SDR $$Y_{\textrm{SDR}}$$ at time $$t$$. The mixing factors $$\alpha_k$$ and $$\beta_k$$ are design parameters of the simulation.

## Learning Predictions

We use a very similar learning rule for distal dendrite segments as we did for the feedforward inputs:

$$p_i^{(t+1)} = \begin{cases} (1+\sigma_{inc})p_i^{(t)} & \text {if cell j active, segment k active, synapse i active} \\ (1-\sigma_{dec})p_i^{(t)} & \text {if cell j active, segment k active, synapse i not active} \\ p_i^{(t)} & \text{otherwise} \\ \end{cases}$$

Again, this reinforces synapses which contribute to activity of the cell, and decreases the contribution of synapses which don’t. A boosting rule, similar to that for proximal synapses, allows poorly performing distal connections to improve until they are good enough to use the main rule.

## Interpretation

We can now view the layer of neurons as forming a number of representations at each timestep. The field of predictive potentials $$o_{\textrm{pred},j}$$ can be viewed as a map of the layer’s confidence in its prediction of the next input. The field of feedforward potentials can be viewed as a map of the layer’s recognition of current reality. Combined, these maps allow for prediction-assisted recognition, which, in the presence of temporal correlations between sensory inputs, will improve the recognition and representation significantly.

We can quantify the properties of the predictions formed by such a layer in terms of the mutual information between the SDRs at time $$t$$ and $$t+1$$. I intend to provide this analysis as soon as possible, and I’d appreciate the kind reader’s assistance if she could point me to papers which might be of help.

A layer of neurons connected as described here is a Transition Memory, and is a kind of first-order memory of temporally correlated transitions between sensory patterns. This kind of memory may only learn one-step transitions, because the SDR is formed only by combining potentials one timestep in the past with current inputs.

Since the neocortex clearly learns to identify and model much longer sequences, we need to modify our layer significantly in order to construct a system which can learn high-order sequences. This is the subject of the next part of this series.

Note: For brevity, I’ve omitted the matrix treatment of the above. See Part One for how this is done for Pattern Memory; the extension to Transition Memory is simple but somewhat arduous.

• Nov 28 / 2014

## Mathematics of Hierarchical Temporal Memory

This article describes some of the mathematics underlying the theory and implementations of Jeff Hawkins’ Hierarchical Temporal Memory (HTM), which seeks to explain how the neocortex processes information and forms models of the world.

Note: Part II: Transition Memory is now available.

## The HTM Model Neuron – Pattern Memory (aka Spatial Pooling)

We’ll illustrate the mathematics of HTM by describing the simplest operation in HTM’s Cortical Learning Algorithm: Pattern Memory, also known as Spatial Pooling, forms a Sparse Distributed Representation from a binary input vector. We begin with a layer (a 1- or 2-dimensional array) of single neurons, which will form a pattern of activity aimed at efficiently representing the input vectors.

### Feedforward Processing on Proximal Dendrites

The HTM model neuron has a single proximal dendrite, which is used to process and recognise feedforward or afferent inputs to the neuron. We model the entire feedforward input to a cortical layer as a bit vector $${\mathbf x}_{\textrm{ff}}\in\lbrace{0,1}\rbrace^{n_{\textrm{ff}}}$$, where $$n_{\textrm{ff}}$$ is the width of the input.

The dendrite is composed of $$n_s$$ synapses which each act as a binary gate for a single bit in the input vector.  Each synapse has a permanence $$p_i\in{[0,1]}$$ which represents the size and efficiency of the dendritic spine and synaptic junction. The synapse will transmit a 1-bit (or on-bit) if the permanence exceeds a threshold $$\theta_i$$ (often a global constant $$\theta_i = \theta = 0.2$$). When this is true, we say the synapse is connected.

Each neuron samples $$n_s$$ bits from the $$n_{\textrm{ff}}$$ feedforward inputs, and so there are $${n_{\textrm{ff}}}\choose{n_{s}}$$ possible choices of input for a single neuron. A single proximal dendrite represents a projection $$\pi_j:\lbrace{0,1}\rbrace^{n_{\textrm{ff}}}\rightarrow\lbrace{0,1}\rbrace^{n_s}$$, so a population of neurons corresponds to a set of subspaces of the sensory space. Each dendrite has an input vector $${\mathbf x}_j=\pi_j({\mathbf x}_{\textrm{ff}})$$ which is the projection of the entire input into this neuron’s subspace.

A synapse is connected if its permanence $$p_i$$ exceeds its threshold $$\theta_i$$. If we subtract $${\mathbf p}-{\vec\theta}$$, take the elementwise sign of the result, and map to $$\lbrace{0,1}\rbrace$$, we derive the binary connection vector $${\mathbf c}_j$$ for the dendrite. Thus:

$$c_i=(1 + sgn(p_i-\theta_i))/2$$

The dot product $$o_j({\mathbf x})={\mathbf c}_j\cdot{\mathbf x}_j$$ now represents the feedforward overlap of the neuron with the input, ie the number of connected synapses which have an incoming activation potential. Later, we’ll see how this number is used in the neuron’s processing.

The elementwise product $${\mathbf o}_j={\mathbf c}_j\odot{\mathbf x}_j$$ is the vector in the neuron’s subspace which represents the input vector $${\mathbf x}_{\textrm{ff}}$$ as “seen” by this neuron. This is known as the overlap vector. The length $$o_j = \lVert{\mathbf o}_j\rVert_{\ell_1}$$ of this vector corresponds to the extent to which the neuron recognises the input, and the direction (in the neuron’s subspace) is that vector which has on-bits shared by both the connection vector and the input.

If we project this vector back into the input space, the result $$\mathbf{\hat{x}}_j =\pi^{-1}({\mathbf o}_j)$$ is this neuron’s approximation of the part of the input vector which this neuron matches. If we add a set of such vectors, we will form an increasingly close approximation to the original input vector as we choose more and more neurons to collectively represent it.

## Sparse Distributed Representations (SDRs)

We now show how a layer of neurons transforms an input vector into a sparse representation. From the above description, every neuron is producing an estimate $$\mathbf{\hat{x}}_j$$ of the input $${\mathbf x}_{\textrm{ff}}$$, with length $$o_j\ll n_{\textrm{ff}}$$ reflecting how well the neuron represents or recognises the input. We form a sparse representation of the input by choosing a set $$Y_{\textrm{SDR}}$$ of the top $$n_{\textrm{SDR}}=sN$$ neurons, where $$N$$ is the number of neurons in the layer, and $$s$$ is the chosen sparsity we wish to impose (typically $$s=0.02=2\%$$).

The algorithm for choosing the top $$n_{\textrm{SDR}}$$ neurons may vary. In neocortex, this is achieved using a mechanism involving cascading inhibition: a cell firing quickly (because it depolarises quickly due to its input) activates nearby inhibitory cells, which shut down neighbouring excitatory cells, and also nearby inhibitory cells, which spread the inhibition outwards. This type of local inhibition can also be used in software simulations, but it is expensive and is only used where the design involves spatial topology (ie where the semantics of the data is to be reflected in the position of the neurons). A more efficient global inhibition algorithm – simply choosing the top $$n_{\textrm{SDR}}$$ neurons by their depolarisation values – is often used in practise.

If we form a bit vector $${\mathbf y}_{\textrm{SDR}}\in\lbrace{0,1}\rbrace^N\textrm{ where } y_j = 1 \Leftrightarrow j \in Y_{\textrm{SDR}}$$, we have a function which maps an input $${\mathbf x}_{\textrm{ff}}\in\lbrace{0,1}\rbrace^{n_{\textrm{ff}}}$$ to a sparse output $${\mathbf y}_{\textrm{SDR}}\in\lbrace{0,1}\rbrace^N$$, where the length of each output vector is $$\lVert{\mathbf y}_{\textrm{SDR}}\rVert_{\ell_1}=sN \ll N$$.

The reverse mapping or estimate of the input vector by the set $$Y_{\textrm{SDR}}$$ of neurons in the SDR is given by the sum:

$$\mathbf{\hat{x}} = \sum\limits_{j \in Y_{\textrm{SDR}}}{{\mathbf{\hat{x}}}_j} = \sum\limits_{j \in Y_{\textrm{SDR}}}{\pi_j^{-1}({\mathbf o}_j)} = \sum\limits_{j \in Y_{\textrm{SDR}}}{\pi_j^{-1}({\mathbf c}_j\odot{\mathbf x}_j)}= \sum\limits_{j \in Y_{\textrm{SDR}}}{\pi_j^{-1}({\mathbf c}_j \odot \pi_j({\mathbf x}_{\textrm{ff}}))}= \sum\limits_{j \in Y_{\textrm{SDR}}}{\pi_j^{-1}({\mathbf c}_j) \odot {\mathbf x}_{\textrm{ff}}}$$

## Matrix Form

The above can be represented straightforwardly in matrix form. The projection $$\pi_j:\lbrace{0,1}\rbrace^{n_{\textrm{ff}}} \rightarrow\lbrace{0,1}\rbrace^{n_s}$$ can be represented as a matrix $$\Pi_j \in \lbrace{0,1}\rbrace^{{n_s} \times\ n_{\textrm{ff}}}$$.

Alternatively, we can stay in the input space $$\mathbb{B}^{n_{\textrm{ff}}}$$, and model $$\pi_j$$ as a vector $$\vec\pi_j =\pi_j^{-1}(\mathbf 1_{n_s})$$, ie where $$\pi_{j,i} = 1 \Leftrightarrow (\pi_j^{-1}(\mathbf 1_{n_s}))_i = 1$$.

The elementwise product $$\vec{x_j} =\pi_j^{-1}(\mathbf x_{j}) = \vec{\pi_j} \odot {\mathbf x_{\textrm{ff}}}$$ represents the neuron’s view of the input vector $$x_{\textrm{ff}}$$.

We can similarly project the connection vector for the dendrite by elementwise multiplication: $$\vec{c_j} =\pi_j^{-1}(\mathbf c_{j})$$, and thus $$\vec{o_j}(\mathbf x_{\textrm{ff}}) = \vec{c_j} \odot \mathbf{x}_{\textrm{ff}}$$ is the overlap vector projected back into $$\mathbb{B}^{n_{\textrm{ff}}}$$, and the dot product $$o_j(\mathbf x_{\textrm{ff}}) = \vec{c_j} \cdot \mathbf{x}_{\textrm{ff}}$$ gives the same overlap score for the neuron given $$\mathbf x_{\textrm{ff}}$$ as input. Note that $$\vec{o_j}(\mathbf x_{\textrm{ff}}) =\mathbf{\hat{x}}_j$$, the partial estimate of the input produced by neuron $$j$$.

We can reconstruct the estimate of the input by an SDR of neurons $$Y_{\textrm{SDR}}$$:

$$\mathbf{\hat{x}}_{\textrm{SDR}} = \sum\limits_{j \in Y_{\textrm{SDR}}}{{\mathbf{\hat{x}}}_j} = \sum\limits_{j \in Y_{\textrm{SDR}}}{\vec o}_j = \sum\limits_{j \in Y_{\textrm{SDR}}}{{\vec c}_j\odot{\mathbf x_{\textrm{ff}}}} = {\mathbf C}_{\textrm{SDR}}{\mathbf x_{\textrm{ff}}}$$

where $${\mathbf C}_{\textrm{SDR}}$$ is a matrix formed from the $${\vec c}_j$$ for $$j \in Y_{\textrm{SDR}}$$.

## Optimisation Problem

We can now measure the distance between the input vector $$\mathbf x_{\textrm{ff}}$$ and the reconstructed estimate $$\mathbf{\hat{x}}_{\textrm{SDR}}$$ by taking a norm of the difference. Using this, we can frame learning in HTM as an optimisation problem. We wish to minimise the estimation error over all inputs to the layer. Given a set of (usually random) projection vectors $$\vec\pi_j$$ for the N neurons, the parameters of the model are the permanence vectors $$\vec{p}_j$$, which we adjust using a simple Hebbian update model.

The update model for the permanence of a synapse $$p_i$$ on neuron $$j$$ is:

$$p_i^{(t+1)} = \begin{cases} (1+\delta_{inc})p_i^{(t)} & \text {if j \in Y_{\textrm{SDR}}, (\mathbf x_j)_i=1, and p_i^{(t)} \ge \theta_i} \\ (1-\delta_{dec})p_i^{(t)} & \text {if j \in Y_{\textrm{SDR}}, and ((\mathbf x_j)_i=0 or p_i^{(t)} \lt \theta_i)} \\ p_i^{(t)} & \text{otherwise} \\ \end{cases}$$

This update rule increases the permanence of active synapses, those that were connected to an active input when the cell became active, and decreases those which were either disconnected or received a zero when the cell fired. In addition to this rule, an external process gently boosts synapses on cells which either have a lower than target rate of activation, or a lower than target average overlap score.

I do not yet have the proof that this optimisation problem converges, or whether it can be represented as a convex optimisation problem. I am confident such a proof can be easily found. Perhaps a kind reader who is more familiar with a problem framed like this would be able to confirm this. I’ll update this post with more functions from HTM in coming weeks.

Note: Part II: Transition Memory is now available.

• Nov 13 / 2014

## Part 1 – Introduction and Description.

In any attempt to create a theoretical scientific framework, breakthroughs are often made when a single key “law” is found to underly what previously appeared to be a number of observed lesser laws. An example from Physics is the key principle of Relativity: that the speed of light is a constant in all inertial frames of reference, which quickly leads to all sorts of unintuitive phenomena like time dilation, length contraction, and so on. This discussion aims to do this for HTM by proposing that its key underlying principle is the efficiency of predicted sparseness at all levels. I’ll attempt to show how this single principle not only explains several key features of HTM identified so far, but also explains in detail how to model any required structural component of the neocortex.

The neocortex is a tremendously expensive organ in mammals, and particularly in humans, so it seems certain that the benefits it provides are proportionately valuable to the genes of an animal. We can use this relationship between cost and benefit, with sparseness and prediction as mediating metrics, to derive detailed design rules for the neocortex at every level, down to individual synapses and their protein machinery.

If you take one thing away from this talk, it should be that Sparse Distributed Representations are the key to Intelligence. Jeff Hawkins

Note: The next post in this series describes the Mathematics of Hierarchical Temporal Memory.

Sparse Distributed Representations are a key concept in HTM theory. In any functional piece of cortex, only a small fraction of a large population of neurons will be active at a given time; each active neuron encodes some component of the semantics of the representation; and small changes in the exact SDR correspond with small differences in the detailed object or concept being represented. Ahmad 2014 describes many important properties of SDRs.

SDRs are one efficient solution to the problem of representing something with sufficient accuracy at optimal cost in resources, and in the face of ambiguity and noise. My thesis is that in forming SDRs, neocortex is striving to optimise a lossy compression process by representing only those elements of the input which are structural and ignoring everything else.

Shannon proposed that any message has a concrete amount of information, measured in bits, which reflects the amount of surprise (i.e. something you couldn’t compute from the message so far, or by other means) contained in the message.

The most efficient message has zero length – it’s the message you don’t need to send. The next most efficient message contains only the information the receiver lacks to reconstruct everything the sender wishes her to know. Thus, by using memory and the right encoding to connect with it, a clever receiver (or memory system) can become very efficient indeed.

We will see that neocortex implements this idea literally, at all levels, as it attempts to represent, remember and predict events in the world as usefully as possible and at minimal cost.

The organising principle in cortical design is that components (from the whole organism down to a synapse) can do little about the amount of signal they receive, but they can – and do – adapt and learn to make best use of that signal to control what they do, only acting – sending a signal – when it’s the predicted optimal choice. This gives rise to sparseness in space and time everywhere, which directly reflects the degree of successful prediction present in any part of the system.

The success metric for a component in neocortex is the ratio of input data rate to output information rate, where the component has either a fixed minimum, or (for neurons and synapses) a fixed maximum, output level.

Deviations from the target indicate some failure to predict activity. This failure is either an opportunity to learn (and predict better next time), or, failing that, something which needs to be acted upon in some other way, by taking a different action or by passing new information up the hierarchy.

Note inputs in this context are any kind of signal coming in to the component under study. In the case of regions, layers and neurons, these include top-down feedback and lateral inputs as well as feedforward.

### Hierarchy

Neocortex is a hierarchy because it has finite space to store its model of the world, and a hierarchy is an optimal strategy when the world itself has hierarchical structure. Each region in the hierarchy is subjected (by design) to a necessarily overwhelming rate of input, it will run at capacity to absorb its data stream, reallocating its finite resources to contain an optimal model of the world it perceives.

### Regions

The memory inside a region of cortex is driven towards an “ideal” state in which it always predicts its inputs and thus produces a “perfect”, minimal message – containing its learned SDR of its world’s current state – as output. Any failure to predict is indicated by a larger output, the deviation from “ideal” representing the exact surprise of the region to its current perception of the world.

A region has several output layers, each of which has a different (and usually more than one) purpose.

For each region, two layers send (different) signals up the hierarchy, therefore signalling both the current state of its world and the encoding of its unpredictability. The higher region now gets details of something it should hopefully have the capacity to handle – predict – or else it passes the problem up the chain.

Two layers send (again different) signals down to lower layers and (in the case of motor) to subcortical systems. The content of these outputs will relate to the content as well as the stability and confidence of the region’s model, and also actions which are appropriate in terms of that content and confidence level.

### Layers

A cortical layer which has fully predicted its inputs has a maximally sparse output pattern. A fully failing prediction pattern in a layer causes it to output a maximally bursting and minimally sparse pattern, at least for a short time. At any failure level in between, the exact evolution of firing in the bursting neurons encodes the precise pattern of prediction failure of the layer, and this is the information passed to other layers in the region, to other regions in cortex, or to targets outside the cortex.

The output of a cortical layer is thus a minimal message – it “starts” with the best match of its prediction and reality, followed (in a short period of time) by encodings of reality in the context of increasingly weak prediction.

### Columns

A layer’s output, in turn, is formed from the combination of its neurons, which are themselves arranged in columns. The columnar arrangement of cells in cortical columns is the key design leading to all the behaviour described previously.

Pyramidal cells, which represent both the SDR activity pattern and the “memory” in a layer, are all contained in columns. The sparse pattern of activity across a layer is dictated by how all the cells compete within this columnar array.

Columns are composed of pyramidal cells, which act independently, and a complex of inhibitory cells which act together to define how the column operates. All cells share a very similar feedforward receptive field, due to the fact that feedforward axons physically run up through the narrow column and abut the pyramidal bodies as they squeeze past.

#### Columnar Inhibition

The inhibitory cells have a broader and faster feedforward response compared with the pyramidal cells Reference so, in the absence of strong predictive inputs to any pyramidal cells, the entire assemblage of inhibitory neurons will be first to fire in a column. When this happens, these inhibitory cells excite those in adjacent columns, and a wave of inhibition spreads out from a successfully firing column.

The wave continues until it arrives at a column which has already been inhibited by a wave coming from elsewhere in the layer (from some recently active column). This gives rise to a pattern of inactivity around columns which are currently active.

#### Predictive Activation

Each cell in a column has its own set of feedforward and predictive inputs, so every cell has a different rate of depolarising as it is driven towards firing threshold.

Some cells may have received sufficient depolarising input from predictive lateral or top-down dendrites to reach firing threshold before the column’s sheath of inhibitory cells. In this case the pyramidal cell will fire first, trigger the column’s inhibitory sheath, and cause the wave of inhibition to spread out laterally in the layer.

#### Vertical Inhibition in Columns

When the inhibitory sheath fires, it also sends a wave of inhibitory signals vertically in the column. This wave will shut down any pyramidal cells which have not yet reached threshold, giving rise to a sparse activity pattern in the column.

The exact number of cells which get to fire before the sheath shuts them down depends mainly on how predictive each cell was and whether the sheath was triggered by a “winning cell” (previous section), by the sheath being first to fire, or as a result of neighbouring columns sending out signals.

If there is a wave of inhibition reaching a column, all cells are shut down and none (or no more) fire.

If there was a cell so predictive that it fired before the sheath, all other cells are very likely shut down and only one cell fires.

Finally, if the sheath was first to fire due to its feedforward input, the pyramidal cells are shut down quite quickly, but the most predictive may get the chance to fire just before being shut down.

This last process is called bursting, and gives rise to a short-lived pattern which encodes exactly how well the column as an ensemble has matched its predictions. Basically, the more cells which fire, the more “confused” the match between prediction and reality. This is because the inhibition happens quickly, so the gap between the first and last cell to burst must be small, reflecting similar levels of predictivity.

The bursting process may also be ended by an incoming wave of inhibition. The further away a competing column is, the longer that will take, allowing more cells to fire and extending the burst. Thus the amount of bursting also reflects the local area’s ability to respond to the inputs.

### Neurons

Neurons are machines which use patterns of input signals to produce a temporal pattern of output signal. The neuron wastes most resources if its potential rises but just fails to fire, so the processes of adaption of the neuron are driven to a) maximise the response to inputs within a particular set, and b) minimise the response to inputs outside that set.

The set of excitatory inputs to one neuron are of two main types – feedforward and predictive; the number of each type of input varies from 10’s to 10’s of thousand; and the inputs arrive stochastically in combinations which contain mixtures of true structure and noise, so the “partitioning problem” a neuron faces is intractable. It simply learns to do the best it can.

Note that neurons are the biggest components in HTM which actually do anything! In fact, the regions, layers and columns are just organisational constructs, ways of looking at the sets of interacting neurons.

The neuron is the level in the system at which genetic control is exercised. The neuron’s shape, size, position in the neocortex, receptor selections, and many more things are decided per-neuron.

Importantly, many neurons have a genetically expressed “firing program” which broadly sets a target for the firing pattern, frequency and dependency setup.

Again, this gives the neuron an optimal pattern of output, and its job is to arrange its adaptations and learn to match that output.

### Dendrites

Distal dendrites have a similar but simpler and smaller scale problem of combining inputs and deciding whether to spike.

I don’t believe dendrites do much more than passively respond to global factors such as modulators and act as conduits for signals, both electrical and chemical, originating in synapses.

### Synapses

Synapses are now understood to be highly active processing components, capable of growing both in size and efficiency in a few seconds, actively managing their response to multiple inputs – presynaptic, modulatory and intracellular, and self-optimising to best correlate a stream of incoming signals with the activity of the entire neuron.

Part Two takes this idea further and details how a multilayer region uses the efficiency of predicted sparseness to learn a sensorimotor model and generate behaviour.

The next post in this series describes the Mathematics of Hierarchical Temporal Memory. This diversion is useful before proceeding with the main thread.

Blättler F, Hahnloser RHR. An Efficient Coding Hypothesis Links Sparsity and Selectivity of Neural Responses. Kiebel SJ, ed. PLoS ONE 2011;6(10):e25506. doi:10.1371/journal.pone.0025506. [Full Text]

• Sep 14 / 2014

## A Unifying View of Deep Networks and Hierarchical Temporal Memory

There’s been a somewhat less than convivial history between two of the theories of neurally-inspired computation systems over the last few years. When a leading protagonist of one school is asked a question about the other, the answer often varies from a kind of empty semi-praise to downright dismissal and the occasional snide remark. The objections of one side to the others’ approach are usually valid, and mostly admitted, but the whole thing leaves one with a feeling that it is not a very scientific way to proceed or behave. This post describes an idea which might go some way to resolving this slightly unpleasant impasse and suggests that the discrepancies may simply be as a result of two groups using the same name for two quite different things.

In HTM, Jeff Hawkins’ plan is to identify the mechanisms which actually perform computation in real neocortex, abstracting them only far enough that the details of the brain’s bioengineering are simplified out, and hopefully leaving only the pure computational systems in a form which allows us to implement them in software and reason about them. On the other hand, Hinton and LeCun’s neural networks are each built “computation-first,” drawing some inspiration from and resembling the analogous (but in detail very different) computations in neocortex.

The results (ie the models produced), inevitably, are as different at all levels as their inventors’ approaches and goals. For example, one criterion for the Deep Network developer is that her model is susceptible to a set of mathematical tools and techniques, which allow other researchers to frame questions, examine and compare models, and so on, all in a similar mathematical framework. HTM, on the other hand, uses neuroscience as a standard test, and will not admit to a model any element which is known to be contradicted by observation of natural neocortex. The Deep Network people complain that the models of HTM cannot be analysed like theirs can (indeed it seems they cannot), while the HTM people complain that the neurons and network topologies in Deep Networks bear no relationship with any known brain structures, and are several simplifications too far.

Yann LeCun said recently on Reddit (with a great summary):

Jeff Hawkins has the right intuition and the right philosophy. Some of us have had similar ideas for several decades. Certainly, we all agree that AI systems of the future will be hierarchical (it’s the very idea of deep learning) and will use temporal prediction.

But the difficulty is to instantiate these concepts and reduce them to practice. Another difficulty is grounding them on sound mathematical principles (is this algorithm minimizing an objective function?).

I think Jeff Hawkins, Dileep George and others greatly underestimated the difficulty of reducing these conceptual ideas to practice.

As far as I can tell, HTM has not been demonstrated to get anywhere close to state of the art on any serious task.

The topic of HTM and Jeff Hawkins was second out of all the major themes in the Q&A session, reflecting the fact that people in the field view this as an important issue, and (it seems to me) wish that the impressive progress made by Deep Learning researchers could be reconciled with the deeper explanatory power of HTM in describing how the neocortex works.

Of course, HTM people seldom refuse to play their own role in this spat, saying that a Deep Network sacrifices authenticity in favour of mathematical tractability and getting high scores on artificial “benchmarks”. We explain or excuse the fact that our models are several steps smaller in hierarchy and power, making the valid claim that there are shortcuts and simplifications we are not prepared to make,  and speculating that we will – like the tortoise – emerge alone at the finish with the prize of AGI in our hands.

The problem is, however, a little deeper and more important than an aesthetic argument (as it sometimes appears). This gap in acknowledging the valid accomplishments of the two models, coupled with a certain defensiveness, causes a “chilling effect” when an idea threatens to cross over into the other realm. This means that findings in one regime are very slow to be noticed or incorporated in the other. I’ve heard quite senior HTM people actually say things like “I don’t know anything about Deep Learning, just that it’s wrong” – and vice versa. This is really bad science.

From reading their comments, I’m pretty sure that no really senior Deep Learning proponent has any knowledge of the current HTM beyond what he’s read in the popular science press, and the reverse is nearly as true.

I consider a very good working knowledge of Deep Learning to be a critical part of any area of computational neuroscience or machine learning. Obviously I feel at least the same way about HTM, but recognise that the communication of our progress (or even the reporting of results) in HTM has not made it easy for “outsiders” to achieve the levels of understanding they feel they need to take part. There are historical reasons for much of this, but it’s never too late to start fixing a problem like this, and I see this post (and one of my roles) as a step in the right direction.

## The Neuron as the Unit of Computation

In both models, we have identified the neuron as the atomic unit of computation, and the connections between neurons as the location of the memory or functional adjustment which gives the network its computational power. This sounds fine, and clearly the brain uses neurons and connections in some way like this, but this is exactly where the two schools mistakenly diverge.

Jeff Hawkins rejects the NN integrate-and-fire model and builds a neuron with vastly higher complexity. Geoff Hinton admits that, while impossible to reason about mathematically, HTM’s neuron is far more realistic if your goal is to mimic neocortex. Deep Learning, using neurons like Lego bricks, can build vast hierarchies and huge networks, find cats in Youtube videos, and win prizes in competitions. HTM, on the other hand, struggles for years to fit together its “super-neurons” and builds a tiny, single-layer model which can find features and anomalies in low-dimensional streaming data.

Looking at this, you’d swear these people were talking about entirely different things. They’ve just been using the same names for them. And, it’s just dawned on me, therein lies both the problem and its solution. The answer’s been there all the time:

Each and every neuron in HTM is actually a Deep Network.

In a HTM neuron, there are two types of dendrite. One is the proximal dendrite, which contains synapses receiving inputs from the feedforward (mainly sensory) pathway. The other is a set of coincidence-detecting, largely independent, distal dendrite segments, which receive lateral and top-down predictive inputs from the same layer or higher layers and regions in neocortex.

My thesis here is that a single neuron can be seen as composed of many elements which have direct analogues in various types of Deep Learning networks, and that there are enough of these, with a sufficient structural complexity, that it’s best to view the neuron as a network of simple, Deep Learning-sized nodes, connected in a particular way. I’ll describe this network in some detail now, and hopefully it’ll become clear how this approach removes much of the dichotomy between the models.

Firstly, a synapse in HTM is very much like a single-input NN node, where HTM’s permanence value is akin to the bias in a NN node, and the weight on the input connection is fixed at 1.0. If the input is active, and the permanence exceeds the threshold, the synapse produces a 1. In HTM we call such a synapse connected, in that the gate is open and the signal is passed through.

The dendrite or dendrite segment is like the next layer of nodes in NN, in that it combines its inputs and passes the result up. The proximal dendrite effectively acts as a semi-rectifier, summing inputs and generating a scalar depolarisation value to the cell body. The distal segments, on the other hand, act like thresholded coincidence detectors and produce a depolarising spike only if the sum of the inputs exceeds a threshold.

These depolarising inputs (feedforward and recurrent) are combined in the cell body to produce an activation potential. This only potentially generates the output of the entire neuron, because a higher-level inhibition system is used to identify those neurons with highest potential, allow those to fire (producing a binary 1), and suppress the others to zero (a winner-takes-all step with multiple local winners in the layer).

So, a HTM layer is a network of networks, a hierarchy in which neuron-networks communicate with connections between their sub-parts. At the HTM layer level, each neuron has two types of input and one output, and we wire them together at such, but each neuron is really hiding an internal, network-like structure of its own.

• Aug 23 / 2014

## Suggested Naming in HTM Theory and White Paper

“There are only two hard things in Computer Science: cache invalidation and naming things.” – Phil Karlton

In the case of HTM, we also have the much bigger problem of explaining how neocortex may work, and how a non-obvious CLA operates to use cortical principles. Extra confusion caused by poor naming multiplies the difficulties.

A key component of the art of naming consists in identifying the scope of each name. We need to have names which are just specific enough to capture the underlying concept, but not so specific that they entangle non-essential details. Names also need to be memorable and comfortable, while not being too easy to misconstrue, because they resemble or contain words which have other meanings.

I’d like to begin a reasoned discussion about key names in HTM and CLA. The goal of the discussion is to arrive at a set of names which everyone strongly believes captures the concepts for both theory and implementation.

As a famous Supreme Court judge once said of pornography, “we cannot define it but we know it when we see it.” We are looking for this kind of name, with the added advantage that HTM can actually precisely define the concept behind each name.

Until we arrive at a good name for something (ie one which magically gets everyone’s support), we should identify the key flaws in each candidate and agree that they invalidate that candidate. This is a healthy process which should not be regarded as a criticism of any proposer.

Please treat that as an open invitation to tell me how poor my proposed names are, but only for reasons you’d accept as rational if they were directed at yours!

I’m currently re-reading the 2011 White Paper with a view to updating and improving it. This document is a very rich source of information pertinent to this discussion, and in fact appears to answer a couple of the thorniest ones! I’d very strongly recommend re-reading it as preparation for taking part in this discussion.

I’d like to go through the main named concepts one by one, discuss the strengths and weaknesses of the current names, and propose a new name for each concept with some supporting motivations and argument. I don’t expect that my proposals will stick, but they should get us a noticeable step in the right direction, or at least throw light on the relevant issues.

### Sparse Distributed Representation.

I start with this one because, in my experience of learning, reasoning about, writing about, talking about, and explaining HTM, the term SDR is as close to perfect as I can imagine. It has the property of monotonically improving understanding the more you find out about each of the three concepts named.

It is also an easily testable name. We all remember when Francisco showed us the CEPT Retina SDRs, in fact they were so SDRish, some of us thought they were too good to be true!

### Spatial Pooling.

There are several problems with this term. We understand that “spatial” was chosen to indicate that each presentation of the data has some properties and structure in the sensory domain (such as a shape, size or colour), and it’s called “spatial” as opposed to “temporal”.

A difficulty arises for newcomers who read too much into this use of the word. There is a strong temptation to rely on our commonsense ideas of space when Jeff is really talking about mathematical, vector spaces and the abstract “spaces” of SDRs.

HTM does not require the kind of retinotopic mapping found in V1. The only reason we have literal spatial layouts in just a few primary areas of sensory cortex is because it is a simpler evolutionary and developmental design, not because it is needed for the algorithm. The RDSE, the Geospatial Encoder and the CEPT retina are all superb examples of how “pseudorandom” representations are better than more pictorially understandable spatial representation regimes.

Lastly, we’ve already tripped over this when we started talking about the new sensorimotor theory. L4 cells are now dealing with motor inputs as well as “spatial”, and L3 cells are now expected to “see” a set of L4 outputs whose members are substituted over time. So the word “spatial” really needs to go.

The word “Pooling” has, for many, either no meaning at all (most cases), or worse, the wrong meanings in this context. If you are trying to capture the notion of a noise-tolerant, largely stable representation of closely related sensory input, “pooling” isn’t going to do that for most people.

I’m not sure there is a good word for this, so my suggestion drops this aspect. As mentioned several times in the 2011 White Paper, the concept of pooling (noise-tolerance, high-overlap) is already embedded as a property of the product of SP – the SDR.

I propose the term Pattern Memory for what we currently call Spatial Pooling. This captures the fact that patterns in the data are recognised-learned and that the CLA is developing a memory of patterns it has seen. By not being too specific about which patterns we mean, it also allows us to say that the CLA learns to recognise and remember patterns of input data, stores patterns of synaptic connections, and forms patterns of activation (SDRs) to represent its inputs.

This name is also robust to adopting the new theory. L4 cells can learn sensorimotor patterns, and L3 cells can learn to recognise patterns of membership in a sequence-set.
We can run this in the top-down direction too, talking about patterns appearing in L1, motor patterns, patterns of depolarisation, and so on.

### (old) Temporal Pooling.

The problems with using this term in its old context have been well-rehearsed, and it’s now used for the much more appropriate concept of representing a stable(r) sequence-identifying SDR in Layer 3 when sensorimotor transitions from that sequence are occurring in Layer 4. Temporal Pooling, in that sense, is another great name.

I had previously offered the term “Transition Prediction” for the component of CLA involving lateral connections and predictive states. Jeff and Numenta are currently using “Temporal Memory”. I believe both are flawed.

My suggestion accurately captured the limited, 1-timestep scope of this component, and also the fact that prediction is the key to temporal learning. However, it sounds like we need to add words to the name, to reflect “something missing” from the two word name.

Temporal Memory, on the other hand, is too high-ranking and valuable a name for this relatively basic component. It carries the risk that people will think HTM is just a hierarchy of TMs. Also, “temporal” is too general – the same word is currently used for single-timestep (old TP/TM) all the way up to entire sequences (new TP).

I propose Transition Memory for this second core component of CLA. This captures most literally what the algorithm is doing – learning single transitions. It is also the temporal equivalent of Pattern Memory, using distal dendrites to link to past SDRs just as PM uses proximal dendrites to link to feedforward patterns.

Importantly, the term Transition Memory is not trying to work too hard. We can explain that learned transitions are used to put cells into predictive states, and that these predictive patterns are used both in sensory (variable order) and sensorimotor (first order) temporal learning. They are used to match predicted and actual inputs, detect anomalies and create patterns which indicate continuing successful prediction or trigger a pattern of bursting columns. It seems impossible to me to have one name capture all these aspects, so I propose we stop trying and give the name a break!

In a variation on Pattern Memory (SP), depolarisation due to Transition Memory is combined with feedforward inputs to assist recognition and increase noise-tolerance. In Jeff’s new sensorimotor theory, combining distal with proximal inputs is likely to be key to the function.

### Old and New Versions of HTM/CLA Theory.

In previous posts, I used “old and new” or “2013 and 2014” to distinguish these two generations of the theory. In reworking the White Paper, I’ve recognised that these two theories are akin to the Newtonian versus Relativistic or Quantum views of mechanics. You need to quite deeply understand the simpler theory before you can begin to deal with the far more complex and realistic one. And for many purposes, the simpler theory is perfectly sufficient both for understanding how the neocortex works, and for useful application in software.

I thus propose that the older, simpler theory and model be called the “Sensory Cortical Learning Algorithm” or “Sensory CLA”, the newer being called the “Sensorimotor CLA”.

SCLA (or just CLA) and SMCLA are simple, distinguishable acronyms.

This also allows us to talk about HTM systems with SCLA single-layer regions (as NuPIC can/does), which just do feedforward, sensory hierarchy, or else fuller HTMs which incorporate behaviour, stable sequences, temporal pooling, and true bidirectional hierarchy using SMCLA in each region.

• Aug 14 / 2014

## Implications of the NuPIC Geospatial Encoder

Numenta’s Chetan Surpur recently demoed and explained the details of a new encoder for NuPIC which creates Sparse Distributed Representations (SDRs) from GPS data. Apart altogether from the direct applications which this development immediately suggests, I believe that Chetan’s invention has a number of much more profound implications for NuPIC and even HTM in general. This post will explore a few of the most important of these. Chetans’ demo and a tutorial by Matt Taylor are available on Youtube. First, here is Chetan presenting to, and discussing it with, Numenta people: And here’s Matt with another excellent hands-on tutorial:

### Mechanism

I’ll begin by describing the encoder itself. The Geospatial Encoder takes as input a triple [Lat, Long, Speed] and returns a Sparse Distributed Representation (SDR) which uniquely identifies that position for the given speed. The speed is important because we want the “resolution” of the encoding to vary depending on how quickly the position is changing, and Chetan’s method does this very elegantly. The algorithm is quite simple. First, a 2D space (Lat, Long) is divided up (virtually) into squares of a given scale (a parameter provided for each encoder), so each square has an x and y integer co-ordinate (the Lat-Long pair is projected using a given projection scheme for convenient display on mapping software). This co-ordinate pair can then be used as a seed for a pseudorandom number generator (Python and numpy use the cross-platform Mersenne Twister MT19937), which is used to produce a real-valued order between 0 and 1, and a bit position chosen from the n bits in the encoding. These can be generated on demand for each square in the grid, always yielding the same results. To create the SDR for a given position and speed, the algorithm first converts the speed to a radius and forms a box of squares surrounding the position and calculates the pair [orderbit] for each square in the box. The top w squares (with the highest order) are chosen, and their bit values are used to choose the w active bits in the SDR.

### Initial Interpretation

The first thing to say is that this encoder is an exemplar of transforming real-world data (location in the context of movement) into a very “SDR-like” SDR. It has the key properties we seek in an SDR encoder, in that semantically similar inputs will yield highly overlapping representations. It is robust to noise and measurement error in both space and time, and the representation is both unique (given a set scale parameter) and reproducible (given a choice of cross-platform random number generator), independently of the order of presentation of the data. The reason for this “SDR-style” character is that the entire space of squares forms an infinite field of “virtual neurons”, each of which has some activation value (its order) and position in the input bit vector (its bit). The algorithm first sparsifies this representation by restricting its sampling subspace to a box of squares around the position, and then enforces the exact sparseness by picking the w squares using a competitive analogue of local inhibition.

### Random Spatial Neuron Field (Spatial Retina)

This idea can be generalised to produce a “spatial retina” in n-dimensional space which provides a (statistically) unique SDR fingerprint for every point in the space. The SDRs specialise (or zoom in) when you reduce the radius factor, and generalise (or zoom out) when radius is increased. This provides a distance metric between two points which involves the interplay of spatial zoom and the fuzziness of overlap. Any two points will have identical SDRs (w bits of overlap) if you increase the radius sufficiently, and entirely disparate SDRs (0 bits overlap) if you zoom in sufficiently (down to the order of w*scale). Since the Coordinate Encoder operates in a world of integer-indexed squares, we first need to transform each dimension using its own scale parameter (the Geospatial Encoder uses the same scale for each direction, but this is not necessary). We thus have a single, efficient, simple mechanism which allows HTM to navigate in any kind of spatial environment. This is, I believe a really significant invention which has implications well beyond HTM and NuPIC. As Jeff and others mentioned during Chetan’s talk, this may be the mechanism underlying some animals’ ability to navigate using the Earth’s magnetic field. It is possible to envisage a (finite, obviously) field of real neurons which each have a unique response to position in the magnetic field. Humans have a similar ability to navigate, using sensory input to provide an activation pattern which varies over space and identifies locations. We combine whichever modalities work best (blind people use sound and memories of movement to compensate for impaired vision), and as long as the pipeline produces SDRs of an appropriate character, we can now see how this just works.

### Comparison with Random Distributed Scalar Encoder (RDSE)

The Geospatial Encoder uses the more general Coordinate Encoder, which takes a n-dimensional integer vector and a radius, and produces the corresponding SDR. It is easy to see how a 1D spatial encoder with a fixed speed would produce an SDR for arbitrary scalars, given an initial scale which would decide the maximum resolution of the encoder.  This encoder would be an improved replacement for the RDSE, with the following advantages:

• When encoding a value, the RDSE needs to encode all the values between existing encodings and the new value (so that the overlap guarantees are honoured). A 1D-Geo encoder can compute each value independently, saving significantly in time and memory footprint.
• In order to produce identical values for all inputs regardless of the order of presentation, the RDSE needs to “precompute” even more values in batches around a fixed “centre” (eg to compute f(23) starting at 0, we might have to compute [f(-30),…,f(30)]). Again, 1D-Geo scalar encoding computes each value uniquely and independently.
• Assuming scale (which decides the max resolution) is fixed, the 1D-Geo scalar encoding can compute encodings of variable resolution with semantic degradation by varying speed. The SDR for a value is exactly unique for the same speed, but changes gradually as speed is increased or decreased. The RDSE has no such property.

This would strongly suggest that we can replace the RDSE with a 1D coordinate spatial encoder in NuPIC, and get all the above benefits without any compromise.

### Combination with Spatially-varying Data

It is clear how you could combine this encoding scheme with data which varies by location, to create a richer idea of “order” in feeding the SDR generation algorithm. For example, you could combine random “order” with altitude or temperature data to choose the top w squares. Alternatively, the pure spatial bit signature of a location may be combined in parallel with the encoded values of scalar quantities found at the current location, so that a HTM system associatively learns the spatial structure of the given scalar field.

The Geospatial Encoder computes a symbolic SDR address for a spatial location, effectively a “name” or “word” for each place. The elements or alphabet of this encoding are simply random order activation values of nearby squares, so any more “real” semantic SDR-like activation pattern will do an even better job in computing spatial addresses. We use memories of spatial cues (literally, landmarks), emotional memories, maps, memories of moving within the space, textual directions, and so on to encode and reinforce these representations. This model explains why memory experts often use Memory Palaces (aka the Method of Loci) to remember long sequences of data items. They associate each item (or an imagined, memorable visual proxy) occupying a location in a very familiar spatial environment. It also explains the existence of “place neurons” in rodent hippocampi – these neurons are each participating in generating a spatial encoding similar in character to the Geospatial Encoder.

### Zooming, Panning and Attention

This is a wonderful model for how we “zoom in” or “zoom out” and perceive a continuously but smoothly varying model of the world. It also models how we can perceive gracefully degrading levels of detail depending on how much time or attention we pay for a perception. In this case, the “encoder” detailed here would be a subcortical structure or a thalamus-gated (attention controlled) input or relay between regions. If we could find a mechanism in the brain which controls the size and position of a “window” of signals (akin to our variable box of squares), we would have a candidate for our ability to use attention to control spatial resolution and centre of focus. Such a mechanism may automatically arise from preferentially gating neurons at the edges of a “patch”, by virtue of the inhibition mechanism’s ability to smoothly alter the representation as inputs are added or removed. This mechanism would also explain boundary extension error, in which we “fill out” areas surrounding the physical boundaries of objects and images. As explained in detail in her talk at the Royal Institute, Eleanor Maguire believes that the hippocampus is crucial for both this phenomenon and our ability to navigate in real space. As one of the brain components at the “top” of the hierarchies, the hippocampus may be the place where we can perform the crucial “zooming and panning” operations and where we manipulate spatial SDRs as suggested by the current discovery.

### Implementation Details

The coordinate encoder has a deterministic, O(1), order-independent algorithm for computing both “order” and bit choice. One important issue is that the pseudorandom number is Python-specific, and so a Java encoder (which uses a different pseudorandom number generator) will produce completely different answers. The solution is to use the Python (and numpy) RNG, which is the Mersenne Twister MT19937, also used by default in numerous other languages. I believe it would be worth exploring using Perlin noise to generate the order and bit choice values. This would give you a) identical encodings across platforms, b) pseudorandom, uncorrelated values when the noise samples are far enough apart (eg when the inputs are integers as in this case), and c) smoothly changing values if you use very small step sizes. Just one point about changing radius and its effect on the encoding. I’m very confident that the SDR is very robust to changes in radius, due to the sparsity of the SDRs. In other words, the overlap in an SDR at radius r with that at radius r’ (at the same GPS position) will be high, because you are only adding or removing an annulus around the same position (this will be similar to adding or removing a strip of squares when a small position change occurs).

### Links to the Demo and Encoder Code

Chetan’s demo code (which is really comprehensive) is at https://github.com/numenta/nupic.geospatial. The Geospatial Encoder code is at https://github.com/numenta/nupic/blob/master/nupic/encoders/geospatial_coordinate.py and the Coordinate Encoder is at https://github.com/numenta/nupic/blob/master/nupic/encoders/coordinate.py.

• Mar 31 / 2014

## Real Machine Intelligence now Available on Leanpub.com

The first three chapters of my new book, Real Machine Intelligence with Clortex and NuPIC has just gone live on Leanpub.com. Lean Publishing is a new take on a very old idea – serial publishing – which goes back to Charles Dickens in the 19th century. The idea is to evolve the book based on reader feedback, and to give people a chance to read the whole book before committing to buy.

I’ve also set up a Google Group and Facebook Page. Prior to going live with Clortex, you can read a snapshot of the documentation on its GitHub.io page.

• Nov 24 / 2013
NuPIC

## Book Preview: Chapter 1 – Some Context for Machine Intelligence

The following is the draft of Chapter One of my upcoming book, Real Machine Intelligence with NuPIC – Using Neuroscience to Build Truly Intelligent Machines. The book is intended as an introduction to Jeff Hawkins’ Hierarchical Temporal Memory theory, which seeks to explain in detail the principles underlying the human brain, and the open source software he’s built based on those principles. The book, aimed at the interested non-expert, will be out on Amazon in early December. You might like to read the Introduction first.

This book is about a new theory of how the brain works, and a piece of software which uses this theory to solve real-world problems intelligently in the same way that the brain does. In order to understand both the theory and the software, a little context is useful. That’s the purpose of this chapter.

Before we start, it’s important to scotch a couple of myths which surround both Artificial Intelligence (AI) and Neuroscience.

The first myth is that AI scientists are gradually working towards a future human-style intelligence. They’re not. Despite what they tell us (and they themselves believe), what they are really doing is building computer programs which merely appear to behave in a way which we might consider “smart” or “intelligent” as long as we ignore how they work. Don’t get me wrong, these programs are very important in our understanding of what constitutes intelligence, and they also provide us with huge improvements in understanding the nature and structure of problems solved by brains. The difficulty is that brains simply don’t work the way computer programs do, and there is no reason to believe that human-style intelligence can be approached just by adding more and more complex computer programs.

The other myth is that Neuroscience has figured out how our brains work. Neuroscience has collected an enormous amount of data about the brain, and there is good understanding of some detailed mechanisms here and there. We know (largely) how individual cells in the brain work. We know that certain regions of the brain are responsible for certain functions, for example, because people with damage there exhibit reduced efficiency in particular tasks. And we know to some extent how many of the pieces of the brain are connected together, either by observing damaged brains or by using modern brain-mapping technologies. But there is no systematic understanding which could be called a Theory of Neuroscience, one which explains the working of the brain in detail.

In order to understand how traditional AI does not provide a basis for human-like intelligence, let’s take a look inside a digital computer.

A computer chip contains a few billion very simple components called transistors. Transistors act as a kind of switch, in that they can allow a signal through or not, based on a control signal sent to them. Computer chip, or hardware, designers produce detailed plans for how to combine all these switches to produce the computer you’re reading this on. Some of these transistors are used to produce the logic in the computer, making decisions and performing calculations according to a program written by others: software engineers. The program, along with the data the program uses, are stored in yet more chips – the memory – using transistors which are either on or off. The on or off state of these memory “bits” comprise a code which stands for data – whether numbers, text, image pixels, or program codes which instruct the computer what instruction to perform at a particular time.

If you open up a computer, you can clearly see the different parts. There’s a big chip (usually with a fan on top to cool it), called the Central Processing Unit or CPU, which is where the hardware logic is housed. Separate from this, a bank of many smaller chips houses the Random Access Memory (RAM) which is the fastest kind of memory storage. There will also be either a hard disk (HD) or a solid state disk (SSD, a kind of chip-based long-term memory, faster than a HD, bigger but slower than RAM) which is where all your bulk data (programs, documents, photos, music and video) is stored for use by the computer. When your computer is running, the CPU is constantly fetching data from the memory and disks, doing some work on it, and writing the results back out to storage.

Computers have clearly changed the world. With these magical devices, we can calculate in one second with a spreadsheet program what would have taken months or years to do by hand. We can fly unflyable aircraft. We can predict the weather 10 days ahead. We can create 3D movies in high definition. We can, using other electronic “senses”, observe the oxygen and sugar consumption inside our own brains, and create a “map” of what’s happening when we think.

We write programs for these computers which are so well thought out that they appear to be “smart” in some way. They look like they’re able to out-think us; they look like they can be faster on the draw. But it turns out that they’re only good at certain things, and they can only really beat us at those things. Sure, they can calculate how to fly through the air and get through anti-aircraft artillery defences, or they can react to other computer programs on the stock exchange. They seem to be “superhuman” in some way, yet the truth is that there is no “skill” involved, no “knowledge” or “understanding” of what they’re doing. Computer programs don’t learn to do these amazing things, and we don’t teach them. We must provide exhaustive lists of absolutely precise instructions, detailing exactly what to do at any moment. The programs may appear to behave intelligently, but internally they are blindly following the scripts we have written for them.

The brain, on the other hand, cannot be programmed, and yet we learn a million things and acquire thousands of skills during our lives. We must be doing it some other way. The key to figuring this out is to look in some detail at how the brain is put together and how this structure creates intelligence. And just like we’ve done with a computer, we will examine how information is represented and processed by the structures in the brain. This examination is the subject of Chapter Two. Meanwhile, let’s have a quick look at some of the efforts people have made to create an “artificial brain” over the past few decades.

Artificial Intelligence is a term which was coined in the early 1950’s, but people have been thinking about building intelligent machines for over two thousand years. This remained in the realm of fantasy and science fiction until the dawn of the computer age, when machines suddenly became available which could provide the computational power needed to build a truly intelligent machine. It is fitting that some of the main ideas about AI came from the same legendary intellects behind the invention of digital computers themselves: Alan Turing and John von Neumann.

Turing, who famously helped to break the Nazi Enigma codes during WWII, theorised about how a machine could be considered intelligent. As a thought experiment, he suggested a test involving a human investigator who is communicating by text with an unknown entity – either another human or a computer running an AI program. If the investigator is unable to tell whether he is talking to a human or not, then Turing considers the computer to have passed his test and must be regarded as “intelligent” by this definition. This became known as the Turing Test and has unfortunately become a kind of Holy Grail for AI researchers for more than sixty years.

Meanwhile, the burgeoning field of AI attracted some very smart people, who all dreamed of soon becoming the designer of a machine one could talk to and which could help one solve real-world problems. All sorts of possibilities seemed within easy reach, and so the researchers often made grand claims about what was “just around the corner” for their projects. For instance, one of the “milestones” would be a computer which could beat the World Chess Champion, a goal which was promised “within 5 years” every year since the mid-50s, and which was only achieved in the 21st century using a huge computer and a mixture of “intelligent” and “brute-force” techniques, none of which resembled how Gary Kasparov’s brain worked.

Everyone recognised early on that intelligence at the level of the Turing Test would have to wait, so they began by trying to break things down into simpler, more achievable tasks. Having no clue about how our brains and minds worked as machines, they decided instead to theorise about how to perform some of the tasks which we can perform. Some of the early products included programs which could play Noughts and Crosses (tic-tac-toe) and Draughts (checkers), programs which could “reason” about placing blocks on top of other blocks (in a so-called micro-world), and a program called Eliza which used clever and entertaining tricks to mimic a psychiatrist interviewing a patient.

Working on these problems, developing all these programs, and thinking about intelligence in general has had profound effects beyond Computer Science in the last sixty years. Our understanding of the mind as a kind of computer or information processor is directly based on the knowledge and understanding gained from AI research. We have AI to thank for Noam Chomsky’s foundational Universal Grammar, and the field of Computational Linguistics is now required for anyone wishing to understand linguistics and human language in general. Brain surgeons use the computational model of the brain to identify and assess birth defects, the effects of disease and brain injuries, all in terms of the “functional modules” which might be affected. Cognitive psychology is now one of the basic ways to understand the way that our perceptions and internal processes operate. And the list goes on. Many, many fields have benefited indirectly from the intense work of AI researchers since 1950.

However, traditional AI has failed to live up to even its own expectations. At every turn, it seems that the “last 10%” of the problem is bigger than the first 90%. A lot of AI systems require vast amounts of programmer intelligence and do not genuinely embody any real intelligence themselves. Many such systems are incapable of flexibly responding to new contexts or situations, and they do not learn of their own accord. When they fail, they do not do so in a graceful way like we do, because they are brittle and capable only of working while “on-tracks” in some way. In short, they are nothing like us.

Yet AI researchers kept on going, hoping that some new program or some new technique would crack the code of intelligent machine design. They have built ever-more-complex systems, accumulated enormous databases of information, and employed some of the most powerful hardware available. The recent triumphs of Deep Blue (beating Kasparov at chess) and Watson (winning at the Jeopardy quiz game) have been the result of combining huge, ultra-fast computers with enormous databases and vast, complex, intricate programs costing tens of millions of dollars. While impressive, neither of these systems can do anything else which could be considered intelligent without reinvesting similar resources in the development of those new programs.

It seems to many that this is leading us away from true machine intelligence, not towards it. Human brains are not running huge, brittle programs, nor consulting vast databases of tabulated information. Our brains are just like those of a mouse, and it seems that we differ from mice only in the size and number of pieces (or regions) of brain tissue, and not in any fundamental way.

It appears very likely that intelligence is produced in the brain by the clever arrangement of brain regions, which appear to organise themselves and learn how to operate intelligently. This can be proven in the lab, when experimenters cut connections, shut down some regions, breed mutants and so on. There is very little argument in Neuroscience that this is how things work. The question then is: how do these regions work in detail? What are they doing with the information they are processing? How do they work together? If we can answer these questions, it is possible that we can both learn how our brains work and build truly intelligent machines.

I believe we can now answer these questions. That’s what this book claims to be about, after all!

• Nov 13 / 2013
NuPIC

## Book Preview: Introduction to “Real Machine Intelligence with NuPIC”

The following is the (draft) Introduction to my upcoming book, Real Machine Intelligence with NuPIC – Using Neuroscience to Build Truly Intelligent Machines. The book is intended as an introduction to Jeff Hawkins’ Hierarchical Temporal Memory theory, which seeks to explain in detail the principles underlying the human brain, and the open source software he’s built based on those principles. The book, aimed at the interested non-expert, will be out on Amazon in early December.

This book is about a true learning machine you can start using today. This is not science fiction, and it’s not some kind of promised technology we’re hoping to see in the near future. It’s already here, ready to download and use. It is already being used commercially to help save energy, predict mechanical breakdowns, and keep computers running on the Internet. It’s also at the centre of a vibrant open source community with growing links to leading-edge academic and industrial research. Based on more than a decade of research and development by Jeff Hawkins and his team at Grok, NuPIC is a system built on the principles of the human brain, a theory called Hierarchical Temporal Memory (or HTM).

NuPIC stands for Numenta Platform for Intelligent Computing. On the face of it, it’s a piece of software you can download for free, do the setup, and start using right away on your own data, to solve your own problems. This book will give you the information you need to do just that. But, as you’ll learn, the software (and its usefulness to you as a product) is only a small part of the story.

NuPIC is, in fact, a working model in software of a developing theory of how the brain works, Hierarchical Temporal Memory. Its design is constrained by what we know of the structure and function of the brain. As with an architect’s miniature model, a spreadsheet in the financial realm, or a CAD system in engineering, we can experiment with and adjust the model in order to gain insights into the system we’re modelling. And, just as with those tools, we can also do useful work, solve real-world problems, and derive value from using them.

And, as with other modelling tools, we can use NuPIC as a touchstone for a growing discussion of the basic theory of what is going on inside the brain. We can compare it with all the facts and understanding from decades of neuroscience research, a body of knowledge which grows daily. We believe that the theories underlying NuPIC are the best candidates for a true understanding of human intelligence, and that NuPIC is already providing compelling evidence that these theories are valid.

This book begins with an overview of how NuPIC fits in to the worlds of Artificial Intelligence and Neuroscience. We’ll then delve a little deeper into the theory of the brain which underlies the project, including the key principles which we believe are both necessary and sufficient for intelligence. In Chapter 3, we’ll see how the design of NuPIC corresponds to these principles, and how it works in detail. Chapter 4 describes the NuPIC software at time of writing, as well as its commercial big brother, Grok. Finally, we’ll describe what the near future holds for HTM, NuPIC and Grok, and how you can get involved in this exciting work. The details of how to download and operate NuPIC are found in the Appendices, along with details of how to join the NuPIC mailing list.

Pages:12