Better Living through Thoughtful Technology

Posts Categorized / Clortex (HTM in Clojure)

• Dec 17 / 2015

New Paper and Talk: Symphonies from Synapses

Just in time for Christmas, I’ve completed this paper on my theory of the brain as a Universal Dynamical Systems Computer, analogous to the Turing Machine as a Universal Symbolic Computer. The world is made of hierarchical complex systems, so our brains have evolved to use the power of coupled dynamical systems to automatically model and interact with the external and internal world. The paper uses results from Applied Maths to show precisely how this can be achieved, and combines that with a concrete design which gives a role to all 6 layers of a region of neocortex.

I gave a talk on this to the HTM Community Meetup in November:

I’d welcome any comments and feedback.

• Jan 02 / 2015

Self-Stabilisation in Hierarchical Temporal Memory

This post was written in response to Jeff Hawkins’ comments on last week’s article on a new Multilayer Model of Neocortex in Hierarchical Temporal Memory (HTM). Jeff expressed concerns about the clarity or correctness of my claim that sublayers in a cortical region act to self-stabilise in the face of unpredicted changes in the world (including changes in top-down feedback from higher regions). This discussion is a companion to an earlier description of the Efficiency of Predicted Sparseness, but goes into much more detail when describing how a non-sparse output from one sublayer is absorbed and processed by downstream sublayers.

In the earlier posts, we described how each sublayer in a region combines context inputs with feedforward inputs to form a sparse, predicted representation of the world in context. When this succeeds perfectly, each column in the sublayer has only a single active cell, and that cell represents the best combination of prediction from context and recognition of the feedforward input. The single-cell-per-column representation occurs when the single cell is sufficiently depolarised by distal (predictive/context) inputs to beat its columnar inhibitory sheath and fire first. If this does not happen, then the sheath fires first, allowing some number of contained pyramidal cells to fire before vertical inhibition reduces the column’s activity to just the one, best-predicted cell.

In order to understand the stabilising effect, we need to zoom in temporally and watch how the potentials evolve in extreme “slow-motion” in which the time steps correspond to individual synaptic events. At this framerate, we can observe the individual neurons’ potentials rising towards firing and the effect of inhibition both vertically and horizontally on the patterns of activation. This level of granularity also allows us to characterise the opportunities for synapses to adapt, which turns out to be crucial for understanding the model.

Synapses grow when there is a temporal correlation between their pre-synaptic inputs and the action potentials of the post-synaptic cell. The more often the cell fires within a short (c. 10ms) window of time after the synapse receives an action potential, the bigger and more receptive the synapse grows. In HTM, we model this with a scalar value we call permanence, which varies between 0.0 and 1.0, and we say that the synapse is connected when its permanence is above a threshold (usually 0.2), otherwise it is disconnected.

The current “official” Cortical Learning Algorithm (or CLA, the detailed computational model in HTM) separates feedforward and predictive stages of processing. A modification of this model (which I call prediction-assisted recognition or paCLA) combines these into a single step involving competition between highly predictive pyramidal cells and their surrounding columnar inhibitory sheaths. Though this has been described in summary form before, I’ll go through it in detail here.

Neural network models generally model a neuron as somehow “combining” a set of inputs to produce an output. This is based on the idea that input signals cause ion currents to flow into the neuron’s cell body, which raises its voltage (depolarises), until it reaches a threshold level and fires (outputs a signal). paCLA also models this idea, with the added complication that there are two separate pathways (proximal and distal) for input signals to be converted into effects on the voltage of the cell. In addition, paCLA treats the effect of the inputs as a rate of change of potential, rather than as a final potential level as found in standard CLA.

Slow-motion Timeline of paCLA

[Note: this section relates to Mathematics of HTM Part I  and Part II – see those posts for a full treatment].

Consider a single column of pyramidal cells in a layer of cortex. Along with the set of pyramidal cells $$\{P_1,P_2 .. P_n\}$$, we also model a columnar sheath of inhibitory cells as a single cell $$I$$. All the $$P_i$$ and $$I$$ are provided with the same feedforward input vector $$\mathbf{x}_t$$, and they also have similar (but not necessarily identical) synaptic connection vectors $$\mathbf{c}_{P_i}$$ and $$\mathbf{c}_{I}$$ to those inputs (the bits of $$\mathbf{x}_t$$ are the incoming sensory activation potentials, while bit $$j$$ of a connection vector $$\mathbf{c}$$ is 1 if synapse $$j$$ is connected). The feedforward overlap $$o^{\textrm{ff}}_{P_i}(\mathbf{x}_t) = \mathbf{x}_t \cdot \mathbf{c}_{P_i}$$ is the output of the proximal dendrite of cell $${P_i}$$ (and similarly for cell $$I$$).

In addition, each pyramidal cell (but not the inhibitory sheath) receives signals on its distal dendrites. Each dendrite segment acts separately on its own inputs $$\mathbf{y}_k^{t-1}$$, which come from other neurons in the same layer as well as other sublayers in the region (and from other regions in some cases). When a dendrite segment $$k$$ has a sufficient distal overlap, exceeding a threshold $$\lambda_k$$, the segment emits a dendritic spike of size $$s_k$$. The output of the distal dendrites is then given by:

$$o^{\textrm{pred}}=\sum\limits_{o_k^{t} \ge \lambda_k}{s_k}$$

The predictive potential is combined with the overlap score from the feedforward overlap coming from the proximal dendrite to give the total depolarisation rate:

$$d_j = \frac{\partial V_j}{\partial t} = \alpha_j o^{\textrm{ff}}_{P_j} + \beta_j o^{\textrm{pred}}_{P_j}$$

where $$\alpha_j$$ and $$\beta_j$$ are parameters which transform the proximal and distal contributions into a rate of change of potential (and also control the relative effects of feedforward and predictive inputs). For the inhibitory sheath $$I$$, there is only the feedforward component $$\alpha_I o^{\textrm{ff}}_I$$, but we assume this is larger than any of the feedforward contributions $$\alpha_j o^{\textrm{ff}}_{P_j}$$ for the pyramidal cells [cite evidence].

Now, the time a neuron takes to reach firing threshold is inversely proportional to its depolarisation rate. This imposes an ordering of the set $$\{P_1..P_n,I\}$$ according to their (prospective) firing times $$\tau_{P_j} = \gamma_P \frac{1}{d_j}$$ (and $$\tau_I = \gamma_I \frac{1}{d_I}$$).

Formation of the Sparse Distributed Representation (SDR)

Zooming out from the single column to a neighbourhood (or sublayer) $$L_1$$ of columns $$C_m$$, we see that there is a local sequence $$\mathbb{S}$$ in which all the pyramidal cells (and the inhibitory sheaths) would fire if inhibition didn’t take place. The actual sequence of cells which do fire can now be established by taking into account the effects of inhibition.

Let’s partition the sequence as follows:

$$\mathbb{S} = \mathbb{P}^{\textrm{pred}} \parallel \mathbb{I}^{\textrm{pred}} \parallel \mathbb{I}^{\textrm{ff}} \parallel \mathbb{P}^{\textrm{burst}} \parallel \mathbb{I}^{\textrm{spread}}$$

where:

1. $$\mathbb{P}^{\textrm{pred}}$$ is the (possibly empty) sequence of pyramidal cells in a highly predictive state, which fire before their inhibitory sheaths (ie $$\mathbb{P}^{\textrm{pred}} = \{P~|~\tau_P < \tau_{I_m}, P \in C_m\}$$);
2. $$\mathbb{I}^{\textrm{pred}}$$ is the sequence of inhibitory sheaths which fire due to triggering by their contained predictively firing neurons in $$\mathbb{P}^{\textrm{pred}}$$ – these cells fire in advance of their feedforward times due to inputs from $$\mathbb{P}^{\textrm{pred}}$$;
3. $$\mathbb{I}^{\textrm{ff}}$$ is the sequence of inhibitory sheaths which fire as a result of feedforward input alone;
4. $$\mathbb{P}^{\textrm{burst}}$$ is the sequence of cells in columns where the inhibitory sheaths have just fired but their vertical inhibition has not had a chance to reach these cells (this is known as bursting) – ie $$\mathbb{P}^{\textrm{burst}} =\{P~|~\tau_P < \tau_{I_m} + \Delta\tau_{\textrm{vert}}, P \in C_m\}$$;
5. Finally, $$\mathbb{I}^{\textrm{spread}}$$ is the sequence of all the other inhibitory sheaths which are triggered by earlier-firing neighbours, which spreads a wave of inhibition imposing sparsity in the neighbourhood.

Note that there may be some overlap in these sequences, depending on the exact sequence of firing and the distances between active columns.

The output of a sublayer is the SDR composed of the pyramidal cells from $$\mathbb{P}^{\textrm{pred}} \parallel \mathbb{P}^{\textrm{burst}}$$ in that order. We say that the sublayer has predicted perfectly if $$\mathbb{P}^{\textrm{burst}} = \emptyset$$ and that the sublayer is bursting otherwise.

The cardinality of the SDR is minimal under perfect prediction, with some columns having a sequence of extra, bursting cells otherwise. The bursting columns represent feedforward inputs which were well recognised (causing their inhibitory sheaths to fire quickly) but less well predicted (no cell was predictive enough to beat the sheath), and the number of cells firing indicates the uncertainty of which prediction corresponds to reality. The actual cells which get to burst are representative of the most plausible contexts for the unexpected input.

Transmission and Reception of SDRs

A sublayer $$L_2$$ which receives this $$L_1$$ SDR as input will first see the minimal SDR $$\mathbb{P}^{\textrm{pred}}$$ representing the perfect match of input and prediction, followed by the bursting SDR elements $$\mathbb{P}^{\textrm{burst}}$$ in decreasing order of prediction-reality match.

This favours cells in $$L_2$$ which have learned to respond to this SDR, and even more so for the subset which are also predictive due to their own contextual inputs (this biasing happens regardless of whether the receiving cells are proximally or distally enervated). The more sparse (well-predicted) the incoming SDR, the more sparse the activation of $$L_2$$.

When there is a bursting component in the SDR, this will tend to add significant (or overwhelming) extra signal to the minimal SDR, leading to high probability of a change in the SDR formed by $$L_2$$, because several cells in $$L_2$$ will have a stronger feedforward response to the extra inputs than those which respond to the small number of signals in the minimal SDR.

For example, in software we typically use layers containing 2,048 columns of 32 pyramidal neurons (64K cells), with a minimal column SDR of 40 columns (c. 2%). At perfect prediction, the SDR has 40 cells (0.06%), while total bursting would create an SDR of 1280 cells. In between, the effect is quite uneven, since each bursting column produces several signals, while all non-bursting columns stay at one. Assuming some locality of the mapping between $$L_1$$ and $$L_2$$, this will have dramatic local effects where there is bursting.

The response in $$L_2$$ to bursting in its input will not only be a change in the columnar representation, but may also cause bursting in $$L_2$$ itself if the new state was not well predicted using $$L_2$$’s context. This will cause bursting to propagate downstream, from sublayer to sublayer (including cycles in feedback loops), until some sublayer can stop the cascade either by predicting its input or by causing a change in its external world which indirectly restores predictability.

Since we typically do not see reverberating, self-reinforcing cycles of bursting in neocortex, we must assume that the brain has learned to halt these cascades using some combination of eventual predictive resolution and remediating output from regions. Note that each sublayer has its own version of “output” in this sense – it’s not just the obvious motor output of L5 which can “change the world”. For example, L6 can output a new SDR which it transmits down to lower regions, changing the high-level context imposed on those regions and thus the environment in which they are trying (and failing somewhat) to predict their own inputs. L6 can also respond by altering its influence over thalamic connections, thus mediating or eliminating the source of disturbance. L2/3 and L5 both send SDRs up to higher regions, which may be able to better handle their deviations from predictability. And of course L5 can cause real changes in the world by acting on motor circuits.

How is Self-Stabilisation Learned?

When time is slowed down to the extent we’ve seen in this discussion, it is relatively easy to see how neurons can learn to contribute to self-stabilisation of sparse activation patterns in cortex. Recall the general principle of Hebbian learning in synapses – the more often a synapse receives an input within a short time before its cell fires, the more it grows to respond to that input.

Consider again the sequence of firing neurons in a sublayer:

$$\mathbb{S} = \mathbb{P}^{\textrm{pred}} \parallel \mathbb{I}^{\textrm{pred}} \parallel \mathbb{I}^{\textrm{ff}} \parallel \mathbb{P}^{\textrm{burst}} \parallel \mathbb{I}^{\textrm{spread}}$$

This sequence does not include the very many cells in a sublayer which do not fire at all, because they are contained either in columns which become active, but are not fast enough to burst, or more commonly they are in columns inhibited by a spreading wave from active columns. Let’s call this set $$\mathbb{P}^{\textrm{inactive}}$$.

A particular neuron will, at any moment, be a member of one of these sets. How often the cell fires depends on the average amount of time it spends in each set, and how often a cell fires characteristically for each set. Clearly, the highly predictive cells in $$\mathbb{P}^{\textrm{pred}}$$ will have a higher typical firing frequency than those in $$\mathbb{P}^{\textrm{burst}}$$, while those in $$\mathbb{P}^{\textrm{inactive}}$$ have zero frequency when in that set.

Note that the numbers used earlier (65536 cells, 40 cells active in perfect prediction, 1280 in total bursting) mean that the percentage of the time cells are firing on average is massively increased if they are in the predictive population. Bursting cells only fire once following a failure of prediction, with the most predictive of them effectively “winning” and firing if the same input persists.

Some cells will simply be “lucky enough” to find themselves in the most predictive set and will strengthen the synapses which will keep them there. Because of their much higher frequency of firing, these cells will be increasingly hard to dislodge and demote from the predictive state.

Some cells will spend much of their time only bursting. This unstable status will cause a bifurcation among this population. A portion of these cells will simply strengthen the right connections and join the ranks of the sparsely predictive cells (which will eliminate their column from bursting on the current inputs). Others will weaken the optimal connections in favour of some other combination of context and inputs (which will drop them from bursting to inactive on current inputs). The remainder, lacking the ability to improve to predictive and the attraction of an alternative set of inputs, will continue to form part of the short-lived bursting behaviour. In order to compete with inactive cells in the same column, these “metastable” cells will have to have an output which tends to feed back into the same state which led to them bursting in the first place.

Cells which get to fire (either predictively or by bursting) have a further advantage – they can specialise their sensitivity to feedforward inputs given the contexts which caused them to fire, and this will give them an ever-improving chance of beating the inhibitory sheath (which has no context to help it learn). This is another mechanism which will allow cells to graduate from bursting to predictive on a given set of inputs (and context).

Since only active cells have any effect in neocortex, we see that there is an emergent “drive” towards stability and sparsity in a sublayer. Cells, given the opportunity, will graduate up the ladder from inactive to bursting to predictive when presented with the right inputs. Cells which fail to improve will be overtaken by their neighbours in the same column, and demoted back down towards inactive. A cell which has recently started to burst (having been inactive on the same inputs) will be reinforced in that status if its firing gives rise to a transient change in the world which causes its inputs to recur. With enough repetition, a cell will graduate to predictive on its favoured inputs, and will participate in a sparse, stable predictive pattern of activity in the sublayer and its region. The effect of its output will correspondingly change from a transient “restorative” effect to a self-sustaining, self-reinforcing effect.

• Nov 29 / 2014

Mathematics of HTM Part II – Transition Memory

This article is part of a series describing the mathematics of Hierarchical Temporal Memory (HTM), a theory of cortical information processing developed by Jeff Hawkins. In Part One, we saw how a layer of neurons learns to form a Sparse Distributed Representation (SDR) of an input pattern. In this section, we’ll describe the process of learning temporal sequences.

We showed in part one that the HTM model neuron learns to recognise subpatterns of feedforward input on its proximal dendrites. This is somewhat similar to the manner by which a Restricted Boltzmann Machine can learn to represent its input in an unsupervised learning process. One distinguishing feature of HTM is that the evolution of the world over time is a critical aspect of what, and how, the system learns. The premise for this is that objects and processes in the world persist over time, and may only display a portion of their structure at any given moment. By learning to model this evolving revelation of structure, the neocortex can more efficiently recognise and remember objects and concepts in the world.

Distal Dendrites and Prediction

In addition to its one proximal dendrite, a HTM model neuron has a collection of distal (far) neurons, which gather information from sources other than the feedforward inputs to the layer. In some layers of neocortex, these dendrites combine signals from neurons in the same layer as well as from other layers in the same region, and even receive indirect inputs from neurons in higher regions of cortex. We will describe the structure and function of each of these.

The simplest case involves distal dendrites which gather signals from neurons within the same layer.

In Part One, we showed that a layer of $$N$$ neurons converted an input vector $$\mathbf x \in \mathbb{B}^{n_{\textrm{ff}}}$$ into a SDR $$\mathbf{y}_{\textrm{SDR}} \in \mathbb{B}^{N}$$, with length$$\lVert{\mathbf y}_{\textrm{SDR}}\rVert_{\ell_1}=sN \ll N$$, where the sparsity $$s$$ is usually of the order of 2% ($$N$$ is typically 2048, so the SDR $$\mathbf{y}_{\textrm{SDR}}$$ will have 40 active neurons).

The layer of HTM neurons can now be extended to treat its own activation pattern as a separate and complementary input for the next timestep. This is done using a collection of distal dendrite segments, which each receive as input the signals from other neurons in the layer itself. Unlike the proximal dendrite, which transmits signals directly to the neuron, each distal dendrite acts as an active coincidence detector, firing only when it receives enough signals to exceed its individual threshold.

We proceed with the analysis in a manner analogous to the earlier discussion. The input to the distal dendrite segment $$k$$ at time $$t$$ is a sample of the bit vector $$\mathbf{y}_{\textrm{SDR}}^{(t-1)}$$. We have $$n_{ds}$$ distal synapses per segment, a permanence vector $$\mathbf{p}_k \in [0,1]^{n_{ds}}$$ and a synapse threshold vector $$\vec{\theta}_k \in [0,1]^{n_{ds}}$$, where typically $$\theta_i = \theta = 0.2$$ for all synapses.

Following the process for proximal dendrites, we get the distal segment’s connection vector $$\mathbf{c}_k$$:

$$c_{k,i}=(1 + sgn(p_{k,i}-\theta_{k,i}))/2$$

The input for segment $$k$$ is the vector $$\mathbf{y}_k^{(t-1)} = \phi_k(\mathbf{y}_{\textrm{SDR}}^{(t-1)})$$ formed by the projection $$\phi_k:\lbrace{0,1}\rbrace^{N-1}\rightarrow\lbrace{0,1}\rbrace^{n_{ds}}$$ from the SDR to the subspace of the segment. There are $${N-1}\choose{n_{ds}}$$ such projections (there are no connections from a neuron to itself, so there are $$N-1$$ to choose from).

The overlap of the segment for a given $$\mathbf{y}_{\textrm{SDR}}^{(t-1)}$$ is the dot product $$o_k^t = \mathbf{c}_k\cdot\mathbf{y}_k^{(t-1)}$$. If this overlap exceeds the threshold $$\lambda_k$$ of the segment, the segment is active and sends a dendritic spike of size $$s_k$$ to the neuron’s cell body.

This process takes place before the processing of the feedforward input, which allows the layer to combine contextual knowledge of recent activity with recognition of the incoming feedforward signals. In order to facilitate this, we will change the algorithm for Pattern Memory as follows.

Each neuron begins a timestep $$t$$ by performing the above processing on its $${n_{\textrm{dd}}}$$ distal dendrites. This results in some number $$0\ldots{n_{\textrm{dd}}}$$ of segments becoming active and sending spikes to the neuron. The total predictive activation potential is given by:

$$o_{\textrm{pred}}=\sum\limits_{o_k^{t} \ge \lambda_k}{s_k}$$

The predictive potential is combined with the overlap score from the feedforward overlap coming from the proximal dendrite to give the total activation potential:

$$a_j^t=\alpha_j o_{\textrm{ff},j} + \beta_j o_{\textrm{pred},j}$$

and these $$a_j$$ potentials are used to choose the top neurons, forming the SDR $$Y_{\textrm{SDR}}$$ at time $$t$$. The mixing factors $$\alpha_k$$ and $$\beta_k$$ are design parameters of the simulation.

Learning Predictions

We use a very similar learning rule for distal dendrite segments as we did for the feedforward inputs:

$$p_i^{(t+1)} = \begin{cases} (1+\sigma_{inc})p_i^{(t)} & \text {if cell j active, segment k active, synapse i active} \\ (1-\sigma_{dec})p_i^{(t)} & \text {if cell j active, segment k active, synapse i not active} \\ p_i^{(t)} & \text{otherwise} \\ \end{cases}$$

Again, this reinforces synapses which contribute to activity of the cell, and decreases the contribution of synapses which don’t. A boosting rule, similar to that for proximal synapses, allows poorly performing distal connections to improve until they are good enough to use the main rule.

Interpretation

We can now view the layer of neurons as forming a number of representations at each timestep. The field of predictive potentials $$o_{\textrm{pred},j}$$ can be viewed as a map of the layer’s confidence in its prediction of the next input. The field of feedforward potentials can be viewed as a map of the layer’s recognition of current reality. Combined, these maps allow for prediction-assisted recognition, which, in the presence of temporal correlations between sensory inputs, will improve the recognition and representation significantly.

We can quantify the properties of the predictions formed by such a layer in terms of the mutual information between the SDRs at time $$t$$ and $$t+1$$. I intend to provide this analysis as soon as possible, and I’d appreciate the kind reader’s assistance if she could point me to papers which might be of help.

A layer of neurons connected as described here is a Transition Memory, and is a kind of first-order memory of temporally correlated transitions between sensory patterns. This kind of memory may only learn one-step transitions, because the SDR is formed only by combining potentials one timestep in the past with current inputs.

Since the neocortex clearly learns to identify and model much longer sequences, we need to modify our layer significantly in order to construct a system which can learn high-order sequences. This is the subject of the next part of this series.

Note: For brevity, I’ve omitted the matrix treatment of the above. See Part One for how this is done for Pattern Memory; the extension to Transition Memory is simple but somewhat arduous.

• Nov 28 / 2014

Mathematics of Hierarchical Temporal Memory

This article describes some of the mathematics underlying the theory and implementations of Jeff Hawkins’ Hierarchical Temporal Memory (HTM), which seeks to explain how the neocortex processes information and forms models of the world.

Note: Part II: Transition Memory is now available.

The HTM Model Neuron – Pattern Memory (aka Spatial Pooling)

We’ll illustrate the mathematics of HTM by describing the simplest operation in HTM’s Cortical Learning Algorithm: Pattern Memory, also known as Spatial Pooling, forms a Sparse Distributed Representation from a binary input vector. We begin with a layer (a 1- or 2-dimensional array) of single neurons, which will form a pattern of activity aimed at efficiently representing the input vectors.

Feedforward Processing on Proximal Dendrites

The HTM model neuron has a single proximal dendrite, which is used to process and recognise feedforward or afferent inputs to the neuron. We model the entire feedforward input to a cortical layer as a bit vector $${\mathbf x}_{\textrm{ff}}\in\lbrace{0,1}\rbrace^{n_{\textrm{ff}}}$$, where $$n_{\textrm{ff}}$$ is the width of the input.

The dendrite is composed of $$n_s$$ synapses which each act as a binary gate for a single bit in the input vector.  Each synapse has a permanence $$p_i\in{[0,1]}$$ which represents the size and efficiency of the dendritic spine and synaptic junction. The synapse will transmit a 1-bit (or on-bit) if the permanence exceeds a threshold $$\theta_i$$ (often a global constant $$\theta_i = \theta = 0.2$$). When this is true, we say the synapse is connected.

Each neuron samples $$n_s$$ bits from the $$n_{\textrm{ff}}$$ feedforward inputs, and so there are $${n_{\textrm{ff}}}\choose{n_{s}}$$ possible choices of input for a single neuron. A single proximal dendrite represents a projection $$\pi_j:\lbrace{0,1}\rbrace^{n_{\textrm{ff}}}\rightarrow\lbrace{0,1}\rbrace^{n_s}$$, so a population of neurons corresponds to a set of subspaces of the sensory space. Each dendrite has an input vector $${\mathbf x}_j=\pi_j({\mathbf x}_{\textrm{ff}})$$ which is the projection of the entire input into this neuron’s subspace.

A synapse is connected if its permanence $$p_i$$ exceeds its threshold $$\theta_i$$. If we subtract $${\mathbf p}-{\vec\theta}$$, take the elementwise sign of the result, and map to $$\lbrace{0,1}\rbrace$$, we derive the binary connection vector $${\mathbf c}_j$$ for the dendrite. Thus:

$$c_i=(1 + sgn(p_i-\theta_i))/2$$

The dot product $$o_j({\mathbf x})={\mathbf c}_j\cdot{\mathbf x}_j$$ now represents the feedforward overlap of the neuron with the input, ie the number of connected synapses which have an incoming activation potential. Later, we’ll see how this number is used in the neuron’s processing.

The elementwise product $${\mathbf o}_j={\mathbf c}_j\odot{\mathbf x}_j$$ is the vector in the neuron’s subspace which represents the input vector $${\mathbf x}_{\textrm{ff}}$$ as “seen” by this neuron. This is known as the overlap vector. The length $$o_j = \lVert{\mathbf o}_j\rVert_{\ell_1}$$ of this vector corresponds to the extent to which the neuron recognises the input, and the direction (in the neuron’s subspace) is that vector which has on-bits shared by both the connection vector and the input.

If we project this vector back into the input space, the result $$\mathbf{\hat{x}}_j =\pi^{-1}({\mathbf o}_j)$$ is this neuron’s approximation of the part of the input vector which this neuron matches. If we add a set of such vectors, we will form an increasingly close approximation to the original input vector as we choose more and more neurons to collectively represent it.

Sparse Distributed Representations (SDRs)

We now show how a layer of neurons transforms an input vector into a sparse representation. From the above description, every neuron is producing an estimate $$\mathbf{\hat{x}}_j$$ of the input $${\mathbf x}_{\textrm{ff}}$$, with length $$o_j\ll n_{\textrm{ff}}$$ reflecting how well the neuron represents or recognises the input. We form a sparse representation of the input by choosing a set $$Y_{\textrm{SDR}}$$ of the top $$n_{\textrm{SDR}}=sN$$ neurons, where $$N$$ is the number of neurons in the layer, and $$s$$ is the chosen sparsity we wish to impose (typically $$s=0.02=2\%$$).

The algorithm for choosing the top $$n_{\textrm{SDR}}$$ neurons may vary. In neocortex, this is achieved using a mechanism involving cascading inhibition: a cell firing quickly (because it depolarises quickly due to its input) activates nearby inhibitory cells, which shut down neighbouring excitatory cells, and also nearby inhibitory cells, which spread the inhibition outwards. This type of local inhibition can also be used in software simulations, but it is expensive and is only used where the design involves spatial topology (ie where the semantics of the data is to be reflected in the position of the neurons). A more efficient global inhibition algorithm – simply choosing the top $$n_{\textrm{SDR}}$$ neurons by their depolarisation values – is often used in practise.

If we form a bit vector $${\mathbf y}_{\textrm{SDR}}\in\lbrace{0,1}\rbrace^N\textrm{ where } y_j = 1 \Leftrightarrow j \in Y_{\textrm{SDR}}$$, we have a function which maps an input $${\mathbf x}_{\textrm{ff}}\in\lbrace{0,1}\rbrace^{n_{\textrm{ff}}}$$ to a sparse output $${\mathbf y}_{\textrm{SDR}}\in\lbrace{0,1}\rbrace^N$$, where the length of each output vector is $$\lVert{\mathbf y}_{\textrm{SDR}}\rVert_{\ell_1}=sN \ll N$$.

The reverse mapping or estimate of the input vector by the set $$Y_{\textrm{SDR}}$$ of neurons in the SDR is given by the sum:

$$\mathbf{\hat{x}} = \sum\limits_{j \in Y_{\textrm{SDR}}}{{\mathbf{\hat{x}}}_j} = \sum\limits_{j \in Y_{\textrm{SDR}}}{\pi_j^{-1}({\mathbf o}_j)} = \sum\limits_{j \in Y_{\textrm{SDR}}}{\pi_j^{-1}({\mathbf c}_j\odot{\mathbf x}_j)}= \sum\limits_{j \in Y_{\textrm{SDR}}}{\pi_j^{-1}({\mathbf c}_j \odot \pi_j({\mathbf x}_{\textrm{ff}}))}= \sum\limits_{j \in Y_{\textrm{SDR}}}{\pi_j^{-1}({\mathbf c}_j) \odot {\mathbf x}_{\textrm{ff}}}$$

Matrix Form

The above can be represented straightforwardly in matrix form. The projection $$\pi_j:\lbrace{0,1}\rbrace^{n_{\textrm{ff}}} \rightarrow\lbrace{0,1}\rbrace^{n_s}$$ can be represented as a matrix $$\Pi_j \in \lbrace{0,1}\rbrace^{{n_s} \times\ n_{\textrm{ff}}}$$.

Alternatively, we can stay in the input space $$\mathbb{B}^{n_{\textrm{ff}}}$$, and model $$\pi_j$$ as a vector $$\vec\pi_j =\pi_j^{-1}(\mathbf 1_{n_s})$$, ie where $$\pi_{j,i} = 1 \Leftrightarrow (\pi_j^{-1}(\mathbf 1_{n_s}))_i = 1$$.

The elementwise product $$\vec{x_j} =\pi_j^{-1}(\mathbf x_{j}) = \vec{\pi_j} \odot {\mathbf x_{\textrm{ff}}}$$ represents the neuron’s view of the input vector $$x_{\textrm{ff}}$$.

We can similarly project the connection vector for the dendrite by elementwise multiplication: $$\vec{c_j} =\pi_j^{-1}(\mathbf c_{j})$$, and thus $$\vec{o_j}(\mathbf x_{\textrm{ff}}) = \vec{c_j} \odot \mathbf{x}_{\textrm{ff}}$$ is the overlap vector projected back into $$\mathbb{B}^{n_{\textrm{ff}}}$$, and the dot product $$o_j(\mathbf x_{\textrm{ff}}) = \vec{c_j} \cdot \mathbf{x}_{\textrm{ff}}$$ gives the same overlap score for the neuron given $$\mathbf x_{\textrm{ff}}$$ as input. Note that $$\vec{o_j}(\mathbf x_{\textrm{ff}}) =\mathbf{\hat{x}}_j$$, the partial estimate of the input produced by neuron $$j$$.

We can reconstruct the estimate of the input by an SDR of neurons $$Y_{\textrm{SDR}}$$:

$$\mathbf{\hat{x}}_{\textrm{SDR}} = \sum\limits_{j \in Y_{\textrm{SDR}}}{{\mathbf{\hat{x}}}_j} = \sum\limits_{j \in Y_{\textrm{SDR}}}{\vec o}_j = \sum\limits_{j \in Y_{\textrm{SDR}}}{{\vec c}_j\odot{\mathbf x_{\textrm{ff}}}} = {\mathbf C}_{\textrm{SDR}}{\mathbf x_{\textrm{ff}}}$$

where $${\mathbf C}_{\textrm{SDR}}$$ is a matrix formed from the $${\vec c}_j$$ for $$j \in Y_{\textrm{SDR}}$$.

Optimisation Problem

We can now measure the distance between the input vector $$\mathbf x_{\textrm{ff}}$$ and the reconstructed estimate $$\mathbf{\hat{x}}_{\textrm{SDR}}$$ by taking a norm of the difference. Using this, we can frame learning in HTM as an optimisation problem. We wish to minimise the estimation error over all inputs to the layer. Given a set of (usually random) projection vectors $$\vec\pi_j$$ for the N neurons, the parameters of the model are the permanence vectors $$\vec{p}_j$$, which we adjust using a simple Hebbian update model.

The update model for the permanence of a synapse $$p_i$$ on neuron $$j$$ is:

$$p_i^{(t+1)} = \begin{cases} (1+\delta_{inc})p_i^{(t)} & \text {if j \in Y_{\textrm{SDR}}, (\mathbf x_j)_i=1, and p_i^{(t)} \ge \theta_i} \\ (1-\delta_{dec})p_i^{(t)} & \text {if j \in Y_{\textrm{SDR}}, and ((\mathbf x_j)_i=0 or p_i^{(t)} \lt \theta_i)} \\ p_i^{(t)} & \text{otherwise} \\ \end{cases}$$

This update rule increases the permanence of active synapses, those that were connected to an active input when the cell became active, and decreases those which were either disconnected or received a zero when the cell fired. In addition to this rule, an external process gently boosts synapses on cells which either have a lower than target rate of activation, or a lower than target average overlap score.

I do not yet have the proof that this optimisation problem converges, or whether it can be represented as a convex optimisation problem. I am confident such a proof can be easily found. Perhaps a kind reader who is more familiar with a problem framed like this would be able to confirm this. I’ll update this post with more functions from HTM in coming weeks.

Note: Part II: Transition Memory is now available.

• May 01 / 2014
Clortex (HTM in Clojure)

Clortex Pre-Alpha Now Public

This is one of a series of posts on my experiences developing Clortex in Clojure, a new dialect of LISP which runs on the Java Virtual Machine. Clortex is a re-implementation of Numenta’s NuPIC, based on Jeff Hawkins’ theories of computational neuroscience. You can read my in-progress book by clicking on the links to the right.

Until today, I’ve been developing Clortex using a private repo on Github. While far from complete, I feel that Clortex is now at the stage where people can take a look at it, give feedback on the design, and help shape the completion of the first alpha release over the coming weeks.

I’ll be hacking on Clortex this weekend (May 3rd-4th) at the NuPIC Spring Hackathon in San José, please join us on the live feeds and stay in touch using the various Social Media tools.

WARNING: Clortex is not even at the alpha stage yet. I’ll post instructions over the next few days which will allow you to get some visualisations running.

You can find Clortex on Github at https://github.com/fergalbyrne/clortex

A new kind of computing requires a new kind of software design.
Hierarchical Temporal Memory (HTM) and the Cortical Learning Algorithm (CLA) represent a new kind of computing, in which many, many millions of tiny, simple, unreliable components interact in a massively parallel, emergent choreography to produce what we would recognise as intelligence.
Jeff Hawkins and his company, Numenta, have built a system called NuPIC using the principles of the neocortex. Clortex is a reimagining of CLA, using modern software design ideas to unleash the potential of the theory.
Clortex’ design is all about turning constraints into synergies, using the expressive power and hygiene of Clojure and its immutable data structures, the unique characteristics of the Datomic database system, and the scaleability and portability characteristics of the Java Virtual Machine. Clortex will run on hosts as small as Raspberry Pi, a version will soon run in browsers and phones, yet it will scale layers and hierarchies across huge clusters to deliver real power and test the limits of HTM and CLA in production use.
How can you get involved?
Clortex is just part of a growing effort to realise the potential of Machine Intelligence based on the principles of the brain.
• Visit the Numenta.org site for videos, white papers, details of the NuPIC mailing list, wikis, etc.
• Have a look at (and optionally pre-purchase) my Leanpub.com book: Real Machine Intelligence with Clortex and NuPIC.
• We’ll be launching an Indiegogo campaign during May 2014 to fund completion of Clortex, please let us know if you’re interested in supporting us when we launch.
• Apr 29 / 2014

Clortex: The Big Ideas

As you might know, I’ve been working away on “yet another implementation of Hierarchical Temporal Memory/Cortical Learning Algorithm” for the last few months. While that would have been nice as a hobby project, I see what I’m doing as something more than that. Working with NuPIC over the last year, I’ve gradually come to a realisation which seems kind of obvious, but remains uncomfortable:

A new kind of computing requires a new kind of software design.

HTM and CLA represent a new kind of computing, in which many, many millions of tiny, simple, unreliable components interact in a massively parallel, emergent choreography to produce what we would recognise as intelligence. This sounds like just another Neural Net theory, which is truly unfortunate, because that comparison has several consequences.

CLA – Not Another Neural Network

The first consequence relates to human comprehension. Most people who are interested in (or work in) Machine Learning and AI are familiar with the various flavours of Neural Nets which have been developed and studied over the decades. We now know that a big enough NN is the equivalent of a Turing Universal Computer, and we have a lot of what I would call “cultural knowledge” about NNs and how they work. Importantly, we have developed a set of mathematical and computational tools to work with NNs and build them inside traditional computing architectures.

This foreknowledge is perhaps the biggest obstacle facing HTM-CLA right now. It is very difficult to shake off the idea that what you’re looking at in CLA is some new twist on a familiar theme. There are neurons, with synapses connecting them; the synapses have settings which seem to resemble “weights” in NNs; the neurons combine their inputs and produce an output, and so on. The problem is not with CLA: it’s that the NN people got to use these names first and have hijacked (and overwritten) their original neuroscientific meanings.

CLA differs from NNs at every single level of granularity. And the differences are not subtle; they are fundamentally different operating concepts. Furthermore, the differences compound to create a whole idea which paradoxically continues to resemble a Neural Net but which becomes increasingly different in what it does as you scale out to layers, regions and hierarchies.

It’s best to start with a simplified idea of what a traditional NN neuron is. Essentially, it’s a pure function of its inputs. Each neuron has a local set of weights which are dot-producted with the inputs, run through some kind of “threshold filter” and produce a single scalar output. A NN is a graph of these neurons, usually organised into a layered structure with a notion of “feedforward” and “feedback” directions. Some NNs have horizontal and cyclical connections, giving rise to some “memory” or “temporal” features; such NNs are known as Recurrent Neural Networks.

In order to produce a useful NN, you must train it. This involves using a training set of input data and a learning algorithm which adjusts the weights so as to approach a situation where the collective effect of all the pure functions is the desired transformation from inputs to outputs.

In sharp contrast, CLA models each neuron as a state machines. The “output” of a neuron is not a pure function of its feedforward inputs, but incorporates the past history of its own and other neurons’ activity.  A neuron in CLA effectively combines a learned probability distribution of its feedforward inputs (as in a NN) with another learned probability distribution in the temporal domain (ie transitions between input states). Further, instead of being independent (as in NNs), the “outputs” of a set of neurons are dictated using an online competitive process. Finally, the output of a layer of CLA neurons is a Sparse Distributed Representation, which can only be interpreted as a whole, a collective, or a holographic encoding what the layer is representing.

These fundamental differences, along with the unfortunate duplicate use of names for the components, mean that your grandfather’s tools and techniques for reasoning about, working with, and mathematically modelling NNs do not apply at all to CLA.

In fact, Jeff’s work has been criticised because he’s allegedly “thrown away” the key ability to mathematically demonstrate some properties of NNs, which the consensus considers necessary if you want your theory to be admitted as valid (or your paper published). This view would have it that CLA sacrifices validity for a questionable (and in some opinions, vacuous) gain.

Jeff gave a talk a few weeks back to the London Deep Learning Meetup – it’s perhaps the best single overview of the current state of the art for CLA:

The second consequence relates to implementation in software. Numenta began work on what is now NuPIC in 2005, and most of the current theory – itself still in its early stages in 2014 – has only appeared in stages over the intervening years. Each innovation in the theory has had to be translated into and incorporated into a pre-existing system, with periodic redesigns of this or that component as understanding developed. It is unfair to expect the resulting software artefacts to magically transmute, Doctor Who-like, in a sequence of fully-formed designs, each perfectly appropriate for a new generation of the theory.

The fact that NuPIC is a highly functional, production-ready, and reasonably faithful implementation of much of CLA is a testament to the people at Numenta and their dedication to bring Jeff’s theories into an engineered reality. The details of how NuPIC works, and how it can be used, are a consequence of its history, co-evolving with the theory, and the software design and development techniques which were available over that history.

Everything involves tradeoffs, and NuPIC is no exception. I have huge respect for the decisions which have led to the NuPIC of 2014, and I would like to view Clortex as nothing other than NuPIC, metamorphosed for a new phase of Machine Intelligence based on HTM and CLA, with a different set of tradeoffs and the chance to stretch the boundaries yet again.

So, rather than harp on what might be limiting or difficult with NuPIC, I’ll now describe some of the key improvements which are possible when a “new kind of software” is created for HTM and CLA.

Architectural Simplicity and Antifragile Software – Russ Miles

It’s odd, but Clortex’ journey began when I followed a link to a talk Jeff gave last year [free registration required] at GOTO Aarhus 2013, and decided to watch one, then two, and finally all three talks given by Russ Miles at the same event. If you’re only able to watch one, the one to watch is Architectural Simplicity through Events. In that talk, Russ outlines his main Axioms for building adaptable software:

1. Your Software’s First Role is to be Useful

Clearly, NuPIC is already useful, but there is a huge opportunity for Clortex to be useful in several new ways:

a) As a Teaching Tool to help understand the CLA and its power. HTM and CLA are difficult to understand at a deep level, and they’re very different from traditional Neural Networks in every way. A new design is needed to transparently communicate an intuitive view of CLA to layman, machine learning expert, and neuroscientist alike. The resulting understanding should be as clear to an intelligent and interested viewer as it is to Jeff himself.

b) As a Research and Development platform for Machine Intelligence. Jeff has recently added – literally – a whole set of layers to his theory, involving a new kind of temporal pooling, sensorimotor modelling, multilayer regions, behaviour, subcortical connections, and hierarchy. This is all being done with thought experiments, whiteboards, pen and paper, and slides. We’ll see this in software sometime, no doubt, but that process has only begun. A new system which allows many of these ideas to be directly expressed in software and tested in real time will accelerate the development of the theory and allow many more people to work on it.

c) As a Production Platform for new Use Cases. NuPIC is somewhat optimised for a certain class of use cases – producing predictions and detecting anomalies in streaming machine-generated numerical data. It’s also been able to demonstrate capabilities in other areas, but there is a huge opportunity for a new design to allow entirely new types of information to be handled by HTM and CLA techniques. These include vision, natural language, robotics and many other areas to which traditional AI and ML techniques have been applied with mixed results. A new design, which emphasises adaptability, flexibility, scaleability and composability, will allow CLA to be deployed at whatever scale (in terms of hierarchy, region size, input space etc as well as machine resources) is appropriate to the task.

2. The best software is that which is not needed at all

Well, we have our brains, and the whole point of this is to build software which uses the principles of the brain. On the other hand, we can minimise over-production by only building the components we need, once we understand how they work and how they contribute to the overall design. Clortex embraces this using a design centred around immutable data structures, surrounded by a growing set of transforming functions which work on that data.

3. Human Comprehension is King

This axiom is really important for every software project, but so much more so when the thing you’re modelling is so difficult to understand for many. The key with applying this axiom is to recognise that the machine is only the second most important audience for your code – the most important being other humans who will interact with your code as developers, researchers and users. Clortex has as its #1 requirement the need to directly map the domain – Jeff’s theory of the neocortex – and to maintain that mapping at all costs. This alone would justify building Clortex for me.

4. Machine Sympathy is Queen

This would seem to contradict Axiom 3, but the use of the word “Queen” is key. Any usable system must also address the machine environment in which it must run, and machine sympathy is how you do that. Clortex’ design is all about turning constraints into synergies, using the expressive power and hygiene of Clojure and its immutable data structures, the unique characteristics of the Datomic database system, and the scaleability and portability characteristics of the Java Virtual Machine. Clortex will run on Raspberry Pi, a version of will run in browsers and phones, yet it will scale layers and hierarchies across huge clusters to deliver real power and test the limits of HTM and CLA in production use.

5. Software is a Process of R&D

This is obviously the case when you’re building software based on an evolving theory of how the brain does it. Russ’ key point here is that our work always involves unknowns, and our software and processes must be designed in such a way as not to slow us down in our R&D work. Clortex is designed as a set of loosely coupled, interchangeable components around a group of core data structures, and communicating using simple, immutable data.

6. Software Development is an Extremely Challenging Intellectual Pursuit

Again, this is so true in this case, but the huge payoff you can derive if you can come up with a design which matches the potential of the CLA is hard to beat. I hope that Clortex can meet this most extreme of challenges.

Stay tuned for Pt II – The Clortex System Design..

• Apr 11 / 2014
Clojure

Doc-driven Development Using lein-midje-doc

This is one of a series of posts on my experiences developing Clortex in Clojure, a new dialect of LISP which runs on the Java Virtual Machine. Clortex is a re-implementation of Numenta’s NuPIC, based on Jeff Hawkins’ theories of computational neuroscience. You can read my in-progress book by clicking on the links to the right. Clortex will become public this month.

One of the great things about Clojure is the vigour and quality of work being done to create the very best tools and libraries for developing in Clojure and harnessing its expressive power. Chris Zheng‘s lein-midje-doc is an excellent example. As its name suggests, it’s uses the comprehensive Midje testing library, but in a literate programming style which produces documentation or tutorials.

Doc-driven Development

Before we get to DDD, let’s review its antecedent, TDD.

Test-driven Development

Test-driven Development (TDD) has become practically a tradition, arising from the Agile development movement. In TDD, you develop your code based on creating small tests first (these specify what your code will have to do); the tests fail at first because you haven’t yet written code to make them pass. You then write some code which makes a test pass, and proceed until all tests pass. Keep writing new tests, failing and coding to pass, until the tests encompass the full spec for your new feature or functionality.

For example, to write some code which finds the sum of two numbers, you might first do the following:
 

(fact "adding two numbers returns their sum" ; "fact" is a Midje term for a property under test
(my-sum 7 5) => 12 ; "=>" says the form on the left should return or match the form on the right
)

 

This will fail, firstly because there is no function my-sum. To fix this, write the following:
 

(defn my-sum [a b]
12)
)

 

Note that this is the correct way to pass the test (it’s minimal). Midje will go green (all tests pass). Now we need to force the code to generalise:
 

(fact "adding two numbers returns their sum"
(my-sum 7 5) => 12
(my-sum 14 10) => 24
)

 

Which makes us actually write the code in place of 12:
 

(defn my-sum [a b]
(+ a b)
)

 

The great advantage of TDD is that you don’t ever write code unless it is to pass a test you’ve created. As Rich Hickey says: “More code means… more bugs!” so you should strive to write the minimum code which solves your problem, as specified in your tests. The disadvantage of TDD is that it shifts the work into designing a series of tests which (you hope) defines your problem well. This is better than designing by coding, but another level seems to be required. Enter literate programming.

Literate Programming

This style of development was invented by the famous Donald Knuth back in the 70’s. Knuth’s argument is that software is a form of communication. The first is obvious: a human (the developer) is communicating instructions to the machine in the form of source code. The second is less obvious but perhaps more important: you are communicating your requirements, intentions and design decisions to other humans (including your future self). Knuth designed and built a system for literate programming, and this forms the basis for all similar systems today.

This post is an example of literate programming (although how ‘literate’ it is is left to the reader to decide), in that I am forming a narrative to explain a concrete software idea, using text, and interspersing it with code examples which could be executed by a system processing the document.

Doc-driven Development

DDD is essentially a combination of TDD and Literate Programming. Essentially, you write a document about some part of your software, a narrative describing what it should do and examples of how it should work. The examples are framed as “given this, the following should be expected to happen”, which you write as facts (a type of test which is easier for humans to read). The DDD system runs the examples and checks the return values to see if they match the expectations in your document.

This big advantage of this is that your documentation is a much higher-level product than a list of unit tests, in that your text provides the reader (including your future self returning to the code) with much more than a close inspection of test code would yield. In addition, your sample code and docs are guaranteed to stay in synch with your code, because they actually run your code every time it changes.

lein-midje-doc

lein-midje-doc was developed by Chris Zheng as a plugin for the Clojure Leiningen project and build tool. It leverages Midje to convert documents written in the Literate Programming style into suites of tests which can verify the code described.

It’s simple to set up. You have to add dependencies to your project.clj file, and then you add an entry for each document you wish to include (instructions are in the README.md, full docs on Chris’ site). then you use two shells to run things. In one, you run lein midje-doc, which repeatedly creates the readable documents as HTML files from your source files, and in the other you run midje.repl‘s (autotest) function to actually spin through your tests.

Here’s Chris demonstrating lein-midje-doc:

• Mar 31 / 2014

Real Machine Intelligence now Available on Leanpub.com

The first three chapters of my new book, Real Machine Intelligence with Clortex and NuPIC has just gone live on Leanpub.com. Lean Publishing is a new take on a very old idea – serial publishing – which goes back to Charles Dickens in the 19th century. The idea is to evolve the book based on reader feedback, and to give people a chance to read the whole book before committing to buy.

I’ve also set up a Google Group and Facebook Page. Prior to going live with Clortex, you can read a snapshot of the documentation on its GitHub.io page.