A Roadmap to Intelligence

This post looks at some key ideas for artificial intelligence systems. It acts as a guide to the landmarks on our path to improved computing.

DALL-E Mini knows all of your AI roadmap stock photo clichés!

The good thing about being an old dilettante is that you don’t need superpowers, just a curious receptive ear and an ability to survive. Over time certain ideas stick in your head, and this leads to a feedback process where you begin to see echoes of those ideas in new material.

The main ideas that stick in my head are:

  • Boolean / ternary representations;
  • Composed linear approximations;
  • Brute force over design;
  • Noisy probabilistic context; and
  • Integration for continuous.

We’ll now take these in turn, highlighting where we see the echoes in current ideas and riffing on future developments.

Boolean / Ternary Representations

One of the most striking aspects of modern neural networks is their reliance on high-bit float element values for data representations. One reason for this is they just seem to work – large 64-bit storage capacities come for free in modern computing architectures and if they work why not use them? However, in both classical computing and neural codes, we see roughly binary or ternary representations – [0 or 1] or [-1, 0, 1]. The reason for this is they offer the lowest common denominator of robustness – they are the smallest informative representation. In the presence of noise or variation, they are also the easiest to distinguish.

DALL-E Mini does binary or ternary

I bundle ternary schemes in here with binary schemes as they can be seen as an extension of a [-1 or 1] binary scheme with 0 being a null space between the two representations. Brains seems to contain a mixture of excitatory and inhibitory synaptic connections, which is more suggestive of a ternary scheme. You can see a ternary scheme as the simplest correlation abstraction – things are either [negatively correlated, not-correlated, or positively correlated].

Traditional neural networks attempted to provide binary or ternary representations – activation functions such as sigmoid or tanh act to effectively output binary or ternary values. However, the practical success of the RELU activation function (which zeros any value below zero and then passes through values above 0) has lead away from these to a more real-valued logit output.

Why should we worry about binary or ternary representations when 64-bit float representations work just as well or even better?

  • There is some evidence to suggest that it is a loss of information that provides the abstractive power of deep neural networks (see below for more details of this). Actively selecting one or small set of values and then using this for the input of a subsequent layer may increase this abstractive power.
  • Binary or ternary representations provide a simplicity that could reduce storage and computation demands, e.g. for embedded devices.
  • Binary or ternary representations tend to echo within even the high level dichotomies of human thought – things “are” or “are not”, rationality is built on binary causality, and human beings seem unusually bad at probabilities.
  • Action is binary – you “act” or you don’t “act” in the end.
  • You can retrieve continuous real representations by integrative approaches (see later below).

Binary or ternary representations are actually hiding within many modern neural network designs. Most successful architectures employ normalisation, typically in batches. For a set of data samples (the “batch”), you subtract the mean of the values and divide by a variance factor for the values that sets the variance (and standard deviation) to 1. The mean is normally thrown away. This is effectively the same as whitening the data input. This is also why it has become important to shuffle your data samples well in training – if you get clusters of similar values in a batch this can throw off training by throwing off the batch norm. The result of normalisation is typically a field of values that are centred around 0 (due to mean subtraction) where most of the values are between -1 and 1 (due to the variance scaling). This gives you a roughly ternary input range. Residual networks (e.g., ResNet) may be doing something similar by passing forward the original signal – learning deviations from the value that are typically small and centred on 0 rather than the value itself. I have a hunch this is what the thalamus is doing in the brain – removing a spatiotemporally local mean and maybe scaling based on a variance whereas the cortex is working on finding patterns within a resulting residual or differential signal. The brain is maybe different in that it “remembers” at least this local mean to add in back into signals for generative reconstructions and output.

Composed Linear Approximations

The relative success of deep neural networks as function approximators points to the power of representing complex high-dimensionality non-linear functions with a composition of linear functions.

DALL-E Mini does composed linear approximations

This actually has a long history in mathematics and engineering. Most engineering courses involve teaching undergraduates to memorise first-order linear approximations, telling them to ignore the second-order non-linear effects until they do a doctorate. The Taylor series shows that any function can be approximated (at first) using linear terms. Principal Component Analysis decomposes a signal into a sum of linear projections based on the variance of the signal. Even calculus is based on imaging a series of linear approximations to the gradient of a function and then further imaging that the length of those linear segments reduce to zero. Gradient descent can often be reduced to ever-decreasing linear journeys within a non-linear topology.

What was somewhat surprising was the effectiveness of stacking linear approximations on top of each other, with non-linearities in between to break linear chaining. This has links with the new field of emergence – reductionism tries to break everything down to a lowest representation but emergence recognises that derived population approximations can be combined in synergistic ways to produce outputs that are a synergistic sum of the approximations, the sum providing information beyond that of the lower representations individually. This is also apparent from network science, where information is seen to reside in the interrelations between things as well as the composition of the things themselves.

Why do stacked linear approximations work? It’s maybe because in the real-world things are constrained in space and time. This results in patterns in the sensory input and constrains an information processing substrate. Things exists with a shape that provides extent in both space and time. In any processing architecture there are limits on connectivity, and on long distance transmission. This is at the heart of representation per se – we say that individual elements may be represented by an information encoding.

So far conventional neural network architectures have concentrated on backpropagation to optimise the network parameters – we start from an error in an output and propagate that error back based on the differentials of the parameters with respect to the error, using the chain rule to link layers. There is possibly room to supplement or supplant this with a “forward” learning based on Hebbian (fire together wire together) learning.

Brute Force Over Design

The history of computing is one of solving problems by throwing more and more compute at the problem. Purists hate this – often it’s not clever algorithmic design that wins the day but the power of raw iteration.

DALL-E Mini does a photo of brute force over design

But we are well aware of this from our most familiar algorithm: evolution. Evolution takes mindless repetition and variation to create the complexity of us.

In the field of deep neural networks we also see this. The systems of the 80s and 90s were those of hand-picked features, and the performance of these systems in many different fields was quickly eclipsed by deep neural networks. Larger models and more training data reliably translate into better performance. We also see it in our individual brains – we learn to see, to hear, and to feel not because these abilities are hard-wired but because they are adaptively tuned in the face of the tsunami of data every one of us experiences daily.

As a design approach, we should always look at trainable and iterative solutions over those where we hand-craft the algorithm. Performance wise it also seems that it is better to compose ease-to-use generic building blocks than try to optimise the design of those blocks.

Since 2017, the success of the Transformer architecture in both natural language processing, reinforcement learning, and computer vision suggests that we can begin to think in terms of networks of pretrained systems where we can build around different specific representations by a few smaller linear layers.

Noisy Probabilistic Context

Real-world signals are naturally “noisy”. In fact, the definition of “noise” and “signal” is often a somewhat erroneous one based on our expectations – “signal” is a placeholder for that which we are trying to measure and “noise” is everything else including that which we don’t model or understand. Look at any set of neuron recordings and you see highly noisy binary signals. A surprisingly high proportion of synaptic transmission simply peter out.

DALL-E Mini does an oil painting of a noisy probabilistic context

The precision of modern sensor systems often blinds us to the inherent chaos of the world around us, and the fact that all measurements are imperfect. The signals from the retina and cochlea are binary neural impulses not precise 64-bit floats. Many recent vision systems seem surprisingly susceptible to adversarial attack that we are not even aware of – they are picking up on tiny patterns in input values that would classically be lost in the noise of a less precise system.

Our own conscious experience also tends to hide from us the noisiness and unpredictability of our lower level sensory systems. We feel we see a coherent scene around us and this is mimicked in our pictures; we hear clear voices and can detect “noise” in an analogue transmission line. But when neuroscientists measure the signals in our sensory organs and input sensory cortex they do not find clear easily discriminative neural patterns – they find variation and noise. There do seem to be patterns there – neural networks trained on the neural recordings are surprisingly good at accurately guessing a neural correlate – but the patterns are probabilistic not deterministic, they are unpredictable at the neuron level and only form patterns at a population level over space and time.

Because modern computing provides an illusion of precision, we can often gain by injecting randomness into our systems, the computing equivalent of biologically messy transmission. Dropout works along these lines in neural networks to improve robustness – we just through away some of our hard earned representations to improve performance! This seems counterintuitive but the underlying idea is that if we learn to expect noisy and unpredictable signal elements we pay more attention to the general patterns over space and time rather than overfitting to spurious or surface patterns in the “precise” values.

Using “noise” creatively is also at the heart of successful image generation systems. Both generative adversarial networks (GANs) and diffusion networks leverage noise to make small jumps towards a coherent image – without adding or removing noise there is no output.

One idea I’ve been playing with is treating a set of element-wise normalised logits as a set of individual probabilities for a ternary signal. For example, if each element in a vector signal is normalised to be primarily within a -1 to 1 range then you can compare the absolute value with a random value (between 0 and 1) to pick one of the three ternary values (-1, 0 or 1), possibly treating as a signed binary selection where the sign is added back in later (e.g. comparing the absolute value to the random value and selecting 1 if above, then adding back / multiplying by the sign).

Integration for Continuous

If we take it that we should try to use binary or ternary representations but also assume that individual elements are unpredictable for individual instance values, how can we construct the kind of real-valued function output we see in all our modern data representations?

DALL-E Mini does a sketch of integration for continuous values

The key to this, I think, is to look at summation or integration of binary or ternary signals over space and/or time.

This integration can be normalised if necessary by simply counting the time base, and if necessary scaling. In individual elements are taken within the same spatial or temporal range then their relative values represent continuous differences. You can easily go from binary or ternary to probabilities by summation, and go from probabilities to binary or ternary by sampling.

This also works well with the normalisation discussed above – if we store a mean value (and possibly a variance scaling factor), then this can be added to an integrated mean-difference to adjust the mean.

Another aspect of composed representations is that we can also generate an output in a composed manner. This seems to be what brains do – we experience a coherent seemingly real-valued sensory world but neural evidence suggests this is composed from widely distributed binary or ternary signals. This composition can take place over space and/or time. For example, we can have different scales for space and/or time and compose those representations over time to create a rich continuous output from a population of randomly fluctuating binary or ternary signals. This is also seen in many successful generative systems where a start is made with a low-resolution image which is used as the seed for higher-resolution versions. Neural networks seem particularly suited to learning small adjustments to an existing input, e.g. performing a form of adaptive upsampling, and image pyramids from noisy steered seeds are a way to obtain a chicken from an egg.

Summing Up

So we’ve taken a look here at:

  • Boolean / ternary representations;
  • Composed linear approximations;
  • Brute force over design;
  • Noisy probabilistic context; and
  • Integration for continuous.

Sketching neural network architectures where these are set as constraints, with relative flexibility on the structure on the trainable computation engines, seems a fruitful area of research.

Initially, I imagine architectures will lack the state of the art evaluations seen by specific neural architectures. However, over time this may be overcome by the “brute force” approach and when scaling up, the limitations of a lack of deterministic precision actually becomes the superpower for more generalist abilities, as our underlying multi-modal representations need to be robust and efficient.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s