This post looks at composition. It starts with Lego. It then looks at a theory of why deep neural networks work and how they could be trained. It ends on how brains may embody these theories.
What is composition?
Composition is where we attempt to construct an entity (a signal or object) from smaller sub-components.
In 3D space, Lego models are constructed by composing small brick units into a larger whole.
When picturing a 2D scene, we sense that the scene comprises a number of objects in particular spatial configurations.
For a 1D signal, such as an audio channel in time, we can use the Fourier transform to decompose the signal into a series of combined waveforms of different frequencies.
Is composition the same as reductionism?
No. Composition is related to reductionism, but is not quite the same.
Reductionism says we can describe a system solely in terms of its component parts. We can descend layers of description and the last layer is the “true” and “meaningful” layer.
Composition presents a similar hierarchical organisation (or description) of matter, but does not focus on any one level as being “better” or more explanatory. It does not deny emergent properties that may arise at each level, for example, that a horizontal line and a vertical line at a lowest level may be represented by a cross representation at a next level, where the cross may in turn be used to construct objects at a yet higher level.
What are some other examples?
Language is full of examples of composition.
Word embeddings have become a successful application of composition to the field of natural language processing.
Word embeddings map a word or word part that is represented as an index in a dictionary (e.g. “dog” may be the 1200th item in a list of used words) to a vector of real numbers (e.g. a list of 300 fractions with values between 0 and 1 represented as long decimals). The vector of real numbers is generated by processing billions of words and looking at the relationships between words.
Returning to composition, each element of this vector can be thought of as a particular category or type of relation, and a word is a weighted combination of different relations. The meaning of a word is “composed” of elements indicating its relation to other categories of words.
Nested grammar structures also show a form of composition. We generate ever more complex sentences by nesting clauses and other grammatical structures. We have a number of grammatical primitives, typically parts of speech, that we “compose” according to particular rules of composition.
Deductive theories are also a form of composition. We start with a set of primitives we call axioms. These are independent generalisations that can be said to be “true” without further decomposition (whether this is actually the case is debatable). We then “compose” these axioms in ordered sequences to generate more complex statements, and/or to arrive at realisations.
The equations of motion are a form of composition. The position of an object can be determined by starting with a static initial position and then adding or subtracting different components based different differentials or rates of change (e.g. a component based on constant velocity, a component based on a constant acceleration, etc.). Indeed, a Taylor series is a form of weighted sum of different elements.
Matter is itself a composition. There are a limited number of elements in the periodic table, and an even limited number of elements that are abundant, useful or stable in normal environments. But we have an infinitely varied world from just this limited number.
One way of seeing composition is in the form of a weighted sum.
If you have solely positive weights, then each weight indicates the strength of a contribution from a particular sub-element. A weight of zero indicates that a particular sub-element is not used.
If you have positive and negative weights, you can see a positive weight as adding a contribution from a particular sub-component and a negative weight as removing a contribution from a particular sub-component. You can alternatively consider positive and negative contributions separately, e.g. as two stages – a first stage where you add and a second stage where you take away. Weighted sums thus become similar to sculpture, you add material of certain primitive forms, and you remove material of similar or different primitive forms.
Often when considering weighted sums, we wish to use a set of normalised weights. In this case, we say that all the weights should sum to a predefined number, typically ‘1’. In this way, each weight represents a relative contribution of each sub-element.
Why is composition useful?
Composition is useful because it allows us to concentrate on a stylised collection of primitives or components that we can individually control. We thus overcome the challenge of dimensionality – we can represent a complex high dimension signal with smaller lower dimension representations.
For example, the Fourier Transform helps us understand 1D signals – a complex sound wave with millions of samples with 16 bits per sample can often be represented using a much lower dimensionality representation – just a few 8-bit values representing a handful of frequencies.
For Lego, to software libraries, to modern manufacturing, we are able to construct complex artefacts from a small number of well-tested and well-designed sub-components that are composable in many different configurations. Instead of having to generate a new artefact from scratch every time something new is desired, we can quickly generate an approximation to the new thing, using re-usable components from old things.
Is composition compression?
Composition and compression are related concepts, but they differ in a number of ways.
Often composition provides compression. The Fourier Transform example above shows how this may be the case, we can replace a large number of values with a smaller number of values. However, we can have composition without compression and compression without composition.
With word embeddings, we can change a integer number that may be in a range of 0 to 100,000 (the size of an average dictionary) to a “dense” representation of 300 elements. This is not compression in the traditional sense – each element may be a 64-bit floating point number, so the word embedding may require 19200 bits as opposed to 32 bits to represent one integer number in the aforementioned range. However, it is compression if we are considering a general “number of things” – we have 300 things as opposed to 100,000 things (ignoring the information content of each “thing”).
In the Fourier Transform case, compression occurs because often we can ignore small contributions, concentrating on the frequencies that make up most of the signal. The small contributions can be seen as noise. In this way we have lossy compression; we are throwing away information, but we can do this because we don’t really notice.
When thinking of 3D objects, a composed object is often not any smaller or less massive than a version generated from scratch. For example, you could build a model house from Lego bricks or 3D print an identical copy. In this case, there is no difference in the mass of the end object, but the Lego house has advantages in construction. Again, there is a reduction in the number of things – we have a discrete number of bricks, rather than a much larger number of atoms, plastic particles
Why does composition work?
Two reasons: composition reflects the patterns within the world and composition is an efficient way to act in the same world.
Composition reflects how the world works. It reflects the way matter is constructed. It reflects the way evolution has built complex organisms. And it especially reflects the world humans have built.
Chemistry dictates regular patterns that cause discrete clustering of particular forms of matter.
The natural world has independent entities with repeating structure. This is because nature just copies and pastes; evolution takes what works and makes more of it. There are animals, trees, and plants. We can pick up stones and taste raindrops. Motifs such as eyes and cells are used by most things.
The Fourier Transform works because many of the sounds we hear are caused by vibrations within gaseous matter (air), where the vibrations have discrete and independent sources that have vibrating elements of particular lengths.
Animals and plants all act in the world. Humans build things in the world. But the world has space for near infinite combinations of matter in time. The machine code that is executed when you surf the web is likely different to the machine code of your neighbour. If you had to construct each website from scratch using machine code there would likely be no websites. Instead, most websites have similarities that exist because to build them quickly most people just existing libraries and arrange parts in various combinations.
What does composition have to do with intelligence?
Thinking about composition is hard for us, as our brains naturally parse our reality into a composed representation. We experience a world of objects but we don’t sense a world of objects. Our sensory input is noisy, confusing, contradictory; i.e. generally a mess. Instead of the world you know and love, reality can only typically be measured through multiple streams of numbers, Matrix-style.
As the world has regular patterns, our brains can build representations or models of these patterns to navigate the world.
Also to act in a reasonable time frame, it makes sense to spend time building mini-models of the regularities in the world, and then try to reuse those mini-models.
If you want to survive in a chaotic and changing world, in turns out that it is a good idea to try to predict both the world and your actions within in. To predict what the world will be like (or even is like), our brains build models of the world. These are likely to be generative models – we can generate (guess) what a sensory input will be based on the model, where the model may be different from the sensory input. Generative models can use the two properties of composition: they can model the regularities and they can efficiently generate a predicted representation.
Modelling the world involves representing a complex representation (e.g. a signal, an image, a set of numbers) using a simpler set of rules or representations. A model can thus be a form of composition, and we can create complexity by composing different simple models. A model is like a mini machine – it creates an output from an input. Often we can feed in a simple input (e.g. a time series that is just a linear range) and generate a complex output (e.g. a series of numbers in a pattern) by applying a common rule to the input. We can also take a complex measurement (like a noisy graph) and determine a simple set of parameters that mostly represent the measurement. Many signal processing models assume a regular underlying pattern and noise. These in combination can create a complex signal that is difficult to predict. But we can generate a representation of that signal by applying a few simple rules and then adding a (pseudo)-random set of noise. We can use the simple rules to predict the signal and ignore the noise.
Can we live with “good enough”?
One of the powerful realisations that comes from the success of composition is that simple rules or models can be surprisingly effective, and allow for the prediction of a surprisingly amount of a signal. Or to put it another way, we can throw away a surprising amount and still retain something that has predictive utility.
Due to the regularities in the world, it turns out that we can adopt a view from Pareto: that much (e.g. 80%) of the variation that we see comes from just a small proportion of causes (e.g. 20%). If we can model those causes using simple rules, we can generate or re-generate a representation of the world that roughly reflects what is there. From Pareto (or a general power-law approach), it turns out that the detail of a representation actually takes a lot of effort to predict accurately, but this only has a relatively small effect on the appearance of the signal. We can do a surprising amount with “good enough”.
We can view “good enough” through a lens of variation. It turns out that much of the variation in the world has macro causes that can be roughly predicted. For a time-series 1D measurement, working out where a data value roughly is within a given scale is easier than working out where specifically it is.
What are our “units” of composition?
If we want to model how the world works, how do we select the sub-components for a composed representation?
If we look at composition as a weighted sum, one approach is to look at basis vectors.
First, imagine we have an N-dimensional signal. This could be a frequency (magnitude) spectrum across 512 different frequencies (i.e. a 512-length signal) or a 32 by 32 bit image that is flattened row by row into a 1024-length signal. Each unique signal can be thought of as a vector (or point) in 512 or 1024 dimensional space.
Within this N-dimensional space, we can define a set of basis vectors. Each basis vector is N-dimensional vector. The set of basis vectors has certain properties:
- Each of the basis vectors in the set has to be linearly independent.
- The set of basis vectors needs to be configured such that we can construct every vector within the N-dimensional space as a weighted sum of the basis vectors.
The first property means that, roughly speaking, the vectors all need to be at right angles (orthogonal) within the N-dimensional space. This is fairly easy to picture for 2 dimensions, but harder for 512 dimensions!
I like to think of basis vectors as a co-ordinate system for the space. For example, for a two-dimensional space, the vectors [1, 0] and [0, 1] form one set of basis vectors. These may be seen as a unit in the x-direction and a unit in the y-direction. Each vector in two-dimensional space may then be thought of as a particular number of units in the x-direction and a particular number of units in the y-direction.
Basis vectors may thus comprise our sub-components for a composed representation.
How do we select the set of basis vectors?
It turns out that there is a statistical technique for determining a set of orthogonal basis vectors for an N-dimensional signal, where the basis vectors represent the ordered directions of variance within the signal. This approach is called Principal Component Analysis (PCA).
PCA is normally described with respect to a 2D example with a ellipsoid distribution (blob) of points. The “principal components”, i.e. the basis vectors, are computed by looking at the variance between the dimensions. The direction where the most variation occurs is the first basis vector, the direction where the next most variation occurs is the second basis vector and so on.
One way to think of the “principal components” of PCA is as building blocks for constructing a signal. The first building block represents the greatest possible range of locations within the set of possible signals and so the first weight represents a “number” of these first building blocks (which can be negative and fractional).
If we were thinking in terms of geography, then the first building block may represent a direction along a motorway, with subsequent building blocks representing side roads, and the last building blocks representing small country lanes. You can get mostly there with the motorway, and the bits off the motorway represent a smaller distance (but take up much of your time). If you had to describe how to get to where you live, you would start with the motorway portion rather than the twist and turns of the last country lanes.
So can we stop with PCA and basis vectors?
No, unfortunately we face one major downside: linearity.
Basis vectors only provide for linear weighted sums. This means they are not effective when the signal is non-linear. Most signals you record in the world are non-linear.
For example, a circle of points within two dimensions is not described efficiently as a linear combination of vectors, instead it is best represented by a quadratic function of elements in each dimension.
Can we get around the constraint of linearity?
The unexpected success of “deep” neural networks to provide complex X>Y mappings suggests one way we can avoid the constraint of linearity: it turns out we can cheat.
When we stack up a series of linear transformations, and add in an (arbitrary) non-linear function between the layers, it turns out we can learn complex non-linear functions. Popular non-linear activation functions include sigmoid, tanh, and RELU. All of these seek to saturate variation in one or more directions, e.g. RELU sets values below 0 to 0, sigmoid caps values between 0 and 1 and tanh caps values between -1 and 1.
PCA + non-linear activation function?
Let’s start with sigmoid and tanh activation functions. These have fallen out of favour compared to RELU and others but they are slightly better examples to understand what is going on.
Now when we generate a set of basis vectors for PCA, these are represented as a matrix transform that maps our original signal to the directions of the basis vectors, so that result is a new signal that represents the number of “units” of each basis vector. This is just what a feed-forward neural network does. The number of “units” can also be seen as a distance measure in each of the basis vector directions.
Now if we are applying sigmoid or tanh activation functions we are limiting the number of “units” of each basis vector we can use (either 0 to 1 or -1 to 1). The sigmoid (and RELU) activation limits us to positive numbers; the tanh allows negative “units” but these are limited to -1.
If we understand this, we see how scaling becomes important. If we get the scaling wrong, then limiting to 1 unit in a basis vector direction can severely corrupt our signal. Imagine a set of 2D points within the range is 0 to 10 in each of the [1, 0] and [0, 1] directions; if we were to limit to one “unit” in these directions, we would lose much of our signal. However, if we divided our signal by 10, and allowed fractional vectors, then we lose little by applying a sigmoid or tanh activation function.
So one option to represent non-linear functions could be to perform PCA, then apply an activation function, then repeat this process. BUT we need our input to be well scaled with respect to our basis vectors. This isn’t so unusual, it turns out we also need this for PCA to be effective.
An interesting question is: what happens if we limit the resolution of our transformed signals?
In the limit, thinking back to our sigmoid and tanh activation functions, if we have a particularly steep slope we start to approximate a step function. This means that our number of “units” is 0 or 1, or -1 or 1, i.e. a discrete integer (binary) number of units – in the limit we have a contribution from a basis vector or not (or a negative contribution). We are building with Lego blocks again.
Of course, our approximate signal will be a lossy approximation; we have thrown away information. Whether we throw away too much information depends a lot on the scaling and distribution of our input signal. It may be that we throw away too much useful information, leaving us with garbage. But also: maybe not.
One advantage of limiting the transformed values using an activation function is we avoid large weights that may be associated with overfitting.
How could we get back our signal?
If we have a signal that is composed of binary quantities of sub-components (in the form of basis vectors), one possibility is we can use this as a first estimate or prediction, and then spend effort estimate or predicting what is left over as a second stage.
To reconstruct the original signal we just sum up the basis vectors using the transformed values as the weights for the sum.
To generate a second estimate we can subtract our reconstructed first estimate or prediction from our original signal and then repeat the process. Subtracting the reconstructed signal may be seen as generating an error signal. We can repeat the process again and again until we get closer to our original signal. At some point we’ll become good enough. Often our first or second predictions will be good enough.
PCA also provides a bonus advantage.
It turns out our basis vectors are ranked based on the variation they model; the first few basis vectors will cover most of the variation. The eigenvalues that result from PCA even give us an indication of the contribution of each basis vector. We can thus set limits on how accurately we want to model the signal. Typically, we can ignore many of the lower basis vectors as these can be seen as noise, and imperceptibly change our final representation.
This gives us a convenient knob to twiddle when computing with limited resources in time. We can turn up or down the fidelity of our predictions. If we are using a layered approach we can do this across the layers. We can get 80% of the benefit with 20% of the effort.
Inside the Brain
This cascade of PCA computation could be what the cortex is computing.
There appear to be to several routes by which information flows across the cortex:
- a first direct route that goes sensory organ > brainstem > thalamus (relay nuclei) > cortex (x1) > further cortex (x2, x3 etc);
- a second indirect route that goes cortex (x1 – L5) > thalamus (relay nuclei) > cortex (x2 – L4) etc;
- a third modulating route that goes cortex (xn – L6) > thalamus (relay nuclei); and
- a fourth global configuring route that goes brainstem > thalamus (intralaminar) > cortex (x1 – L1/L6).
The first direct route appears to pass a result of computation from one portion of the cortex to another. The second indirect route appears to pass a copy of the sensory input (as pre-processed by the brainstem and thalamus) to each portion of cortex. The third and fourth routes appear to provide feedback and set global configurations.
The second indirect route is interesting – why provide this?
One answer could be that this is providing a copy of the input signal that allows different portions of the cortex to determine these estimates of the original signal minus the earlier PCA reconstructions. The direct cortical path passes the result of the PCA transformation and this is subtracted from the copy of the input signal.
How can composition work with constraints of locality in the brain?
In the brain we are constrained by spatio-temporal locality: a neuron is only connected to around 1000 other neurons and most of these are relatively short-range connections. Also neurons have finite firing and refresh times: activity often takes 300-700ms to cross the cortex.
When we think of composition, we often think of a combination of distinguished features. These combinations are typically considered in a manner that is independent of any spatio-temporal constraints. For example, a face has eyes, nose and a mouth; or a side view of a car has two circles and a rectangle. But in the brain the multiple neurons likely represent “eyes”, and these must be coupled to the multiple neurons that represent “face”, but all these must be connected in a way that doesn’t break the rules of locality and limited connectivity.
So how do we reconcile these two points?
The cortex can be thought of as a two-dimensional plane. Signals must pass in space and time over this plane. Shortcuts can be provided by the thalamus, which can link two distant coordinates on the plane. The claustrum and corpus callosum also couple different portions of the brain (intra and inter hemisphere respectively). When the cortex determines properties or features from multiple streams of sensory input these may be thought of as different locations on this plane. Signal processing flows in multiple directions across the plane in time. For meaning to be composed, the result of processing at these different locations must be integrated.
Returning to our PCA example, we can see each basis vector as a sub-component of some kind. Each element of the sub-component may be represented by a neuron. If each neuron is only connected to around 1000 other neurons this limits our dimensionality to around 1000 elements – not enough for complex representations (e.g. you could just about represent the main ways that a 32 by 32 pixel image square varies).
One answer is that we could have a continual composition across the cortical surface, where the composition is performed in stages and a result of the composition is encapsulated in a signal that propagates over the cortex. In this case, the later portions of the signal may be almost incomprehensible with reference to the original input; instead, it will comprise a complex non-linear function over multiple stages.
Another answer is that centralised brain areas such as the thalamus and the claustrum may connect disparate portions of the cortex that each hold different sub-component representations. These areas may also provide for a consistency of representation, providing feedback that forces bottom-up and top-down signals to agree. If we were to connect 1000 widely distributed areas, we could have rich composed representations.
We may also have temporal pooling as we move across the cortex towards the association areas, so that our later signal representations change less than our input signal. Often analysis assumes a static input but inputs are changing all the time. Also processing in each area of the cortex, as well as signal propagation, will take a finite time. Hence, by the time we get to temporal lobes (e.g. the inferior temporal gyrus or IT) we are maybe 400-500ms behind the input.
For example, for signal classification, areas within the IT may pool over the time of the “lower” processing, e.g. integrate over lower level features. V1 features will typically change 4-5 times over the time it takes for one signal to reach the IT, while V2 features will change 3-4 times etc. If we are looking an generative modelling where we are trying to predict a signal input using top-down processing, then we also need to accommodate processing times – e.g. if we want to predict low level sensory features based on IT activity we will possibly need to predict 0.5s into the future.
We also need to remember though that there is no a magical Wizard of Oz at the top of a processing hierarchy. Instead, we have a central “muddle” area of multi-sensory representations that are then translated into specific motor signal outputs. But composition can also work on the way out as well as the way in. The control of complex motor functions may equally be controlled via composed representations, e.g. “raise your arm” will involve many different muscles and many different instructing signals – on the way out we decompose the complex representation into the constituent parts.
So in this post we have looked at composition.
We have seen how general aspects of the world can be looked at in terms of composed elements. Because of this, we can see how composition is a useful strategy for expressing “meaning” in intelligent things.
We have then looked at one way we could model composition using basis vectors and principal component analysis.
We then ended by turning to a complex organ of computation, the brain, and looking at how some of our theories about composition fit in with what we know of brain function. This gives us some ideas for how we can build machines that perceive, and act in, the world.