I often find myself thinking deeply about probability. What *is* it?
Let’s start by assuming there is some form of local reality at a point of interest. Let’s then assume that the “true” nature of this local reality is unknowable. This may be due to the limits of our senses or the limits of time and space. We’re using the word “true” here to denote something like “complete”. As the true nature is ultimately unknowable we can give it any label we like: “reality”, “truth”, “God”, etc. I’ve purposely referred to “local” reality as relativity leads us to believe that there is no block universe reality – only local portions connected by interactions.
Although reality is ultimately unknowable it is not necessarily totally unknowable; we can know bits of reality at varying levels of accuracy. Or to put it differently, we can make measurements at certain points in space and time and have those measurements follow certain patterns that are conserved over relatively large local areas (such as the surface of the Earth).
Now reality is a set of interconnected and interacting things. We can’t measure them all to have a complete picture of reality. But because of the patterns, we can measure some things (imperfectly) and use the patterns to guess or predict other things. As there are multiple things interacting, at multiple scales, there are multiple patterns that may exist at any one time. Our measurements typically carve out a small subset of things – most human science is based on a laughable notion of looking at only two things at the same time (see 2D graphs). So we need a way of representing all we don’t or can’t know. One answer to this is probability.
Probability began when bored French mathematicians tried to work out a way to win more money at simple human games of chance – dice and cards mainly. What many miss is that these games of chance are artificial human constructs – very, very simplified models. So many grew up thinking that probability was a simple tool to understand chance, rather than see it’s initial simplicity as due to the artificial chance humans had constructed. Reality is much messier (as quantum physics and the weather easily show us). In fact all brains have evolved to cope with this messy uncertainty. Brains thus need a way to act in the world despite the messiness. To do this they use a form of probability.
Starting with the simple models, probability is a way to set out the possible outcomes of an event (localised in space and time) and the relative likelihoods or frequencies of these outcomes. This is close to the frequentist view. You can roll a dice hundreds of times and you see that one of six numbers occurs on the top and that each number occurs roughly 1/6 of the time (on a normal dice). A clever invention of probability is using a relative dimensionless range from 0 to 1 to represent a likelihood metric. Once you do this, you can use the same range in clever ways, such as to represent a belief in a Bayesian sense. So even though you’ve never seen something before you can make a guess as to the range of outcomes and then update this as you make repeated observations.
As the mathematics of probability expanded it began to consider multiple events (normally two). There are two ways to consider multiple events: jointly or sequentially. These then lead to joint probability models and conditional probability models. In a joint probability model, you are trying to obtain a probability of outcomes of the two events considered together. In a conditional probability model, you are trying to obtain a probability of one of the events, given knowledge about the other of the events. Joint and conditional probability models are related, and Bayes Rule provides a way to map between the two.
A bit later still, in the later twentieth century, mathematicians also began to consider more complex representations of reality. One way to do this is to use multidimensional vectors to represent a data point. You can thing of this as representing different aspects or measurements taken at the same time (e.g., colour, height, type, speed etc.). Hence, instead of working with just single values, we begin to work with multiple values at the same time (normally in parallel via linear algebra).
At this stage we begin to hit one of the problems of simple probability models: size. As we add more events, and more measured properties, we increase the potential space of all the possible outcomes. The mathematics of probability often tries to fit functions to probability distributions and then integrate over these functions (e.g., using well-developed rules) or alternatively wander around the potential space. Both run into problems.
Large search spaces with float values are actually a bit silly. They are more a result of us taking something simple and stretching it to extremes. Generally, it hasn’t caused as much of a problem as we thought because we also invented computers whose computing speeds increased exponentially over time. Machine learning researchers dream of manifolds, lower dimensional structures in these huge spaces. This is because the evidence shows that reality doesn’t spread itself evenly around the huge space. Instead, it hugs narrow roads and valleys. Why? Because it’s very difficult to produce complexity from scratch one point at a time. Instead, most of the complexity of reality comes from simple local rules and patterns, repeated endlessly over time, and combined over large areas with simple interactions. This is the root of chaos theory – simple interactions over multiple iterations can create beautiful messy complexity.
The patterns of patterns that exist in the world operate over multiple sensory spaces at the same time. There is locality of events but diversity of change. Using multiple modalities – different types of measurements and measurements of different things – we can narrow down the search space by considering each modality in parallel and then pooling our knowledge. As we can never be sure of anything, our measurements are probabilities representing a likelihood of something and using the rules of probability (basically arithmetic with a dimensionless metric between 0 and 1) we can combine different sources of knowledge into a holistic model, where we use different sources of knowledge to constrain wild flights of search space fancy.
Graph structures for probability begin to get some of this. Here you can break down huge joint probability distributions into a series of conditional probability distributions. This is inherently easy to understand: to draw a chair you don’t individually and independently choice the position and colour of every pixel, you say – my chair is roughly this shape, given that rough shape it has this type of back and these legs, given that I want this cushion etc. – a series of choices where previous choices inform and constraint future choices.
Probability at scale also offers nice solutions for robotic behaviour. We don’t need to select every motor torque over time in the form of probabilities – we just need a single metric representing “threat” that is represented as a probability and then act on a sample of that with fight or flight responses. Probability distributions can also be sampled and refined (similar to our chair example) at different scales over time – events typically don’t randomly change throughout time and so we can look at base levels and more detailed refinements.