We are given a list of samples
[x1 ; x2 ; x3; ...; xn] and we know that
they originate from a uniform distribution. What are the parameters of the
For simplicity, we can assume that
x[i] <= x[i+1] and
x < x[n]. This
condition implies that we presort the sampled numbers and there are at least
2 unique numbers in the sample.
I am writing this text mostly because my mentor at my internship mentioned this problem to me. As I was solving the problem, I felt that I have managed to grasp a better understanding of bayesian methods simply by looking at such toy examples. In other words, an unintended consequence of this blog post is to hopefully explain what bayesian estimation mean in the context of model fitting. In particular, we will be studying a specific case where the distribution is known to be uniform and we want to obtain a probability distribution over possible parameter values.
What is a Uniform Distribution
A continuous uniform distribution
is a distribution defined by two parameters,
and where the probability of getting any number between
A Naive Argument
When presented this problem, I immediately claim that
a = x
b = x[n]. This is the most intuitive answer, but I was implicitly
- The estimated parameters is independent of the number of samples.
- The estimated parameters would only be a function of two numbers, that is and .
A Bayesian Argument
What we are trying to do here is to perform a Maximum a posteriori (MAP) Esimation of the parameters. In this particular problem, we can write bayes rule as:
In the above equation, the refers to the prior belief on probability of a given set of parameters. If we are dealing with a discrete distribution, would refer to actual probabilities. In the case of continuous distributions, refers to the probability density function.
For example, in our problem, would refer to the probability that is the right distribution without considering any of the observed samples.
In the text that follows, I will abbreviate as and as (D stands for data). Using the abbreviations, we can rewrite Bayes rule for MAP as follows:
- Posterior distribution on parameter space accounting for the observed data.
- Likelihood function - informally, this is the probability that we obtain the given observation assuming some values for the parameters.
- Prior belief on the parameter space without making any observations.
- Normalizing constant. This constant ensures that the probability distribution integrates to 1.
In machine learning methods that most are familiar with, i.e: some variant of gradient descent, we are searching for the set of parameters that maximizes likelihood function. We can imagine the loss function as some kind of probabilistic measure of the probability that we obtain . This class of methods in called Maximum Likelihood Estimation. The primary downside of this method is that we do not get a distribution over the parameter values and it is fairly easy to overfit the model’s training without taking special precaution.
To Infinity, and Beyond!
We have Bayes’ rule - let’s apply it! However, without defining any bounds on the samples and parameters, there is no clear notion on how we can compute the terms we required to apply Bayes’ rule.
One might argue that , and one would be (somewhat) correct! The intuitive argument to this claim is to imagine an infinitely long line, where the probability of choosing points in random points without any knowledge of the choice’s distribution, is essentially zero. Similarly, you can make such an argument for - without any knowledge of the underlying bounds. The problem with either claims is that , and this on its own violates one of the fundamental property of probabilities.
The trick here is to think in terms of limits and bounds. Let’s assume that , where L is the bounds of our sample space. By constraining our problem, we can express the prior, likelihood function and normalizing constant in terms of , and .
We can do a sanity check by integrating the above equation from to and making sure it is equal to 1. The bounds of the integral of is from to , whereas the bounds of the integral of is from to . Intuitively, you can imagine the integration area as a triangle of length and height 2L. The area of the triangle is and since all possible values of the parameters in the parameter space with non-zero probability space are equally likely, the pdf over the non-zero parameter space should be the reciprocal of the total area.
The normalizing constant is simply the integral of over the entire parameter space. In most cases when we are dealing with complex distributions, especially in the context of machine learning, this constant is non tractable. This is one of the reasons why Bayesian machine learning is “hard”. One of the methods used to estimate this normalizing constant is using Monte Carlo Markov Chain methods to sample from this distribution and estimate the integral.
In the problem we are dealing with, fortunately, is analytically tractable:
To see how we arrive in this equation, one can observe that and and similarly, . Since we’ve assumed that the samples (and implicitly, the parameters) are confined by a space of to , the non-zero probability values of our limits are for and for .
And we can proceed with integration as follows. The resultant integral is as follows, I’ll leave solving the integral an exercise for the reader.
I promise the final equation will be less monstrous and contain less terms than this!
Applying Bayes Rule
Now that we have all three components required for bayes rule, we can apply them together to obtain the posterior distribution of .
Recall that bayes rules is as follows:
Applying the terms we’ve obtained above, we’d arrive at the following equation. I would once again leave out most of the messy maths.
Applying the Limits
Earlier when we are solving this equation, we have constrained the samples (and implicitly, the parameters) to be in the range . As it turns out, to obtain the above posterior distribution of parameters for the entire space of real numbers, all we need to do is take
Forfunately for us, most of the terms cancel out, and we end up with a nice simple expression. This is the probability distribution of the parameter space .
And that is the probability distribution of our parameters.
The obvious answer to our original problem is fairly simple - we want to take the parameters the yield the higher value for . Evidently, that’s going to be the distribution with the smallest width (in other words, we want the values of such that is minimum) which meets the requirement of . It is straightforward arrive at the solution , which proves both arguments of the above solution.
So what now?
But wait? We do have a probability distribution of the parameters, don’t we? This is the primary benefits of bayesian-based model fitting (or bayesian machine learning). Not only do we get “learnt” parameter values, you we a distribution over them. In this case, we’ve obtained a distribution over all possible values of . As a sanity check, we can make sure that does integrate to . This shall be left as an exercise for the reader.
In the space of parameters with non-zero probability values, we have
We can rewrite the above probability distribution as the distribution over the width of the uniform distribution as follows::
This suggests that the decay probability of width as we increase it from decays at a rate with exponential with respect to . This is somewhat expected, but illustrated the strength of bayesian-based parameters estimation - we get an idea of how confident we are with our parameter estimations.
When we searching for the value of that yields that maximum posterior probability, we are essentially looking for the mode of the posterior distribution. What if we consider the mean (the expected value) of ? This can be obtained by solving the integral:
Which will give us the following:
I find this to be a really nice and intriguing result. Nice because we have managed to show that in a unbounded sample space, we have the mean width of the distribution to be proportional to the width of our sample size, and (somewhat) inversely proportional to the number of samples we have collected. Intriguing because of the term. From an intuition perspective, I do not understand why would be the denominator.
To go about the cases where , we will need to account for some corner cases in integration (Recall that integrating when yields ).
I’d be interested if anyone can come up with a plausible explanation
for the term below. If you do, please do drop me an email at
fuyong [at] fyquah.me.