Bernstein–von-Mises theorem

A transcript of how I think about the BvM theorem, with interactive graphics and collapsible document structure.

Posted on November 25, 2018

Note: You can click on and in order to collapse and expand the document structure


In this article we will think about the frequentist and Bayesian approach to statistics and when they are equivalent. This is the focus of the Bernstein–von-Mises theorem. The idea of this post is to make papers like Leahu, "On the Bernstein-von Mises phenomenon in the Gaussian white noise model" more accessible to someone not particularly familiar with either frequentist or Bayesian statistics. If you have comments and/or suggestions, drop me a line at contact@pwacker.com

Our discussion will revolve around the following minimal toy problem: We are interested in inferring the value of an unknown parameter \(\theta\in \mathbb R\). We only have indirect knowledge as someone gave us a noisy estimate \(X\), polluted by an additive \(N(0,\sigma^2)\) Gaussian random variable ("noise") of it:

\(X\)
Observation
\(=\)
\(\theta\)
unknown parameter
\(+\)
\(\sigma \)
noise level
\(\cdot\)
\( \epsilon\)
\(N(0,1)\) Gaussian noise

So, the problem we want to solve, is:

Given \(X\), what is \(\theta\)?

Now there are two ways of approaching this problem: The frequentist ("classical") way and the Bayesian way. We will discuss how a frequentist and a Bayesian will "solve" this.

The way of the frequentist

Skip this paragraph if you already know: Maximum likelihood estimator, consistency.

The frequentist devises a clever estimator \(\hat\theta\). This is a function which takes the data \(X\) and outputs a number which is believed to be a good approximation for \(\theta.\)

In this simple case, a common estimator would be the maximum likelihood estimator \[\hat\theta_{\text{ML}}(X) = X.\]

In other words, for a given observation \(X\) we will just take the position of \(X\) as an estimate for \(\theta\). This sounds plausible: \(X\) is generated by adding a symmetric noise term tp \(\theta\), so we cannot tell whether \(\theta\) is "actually" to the left or to the right so we will just stick to the middle, i.e. \(X\).

The frequentist is interested in consistency of his estimator. This means that he is happy if his estimator converges — for noise level \(\to 0\) — to the true value \(\theta\) (in a suitable probabilistic limit).

You can play with the figure above to see that for small \(\sigma\), samples \(X\) (and hence, the estimator \(\hat\theta_{\text{ML}}\)) will get closer and closer to the true parameter \(\theta\)

If you object that letting \(\sigma \to 0\) is unrealistic, consider the case where he have not one observation \(X\) but a growing number of samples \[\begin{aligned}X_1 &= \theta + \sigma \cdot \epsilon_1\\X_2 &= \theta + \sigma \cdot \epsilon_2\\ \vdots &\qquad \vdots\\X_N &= \theta + \sigma \cdot \epsilon_N\end{aligned}\]

Note that the dataset is generated by one common true parameter \(\theta\).

Then the maximum likelihood estimator given the whole dataset \(\{X_1,\ldots,X_N\}\) is defined as the arithmetic mean \[\hat\theta_{\text{ML}} = \frac{1}{N}\sum_{n=1}^N X_n.\]

A simple analysis shows that for growing \(N\), the estimator \(\hat\theta_{\text{ML}}\) converges (again in a suitable probabilistic meaning) to the true value \(\theta\). This is equivalent to having just one sample \(X\) and letting \(\sigma\to 0\). From now on, we will work with the latter scenario because it has fewer indices. For an interpretation, we can always think about the "growing dataset" case.

0 \(X = \hat\theta_\text{ML}\)

The Bayesian approach

Skip this paragraph if you know how the posterior is computed from prior and likelihood.

Along comes the Bayesian. He would like to ask questions like

  • If \(X=1\), what is the probability that \(\theta > 0.5\)?
  • If \(X=1\), what is the
    "most probable" value of \(\theta\)Meaning value of highest probability density
    ?
  • If the last question can be answered to the positive, is there a symmetric interval around that value which carries 95% of probability that \(\theta\) is in it?

The frequentist tries to tell him that it makes no sense to ask questions about the probability distribution of \(\theta\)after all, \(\theta\) is fixed just like the face of a poker card is fixed even when you cannot see it. But the Bayesian has already started writing down conditional probabilities note that in the following, \(\mathbb P\) sometimes means a probability measure and sometimes a probability density: \[\mathbb P(\theta|X) = \frac{\mathbb P(X|\theta) \cdot \mathbb P(\theta)}{\mathbb P(X)}. \]

The Bayesian realizes that in order to continue further, he needs to define a prior distribution \(\mathbb P(\theta)\). This will reflect his belief about \(\theta\):

  • If he believes that \(\theta\) has to be very close to \(1\), he will choose a probability distribution \(\mathbb P(\theta)\) which gives a lot of probability mass to intervals containing \(1\), like a sharply peaked normal distribution around \(1\).
  • If he has a strong belief that \(\theta\) cannot be negative, he will choose \(\mathbb P(\theta)\) to have only probability mass on the positive numbers, like a Gamma distribution.
  • If he is certain that \(\theta\) is equally likely any of the numbers \(1, 3, 6\), he will choose a mixture of point masses \(\mathbb P(\theta) = \frac{1}{3}\delta_1(\theta)+\frac{1}{3}\delta_3(\theta)+\frac{1}{3}\delta_6(\theta)\)
  • If he has no strong opinion about \(\theta\), he might choose a non-committal probability distribution \(\mathbb P(\theta) = \mathcal N(0, \tau^2)\) for some \(\tau\) not too small.

Let's see how this works. We choose a prior \(\mathbb P(\theta) = \mathcal N(0, \tau^2)\) (in the figure below, \(\tau = 1.0\)) and we assume that \(X\) is given and fixed (in the figure below, \(X = 1.3\)). The product of prior and likelihood (multiplied by the constant \(\mathbb P(X)\) to make it a proper probability measure) yields the posterior \(\mathbb P(\theta|X).\)

In this case, the posterior is again a Gaussian probability measurethis is due to a property called "conjugacy" of this particular choice of prior and likelihood. If we go through the math, we get that \(\mathbb P(\theta|X) = \mathcal N\left(\frac{\tau^2}{\sigma^2 + \tau^2}\cdot X, \frac{\sigma^2\tau^2 }{\sigma^2+\tau^2} \right).\)

Prior \(\mathbb P(\theta)\)
Likelihood \(\mathbb P(X|\theta)\)This is a function (and a plot) over \(\theta\), as \(X\) is fixed!
PosteriorWe choose \(C=\frac{1}{\mathbb P(X)}\), which is a constant w.r.t. \(\theta\). Then the product is indeed a probability measure \(C\cdot\mathbb P(X|\theta)\cdot \mathbb P(\theta).\)

Note how the posterior is now not centered at \(X=1.3\) anymore, but at
\(\frac{\tau^2}{\sigma^2 + \tau^2}\cdot X = 1.04.\) This is due too the prior "pulling" the mean towards \(0.\)

The posterior measure \(P(\theta|X)\) is the Bayesian's answer to "Given \(X\), what is \(\theta\)?". Only when put under pressure will he consent to also give a point estimator (usually the posterior mean or the point of highest posterior density which are identical in this case). Contrast this to the frequentist approach where we will always obtain a point estimator but we will never ever talk about "probability distribution of \(\theta\)", neither prior nor posterior, because \(\theta\) is fixed and non-random (albeit unknown).

A frequentist's confidence intervals

Skip this paragraph if you know confidence intervals.

Give an estimator \(\hat \theta\) of \(\theta\)Recall that \(\hat \theta\) will be a function of the data \(X\)., the frequentist statistician is interested in how good it is, i.e. how close to the true parameter \(\theta\) he can expect it to be.

An important tool for this are confidence intervals. This is done as follows:

  1. Choose a confidence level \(\gamma\)Typically, \(\gamma = 0.95,\) \(\gamma = 0.99\) or similar..
  2. Choose numbers \(u(X)\) and \(v(X)\)depending on \(X\), such that \[\mathbb P\!\left[u(X) < \theta < v(X)\right] \geq \gamma. \]
  3. The interval \(\left[ u(X), v(X)\right]\) is then a confidence interval with confidence level \(\gamma.\)

If you know Bayesian credible sets and think that this looks suspiciously like the definition of credible sets, consult the next section on credible sets.

We have to be careful in the interpretation of the probability statement in step 2. This does NOT mean:

"The probability that \(\theta\) lies between \(u(X)\) and \(v(X)\) is at least \(\gamma.\)",

but rather:

"The probability that the interval \(\left[ u(X), v(X)\right]\) contains \(\theta\) is at least \(\gamma.\)"

This may seem like nitpicking but remember that in the frequentist setting we cannot assign a probability to \(\theta\)! This value is fixed and non-random but we can talk about the probability distribution of \(X\) (and thus, also of \(u(X),v(X)\)).

Let's work through our toy example here: Our estimator is the maximum likelihood estimator, i.e. \(\hat\theta = X\). Now we need to choose \(u(X)\) and \(v(X)\) such that the probability statement in step 2 holds. As the setting is very symmetric, we set \(u(X) = X-\delta\) and \(v(X) = X + \delta\) and now we only need to set the correct value of \(\delta\) such that \[\mathbb P\left[ X - \delta < \theta < X + \delta\right] > \gamma.\] We recall that \(X = \theta + \mathcal N(0, \sigma^2)\) (\(\theta\) is fixed but unknown.) and hence this is equivalent to \[\mathbb P\left[ -\frac{\delta}{\sigma} < \frac{X-\theta}{\sigma} < \frac{\delta}{\sigma}\right] > \gamma. \] But \(\frac{X-\theta}{\sigma}\) is now a \(N(0,1)\) random variable and we can consult tables for the normal distribution which will tell us the correct value of \(\delta\). For example, for \(\delta = 2\sigma,\) the probability statement is true for \(\gamma = 0.95\). This means:

If we set \(\gamma = 0.95\) and \(u(X) = X - 2\sigma, v(X) = X+2\sigma\), then \[ \mathbb P\left[u(X) < \theta < v(X)\right] > \gamma.\]

Let's check this computationally. For a fixed \(\theta\), we sample 100 datapoints for \(X\) and check whether the confidence interval for \(\gamma = 0.95\) contains \(\theta\) or not. In the latter case, the interval will be displayed in red. We expect this to be the case in roughly 5%, i.e. approximately 5 of the 100 intervals will not contain the parameter.

Bayesian credible sets

Skip this paragraph if you know Bayesian credible sets.

The Bayesian is usually content with his posterior distribution as a result of his statistical analysis but external constraints sometimes compel him to extract more information from it. We sometimes want to know a set which contains the true parameter with high probability. This was the motivation for frequentist confidence intervals, and it is the reason for the definition of Bayesian credible sets. The definition is as follows:

  1. Choose a credibility level \(\gamma\)Typically, \(\gamma = 0.95,\) \(\gamma = 0.99\) or similar..
  2. Choose numbers \(u(X)\) and \(v(X)\)depending on \(X\), such that \[\mathbb P\!\left[u(X) < \theta < v(X)\right] \geq \gamma. \]
  3. The interval \(\left[ u(X), v(X)\right]\) is then a credibility interval with confidence level \(\gamma.\)

If you compare this to the definition of confidence intervals (see last section), you will see that I just exchanged "confidence" by "credibility". So what's the difference?

Distinction between credible and confidence intervals.

Skip this paragraph if you know what the difference between Bayesian credible sets and frequentist confidence sets is.

In the frequentist setting, we considered \(\theta\) as unknown, but fixed. We were not allowed to think about probabilistic statements in terms of \(\theta\). The stochasticity consisted solely of the unknown perturbation which was responsible for the generation of the sample \(X\). The frequentist confidence interval got its meaning by (mentally) repeating the measuring process \(X = \theta + \sigma\cdot\epsilon\) over and over again, with more and more data \(X\).

The Bayesian viewpoint assumes that there is just one observation \(X\) which is fixed and we're stuck with it. There is no noise strength \(\sigma\) going to 0, so frequentist consistency is useless for us. On the other hand, we think about \(\theta\) being not only unknown, but probabilistic, with a prior \(\mathbb P(\theta)\) describing our belief about a-priori possible values of \(\theta\) (and their relative weighting).

The different philosophy of frequentist and Bayesian statistic can thus be (maybe oversimplified) described by this lineup:

Frequentist Bayesian
Parameter \(\theta\) unknown, fixed unknown, probabilistic via prior \(\mathbb P(\theta)\)
Observation \(X\) Assume for analysis: \(X\) is result of a repeatable experiment, we can gather more data and get a more accurate estimate You're stuck with one fixed value of \(X\), deal with it.
interval
\(I=[u,v]\)
such that
\(\mathbb P(\theta\! \in \!I)\! \geq\! \gamma\)
confidence interval
  • \(I\) is probabilistic and tries to "catch" fixed \(\theta\), successful with probability at least \(\gamma\).
  • Interval becomes fixed for a concrete observation \(X\).
credible interval
  • \(I\) is invariable by our unique fixed \(X\).
  • The (variable) probabilistic parameter \(\theta\) falls in \(I\) with posterior probability at least \(\gamma\).

Coming back to our example, we derived the following form of the posterior: \[\mathbb P(\theta|X) = \mathcal N\left(\frac{\tau^2}{\sigma^2 + \tau^2}\cdot X, \frac{\sigma^2\tau^2 }{\sigma^2+\tau^2} \right).\] For convenience we write posterior mean and standard deviaton by \[\begin{aligned}\mu_{\theta|X} &:= \frac{\tau^2}{\sigma^2 + \tau^2}\cdot X \\ \sigma_{\theta|X}&:= \frac{\sigma\tau }{\sqrt{\sigma^2+\tau^2}}\end{aligned}.\] This is again a symmetrical setting (but this time around the posterior mean \(\mu_{\theta|X}\), so we will look for \(u = \mu_{\theta|X} - \delta\) and \(v = \mu_{\theta|X} + \delta\) such that \[\mathbb P(u < \theta < v|X) \geq \gamma. \] Again, we just need to find the number \(\delta\) from a Gaussian quantile table, and for \(\gamma = 0.95\) we obtain \(\delta = 2 \sigma_{\theta|X}.\) This means that for \[\begin{aligned}u &= \mu_{\theta|X} -\delta = \frac{\tau^2}{\sigma^2 + \tau^2}\cdot X - 2\frac{\sigma\tau }{\sqrt{\sigma^2+\tau^2}} \\ v &= \mu_{\theta|X} +\delta = \frac{\tau^2}{\sigma^2 + \tau^2}\cdot X + 2\frac{\sigma\tau }{\sqrt{\sigma^2+\tau^2}}, \end{aligned}\] the parameter \(\theta\) will a-posteriori lie inside \([u,v]\) with probability at least \(\gamma\)

Visually, this looks like this: We plot the posterior from above for \(X=1.3,\)
\(\tau = 1\) and \(\sigma = 0.5\). The interval bracketed in black contains more than 95% of the posterior probability mass.

Conciliating Bayesian and Frequentist statistical analysis

This is the main paragraph of this article. If you can skip it, this whole article is wasting your time.

We have talked about how a frequentist and a Bayesian would approach the problem of inferring the parameter \(\theta\) if they are given a noisy observation \[ X = \theta + \sigma \cdot \mathcal N(0,1).\] Both approaches differ radically: The frequentist designs an estimator \(\hat \theta\) (for example the maximum likelihood estimator \(\hat\theta_{\text{ML}} = X\)) and proves its asymptotic consistency for \(\sigma\to 0\). This proves that given enough dataOf course, in reality we do not have vanishing observation error (and not arbitrary amounts of data). This is something the Bayesian points out readily., their estimator works nicely.

The Bayesian conjectures the existence of a prior probability distributionThe frequentist does not know how to explain him for the hundredth time that the parameter is merely unknown but there is nothing "random" about it. which he chooses according to how he "feels" about \(\theta\) before looking at data. Then he takes a given observation \(X\), calculates the product of prior and likelihood function, normalizes it to a probability distribution and calls this the "posterior" distribution.

The frequentist and the Bayesian have argued enough about who is more wrong and decide to see whether there is maybe something they both can agree on. There is no general prior which can be chosen such that the frequentist estimator can be interpreted as a Bayesian posterior measure, but there is a way of looking at Bayesian statistics from a frequentist viewpoint.

One obstacle to comparing frequentist and Bayesian statistics is that the former considers the likelihood \(\mathbb P(X|\theta)\) and the latter talks about the posterior \(\mathbb P(\theta|X).\) First we define the posterior mean \(\hat \theta := \mathbb E(\theta|X).\) This is a point estimator and the frequentist is fine with objects like that.Regardless of the fact that it was constructed by means of a forbidden probability distribution over the parameter. This estimator is called the maximum a posteriori estimator or conditional mean estimator (two concepts that coincide with each other in this special case).

We now employ a very simple numerical trick: We define the quantity \[\Delta := \theta - \hat \theta.\] The frequentist and the Bayesian see something completely different when looking at \(\Delta\):

Frequentist

\(\Delta\)
\(=\)
\(\theta\qquad-\)
unknown, fixed parameter
\(\)
\(\tfrac{\tau^2}{\sigma^2 + \tau^2}\cdot X \)
MAP estimator, depends on (yet to be specified) \(X\)

Bayesian

\(\Delta\)
\(=\)
\(\theta\)
random variable
\(-\)
\(\hat \theta \)
random variable's posterior mean
Now the left column needs to be conditioned on some \(\theta\) because the frequentist believes in the invariance and constance of \(\theta\) and the right column "wants" to be conditioned to \(\theta\) because that's what Bayesians do: Incorporate data in the posterior. Then the quantity \(\Delta\) is viewed, respectively:

Frequentist

\(\Delta|\theta\)
\(=\)
\(\theta\quad-\)
known, fixed parameter
\(\)
\(\tfrac{\tau^2}{\sigma^2 + \tau^2}\cdot (\theta + \sigma \cdot \epsilon) \)
depends on noise term \(\epsilon\)

Bayesian

\(\Delta|X\)
\(=\)
\(\theta|X\)
posterior distribution
\(-\)
\(\hat \theta | X\)
mean of posterior distribution

Hence, we have (as seen by the frequentist) \[\mathbb P(\Delta|\theta) = \mathcal N\left(\frac{\sigma^2}{\sigma^2+\tau^2}\theta, \frac{\sigma^2\tau^4}{(\sigma^2+\tau^2)^2}\right)\] and (as seen by the Bayesian) \[\mathbb P(\Delta|X) = \mathcal N\left(0, \frac{\sigma^2\tau^2}{\sigma^2+\tau^2}\right).\] Below is a plot of both measures. You can vary the numerical value of \(\sigma\) and you will see that the two measures converge to each other for \(\sigma \to 0\).

In fact, the statement \[\|\mathbb P(\Delta|X) - \mathbb P(\Delta|\theta)\|_\text{TV} \xrightarrow{\sigma\to 0} 0,\] i.e. the measures converge in the total variation norm, is a suitable formulation of the Bernstein–von-Mises theorem which still makes sense in infinite dimensions (see Leahu, "On the Bernstein-von Mises phenomenon in the Gaussian white noise model").