title: Advanced Statistics
subtitle: DLMDSAS01
authors: Prof. Dr. Unknown
publisher: IU International University of Applied Sciences
date: 2023

Unit 4: Bayesian Statistics

Our learning objectives are as follows:

Introduction

p. 96

The course book begins with an example of traffic lights, if one is green we assume another is red. In this case, based on past experience, we hold a prior belief that if our traffic light is green, the other is red.

Incorporating prior knowledge plays a major role in decision making and directly affects the outcome and the decisions we make. This way of using and interpreting probabilities is called Bayesian Statistics, name after The Reverend Thomas Bayes | Wiki.

4.1 - Bayes’ Rule

Central to Bayesian analysis is conditional probabilities:

P(BA)=P(BA)P(A)P(B|A) = {P(B\cap A) \over P(A)}

It is said “The probability of BB given AA”, meaning that event AA has already occurred.

Derivation of Bayes’ Formula:

P(AB)=P(AB)P(B)P(AB)=P(AB)P(B)P(BA)=P(AB)=P(AB)P(B)then...P(BA)=P(BA)P(A)=P(AB)P(B)P(A)\begin{align*} P(A|B) &= {P(A \cap B) \over P(B)}\\ P(A \cap B) &= P(A|B)P(B)\\ P(B \cap A) &= P(A\cap B) = P(A|B)P(B) \\ &\text{then...} \\ P(B|A) &= {P(B\cap A) \over P(A)}\\ &= \frac{P(A|B)P(B)}{P(A)} \end{align*}

Writing it another way using HH to denote our hypothesis and DD to denote data (Sometimes people use EE for “evidence”):

P(HD)=P(DH)P(H)P(D)P(H|D) = \frac{P(D|H)P(H)}{P(D)}

Let’s discuss the meaning of each component:

Note: the evidence does not depend on the hypothesis HH and is therefore the same for all hypotheses we might want to test.

The Posterior (what we want) is given by the likelihood times the prior, normalized to the evidence.

The course book gives an example regarding a smoke relating to fire. However, I always like medical test results to show how accurate tests need to be.

For a Covid-19 test, suppose someone has covid and the test shows a 99% positive rate. The chance that someone randomly selected has Covid is a stunning 1% (made up). The chance people test positive is 1.1%, meaning we are getting a slight false-positive rate. We want to know what is the chance someone has Covid given they test positive?

Let P(C+)=1%P(C+) = 1\% be covid and P(T+C+)=99%P(T+|C+)=99\%, and P(T+)=1.1%P(T+)=1.1\% be the probability that a person tests positive.

P(C+T+)=P(T+C+)P(C+)P(T+)=0.990.010.012=0.90\begin{align*} P(C+|T+) &= \frac{P(T+|C+)P(C+)}{P(T+)}\\ &= {0.99\cdot 0.01 \over 0.012}\\ &= 0.90 \end{align*}

Even with that small false-positive rate, you can see the test results drop quite a bit.

I made up a lot of the stats to make the example easy, but in many cases we do not have direct access to these quantities. However, we can decompose the evidence using the total law of probability:

P(A)=iP(ABi)P(Bi)P(A) = \sum_iP(A|B_i)P(B_i)

Which is the summing up all of the ways something can happen separately. We then tuck that into our Bayes’ Theorem:

P(HD)=P(DH)P(H)i(P(DHi)P(Hi))P(H|D) = \frac{P(D|H)P(H)}{\sum_i\left(P(D|H_i)P(H_i)\right)}

For the Covid example, the total probability of positive test is sum of positive tests given the patient has covid times probability they have Covid, plus a positive test given the patient does not have Covid times probability they do not have Covid.

4.2 - Estimating the Prior, Benford’s Law and Jerey’s Rule

The Role of the Prior

We made up the prior for an example, but it is actually an important piece. The course book covers HIV tests. This is much like the Covid example above, but we are given:

We are not given the prior, so must solve for it before we can find P(HT+)P(H-|T+). The prior is P(H+)=0.00035P(H+) = 0.00035 according to some data from Germany.

The data then is:

P(T+)=P(T+H+)P(H+)+P(T+H)P(H)=0.9990.00035+0.005(10.00035)=0.05\begin{align*} P(T+)&=P(T+|H+)P(H+)+P(T+|H-)P(H-)\\ &= 0.999*0.00035 + 0.005*(1-0.00035)\\ &= 0.05 \end{align*}

Then we have:

P(H+T+)=P(T+H+)P(H+)P(T+)=0.9990.000350.05=0.069\begin{align*} P(H+|T+) &= \frac{P(T+|H+)P(H+)}{P(T+)} \\ &=\frac{0.999*0.00035}{0.05} \\ &= 0.069 \end{align*}

So if someone randomly takes a test and it becomes positive, there’s really only a 7% chance it’s correct because the relatively high false-positive rate.

Doctors may narrow the sample pool by also including risk factors and symptoms.

Defining the prior can be difficult in practice.

Benford’s Law

Getting the prior is one of the most difficult aspects of Bayesian analysis. It can be suggested to use a uniform or at prior, meaning use a uniform distribution. However, this is one of the most common mistakes.

Most systems are not uniform. Newcomb (1881) describes a formula for pages worn in books like:

P(d)=log10(1+1d)P(d) = \log_{10}\left( 1 + {1 \over d} \right)

Benford (1983) revisited this equation and so it became known as “Benford’s Law”. It has many applications across many fields.

The key point in understanding the emergence of this phenomenon is we are combining many different numbers from many different sources.

Jerey’s Rule

We also might not know much about the system we want to analyse using Bayesian statistics. We would want to avoid specifying a prior, but Bayes’ formula demands it. We could assume a uniform prior, but as we have seen, that is probably not correct.

We would want to use the uniform distribution as a prior to express we do not know the value the parameter should take, and we do not want to impose any constraints.

Consider that the parameter of the family of random variables θ\theta is assumed to follow a uniform distribution in the interval (0,1)(0,1), and we hope this expresses we do not know anything about θ\theta.

When setting up our model, there is no single best parameterization. We could express θ\theta in terms of the logit function and applied the transformation:

θ=log(θ1θ)\theta'= \log\left( {\theta \over 1-\theta} \right)

Where θ(,)\theta'\sim(-\infty, \infty). The transformed parameter is no longer uniformly distributed. We are going to try to define this more formally:

Start with θ\theta. Transform θ\theta with transformation function gg to get new variable ϕ\phi, where ϕ=g(θ)\phi=g(\theta). The transformation rule is defined as:

f(ϕ)=f(g1(ϕ))dg1(ϕ)dϕf(\phi) = f(g^{-1}(\phi))\left| \frac{dg^{-1}(\phi)}{d\phi} \right|

And because θ\theta is assumed to follow the uniform distribution, f(θ)f(\theta) is constant. But this contradicts our previous assumption that we have no knowledge about the prior. Applying a transformation should not make a difference. Therefore, uniform as a prior is not suitable.

The Jerey’s prior (Jereys, 1946) defines a prior for a parameterized random variable that is invariant under transformations and defined by pdf:

f(θ)J(θ)f(\theta) \propto \sqrt{J(\theta)}

The \propto means “proportional to”. And J(θ)J(\theta) is the expected Fisher Information of the parameter θ\theta. There’s also a constant such that f(θ)=cJ(θ)f(\theta) = c\cdot\sqrt{J(\theta)}. And f(θ)f(\theta) is unique because the integral over all real numbers is 1.

What is the Fisher Information? It is a metric that measures the amount of information that a random variable XX contains about the parameter θ\theta given an observed sample x1,,xnx_1, \dots, x_n. It measures the amount of information about the parameters θ\theta and is given by the negative of the second derivative of the log-likelihood function:

I(θ)=d2log(L(θ))dθ2I(\theta) = - \frac{d^2\log(L(\theta))}{d\theta^2}

Another book says that Fisher Information is the variance a specific random variable and written as:

I(θ)=Var(logf(X;θ)θ)I(\theta) = \text{Var}\left( \frac{\partial \log f(X;\theta)}{\partial\theta} \right)

Log Likelihood function

We will denote the likelihood function L\mathcal L. It measures the probability, or likelihood, of observing the current data given a specific model that depends on one or more parameter:

L(θ)=fX(xθ)\mathcal L(\theta) = f_X(x|\theta)

We assume to know the density of the underlying probability distribution fX(.θ)f_X(.|\theta), except the parameter θ\theta. We often use the log-likelihood function given by ln(L)ln(\mathcal L) for practical reasons.

The first derivative of the log-likelihood function is called the score function:

S(θ)=dlog(L(θ))dθS(\theta) = \frac{d\log(L(\theta))}{d\theta}

The Fisher Information can be written as:

I(θ)=d2log(L(θ))dθ2=dS(θ)dθI(\theta) = -\frac{d^2\log(L(\theta))}{d\theta^2} = -{dS(\theta) \over d\theta}

The expected Fisher information is then the expectation value of I(θ)I(\theta):

J(θ)=E[I(θ)]J(\theta)=E[I(\theta)]

The regularization assumption is that we can change the order of differentiation and integration. With this we can show that:

E[S(θ)]=0Var[S(θ)]=E[S(θ)2]=J(θ)\begin{align*} E[S(\theta)] &= 0\\ \text{Var}[S(\theta)] &= E[S(\theta)^2] = J(\theta) \end{align*}

The book, on pp.105-106, shows the Jerey’s prior is invariant under bijective transformations.

Then pp.106-108 gives an example of Jerey prior for Poisson distribution. This means you make the likelihood distribution Poisson, then take the derivative of the log-likelihood. The Jerey’s prior is the variance of the score function.

Other Approaches

We will often know many details about the system we want to analyze. We can, in a way, interpret the training data used in a machine learning approach as the prior knowledge. This highlights the role of data quality as faulty or biased data can potentially have a significant impact on the output.

4.3 - Conjugate Priors

p. 109

The prior is very important. We just looked at how to use the prior to include our ”a priori” knowledge of the system and how to encode as little information as possible. Now, if we write the posteriori probability with the sum total probability:

P(AjB)=P(BAj)P(Aj)iP(BAi)P(Ai)P(A_j|B) = \frac{P(B|A_j)P(A_j)}{\sum_iP(B|A_i)P(A_i)}

And if we move toward continuity, we can replace the sum with an integral and probabilities with the likelihood function:

f(θx)=f(xθ)f(θ)f(xθ)f(θ)dθf(\theta|x) = \frac{f(x|\theta)f(\theta)}{\int f(x|\theta)f(\theta)d\theta}

From what I understand, numerator probabilities would be so small they are basically zero, but let’s keep learning… this is only creating a new pdf to cumulate over!

Let’s cover the parts of that equation:

The likelihood formalizes the description of the observed data. We cannot influence it too much, but we can influence the prior. Since we need to perform the sum (or integral) over the likelihood time the prior, we can choose the parameterization of the prior so the sum (or integral) becomes easier. Basically, when we combine the two, we want the result to be a probability distribution that we can use easily. In particular, on that can be expressed in closed form of a commonly used probability distribution.

A class of priors is called a Conjugate Prior with respect to a given likelihood function if the a posteriori distribution is of the same family of probability distributions as the prior.

Keep in mind that choosing a conjugate prior is a convenience. If we cannot describe our priori knowledge in terms of a conjugate prior then we should not try to “force” it. However, it would make further handling of the Bayes’ formula easier.

The book lists conjugate priors and continues into an example of conjugate beta prior.

4.4 - Bayesian and Frequentist Approach

There are two ways to think about probabilities. The typical way is through counting, or frequency. If you flip a coin, the probability of heads is the number of times we observe the head of the face up, divided by the number of total tosses. And for a large number of events, a large sample:

P(E)=limnknP(E) = \lim_{n \to \infty} {k \over n}

This reasoning suggests probability is a frequency of events in the long run. This is the Frequentist approach to statistics.

However, this is not the way we usually actually think of probabilities. When you look into the night sky and see a flashing light you probably think of a plane, not an alien spaceship. We assign a degree of plausibility, or belief, to what we observe. This is the Bayesian way of thinking about probabilities.

With the light in the sky example, we see that, use our prior for probability of what it is, like how we assume a coin is 50-50 heads or tails. But we can also update our belief and adapt the prior so that it matches our observations (such as that the coin is a fair coin).

2 Major differences:

Predicting the results of an election doesn’t make sense to a Frequentist because an election with the same settings and population cannot be replicated infinitely number of times to observe outcomes and determine the probabilities. But in Bayesian statistics, we can model our assumptions in the prior and calculate the a posteriori probability.


Assess Yo-Self

Suppose that we know P(T)P(T), P(U)P(U) and P(UT)P(U|T). What can we find from using Bayes’ formula?

I’m not too sure what the question is really asking to be honest, but if we have the latter “PUT”

P(UT)=P(TU)P(U)P(T)P(U|T) = \frac{P(T|U)P(U)}{P(T)}

So we can solve then for P(TU)P(T|U). (yes)

Given an experiment yielding a RV XX whose distribution is defined by λ\lambda, which we suppose to follow a Gamma distribution Γ(3,5)\Gamma(3,5), what can we say about XX?

A conjugate prior… I’m saying XΓ(3λ,5λ)X\sim \Gamma(3\lambda, 5\lambda) (no) I am saying XΓ(3λ,5)X\sim \Gamma(3\lambda, 5) (no) Just XΓ(3,5)X\sim \Gamma(3,5) (no) Only that λΓ(3.5)\lambda\sim\Gamma(3.5) and nothing about XX.

Now we have an experiment yielding XP(λ)X\sim P(\lambda) and we have λΓ(3,5)\lambda \sim \Gamma(3,5) (look familiar?). An experiment following this model gives a sample [2,3,5,7][2,3,5,7]. How can we calculate the posterior?

The calculation of the posterior can be done by evaluating a statistic on the sample using the Bayes formula. (yes)

Benford’s law tell us that the lower digits appear more frequent than higher ones in which medium?

It was first reported in logarithm tables, but later applied just about everywhere. (no) In numbers found in the literature…(yes)

What information do you need to calculate a Jerey’s prior? (The course asks for Jeffrey’s prior… who dat?)

saying the prior distribution and derivative of the parameter. (no) The prior distribution family. (yes)