title: Advanced Statistics
subtitle: DLMDSAS01
authors: Prof. Dr. Unknown
publisher: IU International University of Applied Sciences
date: 2023

Unit 7: Hypothesis Testing

pp. 181 - 205

Our learning objectives are as follows:

Buckle up for a wild ride

Introduction

We often want to make a “statement” about the entire population, but we only have samples of said population. It is usually impossible to analyse an entire population of anything. We therefore must define a suitable sample(s) that we can use to make such a “statement” about the entire population. For example, we cannot give medicine to every human on earth, but want to state “this vaccine protects you against Covid-19.”

One challenge is to avoid biases and ensure the sample is representative of the population.

If we assume that we have gathered such an unbiased sample that is a good representative of the population then we want to develop and test a hypothesis, deciding if it is true based on the sample.

The Null Hypothesis is a statement that represents no effect. Typically denoted as H0H_0 (said “H-not”), and it denotes the absence of an effect. The Alternative Hypothesis, denoted as H1H_1 is the presence of the effect we want to establish.

How do we do this?

The course book dives into a quick example about cholesterol medication that I will include to learn notation. When participants take the medicine we gather data, say changes to their Cholesterol. Well, you’d get the cholesterol levels and find the changes yourself. Then, we would be interested in discriminating between the null hypothesis, H0:θ=0H_0: \theta=0 that the medicine did not have an effect, versus H1:θ0H_1:\theta\ne 0 that the medicine caused some change in cholesterol.

Because the aim would be to lower cholesterol, you might change the null hypothesis to include if the cholesterol increased, and modify the test hypothesis to only include if the cholesterol levels decreased.

The hypothesis tests is defined as a rule that we can use to do either:

A paired test is the case when we have at least two measurements from the same source, a person in the example above.

Randomized controlled trial removes biases by having two groups, a control group to monitor without introducing changes like a new medicine, often using a placebo, and the test group that has the change introduced. Participants are randomly selected and ideally concealed from people administering the tests. Instead of pairing data points like above, this trial considers two independent samples.

When we take measurements we obtain a large dataset of values taken from each sample, x1,x2,,xnx_1, x_2, \dots, x_n. You can say we obtain a distribution of values for each group. To perform the hypothesis test to decide whether to accept or reject the null hypothesis, we must construct a test statistic t=f(x1,x2,,xn)=f(x)t=f(x_1,x_2,\dots,x_n) = f(x). We use this to derive our decision formally.

The easiest way to discriminate between the null and alternative hypothesis is to compare the (sample) mean of the two groups. Per the central limit theorem, any sum of identically distributed random numbers will approximate a normal distribution. If the number of individuals in each sample is sufficiently large then the samples follow an approximate normal distribution.

What is “sufficiently large”? In an old statistics course I took, we were told n40n\ge 40 was a good rule of thumb. The course book drops this down to n>30n\gt 30.

We must walk before we run, and stand before we walk. We start with only one group that we want to determine whether it is compatible with a given hypothesis. But first, what is a z-score? The z-score, for a statistic tt, is defined as:

z=tμσz=\frac{t-\mu}{\sigma}

where μ\mu is the mean and σ\sigma is the standard deviation of the population. I don’t think this is a proper introduction as it is rooted in the normal distribution. Basically, we let zz be the standard normal distribution. Then we note that a normal distribution t=zσ+μt=z\sigma+\mu, where the spread and alignment are determined by μ\mu and σ\sigma.

Our test z\mathbf{z}-score with the group mean as x\langle \mathbf x \rangle:

z=xμσ/n\mathbf z= \frac{\langle \mathbf x \rangle - \mu}{\sigma/\sqrt n}

Good to note that the more elements we consider in our sample, the larger nn becomes, and the smaller the standard deviation. If the samples are taken from a normal distributed sample, zz will also follow a normal distribution without relying on the central limit theorem… because it is the standard normal distribution.

In most cases, we do not know the population statistics and need to rely on sample statistics, like sample standard deviation ss, which has one less degree of freedom than the population.

t=xˉμs/nt = {\bar x - \mu \over s/\sqrt n}

I changed the sample mean to be xˉ\bar x because that is how I was taught and the LaTeXLaTeX way for angle brackets is a lot of typing.

Quick recall of sample variance:

s2=1n1i=1N(xixˉ)2s^2={1 \over n-1}\sum_{i=1}^N\left(x_i - \bar x\right)^2

This is an unbiased estimator because of the correction factor 1/(n1)1/(n-1).

We have two random variables in the definition of tt, the sample mean and sample variance. Our value tt does not follow a normal distribution because of the use of the sample variance instead of the population variance. However, it will follow the Student’s t-distribution, which has a single parameter, the number of degrees of freedom.

The Student’s t-distribution | wiki is a generalization of the standard normal distribution. It’s bell shaped and symmetric around zero. The difference is that tvt_v (the t-distribution) has heavier tails, and the amount of probability given to the tails is controlled by its parameter ν\nu. For ν=1\nu=1 the t-distribution becomes the standard Cauchy distribution with fat tails. And for ν\nu \to \infty, the t-distribution becomes the standard normal distribution.

The Student’s t-distribution | Simple.Wiki was developed by William Sealy Gosset in 1908 and named after the pseudonym he used whilst publishing the paper.

More information about the math can be found on the page Student’s t-Distribution | Wolfram MathWorld.

If the variable XX follows a normal distribution, perhaps approximately via the central limit theorem, then tt will follow the t-distribution. In the sample variance case, the number of degrees of freedom is ν=n1\nu=n-1.

The Student’s t-Distribution behaves much like the standard normal distribution but with more pronounced tails, essentially shifting some of the probability of outcomes. If a variable XX follows a Student tt distribution with nn degrees of freedom, we write Xt(n)X\sim t(n). The PDF of XX is given as follows:

fn(t)=1(πn)Γ((n+1)/2)Γ(n/2)(1+t2n)(n+1)/2f_n(t) = {1 \over \sqrt{(\pi n)}} \frac {\Gamma((n+1)/2)} {\Gamma(n/2)} \left(1+ {t^2 \over n}\right)^{-(n+1)/2}

Obviously not the easiest function to work with.

Now, let us look at the case of two groups. We construct a test statistic similar to the zz-score:

t=(xˉ1xˉ2)δSE[xˉ1xˉ2]t=\frac{(\bar x_1-\bar x_2) - \delta}{SE\left[ \bar x_1 - \bar x_2 \right]}

We have there the difference in sample means on both the top and bottom. The constant δ\delta is the known difference between the population means under the null hypothesis. And SE[xˉ1xˉ2]SE[\bar x_1 - \bar x_2] is an estimate of the standard error of the difference between the two samples.

In most practical applications we would set δ=0\delta =0 because we would not know that information unless we had further or external knowledge about the population.

From the propagation of uncertainties, we can recall the standard error of the differences is given by:

SE[xˉ1xˉ2]=V[x1]+V[x2]SE\left[ \bar x_1 - \bar x_2 \right] = \sqrt{V[x_1] + V[x_2]}

Therefore, if you somehow know the variance of the population, the equation becomes:

t=xˉ1xˉ2σ12n1+σ22n2t=\frac{\bar x_1 - \bar x_2}{ \sqrt{{\sigma^2_1 \over n_1}+{\sigma^2_2 \over n_2}} }

However, it would be a bit odd (in my opinion) to have the population variance but no the population means. Therefore, we typically estimate them from the sample variance, giving:

t=xˉ1xˉ2s12n1+s22n2t=\frac{\bar x_1 - \bar x_2}{ \sqrt{{s^2_1 \over n_1}+{s^2_2 \over n_2}} }

This is a tt-distribution (proven by Hogg et al. theorem 3.6.1) but the degrees of freedom (d.o.f.) can be difficult to determine when either the sample variances nor the number of elements in the sample are the same.

The value of the number of degrees of freedom is between the limits:

min(n11,n21)d.o.f.n1+n22\min(n_1-1, n_2-1) \le \mathtt{d.o.f.} \le n_1+n_2-2

If n1,000n \approx 1,000, you can see the range would be quite large.

A reasonable approximation is the harmonic mean:

d.o.f.=ν=2n11+n21\mathscr{d.o.f.} = \nu= \frac{2}{n_1^{-1} + n_2^{-1}}

If we assume the variances in the two groups are the same, this is called the Two-Sample Student’s t-Test. If we allow the variances to be different from each other, it is called the Welch’s Test.

For Paired Samples, the test can be expressed as a special case of the one-sample t-test. Instead of viewing start and end as two different groups, we take the difference and work with that as just one group. Let di=xiyid_i=x_i-y_i be the difference between 2 samples. The test statistic becomes:

t=dˉμsd÷nt = \frac{\bar d - \mu}{s_d\div\sqrt n}

Who uses the division sign anymore?

We know that dˉ\bar d is the sample mean of the difference and that sds_d is the sample standard deviation of the differences. The population mean, μ\mu is the hypothesis we want to test. And nn is the number of paired measurements, giving n1n-1 degrees of freedom.

There is some confusion around number of samples and degrees of freedom, especially in the two-case scenario.

Now, given that the null hypothesis is true, we can use the distribution of this test statistic to decide whether the observed value of the test statistic is unlikely or not. The observed value of the test statistic comes from replacing all the estimators by their observed values:

tobserved=xˉ1xˉ2(s12n1+s22n2)t_{observed} = \frac{\bar x_1 - \bar x_2}{ \sqrt{ \left( {s_1^2 \over n_1} + {s_2^2 \over n_2}\right) } }

To observe if this test statistics is statistically significant, we need to see where it falls in the distribution. And to make a decision, we need to set a cut-off value, which we will denote tct_c.

The book shows a graph for the example of the standard normal distribution. The cutoff point is at the tail(s) of the distribution, depending on your null hypothesis. If we have a null hypothesis, and our test statistic lands in the high probability region of the distribution, the main bell, then we fail to reject the null and accept there was no change. However, if the test statistic lands in the tail beyond the cut-off, we can more safely reject the null hypothesis.

A note on the distribution. When we take a “large enough” sample, we generally just assume the sample mean follows a normal distribution. However, when we perform the analysis, we need to check if the distribution actually follows a normal distribution. The t-test works well for symmetric distributions, like the normal distribution. It does not work well for asymmetric distributions.

This means that, in practice, if we have to deal wit ha situation where the assumption that the underlying distribution is a normal distribution is not fulfilled, we have to take this into consideration as a source of systematic uncertainty and quantify how big the effect of this non-normality is on our final result.

7.1 - Type I and Type II Errors

p. 187

From above, to decide between the hypothesis we need to construct a function of t he measurements that maps the measured values into a number:

t=f(x1,x2,,xn)=f(x)t=f(x_1,x_2,\dots,x_n) = f(x)

This function can be anything from simple to a sophisticated machine learning algorithm. Using this function, we can show the distribution of this test statistic for the null and the alternative hypothesis, which we will discuss below.

To determine if we should reject the null hypothesis, we need to define a critical value tct_c such that:

Unfortunately, in most realistic cases we cannot distinguish between the null and alternative hypothesis perfectly. The distributions of the test statistics for the two cases will overlap. The course book shows a graph of 2 bell curves which I think is an interesting perspective. Basically, our data gives us a certain curve, and the t-test is also a certain curve. Their tails will overlap at a kind of meaningful portion of the curves.

This leads to 2 types of Errors:

The course book also provides a nice table…

I more mathematical terms:

α=P(reject H0H0 is true)β=P(fail to reject H0H0 is false)\begin{align*} \alpha &= P(\text{reject } H_0|H_0 \text{ is true})\\ \beta &= P(\text{fail to reject }H_0| H_0\text{ is false}) \end{align*}

The cut-off value tct_c will affect these probabilities. In many practical applications, α\alpha is chosen in advance, which determines the cut-off value tct_c. We would decide ahead of time what kind of Type I error we are willing to tolerate. We do this before looking at the data because once we look at the data our tolerance become bias. Consider an edge case. If we look at the data and believe we should reject the H0H_0, we might adjust probabilities to make it so.

Power of the test

Our probabilities α\alpha and β\beta have an inverse relationship, which should be clear.

I do love this definition… The rejection region (RR) is the set of values of the statistic tt that are at least as extreme as the cut-off value. It was hard at first for me to understand, but these values are at least as extreme. By “extreme”, we mean that they are not near the null hypothesis, they are extreme values living in the tail end of the probability distribution. Anything at least as extreme, or more extreme, falls into the rejection region and causes us to reject the null hypothesis.

The chosen value of α\alpha is a naive metric for evaluating the performance of the hypothesis test. A more useful metric is the “power of the test”.

The Power of the Test is a measure of the quality of a test with respect to the probability that the testing process will detect an effect. Informally, it is the probability that the hypothesis test will yield a decision to reject H0H_0:

Power=P(reject H0H1 is true)\text{Power} = P(\text{reject } H_0|H_1 \text{ is true})

and describes the probability to detect an effect if it is indeed there.

Ok, let’s have a look considering a test about parameter θ\theta, our favourite parameter. Suppose we have H0:θ=θ0H_0: \theta=\theta_0, The alternative is H1:θWRH-1:\theta \in W \subseteq \mathbb R. Given θ1W\theta_1 \in W, the test statistic TT, and the rejection region, the power of the test when the actual parameter is θ=θ1\theta=\theta_1 is:

Power(θ1)=P(TRR  θ=θ1)=P(rejecting H0  θ=θ1)\begin{align*} \text{Power}(\theta_1) &= P(T\in RR\ | \ \theta=\theta_1)\\ &= P(\text{rejecting } H_0 \ | \ \theta=\theta_1) \end{align*}

The probability our test statistic TT is in the rejection region given that our parameter is in-fact extreme, is qual to rejecting the null hypothesis.

Now, we go to our Type II error, β(θ1)\beta(\theta_1) is the probability of failing to reject H0H_0 when the true value of the parameter is θ=θ1\theta=\theta_1:

Power(θ1)=1β(θ1)\text{Power}(\theta_1) = 1 - \beta(\theta_1)

The book then flies into an example with σ=3\sigma=3, μ=15\mu=15 and we want a confidence level of α=0.05\alpha=0.05.

The Hypotheses tests are:

H0:μ=15vsH1:μ>15\begin{align*} H_0&:\mu=15\\ &\text{vs}\\ H_1&: \mu > 15 \end{align*}

Note, sometimes the alternative is H1:μcH_1:\mu \ne c and so we would consider both tails of the distribution. That is like cutting the confidence interval in half and applying each half to each side of the distribution (You can see why we like symmetry). In the above case, since we are only concerned with the greater than side, that is the right side of the distribution.

The one-sample Student’s t-test is given by:

t(xμ=15)=xˉ15σ/Nt(x|\mu=15) = \frac{\bar x - 15}{\sigma/\sqrt N}

That little formula is based on the sample mean and the population standard deviation. If NN is sufficiently large, then by the central limit theorem, the distribution of tt will follow a normal distribution if we do know the population standard deviation. If we only have sample standard deviation (typically) then we fall back on the Student’s t-distribution with N1N-1 degrees of freedom.

With our chosen cut-off, we have P(Z>zc)=0.05P(Z \gt z_c) = 0.05. Our zz-score will only follow the normal distribution in this example because we are pretending we know the population standard deviation.

Now, I haven’t covered how to look up values in a normal distribution table, but it’s quite simple. You can find tables online, Standard Normal Distribution Table | MathIsFun.com. This has a cool slider. Because we know our alpha, we look in the table for its value. This one is actually a more difficult table because it only takes the right half, but the graph is symmetric. As such, we are looking for 0.50α=0.450.50 - \alpha = 0.45. It lands right between zc=[1.64,1.65]z_c=[1.64, 1.65] so we split the difference linearly, zc=1.645z_c=1.645. This gives us P(Z>1.645)=0.05P(Z \gt 1.645) = 0.05.

Z>1.645xˉ15σ/N>1.645xˉ>1.645(σ/N)+15xˉ>1.645(3/N)+15\begin{align*} Z &\gt 1.645\\ \frac{\bar x - 15}{\sigma/\sqrt N} &\gt 1.645\\ \bar x &\gt 1.645(\sigma/\sqrt N) + 15\\ \bar x &\gt 1.645(3/\sqrt N) + 15 \end{align*}

We don’t have NN, but if we did, it’s just a matter of comparison.

Suppose the true value of the μ=16\mu=16 (population mean). Let’s look at β(16)\beta(16) and power of the test at 16 too. Remember, our null hypothesis assumes μ=15\mu = 15.

β(16)=P(t(Xμ=16)3/N15+1.6453/N163/N)=P(Z1.645N/3)\begin{align*} \beta(16) &= P\left(\frac{t(X|\mu=16)}{3/\sqrt N} \le \frac{ 15+1.645\cdot 3/\sqrt N-16 }{3/\sqrt N}\right)\\ &= P\left( Z \le 1.645 - \sqrt N/3 \right) \end{align*}

And the power of the test:

power(16)=1P(Z1.645N/3)=P(Z>1.645N/3)\begin{align*} \text{power}(16) &= 1-P\left(Z \le 1.645 - \sqrt N /3\right)\\ &= P\left( Z\gt 1.645 - \sqrt N /3\right) \end{align*}

The probability of the Type II error decreases as the sample size increases, and the power of the increases with the sample size.

The book continues the evaluate when N=36N=36. The probability of a Type II error is about 36%. That is a large change to fail to reject the null hypothesis when the real parameter is just one unit large than assumed. If that is the case, and that unit would be significant, we would need more data to decrease that probability.

Note that the probability of a type I error and the power of a test have an inverse relationship. We want the power to be high and the probability of committing a type I error to be low.

The Neyman-Pearson Lemma

p. 193

We can approach hypothesis testing from another angle. If we look at the distributions of the test statistic for the null and alternative hypothesis, we might be tempted to choose the cut-off value tct_c where they intersect. This choice is not optimal if the apriori probabilities of the null and alternative hypothesis are very different.

Using the Bayesian priors for the null hypothesis P(H0)P(H_0) and the alternative hypothesis P(H1)P(H_1), we can define the cut-off as the biggest tt such that:

P(H0)f0(t)P(H1)f1(t)P(H_0)f_0(t) \ge P(H_1)f_1(t)

Function fi(t)f_i(t) is the distribution of the test statistic for the null and alternative hypothesis. The best choice is determined by the Neyman-Pearson Theorem.

Per the Introduction to Mathematical Statistics text, by Hogg et al., p. 472

Theorem 8.1.1 - Neyman-Pearson Theorem: Let X1,X2,,XnX_1, X_2,\dots, X_n, where nn is a fixed positive integer, denote a random sample from a distribution that has pdf or pmf f(x;θ)f(x;\theta). Let θ\theta be an the unknown distribution parameter(s). The likelihood of X1,X2,,XnX_1, X_2,\dots, X_n is

L(θ;x)=i=1nf(xi;θ),for x=(x1,,xn)\begin{array}{cc} L(\theta; \boldsymbol x) = \prod_{i=1}^nf(x_i;\theta), & \text{for } x'=(x_1,\dots,x_n) \end{array}

Let θ\theta' and θ\theta'' be distinct fixed values of θ\theta so that Ω={θ : θ=θ,θ}\Omega = \{ \theta \ : \ \theta=\theta',\theta'' \}. We will also let kk be some positive number. And we let CC be a subset of the sample space such that

(a)L(θ;x)L(θ;x)k,for each point xC(b)L(θ;x)L(θ;x)k,for each point xCc(c)α=PH0[XC]\begin{array}{clr} (a) & \frac{L(\theta';x)}{L(\theta'';x)}\le k, & \text{for each point } x\in C\\ (b)& \frac{L(\theta';x)}{L(\theta'';x)}\ge k, & \text{for each point } x\in C^c\\ (c) & \alpha = P_{H_0}[X\in C] & -\\ \end{array}

Ok… Then C is the best critical region of size alphaalpha for testing the simple hypothesis H0:θ=θH_0:\theta=\theta' against the alternative simple hypothesis H1:θ=θH_1:\theta=\theta''.

\Box

The text book continues on with a proof.

Our course book states, the most powerful test has a rejection region given by:

L(θ0)L(θ1)>k\frac{\mathcal L(\theta_0)}{\mathcal L(\theta_1)} \gt k

where kk is chosen so that the probability of the Type I error is α\alpha. I do like the latter notation as it’s not giving the impression that there’s a derivative anywhere.

The course book then dives into an example with a Poisson distribution. Poisson is a decent example because the exponents make easy work of the product of pdfs.

7.2 - P-Values

p. 196

We are going to discuss in more detail how we can quantify the significance of the result of the hypothesis test. A good question is, “As we have observed a difference between the groups and, say, rejected the null hypothesis and accepted the alternative hypothesis, how sure are we that this is not due to random chance?”

If we repeat the same experiment many times using different samples, how sure are we that we would always observe this difference between the groups?

Questions like these are important in practice. Take the field of medicine for example. You want to be very sure that a drug is both safe and effective. We need a metric that allows us to make a statement that an observed outcome, for example, the difference of the sample mean of two groups, is larger than we would expect if from random variations and chance alone.

For a result to be statistically significant we require the expected variation due to random fluctuation to be smaller than what we see. This is express by the pp-value, defined by the American Statistical Association as:

Informally, a pp-value is the probability under a specified statistical model that a statistical summary of the data would be equal to ore more extreme than its observed value.

For a test statistic TT, the pp-value is the smallest value of α\alpha for which the observed data suggest that the null hypothesis is rejected. So, the smaller the pp-value, the more unlikely the data comes from the distribution specified by the null hypothesis.

We have already seen the use of threshold p<0.05p \lt 0.05, which is practical. We make two important observations:

The pp-value is not easily understood though. The previous author continues:

While the pp-value can be a useful statistical measure, it is commonly misused and misinterpreted. This has lead to some scientific journals discouraging the use of pp-values, and some scientists and statisticians recommending their abandonment, with some arguments essentially unchanged since pp-values were first introduced.

Now, how do we calculate the pp-value? We assume we have two hypotheses, our null and alternative. The pp-value is the probability of observing a value of our test statistic TT that is at least as extreme as the observed value, assuming H0H_0 is true at first.

p=P(T>ToH0)p=P(T\gt T_o | H_0)

I am letting ToT_o be the observed test statistic, as I don’t want to write the work “observed” in the formula.

The pp-value is based on a random sample. Therefore it is itself a random variable distributed on the interval [0,1][0,1].

If the null hypothesis is true, the distribution of pp-values is uniform, each value is equally likely. That is an interesting take.

Summary of what the pp-value is:

What the pp-value is not:

7.3 - Multiple Hypothesis Testing

p. 201

Reporting a “significant” result based on the pp-value can be very misleading if we do not describe the way we test the hypothesis. We must also report all the tests that did not yield a significant result.

The course book gives a Jelly-Bean example… Let’s say it is observed that people who eat jellybeans have more acne. So we test each colour independently and notice that the green ones cross the threshold we set of p<0.05p\lt 0.05. What does this mean?

Well, if we publish that we tested 100 different colours and we found the green ones to be significant we must remember that even at p<0.05p \lt 0.05, we are saying that we accept 1 out of 20 results with cross the threshold on chance alone.

If we perform multiple hypothesis tests then we need a mechanism that can prevent us from obtaining statistically significant results by testing several hypotheses and cherry picking the ones that give use promising results.

The Familywise Error Rate (FWE) aims at reducing the overall probability of rejecting true null hypothesis in multiple hypothesis testing.

The Family-wise error rate | Wiki is the probability of making one or more false discoveries (I like this definition), or Type I errors when performing multiple hypothesis tests.

There’s also the discussion, Familywise Error Rate | StatisticsHowTo.com. Has very similar definition of the probability of making at least on Type I Error.

If we have 100 tests, like colours of jellybeans, with p=0.05p=0.05, means each one has 95% chance to be correctly assessed. If they are independent, then we can say of the 100, there is a 0.95100=0.005920.95^{100}=0.00592 chance they are all assessed correctly, or about a 99.4% chance that at least one true null hypothesis is rejected by chance.

This begs the question, what should we set α\alpha to then? We would like the overall value to be 0.05, so we rearrange the equation as follows:

1(1α)100<0.05(1α)100>0.95(1α)>0.951/100α0.005\begin{align*} 1-(1-\alpha)^{100} &\lt 0.05\\ (1-\alpha)^{100} &\gt 0.95\\ (1-\alpha) &\gt 0.95^{1/100}\\ \alpha &\le 0.005 \end{align*}

Note, the above example is slightly approximate because of 0.951/1000.95^{1/100} has a lot of decimal places.

The typical method to control the Familywise Error Rate (FWE) is the Bonferroni method. This method rejects a specific null hypothesis if its corresponding pp-value is less than α/m\alpha / m, where mm is the number of hypothesis. This works because when pp is small:

1(1p)1/mp/m1-(1-p)^{1/m} \approx p/m

An issue with this type of control is that the multiple testing procedure might result in a low power. The power is the ability to reject a false null hypothesis. It can become too restrictive, so we look for an alternative measure.

When we have many hypothesis tests, it might make sense to allow a small proportion of true null hypotheses to be rejected. This means accepting a small fraction of false discoveries, as long as we are able to quantify the level and can introduce a control method that allows us to specify the level at which we accept these false discoveries. This is so the power is not too low.

We will define a measure that controls the proportion of true null hypotheses rejected. Let’s cover some notation:

Since we do not know if the null hypothesis is true or not during testing, the quantities with TT or FF are technically not accessible. We can measure the number of experiments we do, the number of discoveries that exceed the threshold and the number of NN where we do not exceed the threshold.

In a single hypothesis test, we want to have a small Type I error (rejecting a true null hypothesis). That is, we want to control the probability of a false positive (detecting an effect when there is none). The False Discovery Rate (FDR) is the corresponding quantity for multiple hypothesis test, which is the expected proportion of the false positive with respect to all positives:

FDR=E(FDD)=E(FDFD+TD)FDR = E\left({FD \over D}\right) = E\left( {FD \over FD+TD} \right)

We compute the expected value because the discovery rate is a random variable. We have unfortunately defined our formula with values that are not observable and therefore it is not computable.

For uncorrelated or positively correlated test, we can us the following approach!

For each test, we define the null hypothesis, H1,H2,,HmH_1, H_2, \dots, H_m. Each hypothesis has an associated pp-value. Now, order the hypotheses in ascending order by their pp-values:

p(1)p(2)p(m)p_{(1)} \le p_{(2)} \le \dots \le p_{(m)}

Now choose the largest kk value such that

p(k)kmαp_{(k)} \le {k \over m} \cdot \alpha

Then, the hypotheses H1,,HkH_1,\dots,H_k are rejected and we say that we control the FDR at level α\alpha. Additionally, we choose the value of α\alpha such that the number of rejections is maximised.

In case that the hypotheses are correlated between the tests, we modify the procedure such that:

P(k)=kmc(m)αP_{(k)} =\frac{k}{m\cdot c(m)}\cdot \alpha

Where

c(m)={i=1m1/iif negatively correlated1Otherwisec(m) = \begin{cases} \sum_{i=1}^m 1/i & \text{if negatively correlated} \\ 1 & \text{Otherwise} \end{cases}

Knowledge Check

Q1: We aim at testing the hypothesis that a new sample shows a greater mean for a variable which we assume follows a normal distribution of known average and variance. Which test is appropriate, considering that we do not know the size of the sample?

a.) Student’s t-test b.) χ2\chi^2-test c.) ZZ-test d.) FF-test

The Chi-Squared test | Wiki I don’t think was mentioned in this unit. It is apparently used when the sample sizes are large. There’s a little more to it but because it sounds like it requires the sample size, we cannot use.

The F-Test | Wiki Also was not yet discussed. The test statistic is used to determine if the data has an F-Distribution under the null hypothesis. Looking through the maths, it it contingent a degrees of freedom that depends on the sample size.

The tt-test also depends on the sample size. Therefore, because it sounds like we know the population statistics, and we don’t know the sample size, we should use the ZZ-test, (c).

Q2: What is Type I error give the null and alternative hypothesis as H0H_0 and HaH_a?

a.) Cases when the statistics calculation makes our test conclude that H0H_0 is true while HaH_a is actually true. b.) Cases when the statistics calculation makes our test conclude that HaH_a is true while H0H_0 is actually true. c.) Cases of any of the two errors. d.) Measured with the probability pp, the power of the test.

The definition in the book states Type I is when the sample appears extreme, significant enough to incorrectly reject H0H_0 and accept HaH_a. I think it would be (b).

Q3: What kind of relationship does the power of the test have with the α\alpha value? That is, as α\alpha increases, how does the power of the test behave?

a.) Increases b.) Varies unpredictably c.) Stays equal d.) Decreases

I believe the course book stated that the power has an inverse relationship with α\alpha. Yes, on p. 192 it states, ”… it is important to consider that the probability of a type I error and the power of a test have an inverse relationship.” So if α\alpha increases, the power decreases (d).

The power of a statistical test is the probability of correctly rejecting the null hypothesis when it is false. In other words, it is the probability of detecting a real effect when it is present. The power of a test is influenced by several factors, including:

A test with a high power is more likely to correctly detect a real effect. Conversely, a test with a low power is more likely to miss a real effect. When designing a study, it is important to choose a significance level, effect size, and sample size that will provide a reasonable level of power.

Here is a table that summarizes the relationship between the significance level, effect size, and power of a test:

Significance level (α)Effect sizePower
0.05Small0.20
0.05Medium0.80
0.01Small0.05
0.01Medium0.60

As you can see, the power of a test increases as the significance level decreases and the effect size increases. However, it is important to note that the power of a test can never be 100%, even if the effect size is large. This is because there is always some chance that the observed difference between the two groups could be due to random chance.

The correct answer is apparently (a), increases.

The power of a statistical test and the significance level (often denoted by alpha, α) are inversely related. The power of a test is the probability that it correctly rejects a false null hypothesis. In other words, it is the ability of the test to detect a true effect or difference when it exists.

The significance level (alpha, α) is the probability of incorrectly rejecting a true null hypothesis. It is the probability of making a Type I error, which is the error of rejecting a null hypothesis that is actually true.

The relationship can be summarized as follows:

  1. Power (1 - β): The power of a test increases as the probability of making a Type II error (β) decreases. Power is influenced by factors such as the effect size, sample size, and the variability in the data.

  2. Significance Level (α): The significance level is the probability of making a Type I error. It is typically set by the researcher before conducting the test and is denoted by α.

  3. Inverse Relationship: There is an inverse relationship between power and the significance level. As you decrease the significance level (α), you increase the probability of making a Type II error (β), which in turn decreases the power of the test. Conversely, as you increase the significance level, you decrease the probability of Type II error, thereby increasing the power of the test.

In summary, there is a trade-off between the risk of Type I errors (α) and Type II errors (β). Researchers must carefully choose the significance level based on the desired balance between these two types of errors, taking into consideration the specific context and consequences of each type of error in their study.

OK, so I think I thought of Alpha incorrectly. It’s almost like we always view it backwards, if we drop a 95% confidence interval down to just 90%, then the probability of failing to reject the null hypothesis goes up, and the probability of correctly rejecting it decreases…

Q4: An FF-test tests that:

a.) The variance of the second sample is left stable. b.) The probability that the variances differ is high. c.) How good is an estimate of the variance. d.) The variances of two samples differ significantly provided they are (or approximately) normally distributed.

Section 3.6.2 of Introduction to Mathematical Statistics specifically covers the FF-distribution, starting on page 212. It begins with 2 chi-square random variables and divides each by their degrees of freedom, and then divides those by each other. That is,

F=U/r1V/r2F={U/r_1 \over V / r_2}

The F-Distribution | Wiki arises frequently as the null distribution of a test statistic, notably in the analysis of variance (ANOVA), and other F-Tests.

The F-distribution is a continuous probability distribution that is used in hypothesis tests to compare two or more variances. It is named after Sir Ronald Fisher, an English statistician who developed it in the early 20th century.

The F-distribution is typically used in two types of hypothesis tests:

Here is a table of some of the common uses of the F-distribution:

Hypothesis testNull hypothesisTest statisticCritical valueInterpretation
One-way ANOVAAll groups have the same meanF = (between-groups variance)/(within-groups variance)F > critical valueReject null hypothesis if means are significantly different
Testing equality of variancesAll populations have the same varianceF = (larger sample variance)/(smaller sample variance)F > critical valueReject null hypothesis if variances are significantly different

The F-distribution is a versatile tool that can be used to test a variety of hypotheses about variances. It is a widely used distribution in statistical analysis.

The F-distribution is a fairly robust test, meaning that it can still be used to compare variances even if the samples are not strictly normally distributed. However, if the samples are heavily skewed or have outliers, then the F-test may not be as accurate. In these cases, it may be better to use a non-parametric test, such as the Kruskal-Wallis test or the Welch’s ANOVA.

Here is a table summarizing the conditions under which the F-test is valid:

ConditionRequirement
IndependenceThe samples are independent of each other.
Equal variancesThe populations from which the samples were drawn have equal variances.
NormalityThe samples are approximately normally distributed.

If the samples are not normally distributed, but they are not severely skewed or have outliers, then the F-test may still be a reasonable choice. However, it is important to be aware of the limitations of the test and to interpret the results cautiously. If the samples are severely skewed or have outliers, then it is best to use a non-parametric test.

Let’s go with (d) for the win.

Q5: When applying the Student’s tt-test on a normal variable, what are the parts we must calculate?

a.) The sample mean and size. b.) The sample size and theoretical variance. c.) The accuracy and sample size. d.) The sample mean, variance, and size, as well as the theoretical mean.

This question does not involve the second case in the course book where we compare 2 different groups. The test statistic is

t=xˉμs/nt={\bar x - \mu \over s/\sqrt n}

So we have everything in (d).