title: Advanced Statistics
subtitle: DLMDSAS01
authors: Prof. Dr. Unknown
publisher: IU International University of Applied Sciences
date: 2023
Our learning objectives are as follows:
p. 68
A deterministic system is like a series of events where we can calculate the outcome with great certainty. Thinking like trajectories in physics, a catapult launching a melon could be a deterministic system.
However, like adding aerodynamics to the melon example, even a simple system can become complex quickly. A coin toss, could be determined given force of the flip, placement of the force, duration force is exerted on coin, height of flip, etc… but with the ever growing number of variable needing to be known we end up more of a random system.
More examples include how many customers enter a shop in an hour, or when a component will fail. These could be determined given enough information, but that information is nearly impossible to obtain. So, we gather what information we can about the system, studying events and outcomes, to find if they follow a pattern. This means the occurrence of events is not fully random.
We mean that given some information, we can match to a probability distribution and the provide a suite of outcomes and associate probabilities. We cannot provide exact numbers, like exactly when a component will fail, but the probability of something, like 95% chance of failure within the next 100 hours.
We can describe the distributions in terms of their descriptive statistics such as expected value and variance.
A formal description is, we say that a given system or specific events are described by a random variable that follows a specific probability distribution and the values of the random variable represents measurable outcomes.
p. 68
The book gives a few examples, including the A/B testing. A Bernoulli trial is basically an event that has 2 outcomes. For the A/B testing example, a user visits a web page and are served either one of two versions. It is then assessed later which is more successful at generating revenue.
This makes probabilities quite simple. If outcome is , then outcome is .
When a random variable describes a Bernoulli trial, we generally write .
This is not in course book. At some point before this, probably in a discussion of Set Theory in an introductory probability course (Usually taught at the beginning), one learns counting rules. Check out the following book perhaps.
title: Introduction to Mathematical Statistics
edition: 8
authors:
- Robert V. Hogg
- Joseph W. McKean
- Allen T. Craig
date: 2019
publisher: Pearson
To set the stage, we let be a set with elements. We are interested in -tuples whose components are elements of .
A permutation is each such -tuple. It has many notations, the book uses the following:
The book covers a well known Birthday Problem where you calculate the probability that a given the number of people in a room, at least 2 people have the same birthday.
I now want to also look at this birthday problem. We start with the counting… each person that enters the room has 365 different birthday possibilities, excluding leap year for simplicity. The first 2 people is a room can have sets like (1 Jan, 1 Jan), or (1 Jan, 2 Jan), or etc… because we are allowing people to have the same birthdays, 2 people form a set of outcomes. A group of people will have a total of outcomes.
Because we say ”at least”, a very important phrase in probability, you should instantly consider the negative of the situation. Why? Because, if it is not true that everyone has a unique birthday, then it is true that at least one pair of people have duplicate birthdays, or more. Saves us the hassle of figuring out probability of 2 people, or 3 people, or 2 and 3 people, etc…
If we have 4 people in a room, and we want to find the number of possible outcomes there for unique birthday… permutations, we start with:
If there are 4 people in the room, we have then:
Different permutations, which is also .
The formula thus becomes:
So, how many people in a room for a 50% chance?
It is here it becomes obvious that it is not easy to solve for . In fact, Understanding the birthday paradox | BetterExplained approximates with the exponential distribution to find the answer of 23.
A combination is:
Basically, the events of permutations are considered distinct from each other where they are not distinct in a combination. A hand of cards might be a good example. If you draw 3 cards and want Ace, King, Queen, in that order, you would be more along the line of permutation. However, if you just wanted Ace, King, Queen in any order, that is combination.
Or
Just an example of why the probabilities would be different. Not my best work but also not the focus of this section.
If you expand the binomial series:
This interesting property is why sometimes it is referred to as the binomial coefficient.
p. 69
The Binomial Distribution describes the general outcomes of performing independent Bernoulli trials. This implies that the result does not depend on the sequence of previous observations.
The random variable then describes the probability that we observe event , which is the success times in our trials. is a discrete random variable taking on values . We then have successes, or occurrences of event , and failures, or event .
We let and . The overall probability is given by because they are independent events.
Let’s look at a small set of just 5 events. Of these 5 events, we want 2 successes, just so we need to do a little bit of math. Given the events of and to represent success and failure, one possible outcome could be , and another , or , etc…
We have:
So, there are 20 unique combinations. However, we would consider to be the same as . Both have 3 failures and 2 successes. This would be how our probability in the distribution is set up. Therefore, we go from Permutation to Combination:
In the context of our situation, we have 10 uniquely different combinations. The binomial distribution does not care about the order in which success and failures occur. It’s goal is to describe the probability of observing “successes” in trials, where is the probability of observing a “success”. The distribution is given as follows:
You can continue the example yourself, giving reasonable values for success and failure.
The expected value is given by:
In this situation, , being the number of successes, which is why it is written this way I think. This is kind of intuitive, being the probability of success times the number you want.
The variance is given by:
You can check out Binomial Distribution | Wiki for proofs I think, and just more information.
The following recurrence formula is helpful in practical applications:
The Negative Binomial Distribution | Wiki predicts the number of failures in a sequence of independent and identically distributed Bernoulli trials before a specified (non-random) number of successes occurs. Example could be rolling die, a success is rolling the number one, and we want to roll two of them. The probability distribution is thus the number of failures that will occur before our success, and their associated probability of occurring.
The question is “What is the probability of only failures before we observe the success?” The Negative Binomial Distribution is given by:
where is the number of failures before the success and is the probability to observe a success.
The mean is given by:
And the variance is given by:
p. 72
Author refers to “The Evolution of the Normal Distribution” in Mathematics magazine by S. Stahl (2006) for a deeper historical dive into the normal distribution. A normal random variable for has PDF:
where , written only to better view the exponent. The density only depends on and , the mean and the variance.
Setting gives the standard normal distribution.
Notation for the normal distribution is often indicated with:
Some texts may refer to the standard deviation which can indeed make some notations confusing. I will do my best to use the variance when possible for consistency and because variance comes before standard deviation.
The Law of Large Numbers | Wiki is a theorem that the average of the results obtained from a large number of independent and identical random samples converges to the true value, if it exists. That is, the sample mean converges to the true mean. This guarantees stable long-term results for the averages of some random events.
The LLN only applies to the average of the results obtained from repeated trials and claims that this average converges to the expected value. It does not claim that the sum of results gets close to the expected values times as increases.
Distributions such as the Cauchy distribution have problems converging as becomes larger because of heavy tails.
Mathematically we have something like, the sample average:
converges to the expected value:
Also, for identical finite variance, , and no correlation between random variables, the variance of the average of random variables is:
Variance is a very fun topic.
There is a strong law and a weak law. You can read about them online.
The normal distribution is very important to the Central Limit Theorem | Wiki. The Wiki page provides a proof that spans over a good portion of mathematics from characteristic functions and little “o” notation to Taylor’s theorem.
Also check out the book referenced earlier, section 5.3.
Theorem - Central Limit Theorem: Let be observations of a random sample from a distribution that has mean and positive variance . Create random variable :
This random variable converges in distribution to a random varialbe that has a normal distribution with .
The book “Introduction to Mathematical Statistics”, section 5.3, also provides a proof using moment generating functions.
The course book states that a probability distribution being the sum of independent random variables that follow some probability distribution with mean and variance and ; it will converge towards a normal distribution with and . That means the standard deviation is a factor of the square root of .
Additionally, the random variables are generally required to be identically and independently distributed.
The book refers to the Galton Board, sometimes called a “bean machine”. Each bean, as it fall through the machine, goes through many Bernoulli trials, an experiment with two distinct outcomes. This is repeated many times for many beans, creating a normal distribution.
Because of its enormous importance in practical applications, we typically assume a normal distribution when we report measured or determined values as well as a dispersion relation.
Deviations are as follows for a normal distribution
p. 76
We will now consider questions like, how many lightning bolts will strike the ground in a given area? Which amount of a specific product will be sold on any given day at a specific store location? The idea is a random count over some given dimension, be it area, location, possibly time, etc…
I was looking at Poisson Distribution | LibreTexts initially, a good resource.
Poisson Distribution | Statology also appears to be a nice recourse.
The Poisson distribution describes the count data for a fully random process. Popular for modelling the number of times an event occurs in an interval of time or space. Like shoppers buying items or entering a store.
It is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space. The distribution, written is defined as:
Where is the number of events we would like to observe and is the average number of events we observe per unit time. So, if you want to know customers entering store per hour, then must be the average per hour.
The Poisson distribution emerges as the limit if we take many trials while keeping the mean fixed.
The course book gives some examples and notes that as increases, the Poisson distribution approximates a Gaussian or Normal Distribution. How do we increase lambda? You can increase the time span for events to occur. Then more events would occur and lambda would get bigger.
The course book also gives the proof that the Poisson distribution emerges as the limit of the Binomial distribution as . It is more of an approximation.
From this approximation though, we can derive two important characteristics:
Another fun fact is:
The course book quickly covers an example of customers entering a store and noting that the variance does not quite equal the mean. This means that the assumption of a Poisson distribution does not quite hold. However, the Poisson distribution can still describe the data reasonably well.
Since the Central Limit Theorem is the sum of independent and identically distributed probability distributions, what is the sum of Independent Poisson random variables? The sum of independent numbers distributed according to a Poisson distribution is… a Poisson distribution!
The book shows the proof on pp. 78-81.
The new lambda is . And remember that the larger the parameter lambda becomes, the closer it is to resembling a normal distribution, which we would expect from the central limit theorem.
p. 81
The Poisson distribution has the limitation that it is not great for modelling over-dispersed data where . This will require a more general approach to describe discrete count data.
Having the parameter lambda for Poisson implies that we know the value of the mean without uncertainties. However, in practical applications, the mean itself is typically a random variable and therefore needs to be described by a probability distribution. This can be done with Bayesian Statistics. This field uses the information of a sample and the prior assumption to allow you to compute the modelled distribution.
Definition - Conjugate Prior: The prior is of the same family of distributions as the posterior.
The Gamma family of distribution is a conjugate prior for the Poisson distribution. We will use it as a prior for the Poisson parameter .
This is typically parameterized as Gamma
That is a bit to take in. So, is a form parameter and the rate is chosen from the binomial distribution. That indicates the prior describes the total count of in prior observations.
We can also express this by the following convolution integral:
The Gamma-Poisson mixture model can also be expressed as the negative binomial distribution. You start with the convolution integral above and insert the definition of the distributions.
Let be a shape parameter and be an inverse scale parameter:
Gamma has some fun properties. The Gamma function arises from the extension of the factorial to non-integer numbers via:
And
We let and (because it is inverse), and pop those into our equations. The book does a great job pp. 82-83. It ends up with
This means that we can describe a Poisson-process where we treat the parameter as a random variable following a Gamma function as prior using a negative binomial distribution. We can use the following relationship between mean and variance of the distribution with the parameters of the negative-binomial distribution:
where we assume .
p. 85
We have been counting events occurring in an amount of time or space. But we can flip the question and ask how long do we wait to observe the next event? This begins down the road of failure modelling and life contingencies.
A fixed rate means that the rate and mean are static and connected by . In that equation, we are letting be the fixed rate.
For the Poisson distribution, lambda was the mean, so we can replace that with the more expressive mean equation. There’s a good section here, The Exponential Distribution | Stats.LibreTests, about this distribution.
Basically, you can derive the cumulative distribution function from the Poisson like:
We let mean the probability that the time it takes for one event to occur is greater than . Sorry to throw a Random variable (pun intended) into the mix half way through.
Exponential distribution has a fun memoryless property. If you wait some time for an event to occur and it does not occur, then you check the probability again, it will not change. Something like:
The “Introduction to Mathematical Statistics” book leaves 2 cases of this property to the reader to solve. Good luck.
p. 87
The exponential distribution describes when it is most likely that the next event occurs. Looking at the density function though, you’ll see that the absolute most likely moment for the next event is always “now”. This is because the exponential distribution falls monotonously. This is not always desired.
The Weibull Distribution was originally developed in the context of materials science to describe failure of materials under stress. It has many other applications. It is given by:
where is the “shape” parameter or the Weibull modulus. In the case that , the Weibull distribution reduces to the exponential distribution! There are 3 different regimes for the shape parameter :
The distribution becomes more symmetric and localized for values , much greater than. The course book shows graphs where even for it begins looking more symmetric.
For predictive maintenance, we can use observational data from the component to predict the parameters of the Weibull distribution. We could then define a threshold where the risk of failure is acceptable and schedule the maintenance once the probability of failure approaches this threshold.
The new question is centred around multivariate random variables. The book “Introduction to Mathematical Statistics” designates Chapter 2 to Multivariate Distributions. They start with basic coin flip example.
Definition - Random Vector: given a random experiment with a sample space , consider two random variables and , which assign to each element of one and only one ordered pair, . The ordered pair is the random vector and lives in the space, or set of ordered pairs, .
The course book looks at 2 Gamma distributions and . What can we then say about ?
We can say that is also a random variable that maps sample space to some real numbers , with any real values. We can also compute the density of provided the following general theorem that requires to reason over multivariate random variables.
Need to lay down some of the ground work.
A multivariate continuous random variable is a mapping from the sample space to the set of vectors in .
Let be the set of possible values of . So, in the above, we have that is like a function that takes in and maps to now. Something like in our Gamma example, is the set of where both .
Now, suppose we have a mapping where is a subset of . It is like we have another function that takes in elements of , which are themselves functions of . We have then , often written as or .
The course book continues that it is the same as having a series of continuous random variables, expressed as and a series of functions . Consider:
I added spaces to hopefully clarify what is encapsulated. For the two Gamma random variables example, we’d have that and . Then, we can take, as transformation the mapping from .
Now suppose that is differentiable transformation. This means that is differentiable, that exists for each . We also suppose that the inverse of also exists. We can then have the following transformation theorem, proven in “Introduction to Mathematical Statistics” book by Hogg:
is the Jacobian of the transformation and is non-zero at least in on . I think I have notes from the Advanced Maths course, but the Jacobian is the determinant of the matrix of each partial derivative of .
The transformation theorem allows us to calculate the probability density function of a transformed random variable allows us to calculate the probability density function of a transformed random variable by combining the density with the inverse of the transformation and multiplying by the Jacobian of the inverse of the transformation.
Example of provided on pp. 91-93.
Exponential Distribution | ProbabilityCourse
I am not a fan of how the questions are worded and what is chosen to be quizzed on.
Define a random variable the follows the Bernoulli distribution.
It only has on parameter and it’s values can be anything really, not just 0 or 1.
Suppose follows and exponential distribution with parameter .
The parameter is inverse mean. And the variance is then the mean squared.
Which is correct?
Which is correct?
One can say that a variable follows, approximately, a normal distribution when…?
When is the sum of many iid variables, as per the central limit theorem. Not if is the average of many RVs or the sum of many uniform RVs, something I don’t think was even covered yet.