title: Advanced Statistics
subtitle: DLMDSAS01
authors: Prof. Dr. Unknown
publisher: IU International University of Applied Sciences
date: 2023
p. 11 - 42
Learning goals include:
I want to put Statistics | LibreTexts here as they have an entire statistics website with many resources for free.
Statistics as a science is the science of collecting, presenting, analysing, and interpreting facts and data. This sounds a bit like Data Science.
Statistics can also be the plural for statistic, which is a single measure of a sample. More about this later.
What are the two main branches that Statistics as a science are commonly subdivided into?
How do we bridge the two branches of statistics, descriptive and inferential statistics?
What is a “population” in the context of statistics?
What does the term “measurement” mean in statistics?
Not everything can be measured the same way. Therefore, there are several different levels of measurement. What are four commonly used levels of measurement?
I image one day, in the far future, if I revisit these notes I would separate probability from statistics as they are actually 2 different fields of study.
What is probability theory generally concerned with?
What is the difference between deterministic and random events?
What is a random experiment?
What is a sample space?
What is a set?
What is an event?
What is an outcome?
What are the two extreme events?
What is a random variable?
What is the Expectation value (aka: expected value)?
What is the union of 2 events?
What is the intersection of 2 events?
What do we mean when 2 events are mutually exclusive (aka: disjoint)?
There are other things, like the empty set is a set with no elements. It’s like set theory 0. We say is the complement of , which consists of all outcomes in the sample space that are not in . You may also see it as .
What are some fun properties?
We use to say element is contained in the event . Also, means that all elements in sample space are contained in event . We use for subset or equal. If you see , that is the count of elements in an Event.
A popular way to visualize sets of events is with a Ven diagram, introduced by John Venn in the 19th century. The book provides examples on page 16. Spend time learning Venn diagrams if you are unfamiliar.
In 1956, Andrey Kolmogorov introduced 5 axioms that became central to probability theory. They are commonly reduced to 3 axioms you need to remember.
What are Kolmogorov’s 3 Axioms?
Remember, mutually exclusive just means that and have no elements in common. So if one event happens, the other cannot.
If two events are not mutually exclusive, how do we add them?
That is, we add and the part that intersects , with and the part that intersects . If that sounds confusing, it is because we essentially added the intersection of the two events twice, which is why we remove it. We also use the same logic when we added mutually exclusive events. But in this niche case, the intersection of and is 0.
How is conditional probability different than regular probability?
How do we define conditional probability mathematically?
Also, is said “The probability of A given B.” I have also realized we are discussing conditional probability without having defined the simpler case of independence first.
What is meant when we say that two events are independent?
How is independence represented mathematically?
Simple multiplication for the probability of both events occurring. If it is anything but this, it’s indicative that the events have some dependence.
What happens to conditional probability if events and are independent?
Based on the above, you would have
This is a good explanation of above and I will walk through this logically (hopefully). If events and are independent, then . That is, the probability of event happening given event has happened, is still just , because there’s no dependence on event .
Therefore, in this sort of backwards logical way:
A sticking point for me is independence and mutual exclusivity.
What is the difference between two event being independent and two events being mutually exclusive?
What is the total law of probabilities?
Bare with me… the Law of Total Probability | wiki is a theorem that allows us to decompose a probability of an event into its constituents. It is a fundamental rule for relating marginal (subset) probabilities to conditional probabilities and expresses the total probability of an outcome as several distinct events.
It’s like a weighted average. Because of this, the marginal probability may be called the “average probability”.
Consider and example of trying to determine probability of a machine failing. It can be expressed as a sum of probabilities for each way the machine can fail.
The course book covers an example on p. 22 that also covers formulas for sensitivity and specificity. Sensitivity is, did we catch all the positive cases? This is important in medical testing.
What are Type I and Type II errors?
What is set theory?
What is a mapping between two sets, and ?
How do we characterize a discrete probability distribution?
What does a probability mass function do?
How do we characterize a continuous probability distribution?
What does a probability density function do?
How can we derive probabilities from the PDF?
What is the mathematical definition of cumulative distribution function?
Often, the cumulative distribution function is denoted with a capital letter.
Can you describe the Gaussian distribution?
Also called the Normal distribution, it is expressed in terms of its mean and standard deviation:
And has a density distribution written as (one of a few ways):
You see we use lowercase variables to denote values of the random variable . Sometimes you will see the random variable subscript such as . It’s similar to the partial derivative subscript but not to be confused.
Additionally, you may also have parameters of the distribution provided in the function’s definition as well such as:
This make the distribution parameters explicit. I don’t mind this notation.
What characteristics arise in probability distributions because of the Kolmogorov axioms?
For the continuous case, probability is always assigned for a range and not a specific value because each pin point value is infinitesimally small.
p. 26 covers important probability distributions!
Discrete:
Continuous:
Words of warning:
To save on space, we did not include explicit distribution parameters in the function definition, but you can find those in the notation.
We can extend these concepts and define probability distributions for two or more random variables.
What do we call a probability distribution that combines multiple random variables?
What is the notation for a joint probability distribution for random variables and ?
What is a marginal distribution?
Considering random variables and as continuous, what are their marginal distribution functions?
Integrate the joint probability distribution function over the variable you are looking to exclude. The book goes on to explain how this notation can be confused with partial derivatives. To avoid confusion, we will use the fraction notation like to denote partial derivatives.
We can extend the two-dimensional case to even more random variables! We are going to look at the Normal Distribution | wiki, but more specifically the Multivariate Normal Distribution | Wiki. Unfortunately, the text cover this much but only gives an example.
The Wiki page gives several proofs, with convolutions, Fourier transform, and Geometric proofs. However, the outcome is if and are independent normally distributed random variables,
Then,
Yes, the sum is also a normally distributed random variable.
It goes by very many names, like variance decomposition formula, or “Eve’s Law”, but states that if and are random variables on the same probability space, and variance of is finite, then;
Right, this doesn’t relate to our current topic but I will leave it in because why not.
Bienaymé’s identity states that:
The second expression is just shorthand notation for the first. An important thing to note is that the . So, whenever we are actually summing the variance of that particular random variable in the mix. This is why, if the random variables are independent, all of the covariance factors will equal zero, and we just sum the variance.
Correlation, or dependence, is any statistical relationship between two random variables or bivariate data.
for .
Ok, if we have again and as univariate independent normally distributed random variables and their sum is , . However, not sure if this goes as far as Analysis of Variance | wiki, probably not but it’s a good read.
The answer is in Sum of Normally Distributed Random Variables | Wiki. If the random variables are not independent, then we have:
It may seem silly to use correlation instead of covariance, but in other calculations, we may be using a covariance matrix | wiki.
The Multivariate Normal Distribution | Wiki of a k-dimensional random vector can be written as:
Where you have a k-dimensional mean-vector
and a covariance matrix:
The course book continues to something like:
Where is the inverse of the covariance matrix and sometimes called the precision matrix and denoted .
p. 28
A population in statistics refers to any measurable object. We understand that a population includes all elements of a set of data, as opposed to a sample drawn from the population.
We would do this simply because the population is often too large to measure each individual object.
What are order statistics?
These describe the sample in an ascending order to help us describe the sample in a structured manner.
How do we denote ordered statistics?
Of variable :
Why would ordered statistics be useful?
You could divide the number of observation into four equal parts, called quartiles. It aids in finding the median, a useful and popular statistic of data. Helps find the range, which is just max minus min.
So, very good for providing summary statistics of a given series.
p. 29
Buckle up, it’s a long ride.
Another name for problems of dimensionality is “curse of dimensionality”.
The main question is as follows:
No, this will not work as intended. There will probably be a point where increasing the dimensionality by adding new features would actually degrade the performance of the classifier. This implies there being an optimal number of features for each model to maximize the classifier’s performance.
The book continues to look at adding features one at a time to a linear classifier starting with average “red” colour of a photo. Then average “green” colour and then average “blue” colour.
Notice that the density of training samples decreases exponentially when we increase the dimensionality of the problem. For example, if we have 10 images of cats and dogs to train on, and we have one feature, then we can have something like samples per interval, assuming our width of 10 has 5 unit intervals. Seems arbitrary.
However, with the second dimension, we still have just 10 samples. But they now cover a feature space with an area of square units. The sample density quickly falls to samples per interval. Go ahead and consider the 3rd dimension.
Adding additional features increases the dimensionality of the feature space, giving it more space and making sample instances more sparse.
This is a plane in a higher-dimensional vector space.
A Hyperplane | Wiki is a subspace whose dimension is one less than that of its ambient space. So, if you have points in 3-dimensional space, a hyperplane is a 2-dimensional plane.
Due to sparsity, it becomes easier to find separating hyperplanes with higher dimensions because the likelihood that a training sample lies on the wrong side of the best hyperplane becomes infinitely small when the number of features becomes infinitely large.
That sounds like what we want, so what is the issue?
Given the graph in the book, you can see there is obvious overfitting.
How does overfitting | Wiki?
In contrast to underfitting, overfitting occurs when a training model learns the training data too well and cannot generalize to real world data well, or will fail to fit additional data or predict future observations reliably.
How does overfitting occur?
In our case with classifying dogs and cats, adding the three-dimensional space to obtain better classification results corresponds to using a complicated non-linear classifier in a lower dimensional feature space. The classifier learns the appearances of specific instance and specific details of our training data set, leading to overfitting.
This is a result of the curse of dimensionality.
In the context of ML models, what does the term generalize refer to?
It refers to the ability of a classifier to perform well on unseen data, even if that data is not exactly the same as the training data.
By using fewer features, we can avoid the curse of dimensionality with our classification model. It may perform worse on training data, but because it did not overfit it will generalize to other data more effectively.
Then there’s this train of thought that to train a classifier using only a single feature whose value ranges from zero to one, and is unique for each cat and dog… If we want to training data to cover 20% of this range, you need 20% of the complete population of cats and dogs. Adding another feature results in two-dimensional feature space. To cover 20% of a two-dimensional range, you need 45% of the population in each dimension. This is because .
The data required to cover a static percentage of the feature range frows exponentially with the number of dimensions. You can think of it like, as you increase dimensionality, gaps in data emerge, or almost all the sample space becomes empty. The book illustrates with images.
The more features we use, the more sparse the data becomes, and accurate estimation of classifier’s parameters becomes more difficult.
What is a hypercube | Wiki?
In Geometry, a hypercube is an -dimensional analogue of a square and a cube. For our purposes, it is a generalization of a cube into more than 3 dimensions. Wiki shows how to build a 3-cube into a tesseract (aka 4-cube).
We note that the data sparseness may not, and probably is not, uniformly distributed over the search space. In terms of a hypercube, the average of the feature space would be the center of the unit square. Data that is not near average will end up in the corners of the hypercube, and will be difficult to classify. Classification is easier if most samples fall inside the inscribed unit circle.
The issue is the volume of the hypersphere, which would contain the sort of averages, tends to zero as the dimensionality tends to infinity. Somehow, the volume of the surrounding hypercube remains constant.
This surprising, and kind of counter-intuitive, observation partly explains the problems with the curse of dimensionality in classification. In high dimensional spaces, most of the training data reside in the corners of the hypercube, defining the feature space. But data in the corners is harder to classify.
If you get anything from this discussion, dimensionality is a double edged sword.
p. 35
Real-world data are often structured in a complex manner. The challenge is reducing dimensions with minimal loss of information.
Principal component analysis and discriminant analysis
Good datasets to work with are:
PCA and DA are both linear transformation methods.
PCA is used to find components (directions) that maximize the variance in our dataset. Discriminant Analysis is additionally interested in finding the components (directions) that maximize the separation (discrimination) between different classes. In contrast, PCA ignores class labels.
So PCA will treat the entire data set as one class, where DA will retain classes within the data set. As such, DA maximizes variance within classes and the spread between classes.
The book begins analysis of the Iris dataset. First step is visualization of data. Some features, like the sepal length, are overlapping between species. This means we cannot separate one species from another with this feature.
The goal is to find the components (directions) that maximize the separation (discrimination) between different classes.
This technique transforms data into a subspace that summarizes properties of the whole dataset with a reduced number of dimensions. The new dimensions are called principal components. We can use these newly formed dimensions to visualize the data.
Where is most variability expressed in PCA? The least?
The first principal component holds the data that expresses most of its variability. It descends from there.
Note: Principal components are orthogonal to each other and therefore not correlated. That is part of its power, to un-correlate variable.
There are 6 steps:
Covariance is expressed by:
You can also take a different route following Wiki:
The middle may be a little confusing. The expected value of an expectation is just the expected value.
Using a tool like scikit-learn to perform PCA can show us how much variance is explained by each new variable.
Note that new variables created by PCA do not have intuitive names. This is because the PCA method finds the best linear combination of the original features such that the new variables are ordered by retaining the maximum variance found in the data. We keep the data that is above our arbitrary variance threshold. Thus, we can limit ourselves to a much smaller list of new variables.
The Iris data set has four features, so the power of PCA can be hard to see. Imagine reducing a dataset of a hundred features (or more) down to twenty features (or less). This is significant computational advantages.
Linear Discriminant Analysis (LDA) goes by many names and is most commonly used as a dimensionality reduction technique in the pre-processing step for pattern-classification and machine learning applications.
The goal is to project a dataset onto a lower-dimensional space with good class-separability in order to avoid overfitting and to reduce computational costs.
LDA is similar to PCA in that they both find component axes that maximize the variance of data. However, LDA is additionally interested in the axes that maximize the separation between multiple classes.
Main goal of Linear Discriminant Analysis is to project a feature space onto a smaller subspace while maintaining the class-discriminatory information.
There are 5 steps (Raschka, 2014b) for LDA:
In step 5 we can write as matrix multiplication where represents the whole original dataset, and so is the whole new dataset.
Is a density function symmetric around the zero axis?
Is a density function (of a real continuous random variable) also a continuous function?
Does a discrete random variable always go to the set of natural numbers (so )?
Can the density function of a real continuous random variable be extended to be defined over all real numbers.
Consider the random variable whose probability distribution is defined by for .
Is the above distribution actually a probability distribution?
Yes, although I think they are usually expressed as , the sum of all probabilities will equal one.
What is the density function of
What is the expected value of the above distribution function?
You find that a friend tells you that it is easy to calculate the density of the sum of two random variables given the density of each of them. You then search the web for this. Which citation could you use for this so as to include in a workbook or in a report for customers of the data-processing department you work for?
Taboga, Marco (2017). Sums of independent random variables, Lectures on probability theory and mathematical statistics, Third edition. Kindle Direct Publishing. Online appendix. https://www.statlect.com/fundamentals-of-probability/sums-of-independent-random-variables (accessed 2021-03-01).
What is Kolmogorov’s 1st Axiom?
Positivity! the probability of an event must be a non-negative real number.
What is Kolmogorov’s 2nd Axiom?
Normalization! The probability that some event occurs in a sample space must be 1.
What is Kolmogorov’s 3rd Axiom?
Additivity! If events and are disjoint then the probability of either or occurring is the sum of their individual probabilities.
Consider a sample 7,10,7.5,7 of a random variable X∼N(0,10)�∼�(0,10). Which of the following is a statistic?
We are told that is a random variable. We are then given a sample from this random variable, . Find statistics of mean, median, and variance of this sample.
Right, so we are told what the distribution is, but then given a sample! Use the sample:
The median is somewhere between 7 and 7.5 based on the sample. The mode of the sample is 7.