title: Advanced Mathematics
subtitle: DLMDSAM01
Author: Dr. Robert Graf
publisher: IU International University of Applied Sciences
year: 2023
I would imagine this to be a dense section where the notes would (eventually) be split into different categories. Until then, they are here π
p. 127
Quick overview of our learning goals:
What is information theory? Thereβs always a wikipedia article, which states it is βthe mathematical study of the quantification, storage, and communication of information.β It is in the realm of applied mathematics, and a combination of probability theory, statistics, computer science, and some other fields. It is a vastly wide field that has roots in many, many other applications such as statistical inference, cryptography, neurobiology, quantum computing, black holes, etcβ¦
A major concept of information theory is entropy. It was introduced by Claude Shannon.
To predict a quantity from observed values, we can turn to regression. Independent variables are often called features and the dependent variable is called a label.
where can be one or more parameters.
So, is a class, which is a collection of variables in a given dataset. Corresponding is a specific set of values. This vector is more of a notation and the domain of the vector may not necessarily fulfil all requirements of an actual vector space, as weβve studied previously.
For observations, the mapping results in a prediction, . Linear regression is the simplest nontrivial example of building a prediction model. It takes a form like .
Definition - Free parameters: Free Parameters are those parameters in a model that need to be determined using the data.
We have free parameters and . To determine their value, we need observations of both and .
Without diving into regression analysis, weβll just look at ways to assess the accuracy of the model. The simplest metric is the mean squared error.
Definition - Mean Squared Error: self describing equation for assessing accuracy of prediction model
Thereβs also a mean absolute error, and the difference between the two is interesting, easier seen with a plot, but should be left for regression notes. I believe the use of squaring over absolutes is for the mathematical convenience of using in other equations. Absolutes donβt play as nicely with others.
Because the MSE squares the difference, it βputs a strong penalty of predictions that are far from observed value .β This means the metric can be become tainted and dominated by a few extreme values.
Model accuracy metrics like MSE can be used during model construction and during model testing and assessment.
It would be interesting to build a small algorithm to optimize accuracy of a regression model.
p. 129
Gini Coefficient β from Wolfram MathWorld
The Gini coefficient is a summary statistic of the Lorenz curve and a measure of inequality in a population. It can range from 0 to 1, from all individuals being equal to a theoretical infinite population in which every individual, except one, has a size of zero.
It is a statistical measure of the degree of inequality of values in frequency distributions.
InvestoPedia seems to agree that it measures distribution of something across a population. A high Gini Index indicates greater inequality. That is why, if all values are zero and one is 1, the Gini index is 1. That one value holds all of the wealth.
WikiPedia also describes it as a measure of statistical dispersion.
It was developed by Italian Statistician and Sociologist Corrado Gini in 1921. A value of 0 means perfect dispersion or equality, and a value o 1 means perfect inequality. You can also obtain values greater than one if you consider negative values, like people with debt.
Finding the Gini Index is probably easiest to discuss when talking about income of people in a country. Gather all the data you can. Present data as cumulative percentage of population against cumulative share of income earned. Youβll get a resulting Lorenz Curve.
A simple equation is something like
Where is the amount between the line of equality and the Lorenz Curve, and is the area under the Lorenz Curve.
We can say a few things about that equation
EXAMPLE
Consider if we represent the Lorenz curve as . Can we determine the Gini Index?
The area under the Lorenz Curve is . That is simply
And now the Line of Perfect Equality
This is the entire area. is the area between the perfect equality and the Lorenz Curve
Which leads to the Gini Index
p.134
The Gini Impurity is often confused with the Gini Index. However,
Definition - Gini Impurity: A measure of the homogeneity of a distribution of elements in a set. It is related to the probability of incorrectly classifying an object in a data set.
Suppose we have a data set and there are classification groups, or classes. Let be the probability of a random instance belonging to class . We have the following cases for 2 subsequent experiments of assigning a class to an element:
Proposition - Gini Impurity: To find this, we find the probability of being wrong about any given classification, and sum over all classifications
Now, we can use probability theory to rewrite the equations. Mainly,
That just means all of the probabilities of a system must equal one. That is, for a system to be complete, something must happen. On the same note,
The above reflects that all of the probabilities must again, sum to 1. This time, we are using that fact to solve for the particular one we want though.
Now, for the rewrite,
Trade a double summation for sum of squares.
p. 135
Entropy is a measure of the degree of randomness in a system.
It is said that entropy defines the arrow of time. The arrow of time is a concept that time always moves forward, not backward, and that reactions follow this direction. So, what you do now cannot affect the past.
Hold tight, we are diving into entropy on a fundamental level before introducing applications to information science.
The reader is suggested to read βPhysical Chemistryβ (Atkins & de Paul, 2006, p. 573 ff). Entropy is introduced in thermodynamics. Think of a cup of coffee as it cools to room temperature. It doesnβt spontaneously heat up or combust. Or does it?
In Thermodynamics, entropy relies on the idea that change in a system is related to the energy lost in it process. This can usually be expressed by the amount of energy transferred by heat.
The inner energy is the measure of how much work a system can do. The energy changes by either transferring energy as heat or performing work
Now, we denote entropy as , can be the incremental amount of heat exchanged, and the temperature of the system. We can propose the given formula
The larger is, the less of an impact a given amount of heat has. Isnβt that neat.
Thatβs cool, thinking about a ball hitting the ground and transferring energy into random movements to reduce bounce height, but that doesnβt help with our study of information science.
Letβs look at statistical physics? This field looks more at the bigger picture and less at just a few atoms or molecules on a quantum level. A mole is a measure of count, anything containing particles. Itβs such a large number, that we can kind of look over the fact they are individual particles. Like when we approximate really big things are infinite.
Just to state, highly improbable event do happen in that mole, and they happen all of the time because there are so many changes for them to happen, over chances. However, they are so few and far in between, they get drown out by the regular occurrences, and hence, we see more consistent physical properties.
So, we neglect the individual contribution of each individual particle, and assume things fly around as tiny βbilliard ballsβ, or something. They bounce on each other, exchanging energy and changing modes of motion. The system has balls, each in a state of energy . The concept of energy is seen as discrete. Ground state is the lowest energy of a particle.
Because we have a mole of shite, on average molecules occupy energy state . And the distribution of molecules across the possible states is governed by the single parameter, temperature.
In general, the population of the systemβs energy states is described by . Let be the weight of configuration given by
We now define the Boltzmann entropy
where is the Boltzmann constant and is the weight of the configuration. This equation acts the same way as the thermodynamic definition. The only parameter that defines the system is the temperature. And, if we take the limit , only the ground state is accessible. Then only one configuration is possible, , and hence, and .
Since only one state is accessible, the amount of randomness is minimal and increases as we increase temperature because more states become accessible.
The book covers a fun example using Carbon Monoxide (CO). Iβm no chemist, but apparently as the moleculeβs temperature gets lower and lower, accessible states are actually and . The configurations are similar in terms of energy and it can happen that after a flip, the system doesnβt have enough energy to flip again. This can result in a lattice looking like: .
That is apparently also a glimpse of how entropy can be used in relation to Information Science. You can imagine the sequences to represent a bit stream, it might read like .
You can rewrite Boltzmann entropy as
And, based on properties of logarithms, you can expand it to
Proposition - Stirlingβs Approximation: Without proof, we have a factorial approximation that
Let us apply this to our current situation. For the last bit, note the .
We can then pass this approximation into our calculation of Boltzmann entropy.
where is the fraction of molecules in state or the probability that the molecule is in state .
p. 139
Claude Shannon introduced the term entropy to describe the minimum encoding size necessary to send a message without data loss. It has 2 components
We will focus now on the first part, as it relates to entropy.
In information theory, we are concerned with the amount of information that we can obtain from a system and the information content of some event is defined
where is the probability that event occurs.
is an extreme case where an event always occurs and no further information is added.
Also, information due to independent events is additive: .
Now we look at a larger system that is described by some discrete variable that can take values according to probability distribution . The Shannon Entropy is defined as the average information content of an outcome
Of course, where represents the expected value of a random variable.
Now, consider a probability density function (continuous) instead of our discrete probability distribution function. From probability theory, we rewrite the equation as
EXAMPLE
Calculate the entropy of a coin toss. Since the coin is fair, we have
So, we need one bit per coin toss to encode the resulting information, if heads or tails comes up. Apparently the entropy is maximal as we cannot predict the outcome of the next coin toss from what we have observed so far.
EXAMPLE
Can we calculate the Shannon Entropy of the string β00100010β?
Our system has 2 states, either zero or one. Of the 8 characters, we have 6-zeros and 2-ones. We can say
Then, we plug and chug
Because the zero occurs more frequently than the one, the entropy drops below 1bits.
p. 141
Entropy can be used to compare probability distributions. As we dive into probability distributions, thereβs a discrete and continuous part to our topics now. The relative entropy, or the Kullback_Leibler (KL) Divergence between and is
Discrete:
Continuous:
It satisfies Gibbβs Inequality
It is a measure of how different two distributions are over the same random variable.
Note that .
p. 142
There are times when we incorrectly infer a probability distribution from a data set. Cross Entropy helps discuss this possibility more formally.
Suppose we have 2 probability distributions, and , on the same set of variables. Let be the true distribution, and be the distribution that we optimized. We define cross entropy as the entropy of random variable and the Kullback-Leibler divergence between the true probability distribution and the one we used to estimate, that is between and .
We rewrite the KL divergence with properties of logarithms,
Note the following
This is called the cross entropy of and . It represents the average number of bits required for use to identify an event given that we have coded our scheme using distribution when the true distribution is . Note that . This is because of the asymmetry of the KL divergence.
Another basic attribute of cross entropy is that it is bounded below by the entropy of the true distribution. The smallest cross entropy is obtained when the true distribution is the one we use in our coding scheme. That isβ¦
And that goes back to Shannon Entropy.
Machine learning algorithms are not explicitly programmed, but use data to learn specific relationships. Cross entropy is often used during optimization of the model, more specifically for classification tasks. It determines how well the model describes the true .
Question 1
What is the mean squared error?
It assesses the accuracy of a model by averaging the squared difference between the modelβs predictions and the actual observations.
In a way, you can consider it a symmetric measurement because the model should balance predictions around observed values.
I do not believe a model should or would make large overestimates to compensate for large underestimates to reduce the MSE.
Note that the MSE can be dominated by a few large errors, especially because the errors are squared.
A more realistic, but less mathematically friendly, measurement could be the mean absolute error.
Question 2
If we are looking at a country with a high level of income inequality, what can we say about the Gini coefficient?
It would/should be close to 1.
It would not be greater than 1 unless, I believe, we are taking debt into consideration. And the Gini coefficient would not be equal to 1 because that is more of a theoretical set of perfect inequality.
The Lorenz curve is probably quite far from the linear line of equality.
Question 3
What is the equation for Shannon Entropy?
Looks about right
Question 4
What probability of a coin toss results in the highest possible Shannon entropy?
If we toss a fair coin, 50-50 chance either heads or tails. This will create the possibility of the most unknow and random results, generating the highest entropy.
Question 5
Tell me about Cross Entropy.
Cross entropy can be used to measure how well a model describes the true probability distribution on an underlying set of variables.
It can be used when building models.
In general and the cross entropy is minimal when .