title: Advanced Mathematics
subtitle: DLMDSAM01
Author: Dr. Robert Graf
publisher: IU International University of Applied Sciences
year: 2023

I would imagine this to be a dense section where the notes would (eventually) be split into different categories. Until then, they are here 😊

Unit 6 - Information Theory

p. 127

Quick overview of our learning goals:

Introduction

What is information theory? There’s always a wikipedia article, which states it is β€œthe mathematical study of the quantification, storage, and communication of information.” It is in the realm of applied mathematics, and a combination of probability theory, statistics, computer science, and some other fields. It is a vastly wide field that has roots in many, many other applications such as statistical inference, cryptography, neurobiology, quantum computing, black holes, etc…

A major concept of information theory is entropy. It was introduced by Claude Shannon.

6.1 - Mean Squared Error (MSE)

To predict a quantity from observed values, we can turn to regression. Independent variables are often called features and the dependent variable is called a label.

y=f(X,ai)y=f(X,a_i)

where aia_i can be one or more parameters.

So, XX is a class, which is a collection of variables in a given dataset. Corresponding xβƒ—\vec{x} is a specific set of values. This vector is more of a notation and the domain of the vector may not necessarily fulfil all requirements of an actual vector space, as we’ve studied previously.

For observations, the mapping f(…)f(\ldots) results in a prediction, y^\hat{y}. Linear regression is the simplest nontrivial example of building a prediction model. It takes a form like y=mx+by=mx+b.

Definition - Free parameters: Free Parameters are those parameters in a model that need to be determined using the data.

We have free parameters a1=ma_1=m and a2=ba_2=b. To determine their value, we need observations of both xx and yy.

Without diving into regression analysis, we’ll just look at ways to assess the accuracy of the model. The simplest metric is the mean squared error.

Definition - Mean Squared Error: self describing equation for assessing accuracy of prediction model

MSE=1Nβˆ‘i=1N(yiβˆ’y^)2MSE = \frac{1}{N}\sum_{i=1}^{N}(y_i-\hat{y})^2

There’s also a mean absolute error, and the difference between the two is interesting, easier seen with a plot, but should be left for regression notes. I believe the use of squaring over absolutes is for the mathematical convenience of using in other equations. Absolutes don’t play as nicely with others.

Because the MSE squares the difference, it β€œputs a strong penalty of predictions y^\hat{y} that are far from observed value yy.” This means the metric can be become tainted and dominated by a few extreme values.

Model accuracy metrics like MSE can be used during model construction and during model testing and assessment.

It would be interesting to build a small algorithm to optimize accuracy of a regression model.

6.2 - Gini Index

p. 129

Gini Coefficient β€” from Wolfram MathWorld

The Gini coefficient GG is a summary statistic of the Lorenz curve and a measure of inequality in a population. It can range from 0 to 1, from all individuals being equal to a theoretical infinite population in which every individual, except one, has a size of zero.

It is a statistical measure of the degree of inequality of values in frequency distributions.

InvestoPedia seems to agree that it measures distribution of something across a population. A high Gini Index indicates greater inequality. That is why, if all values are zero and one is 1, the Gini index is 1. That one value holds all of the wealth.

WikiPedia also describes it as a measure of statistical dispersion.

It was developed by Italian Statistician and Sociologist Corrado Gini in 1921. A value of 0 means perfect dispersion or equality, and a value o 1 means perfect inequality. You can also obtain values greater than one if you consider negative values, like people with debt.

Finding the Gini Index is probably easiest to discuss when talking about income of people in a country. Gather all the data you can. Present data as cumulative percentage of population against cumulative share of income earned. You’ll get a resulting Lorenz Curve.

A simple equation is something like

G=AA+BG = \frac{A}{A+B}

Where AA is the amount between the line of equality and the Lorenz Curve, and BB is the area under the Lorenz Curve.

We can say a few things about that equation

EXAMPLE

Consider if we represent the Lorenz curve as y=x5y=x^5. Can we determine the Gini Index?

The area under the Lorenz Curve is BB. That is simply

B=∫01x5dx=16x6∣01=16(1)βˆ’(0)=16\begin{align*} B &= \int_0^1 x^5 dx \\ &= \left. \frac{1}{6}x^6 \right|_0^1\\ &= \frac{1}{6}(1) - (0) = \frac{1}{6} \end{align*}

And now the Line of Perfect Equality

A+B=∫01xdx=12x2∣01=12(1)βˆ’(0)=12\begin{align*} A+B &= \int_0^1 x dx \\ &= \left. \frac{1}{2}x^2 \right|_0^1\\ &= \frac{1}{2}(1) - (0) = \frac{1}{2} \end{align*}

This is the entire area. AA is the area between the perfect equality and the Lorenz Curve

A+16=12A=12βˆ’16A=36βˆ’16A=26=13\begin{align*} A + \frac{1}{6} &= \frac{1}{2}\\ A &= \frac{1}{2} - \frac{1}{6}\\ A &= \frac{3}{6} - \frac{1}{6}\\ A &= \frac{2}{6} = \frac{1}{3}\\ \end{align*}

Which leads to the Gini Index

G=1312=1321=23\begin{align*} G &= \frac{ \frac{1}{3}}{\frac{1}{2}}\\ &= \frac{1}{3}\frac{2}{1}\\ &= \frac{2}{3}\\ \end{align*}

Gini Impurity

p.134

The Gini Impurity is often confused with the Gini Index. However,

Definition - Gini Impurity: A measure of the homogeneity of a distribution of elements in a set. It is related to the probability of incorrectly classifying an object in a data set.

Suppose we have a data set and there are NN classification groups, or classes. Let pip_i be the probability of a random instance belonging to class ii. We have the following cases for 2 subsequent experiments of assigning a class to an element:

Proposition - Gini Impurity: To find this, we find the probability of being wrong about any given classification, and sum over all classifications

G=βˆ‘i=1Nβˆ‘jβ‰ ipipjG = \sum_{i=1}^N \sum_{j \ne i} p_ip_j

Now, we can use probability theory to rewrite the equations. Mainly,

1=βˆ‘Npi1 = \sum_N p_i

That just means all of the probabilities of a system must equal one. That is, for a system to be complete, something must happen. On the same note,

pi=1βˆ’βˆ‘jβ‰ ipjβˆ‘jβ‰ ipj=1βˆ’pi\begin{align*} p_i = 1-\sum_{j \ne i} p_j\\ \sum_{j \ne i} p_j = 1-p_i\\ \end{align*}

The above reflects that all of the probabilities must again, sum to 1. This time, we are using that fact to solve for the particular one we want though.

Now, for the rewrite,

G=βˆ‘i=1Npiβˆ‘jβ‰ ipj=βˆ‘i=1Npi(1βˆ’pi)=βˆ‘i=1Npiβˆ’pi2=βˆ‘i=1Npiβˆ’βˆ‘i=1Npi2=1βˆ’βˆ‘i=1Npi2\begin{align*} G &= \sum_{i=1}^N p_i \sum_{j \ne i}p_j\\ &= \sum_{i=1}^N p_i (1-p_i)\\ &= \sum_{i=1}^N p_i-p_i^2\\ &= \sum_{i=1}^N p_i- \sum_{i=1}^N p_i^2\\ &= 1 - \sum_{i=1}^N p_i^2\\ \end{align*}

Trade a double summation for sum of squares.

Entropy, Shannon Entropy, Kulback-Leibler Divergence

p. 135

Entropy

Entropy is a measure of the degree of randomness in a system.

It is said that entropy defines the arrow of time. The arrow of time is a concept that time always moves forward, not backward, and that reactions follow this direction. So, what you do now cannot affect the past.

Hold tight, we are diving into entropy on a fundamental level before introducing applications to information science.

The reader is suggested to read β€œPhysical Chemistry” (Atkins & de Paul, 2006, p. 573 ff). Entropy is introduced in thermodynamics. Think of a cup of coffee as it cools to room temperature. It doesn’t spontaneously heat up or combust. Or does it?

In Thermodynamics, entropy relies on the idea that change in a system is related to the energy lost in it process. This can usually be expressed by the amount of energy transferred by heat.

The inner energy is the measure of how much work a system can do. The energy changes by either transferring energy as heat or performing work

dU=Ξ΄Q+Ξ΄WdU=\delta Q + \delta W

Now, we denote entropy as SS, QQ can be the incremental amount of heat exchanged, and TT the temperature of the system. We can propose the given formula

dS=Ξ΄QTdS = \frac{\delta Q}{T}

The larger TT is, the less of an impact a given amount of heat has. Isn’t that neat.

That’s cool, thinking about a ball hitting the ground and transferring energy into random movements to reduce bounce height, but that doesn’t help with our study of information science.

Let’s look at statistical physics? This field looks more at the bigger picture and less at just a few atoms or molecules on a quantum level. A mole is a measure of count, anything containing 6.02214076Γ—10236.02214076\times 10^{23} particles. It’s such a large number, that we can kind of look over the fact they are individual particles. Like when we approximate really big things are infinite.

Just to state, highly improbable event do happen in that mole, and they happen all of the time because there are so many changes for them to happen, over 102310^{23} chances. However, they are so few and far in between, they get drown out by the regular occurrences, and hence, we see more consistent physical properties.

So, we neglect the individual contribution of each individual particle, and assume things fly around as tiny β€œbilliard balls”, or something. They bounce on each other, exchanging energy and changing modes of motion. The system has NN balls, each in a state of energy Ο΅i\epsilon_i. The concept of energy is seen as discrete. Ground state is the lowest energy of a particle.

Because we have a mole of shite, on average nin_i molecules occupy energy state Ο΅i\epsilon_i. And the distribution of molecules across the possible states is governed by the single parameter, temperature.

In general, the population of the system’s energy states is described by {n0,n1,n2,…}\{ n_0, n_1, n_2, \ldots \}. Let WW be the weight of configuration given by

W=N!n0!n1!…W=\frac{N!}{n_0!n_1!\ldots}

We now define the Boltzmann entropy

S=kBln⁑(W)S=k_B \ln(W)

where kBk_B is the Boltzmann constant (kB=1.38Γ—10βˆ’23m2kgsβˆ’2Kβˆ’1)(k_B=1.38\times 10^{-23}m^2kgs^{-2}K^{-1}) and WW is the weight of the configuration. This equation acts the same way as the thermodynamic definition. The only parameter that defines the system is TT the temperature. And, if we take the limit Tβ†’0T \to 0, only the ground state is accessible. Then only one configuration is possible, W=1W=1, and hence, S=0S=0 and ln⁑(1)=0\ln(1) =0.

Since only one state is accessible, the amount of randomness is minimal and increases as we increase temperature because more states become accessible.

The book covers a fun example using Carbon Monoxide (CO). I’m no chemist, but apparently as the molecule’s temperature gets lower and lower, accessible states are actually OCOC and COCO. The configurations are similar in terms of energy and it can happen that after a flip, the system doesn’t have enough energy to flip again. This can result in a lattice looking like: COOCCOβ‹―COOCCO\cdots.

That is apparently also a glimpse of how entropy can be used in relation to Information Science. You can imagine the sequences to represent a bit stream, it might read like 010010.

You can rewrite Boltzmann entropy as

ln⁑(W)=ln⁑(N!n0!n1!…)\ln(W) = \ln(\frac{N!}{n_0!n_1!\ldots})

And, based on properties of logarithms, you can expand it to

=ln⁑(N!)βˆ’(ln⁑(n0!)+ln⁑(n1!)+…)=ln⁑(N!)βˆ’βˆ‘iln⁑(ni!)\begin{align*} &= \ln(N!) - (\ln(n_0!)+\ln(n_1!)+\ldots) \\ &= \ln(N!) - \sum_i \ln(n_i!) \end{align*}

Proposition - Stirling’s Approximation: Without proof, we have a factorial approximation that

ln⁑(x!)β‰ˆxln⁑(x)βˆ’x\ln(x!) \approx x \ln(x) - x

Let us apply this to our current situation. For the last bit, note the N=βˆ‘iniN=\sum_i n_i.

ln⁑(W)=ln⁑(N!n0!n1!…)=ln⁑(N!)βˆ’(ln⁑(n0!)+ln⁑(n1!)+…)=ln⁑(N!)βˆ’βˆ‘iln⁑(ni!)β‰ˆ(Nln⁑(N)βˆ’N)βˆ’βˆ‘i(niln⁑(ni)βˆ’ni)β‰ˆNln⁑(N)βˆ’βˆ‘i(niln⁑(ni))+βˆ‘i(ni)βˆ’Nβ‰ˆNln⁑(N)βˆ’βˆ‘i(niln⁑(ni))\begin{align*} \ln(W) &= \ln(\frac{N!}{n_0!n_1!\ldots}) \\ &= \ln(N!) - (\ln(n_0!)+\ln(n_1!)+\ldots) \\ &= \ln(N!) - \sum_i \ln(n_i!) \\ &\approx (N\ln(N)-N) - \sum_i \left( n_i\ln(n_i)-n_i \right)\\ &\approx N\ln(N) - \sum_i \left( n_i\ln(n_i) \right) +\sum_i (n_i) - N\\ &\approx N\ln(N) - \sum_i \left( n_i\ln(n_i) \right)\\ \end{align*}

We can then pass this approximation into our calculation of Boltzmann entropy.

S=kBln⁑(W)β‰ˆkBNln⁑(N)βˆ’βˆ‘i(niln⁑(ni))β‰ˆkBln⁑(N)βˆ‘iniβˆ’βˆ‘i(niln⁑(ni))β‰ˆkBβˆ‘i(niln⁑(N)βˆ’niln⁑(ni))β‰ˆkBβˆ‘i(ni(ln⁑(N)βˆ’ln⁑(ni))β‰ˆkBβˆ‘i(niln⁑(Nni))β‰ˆβˆ’kBβˆ‘i(niln⁑(niN))β‰ˆβˆ’NNkBβˆ‘i(niln⁑(niN))β‰ˆβˆ’NkBβˆ‘i(piln⁑(pi))\begin{align*} S &= k_B \ln(W)\\ &\approx k_B N\ln(N) - \sum_i \left( n_i\ln(n_i) \right)\\ &\approx k_B \ln(N)\sum_i n_i - \sum_i \left( n_i\ln(n_i) \right)\\ &\approx k_B \sum_i \left( n_i\ln(N) - n_i\ln(n_i) \right)\\ &\approx k_B \sum_i \left( n_i(\ln(N) - \ln(n_i) \right)\\ &\approx k_B \sum_i \left( n_i\ln(\frac{N}{n_i})\right)\\ &\approx -k_B \sum_i \left( n_i\ln(\frac{n_i}{N})\right)\\ &\approx -\frac{N}{N}k_B \sum_i \left( n_i\ln(\frac{n_i}{N})\right)\\ &\approx -Nk_B \sum_i \left( p_i\ln(p_i)\right)\\ \end{align*}

where pi=ni/Np_i=n_i / N is the fraction of molecules in state ii or the probability that the molecule is in state ii.

Shannon Entropy

p. 139

Claude Shannon introduced the term entropy to describe the minimum encoding size necessary to send a message without data loss. It has 2 components

We will focus now on the first part, as it relates to entropy.

In information theory, we are concerned with the amount of information that we can obtain from a system and the information content of some event AA is defined

I(A)=βˆ’log⁑2(p(A))=log⁑2(1p(A))I(A) = - \log_2(p(A)) = \log_2\left(\frac{1}{p(A)}\right)

where p(A)p(A) is the probability that event AA occurs.

p(A)=1p(A)=1 is an extreme case where an event always occurs and no further information is added.

Also, information due to independent events is additive: I(A1∩A2)=I(A1)+I(A2)I(A_1\cap A_2)=I(A_1) + I(A_2).

Now we look at a larger system that is described by some discrete variable XX that can take values {x1,x2,…,xn}\{ x_1, x_2, \ldots, x_n \} according to probability distribution p(X)p(X). The Shannon Entropy is defined as the average information content of an outcome

H=E[I(X)]=E[βˆ’log⁑2(p(X))]=βˆ’βˆ‘p(xi)log⁑2(p(xi))\begin{align} H &= E[I(X)]=E[-\log_2(p(X))]\\ &= - \sum p(x_i)\log_2(p(x_i)) \end{align}

Of course, where E[X]E[X] represents the expected value of a random variable.

Now, consider a probability density function (continuous) instead of our discrete probability distribution function. From probability theory, we rewrite the equation as

H(x)=E[I(X)]=E[βˆ’log⁑2(p(X))]=βˆ’βˆ«p(xi)log⁑2(p(xi))dx\begin{align} H(x) &= E[I(X)]=E[-\log_2(p(X))]\\ &= - \int p(x_i)\log_2(p(x_i)) dx \end{align}

EXAMPLE

Calculate the entropy of a coin toss. Since the coin is fair, we have

H=E[I(X)]=E[βˆ’log⁑2(p(X))]=βˆ’βˆ‘p(xi)log⁑2(p(xi))=βˆ’βˆ‘i=1212log⁑2(12)=βˆ’(12log⁑2(12)+12log⁑2(12))=βˆ’((βˆ’12)+(βˆ’12))=1\begin{align*} H &= E[I(X)]=E[-\log_2(p(X))]\\ &= - \sum p(x_i)\log_2(p(x_i))\\ &= - \sum_{i=1}^2 \frac{1}{2}\log_2\left(\frac{1}{2}\right)\\ &= - \left( \frac{1}{2}\log_2\left(\frac{1}{2}\right) + \frac{1}{2}\log_2\left(\frac{1}{2}\right)\right)\\ &= - \left( (-\frac{1}{2})+ (-\frac{1}{2})\right)\\ &= 1 \end{align*}

So, we need one bit per coin toss to encode the resulting information, if heads or tails comes up. Apparently the entropy is maximal as we cannot predict the outcome of the next coin toss from what we have observed so far.

EXAMPLE

Can we calculate the Shannon Entropy of the string β€œ00100010”?

Our system has 2 states, either zero or one. Of the 8 characters, we have 6-zeros and 2-ones. We can say

p(xi)={xi=068=34xi=128=14p(x_i) = \left\{ \begin{array}{lr} x_i = 0 & \frac{6}{8}=\frac{3}{4}\\ x_i=1 & \frac{2}{8}=\frac{1}{4}\\ \end{array} \right.

Then, we plug and chug

H=E[I(X)]=E[βˆ’log⁑2(p(X))]=βˆ’βˆ‘p(xi)log⁑2(p(xi))=βˆ’βˆ‘i=01p(xi)log⁑2(p(xi))=βˆ’(p(x0)log⁑2(p(x0))+p(x1)log⁑2(p(x1)))=βˆ’(34log⁑2(34)+14log⁑2(14))=βˆ’(34(βˆ’0.415)+14(βˆ’2))=0.811\begin{align*} H &= E[I(X)]=E[-\log_2(p(X))]\\ &= - \sum p(x_i)\log_2(p(x_i))\\ &= - \sum_{i=0}^1 p(x_i)\log_2(p(x_i))\\ &= -\left( p(x_0)\log_2(p(x_0))+p(x_1)\log_2(p(x_1)) \right)\\ &= - \left( \frac{3}{4}\log_2(\frac{3}{4})+\frac{1}{4}\log_2(\frac{1}{4}) \right)\\ &= - \left( \frac{3}{4}(-0.415)+\frac{1}{4}(-2) \right)\\ &= 0.811 \end{align*}

Because the zero occurs more frequently than the one, the entropy drops below 1bits.

The Kullback-Leibler Divergence

p. 141

Entropy can be used to compare probability distributions. As we dive into probability distributions, there’s a discrete and continuous part to our topics now. The relative entropy, or the Kullback_Leibler (KL) Divergence between p(x)p(x) and q(x)q(x) is

Discrete:

DKL(pβˆ₯q)=βˆ‘p(x)log⁑2(p(xi)q(xi))D_{KL}(p \| q)=\sum p(x) \log_2\left( \frac{p(x_i)}{q(x_i)} \right)

Continuous:

DKL(pβˆ₯q)=∫p(x)log⁑2(p(xi)q(xi))dxD_{KL}(p \| q)=\int p(x) \log_2\left( \frac{p(x_i)}{q(x_i)} \right) dx

It satisfies Gibb’s Inequality

DKL(pβˆ₯q)β‰₯0D_KL(p \| q) \ge 0

It is a measure of how different two distributions are over the same random variable.

Note that DKL=0β€…β€ŠβŸΊβ€…β€Šp(x)=q(x)D_KL =0 \iff p(x) = q(x).

6.4 Cross Entropy

p. 142

There are times when we incorrectly infer a probability distribution from a data set. Cross Entropy helps discuss this possibility more formally.

Suppose we have 2 probability distributions, pp and qq, on the same set of variables. Let pp be the true distribution, and qq be the distribution that we optimized. We define cross entropy as the entropy of random variable XX and the Kullback-Leibler divergence between the true probability distribution and the one we used to estimate, that is between pp and qq.

We rewrite the KL divergence with properties of logarithms,

DKL(pΒ βˆ₯Β q)=βˆ‘iNp(xi)log⁑2(p(xi)q(xi))=βˆ‘iNp(xi)log⁑2(p(xi))βˆ’βˆ‘iNp(xi)log⁑2(q(xi))=βˆ’H(p)+H(p,q)\begin{align*} D_{KL}(p\ \| \ q) &= \sum_i^N p(x_i)\log_2 \left( \frac{p(x_i)}{q(x_i)} \right)\\ &= \sum_i^N p(x_i)\log_2(p(x_i)) -\sum_i^N p(x_i)\log_2(q(x_i))\\ &= -H(p) + H(p,q)\\ \end{align*}

Note the following

H(p,q)=βˆ’βˆ‘iNp(xi)log⁑2(q(xi))H(p,q) = - \sum_i^Np(x_i)\log_2(q(x_i))

This is called the cross entropy of pp and qq. It represents the average number of bits required for use to identify an event given that we have coded our scheme using distribution qq when the true distribution is pp. Note that H(p,q)β‰ H(q,p)H(p,q) \ne H(q,p). This is because of the asymmetry of the KL divergence.

Another basic attribute of cross entropy is that it is bounded below by the entropy of the true distribution. The smallest cross entropy is obtained when the true distribution is the one we use in our coding scheme. That is…

H(p,p)=H(p)=βˆ’βˆ‘iN(p(xi)log⁑2(p(xi)))H(p,p) = H(p) = -\sum_i^N \left( p(x_i)\log_2(p(x_i)) \right)

And that goes back to Shannon Entropy.

Machine learning algorithms are not explicitly programmed, but use data to learn specific relationships. Cross entropy is often used during optimization of the model, more specifically for classification tasks. It determines how well the model qq describes the true pp.

β–‘\Box


Knowledge Check

Question 1

What is the mean squared error?

It assesses the accuracy of a model by averaging the squared difference between the model’s predictions and the actual observations.

In a way, you can consider it a symmetric measurement because the model should balance predictions around observed values.

I do not believe a model should or would make large overestimates to compensate for large underestimates to reduce the MSE.

Note that the MSE can be dominated by a few large errors, especially because the errors are squared.

A more realistic, but less mathematically friendly, measurement could be the mean absolute error.

Question 2

If we are looking at a country with a high level of income inequality, what can we say about the Gini coefficient?

It would/should be close to 1.

It would not be greater than 1 unless, I believe, we are taking debt into consideration. And the Gini coefficient would not be equal to 1 because that is more of a theoretical set of perfect inequality.

The Lorenz curve is probably quite far from the linear line of equality.

Question 3

What is the equation for Shannon Entropy?

S=βˆ’NkBβˆ‘ipiln⁑(pi)S = -Nk_B \sum_i p_i \ln(p_i)

Looks about right

Question 4

What probability of a coin toss results in the highest possible Shannon entropy?

If we toss a fair coin, 50-50 chance either heads or tails. This will create the possibility of the most unknow and random results, generating the highest entropy.

Question 5

Tell me about Cross Entropy.

Cross entropy can be used to measure how well a model describes the true probability distribution on an underlying set of variables.

It can be used when building models.

In general Ec[p,q]β‰ Ec[q,p]E_c[p,q] \ne E_c[q,p] and the cross entropy is minimal when p=qp=q.