Selected Mathematical Techniques

Not sure if this belongs here or in the Maths section, but here is good for now.

Unit 5: Selected Mathematical Techniques

Introduction

We will discuss mathematical techniques and models used to transform data into insightful information. There are two modeling approaches for prediction:

Basically, the flow of this section is the same order as the learning objectives. It’s noted that there we separate time-series data because it is both common and requires additional considerations.

Feature Extraction

Feature extraction is achieved by reducing the dimensions to the ones that significantly contribute to the features of interest. It assumes that data approximately lie on a lower dimensional space. This is achieved by methods such as convolution, radial transforms, integral transforms, or principal components analysis.

We want to isolate important characteristics. Linear regression is simple feature extraction. A reduction from

R^3

R^1

, that is a reduction of 3 dimensional data, will be included in the last session.

5.1 - Principal Component Analysis

Input data usually include correlated variables, either redundant or irrelevant. If the correlation is high but not 100%, then there is still some amount of independent information within each variable. However, the juice isn’t worth the squeeze, it place more burden on the prediction models than what they are worth.

So, we come up with a correlation threshold and if a variable exceed that threshold with another variable, we can assume it as redundant information and can safely remove from the dataset.

Definition - Principal Component Analysis (PCA): A statistical analysis method applied to transform potentially correlated variables into uncorrelated variables called principle components (PCs).

This is a method that is applied to transform linearly-correlated variables into uncorrelated variables called principal components (PCs). PCA also sorts the produced uncorrelated variables according to their variance along the data records. Variables at the bottom of the list are said to have low changeability, and can be excluded. This results in a desired reduction in dimensionality of the dataset.

The goal is to construct a new set of variables such that most information is contained within the first few variables. Then, in the machine learning model and regressions phase, we might only use a subset of the new variables.

The PCA Algorithm

Step 1: Get and Subtract the Mean

The book describes this a little weird so I will give my interpretation. If we have a table of data, each record is a row, and the columns are our features, or variables as the book says. But if a record is a Person, then one column could be “age”, another could be “height”, and another “gross income”, etc… We can have

M

different variables, each represented as…

\left\{ x_i\;|\;i \in \mathbb{N},1\le i \le M \right\}

\bar{x}_i=\frac{1}{N} \sum_{k=1}^{N} x_{ik}

Where

i = 1,\ 2,\ \dots,\ M

. Then, you simply subtract each value from the mean, resulting in a dataset whose mean is centred around 0, simplifying the remaining steps of the PCA algorithm.

x_i = x_i - \bar{x}_i

Step 2: Calculate the Covariance Matrix

Covariance

C(x_i, x_j)

is the measure of changes in variables

x_i

with respect to the changes in variable

x_j

according to:

C(x_i,\ x_j)=\frac{1}{N-1} \sum_{k=1}^{N} (x_i \cdot x_j)_k

Always fun to note that the covariance of a variable with itself is the variance. Since we are solving for the covariance of all variables with all other variables, you get a symmetric matrix with dimensions

\left[ M \cdot M \right]

C = \begin{bmatrix} C(x_1, x_1) & C(x_1, x_2) & \dots & C(x_1, x_M)\\ C(x_2, x_1) & C(x_2, x_2) & \dots & C(x_2, x_M)\\ \vdots & \vdots & \ddots & \vdots \\ C(x_M, x_1) & C(x_M, x_2) & \dots & C(x_M, x_M) \end{bmatrix}

Remember that covariance can be either positive or negative, and a covariance close to 0 indicates variables are uncorrelated.

\Sigma= \begin{pmatrix} \sigma_X^2 & \sigma^{XY}\\ \sigma_{XY} & \sigma_Y^2 \end{pmatrix}

Step 3: Calculate the EigenValues and EigenVectors

If you have read through the advanced Maths notes, you would have come across these.

The EigenVector shows the largest variance that we must map to, the direction of the largest variance of the sample covariance matrix. The largest EigenValue is the span along the largest variance. Yes, EigenVectors are typical features of typical data. They have meaning, in this case as the variance. But more genearlly, they are the rotation of the invariant in space.

The objective of PCA is to transform the calculated covariance matrix into an optimum form where all variables are uncorrelated linearly to first order. That is,

C(x_i, x_j) = 0,\ i \ne j

. Notice how we state uncorrelated linearly to the first order.

For

y = \ln{(x)}

, even though

y

is dependent solely on

x

, it wouldn’t appear correlated because correlation measure the linear relationship between variables.

The resulting matrix is a diagonal matrix where all elements equal 0 except those in the diagonal.

0 = C - \left[ \begin{array}{ccc} \lambda_1 & \cdots & \lambda_M \end{array} \right] \cdot I

And the diagonal elements of the transformed matrix are called the eigenvalues (

\lambda

). To solve, you take the determinant of the bloody thing.

\text{det}(C-\lambda \cdot I) = \left| C - \lambda \cdot I \right| = 0

And, the Principal Components (PCs) are the Eigen

\vec{V}

ectors of the calculated EigenValues.

What is an EigenVector in this context? It is a vector that, when transformed by the covariance matrix, results in a scaled version of the vector. Remember that in the pure mathematical context, The EigenVector was typically a unit vector where elements could be chosen freely in a manner of speaking.

\begin{align*} C \cdot PC_i &= \left( \lambda_i \cdot I \right) \cdot PC_i\\ \text{a} & \text{gain}\dots\\ i &= 1,\ 2,\ \dots,\ M \end{align*}

Since there are no correlations between the obtained PCs, the eigenvectors are orthogonal vectors.

EigenVectors of a transformation are those vectors that don’t change their direction, but just their magnitude, which is the EigenValue. Multiplying the transformation matrix with the EigenVector results in the EigenVector scaled by the EigenValue

\lambda

\vec{A}v = \lambda v

I think one of those is the transformation matrix. EigenVectors shows a direction of a feature, which is important to us and t he EigenValue tell us something about the magnitude of the feature.

Step 4: Formulate the PCs

The next step is to order all other PCs according to their EigenValues, highest to lowest. The percentage of how much variance each PC represents is calculated by:

H_{PC_i} = \frac{\lambda_i}{\lambda_1 + \cdots + \lambda_M} \cdot 100\%

So, it is a weighted average. And since we centred the variables mean around 0, their weights should be comparable.

Step 5: Dimensionality Reduction

You can choose to ignore the PCs with less significance. They will appear at the bottom of the PC list with the lowest eigen values. You will have a dataset now with

M'

variables, where

M' \lt M

Step 6: Reconstruct the dataset

[y]^T=\begin{bmatrix} PC_1 & \cdots & PC_{M'} \end{bmatrix}^T \cdot [x]^T

PCA Example

If you are anything like me you need an example to set the record straight. Luckily, the book comes with one.

The data records are scattered around the diagonal. This means that the diagonal itself would be a better primary axis because it captures the most important variance of the data records.

The logic here is a little confusing. Not all of the data records are on the diagonal because of some variance. We expect that a second axis perpendicular to the diagonal will capture the second-highest variability of these data records. Not sure what that means…

The Principal Component Analysis (PCA) algorithm is an unsupervised machine learning algorithm that attempts to reduce the dimensionality (number of features) within a dataset while still retaining as much information as possible.

When PCA is applied to a dataset, it finds the principal components of the data. The principal components are new axes that are perpendicular to each other and capture the maximum variability in the data.

In the example provided, the data records are not all on the diagonal. This means that the data is not perfectly correlated with any one axis. However, we can expect that a second axis perpendicular to the diagonal will capture the second-highest variability of the data records.

This means that the information in the graph is better described using the diagonal and a new axis perpendicular to it. If we need to reduce the number of variables, we could use only the new (diagonal) axis, as it captures most of the information, and neglect the second new axis which contains less significant information about the variance of the data points.

Imagine that we have a dataset of two features: height and weight. We can plot this data in a two-dimensional scatter plot. If the data is perfectly correlated, then all of the data points will lie on a straight line. However, if the data is not perfectly correlated, then the data points will be spread out over the two dimensions.

We can use PCA to find the principal components of this dataset. The first principal component will be the axis that captures the maximum variability in the data. In this case, the first principal component will be an axis that runs along the diagonal of the scatter plot.

The second principal component will be the axis that captures the second-highest variability in the data. In this case, the second principal component will be an axis that is perpendicular to the first principal component.

If we need to reduce the number of variables in our dataset, we can use only the first principal component. This will capture most of the information in the data, and we can neglect the second principal component without losing too much information.

d	x1	x2
1	2.5	2.4
2	0.5	0.7
3	2.2	2.9
4	1.9	2.2
5	3.1	3
6	2.3	2.7
7	2	1.6
8	1	1.1
9	1.5	1.6
10	1.1	0.9

\begin{gather*} \bar{x}_1 = 1.81\\ \bar{x}_2 = 1.91 \end{gather*}

d	x1	x2	$x_1-\bar{x}_1$	$x_2-\bar{x}_2$	$(x_1-\bar{x}_1) \cdot (x_2-\bar{x}_2)$	$(x_1-\bar{x}_1)^2$	$(x_2-\bar{x}_2)^2$
1	2.5	2.4	0.69	0.49	0.3381	0.4761	0.2401
2	0.5	0.7	-1.31	-1.21	1.5851	1.7161	1.4641
3	2.2	2.9	0.39	0.99	0.3861	0.1521	0.9801
4	1.9	2.2	0.09	0.29	0.0261	0.0081	0.0841
5	3.1	3	1.29	1.09	1.4061	1.6641	1.1881
6	2.3	2.7	0.49	0.79	0.3871	0.2401	0.6241
7	2	1.6	0.19	-0.31	-0.0589	0.0361	0.0961
8	1	1.1	-0.81	-0.81	0.6561	0.6561	0.6561
9	1.5	1.6	-0.31	-0.31	0.0961	0.0961	0.0961
10	1.1	0.9	-0.71	-1.01	0.7171	0.5041	1.0201
SUM	-	-	-	-	5.539	5.549	6.449

\begin{align*} C(x_i,\ x_j) &= \frac{1}{N-1} \sum_{k=1}^{N} \left( (x_1-\bar{x}_1) \cdot (x_2-\bar{x}_2)_k \right)\\ C(x_i,\ x_j) &= \frac{1}{10-1} \left( 5.539 \right)\\ C(x_i,\ x_j) &= \frac{5.539}{9}\\ C(x_i,\ x_j) &\approx 0.6154\bar{4} \end{align*}

But, the Covariance matrix is a mixture of variance and covariance. Also, it’s not the variance of the actual variable, but its transformation around its mean. Makes the values smaller…

\begin{align*} C &= \begin{bmatrix} C(x_1,x_1) & C(x_1,x_2) \\ C(x_2,x_1) & C(x_2,x_2) \end{bmatrix} \\ \\ C &= \frac{1}{9} \begin{bmatrix} 5.549 & 5.539 \\ 5.539 & 6.449 \end{bmatrix}\\ C &= \begin{bmatrix} 0.6166 & 0.6154 \\ 0.6154 & 0.7166 \end{bmatrix} \end{align*}

\begin{align*} 0 &= \left| C-\lambda \cdot I \right| \\ 0 &= \begin{vmatrix} 0.6166-\lambda & 0.6154 \\ 0.6154 & 0.7166-\lambda \end{vmatrix} \\ 0 &= (0.6166-\lambda) \cdot (0.7166-\lambda)-(0.6154^2)\\ 0 &= \lambda^2 - 0.6166 \lambda - 0.7166 \lambda + 0.4418 - 0.3788 \\ 0 &= \lambda^2 - 0.6166 \lambda - 0.7166 \lambda - 0.3788 \\ 0 &= \lambda^2 - 1.3331 \lambda + 0.063 \\ \end{align*}

\therefore \lambda_1 = 1.284,\ \lambda_2=0.049

\begin{align*} 0 &= (C-\lambda_1 \cdot I) \cdot PC_1 \\ 0 & = \begin{bmatrix} 0.6166-\lambda_1 & 0.6154 \\ 0.6154 & 0.7166-\lambda_1 \end{bmatrix} \cdot \begin{bmatrix} a_1\\ b_1 \end{bmatrix}\\ 0 & = \begin{bmatrix} 0.6166-1.284 & 0.6154 \\ 0.6154 & 0.7166-1.284 \end{bmatrix} \cdot \begin{bmatrix} a_1\\ b_1 \end{bmatrix}\\ 0 & = \begin{bmatrix} -0.6674 & 0.6154 \\ 0.6154 & -0.5674 \end{bmatrix} \cdot \begin{bmatrix} a_1\\ b_1 \end{bmatrix}\\ 0 & = \begin{bmatrix} -0.6674 \cdot a_1 + 0.6154 \cdot b_1 \\ 0.6154 \cdot a_1 - 0.5674 \cdot b_1 \end{bmatrix}\\ \therefore PC_1 &= \begin{bmatrix} a_1\\ b_1 \end{bmatrix} = \begin{bmatrix} 1 \\ 1.084 \end{bmatrix}\\ \end{align*}

What does this mean? It’s the slope of the line of best fit, or something like that, where

So, what about

PC_2

? This will be the line perpendicular to

PC_1

. You can already then guess the values of the eigenvector. Because the math is exactly the same, just different numbers, here are the values:

\therefore PC_2 = \begin{bmatrix} a_2\\ b_2 \end{bmatrix} = \begin{bmatrix} -1.084 \\ 1 \end{bmatrix}\\

According to This Article with Example This is where you can arrange and select you top most Eigen values.

Sixth, we reconstruct the dataset by re-orientating the data from the original axes. The final data set will be the standardized original data set times the feature vectors.

y_i = \begin{bmatrix} PC_1 & PC_2 \end{bmatrix}^T \cdot [x_i]^T

Certainly! Principal Component Analysis (PCA) involves transforming your data into a new coordinate system based on the eigenvectors of the data’s covariance matrix. This transformation can be represented as a matrix multiplication.

To summarize, the key steps are mean-centering your data, calculating the covariance matrix, finding the eigenvectors and eigenvalues, selecting the top k eigenvectors as your projection matrix, and then using this matrix to transform your data into the principal component space.

Keep in mind that the transformed data will have as many columns as the number of eigenvectors you selected (k), which is typically fewer than the original number of features, achieving dimensionality reduction while retaining most of the data’s variance.

Me again, I’ll now walk through calculating the new first row of data. Note, we will use transpose so that the data fits matrix multiplication.

\begin{align*} y_i &= \begin{bmatrix} PC_1 & PC_2 \end{bmatrix}^T \cdot [x_i]^T\\ y_1 &= \begin{bmatrix} 1 & -1.084\\ 1.084 & 1 \end{bmatrix}^T \cdot \begin{bmatrix} 0.69 & 0.49 \end{bmatrix}^T\\ y_1 &= \begin{bmatrix} 1 & 1.084\\ -1.084 & 1 \end{bmatrix}\cdot \begin{bmatrix} 0.69 \\ 0.49 \end{bmatrix}\\ y_1 &= \begin{bmatrix} 1 \cdot 0.69 + 1.084 \cdot 0.49 \\ -1.084 \cdot 0.69 + 1 \cdot 0.49 \end{bmatrix}\\ y_1 &= \begin{bmatrix} 1.22116\\ -0.25796 \end{bmatrix}\\ \end{align*}

You would again, look at the transpose of that, but it’s your new

x_1

and

x_2

values! 🥳

I solved row 2 and compared to the book. Looks like that is how you do it. It amazingly produces a graph with what looks to be 0 correlation. I suppose here you could also then see how the new variables relate to your predictor and drop one if the correlation is low.

Per Video Session 6

With the spread of data as points say going up and to the right, we say that that trend is the largest variance. Then, the perpendicular line captures the second largest variance. We use the vectors of variance to transform the data into a new coordinate system. This can be done for any

N

dimensions.

In signal processing, this is also called discrete Karhunen-Loeve Transform (KLT).

EigenVector Application

Example of how EigenVectors are used in data science, we talk about how pixels change during a movie. Moving orientation would be a transformation calculation in 3-d space. If we look at a rotation, the EigenVector is the vector such that it does not change during the rotation? It essentially becomes the axis of rotation.

5.2 - Cluster Analysis

Definition - Clustering: Method of grouping objects together such that objects in the same group have more in common with one another than those objects in other groups. Clustering is a unsupervised learning technique that permits the input data to be grouped into unlabeled, meaningful clusters. Each group shares a certain level of similarity.

K-Means Clustering

K-means clustering is an algorithm for grouping a given

N

data records into

K

clusters. It does this by using the centroids (arithmetic means). The distance measure is the Euclidean distance. It is considered to be unsupervised since an initial classification does not exist.

Step 2: Select random data records to represent the centroids of these clusters. The video says more like… Calculate

k

prototype centroids from random groups of similar vectors.

Step 3: Create the

k

clusters by calculate the distance between each data record and the defined centroids. Then assign the data record to the clustering containing the centroid closest to the data record.

d(i,\ c)= \sqrt{ (x_{1,i}-x_{1,c})^2 + \cdots + (x_{M,i}-x_{M,c})^2 }

Where

(x_1,\ x_2,\ \dots,\ x_M)

are the

M

data variables,

i

denotes the data record, and

c

denotes the cluster’s centroid.

Step 4: Recalculate the new centroid for each cluster by averaging the included data records.

Step 5: repeat steps 3 and 4 until there are no further changes in the calculated centroids.

The video lecture goes over interesting graph images. It’s a process starting with random centroids. Calculate distance to create random clusters. Calculate new centroids from the clusters. Calculate new distances, perhaps some values change centroids. And continue until you cannot make anymore meaningful changes.

Check rules of thumb for choosing

k

. Too many will not lead to convergence because too much fragmentation.

Example

So, the first centroid is a point of data. The second and further centroids actually appear to be average points. The example shows that even though the distances may update with the new centroid, since no records changed clusters, there’s no need to recalculate everything.

K-Nearest Neighbour

This is from the video lecture. It sounds similar to the K-Mean clustering. However, K-Nearest Neighbour assigns new data to a cluster according to the proximity to

k

classified neighbours. Distance can be measured by many methods like Manhattan, Euclidean, Cosine, MSE, etc…

It is like if we already did classification and we just want to assign a random data point into a known cluster.

This is a supervised method since the classification already exists. The selection of

k

is critical:

Rule of thumb is to use the square root of

n

as well as an odd

k

to prevent confusion with two classes.

Hierarchical Clustering

Hierarchical Clustering is applied to data that has an underlying hierarchy. The book uses example of apples and bananas are fruits, aubergine and courgette are vegetables; however, both are fresh produce.

There are two approaches to hierarchical clustering: divisive (top-down) and agglomerative (bottom-up).

Agglomerative Clustering

Agglomerative Clustering creates a bottom-up tree (dendrogram) of clusters that repeatedly merge two nearest points, or clusters, into a bigger super cluster. Algorithm is formulated as follows:

Example

Those steps, to me, are a little confusing so lets dive into an example. The book looks at a very small dataset of 6 records with 2 variables each. Each dataset is its own cluster at first. The Euclidean distance is found between points. Data with smallest distances are merged.

You may think you got straight from 6 clusters to 3, I did at first. But you only merge smallest distances for non-overlapping points. So record 5 cannot merge with record 4 and record 3 in the same round.

If distance from row 4 to 5 is the smallest, then we create cluster

C_1

. Then, if row 5 to 3 has the smallest distance, we create cluster

C_2

which incorporates all of

C_1

, not just row 5, and the new point. Continue on this trend, clustering points and clusters.

Apparently you still measure distances between points even after you cluster. I thought you’d get the cluster’s centre, but doesn’t look to be the case. This creates sort of a hierarchy of points as well.

5.3 - Linear Regression

Definition - Linear Regression: A method for modeling linear relationships between a dependent variable and one or more independent variables.

The objective of Regression is prediction based on historic data. Building a model is an iterative process to relate independent variables to the dependent variables.

Linear Regression Model

Again, we assume

n

data records, each with

m

features, or variables. We want to discover a relationship something like:

\hat{y}=w_0 + w_1 x_1 + w_2 x_2 + \dots + w_m x_m

We are using

w

as the coefficient to indicate weight or the variable. Of course

\hat{y}

is the predicted value and differs from the actual value by a random error variable,

y_i = \hat{y}_i + \varepsilon_i

This is because the input data is not an exact linear relationship between independent and dependent variables. Or perhaps we are missing information. Note that:

\varepsilon_i = |\hat{y}_i - y_i|

We use the absolute value because when you plot points, the error is the vertical distance between the points. And distance cannot be negative.

Correlation

Before we dive into any examples, the instructor talks about the correlation coefficient, or the product-moment correlation according to Pearson. It determines the extent to which values of two variables are linearly related to each other.

\rho=\frac{cov(x,y)}{\sigma_x \cdot \sigma_y} = \frac{\sigma_{xy}}{\sigma_x \cdot \sigma_y}

\rho=\frac{s_{xy}}{s_x \cdot s_y}= \frac{\sum_{i=1}^n(x_i- \bar{x})(y_i- \bar{y})}{ \sqrt{\sum_{i=1}^n(x_i- \bar{x})^2} \sqrt{\sum_{i=1}^n(y_i- \bar{y})^2} }

\begin{align*} s_{xy} &= \frac{1}{n-1}\sum_{i=1}^n(x_i- \bar{x})(y_i- \bar{y})\\ s_x &= \sqrt{\frac{1}{n-1}\sum_{i=1}^n(x_i- \bar{x})^2}\\ s_y &= \sqrt{\frac{1}{n-1}\sum_{i=1}^n(y_i- \bar{y})^2}\\ \end{align*}

The

n-1

is what makes it a sample. However, you can see that in the grand scheme of things it cancels out.

The population is much more than just the entire set of data, but consider that you only know the data right now. You don’t have future data and therefore just a sample of data.

Correlation coefficient ranges between

\pm 1

. You cannot say something is twice as related if the coefficients are 0.25 and 0.50, that’s not really how it works. But yes, it is more related. Additionally, high correlation does not necessarily indicate a causative relationship between variables. However, if you hypothesise a relationship, the correlation can support your hypothesis.

Coefficient of Determination

The Coefficient of Determination

(r^2)

estimates the ratio of values explaining the relationship given by the regression line. In linear regression with a single variable, it is the square of the Pearson correlation coefficient:

r^2=\frac{s_{xy}^2}{s_x^2 \cdot s_y^2}

It answers the question: how much of the variation in

y

is described by the variation in

x

? That is

1-r^2

shows how much of the change in

y

is associated to a random error in

x

. This coefficient ranges from

(0,\ 1)

, and can be interpreted as a percentage.

What does 64% mean? Changing something on one axis will change on the other. It means 64% of the variation in one axis is represented in the other, leaving 36% left for the bias. So it is like how much of the variation in

y

is explained by the variation in

x

Might be helpful to look at examples between

r

and

r^2

. You’ll notice that

R^2

changes much faster as variation increases. Note that

R^2

will also always be positive.

Simple Linear Regression Model

This is like a

y=mx+b

situation, just one independent variable. In this easy example, the line of best fit for a dataset would minimize the sum of squared error, also called the least-squares method.

\sum_{i=1}^{n} \varepsilon_i^2 = \sum_{i=1}^{n} (\hat{y}_i - y_i)^2

So, we pick the weights,

w_n

that minimize this function. When you think minimize, you may think solving for when the derivative is 0.

\sum_{i=1}^{n} (\hat{y}_i - y_i)^2 = \sum_{i=1}^{n} (\hat{y}_i - w_0-w_1x)^2

Since we are solving for weights, I guess that is what we take derivatives with respect to.

\frac{\partial}{\partial w_0} \sum_{i=1}^{n} \varepsilon_i^2 = 0

This would provide a point where the error is minimal. Work that into our summation:

\begin{align*} 0 &= \frac{\partial}{\partial w_0} \sum_{i=1}^{n} (\hat{y}_i - w_0-w_1x)^2\\ &= \sum_{i=1}^{n} 2(\hat{y}_i - w_0-w_1x)\cdot(0-1-0)\\ &= 2 \sum_{i=1}^{n} (w_0+w_1x-\hat{y}_i)\\ &= 2nw_0 + 2 \sum_{i=1}^{n} (w_1x-\hat{y}_i)\\ -2nw_0 &= 2 \sum_{i=1}^{n} (w_1x-\hat{y}_i)\\ w_0 &= \frac{2}{-2n} \sum_{i=1}^{n} (w_1x-\hat{y}_i)\\ w_0 &= \frac{1}{n} \sum_{i=1}^{n}(\hat{y}_i) - \frac{w_1}{n} \sum_{i=1}^{n}(x)\\ \end{align*}

That wasn’t too bad, but we are going to make an even bigger mess finding

w_1

. This isn’t a statistics course, and I will have notes on a statistics course soon… The steps to finding

w_1

are more complicated but similar. I’ll leave it to either the statistics course to derive for the sake of my time and sanity.

\begin{align*} 0 &= \frac{\partial}{\partial w_1} \sum_{i=1}^{n} \varepsilon_i^2\\ 0 &= \frac{\partial}{\partial w_1} \sum_{i=1}^{n} (\hat{y}_i - w_0-w_1x)^2\\ &= \sum_{i=1}^{n} 2(\hat{y}_i - w_0-w_1x)\cdot(-x)\\ &= \dots\\ w_1 &= \frac{ \displaystyle n \sum_{i=1}^n (\hat{y}x) - \sum_{i=1}^n (\hat{y})\sum_{i=1}^n (x) }{ \displaystyle n \sum_{i=1}^n (x^2) -\left(\sum_{i=1}^n(x)\right)^2 } \end{align*}

Example

\begin{gather*} \sum_{i=1}^{n}(x)\\ \sum_{i=1}^{n}(x^2)\\ \sum_{i=1}^{n}(\hat{y}x)\\ \sum_{i=1}^{n}(\hat{y})\\ \end{gather*}

The

\hat{y}

values that we use are the dependent variables of our dataset. It makes me think the formula is backwards from the beginning where we should have expanded

\hat{y}

, but the expansion of that is actually more like

\hat{y} = y + \varepsilon

. So, expanding just

y

is the right call.

Then, with those calculations, you plug-n-chug for weights / coefficients. Once you have those, your can reconstruct the regression formula.

Multiple Linear Regression Model

What about more realistic datasets, with many records, each having many independent variables? Let’s extend our understanding to a dataset with 2 independent variables.

y = w_0 + w_1 x_1 + w_2 x_2

The process would be taking the derivative of the error with respect to each to each coefficient and setting it equal to zero. Now, I just want to look at the following derivatives,

\begin{align*} 0 &= \frac{\partial}{\partial w_0}\sum_{i=1}^{n} \varepsilon_i^2\\ &= \frac{\partial}{\partial w_0}\sum_{i=1}^{n}(\hat{y}-y)^2\\ &= \frac{\partial}{\partial w_0}\sum_{i=1}^{n}(\hat{y}-(w_0 + w_1 x_1 + w_2 x_2))^2\\ &= \frac{\partial}{\partial w_0}\sum_{i=1}^{n}(\hat{y}-w_0 - w_1 x_1 - w_2 x_2)^2\\ &= 2\sum_{i=1}^{n}\left( (\hat{y}-w_0 - w_1 x_1 - w_2 x_2)\cdot(0-1-0-0) \right)\\ \end{align*}

So,

w_0

is composed of

w_1

and

w_2

, making it a little harder to work with. Additionally,

\begin{align*} 0 &= \frac{\partial}{\partial w_1}\sum_{i=1}^{n} \varepsilon_i^2\\ &= \frac{\partial}{\partial w_1}\sum_{i=1}^{n}(\hat{y}-w_0 - w_1 x_1 - w_2 x_2)^2\\ &= 2\sum_{i=1}^{n}\left( (\hat{y}-w_0 - w_1 x_1 - w_2 x_2)\cdot(0-0-x_1-0) \right)\\ &= 2\sum_{i=1}^{n}\left( w_0x_1 - w_1 x_1^2 - w_2 x_2 x_1 - \hat{y}x_1 \right)\\ \end{align*}

I think it is apparent that additional variables drastically complicates what we are solving for. However, we have as many variables as we do unique functions, and therefore it is solvable. It would be easier to construct a system of linear equations and then solve for unknowns using techniques of linear algebra though.

The book also notes that a large coefficient / weight

w_m

implies that the target

y

is highly correlated to the variable

x_m

, and vice versa. But as the number of independent variables increases, the assumption of a linear relationship between the target variable and other variables becomes weak.

Consequently, nonlinear regression models will produce more accurate predictions.

5.4 - Time-Series Forecasting

Definition - Forecasting Models: This is a model for forecasting (predicting) future events based on data and information gleaned from the past.

When people think about forecasting models, the first thought is probably stocks or sales. Some questions you might ask are:

One difference between regular regression analysis and this time-series forecasting is that individual data records depend on previous data records (to some extent). Thus, the forecasting technique must take this into consideration. Observations must be ordered with respect to time instances.

One popular linear forecasting technique is the autoregressive method (AR). This method assumes the expected output is a linear function of some past outputs. This means if the underlying relationship is nonlinear then the AR approach yields suboptimal results. For better results with nonlinear data, we can upgrade the AR method to include moving averages (ARMA) and integral terms (ARIMA).

Concept of Stationary

Definition - Stationary Time-Series: This type of data has a constant mean and standard deviation over time.

To apply a forecasting model to time-series data, it should be stationary time-series. Only then can the model correctly self-predict its future response from past data points.

Most time-series data is not stationary, but we can convert data into stationary data with a concept of

d_{\Delta t}

, which is the difference between every two data points on an interval

\Delta t

. I suppose, even if the data increases or decreases overtime, then we aim to model the change over each time interval, hoping that is stationary. You can model back multiple intervals as well.

There is an optimum value for

\Delta t

, which results in a completely stationary form of the time-series data.

Autoregressive (AR) Model

Definition - $lag(n)$ : Not sure about the math definition format… This is the backshift of a time-series by

n

time steps.

The AutoRegressive (AR) model is a linear model developed to predict the value of an observation at the very next point in time using linear combination of its values at previous time instances.

The name comes from it using data from the same variable at past points in time. You might see

AR(n)

for a model of order

n

. This refers to the lag defined above, so it is a linear combination of previous data points going back, backshift,

n

points.

y_t = p_0 + p_1y_{t-1} + p_2y_{t-2} + \cdots + p_ny_{t-n} + \varepsilon_t

The

p_n

values are model coefficients, and

\varepsilon_t

is a white noise term

WN(0, \sigma^2)

, which looks a bit standard normal to me.

AR(p) = x_t = c + \sum_{i=1}^p\varphi_i x_{t-i}

Moving Average (MA) Model

y_t = q_0 + q_1 \varepsilon_{t-1} + q_2 \varepsilon_{t-2} + \cdots + q_n \varepsilon_{t-n}

Here, each

\varepsilon_{t-n}

is a white noise error term

WN(0,\ \sigma^2)

, and the

q_n

are model coefficients.

MA(q) = x_t = \sum_{i=1}^q \theta_i \varepsilon_{t-i}

\begin{align*} X_t &= \mu + \varepsilon_t + \theta_1 \varepsilon_{t-1} + \dots + \theta_q \varepsilon_{t-q} \\ & = \mu + \sum_{i=1}^{q} \theta_i \varepsilon_{t-i} + \varepsilon_t \end{align*}

Where

q

is the order of the MA model, the errors are white noise, and

\mu

is the mean of the series. The Moving-Average model is essentially a linear regression of the current value of the series against current and previous (observed) white noise error terms. Errors are assumed mutually independent and typically a normal distribution with mean of 0.

Autocorrelation

The correlation coefficient is how linearly related two variable are, represented by a number between

\left\{ x \in \mathbb{R}\ | -1 \le x \le 1 \right\}

, or just

(-1,\ 1)

. We calculate the correlation between forecasted variables and its value at a specific lag.

ACF(n) = \frac{C(y_t,\ y_{t-n})}{\sqrt{V(y_t)\cdot V(y_{t-n})}}

Autocorrelation (serial correlation) is the correlation of a signal with temporally succeeding portions of the same signal. The function considers having a signal

X=\{ x_1, \dots,x_n \}

and a delayed portion

X_d=\{ x_{1+d},\dots,x_{n-d} \}

, the functions is basically the same as above.

Rewritten to express that we are comparing data with a delayed version of itself, hence the

n-d

as we cannot compare to the

d

delayed versions… Hard to work, but if we are looking at comparing data to itself back 3 periods, then we must start at

z_1 = x_4-x_1

is the basic concept, assuming we don’t have an

x_0

or earlier…

r_d = \frac{ \sum_{i=1}^{n-d}(x_i-\bar x)(x_{i+d}-\bar x) }{ \sqrt{\left( \sum_{i=1}^{n-d}(x_i-\bar x)^2 \right)} \cdot \sqrt{\left( \sum_{i=1}^{n-d}(x_{i+d}-\bar x)^2 \right)} }

The formula starts at

i

and compares to future

i+d

, instead of starting at

i+d

and looking back. It’s the same think, but using a plus instead of minus

d

Autocorrelation can be used to detect non-randomness in time series. Randomness is one of the key assumptions to define a univariate statistical process. It assumes that location, scale, and distribution are constant and that the bias is a random error rather than a systematic error. If randomness is not given, a non-linear or time-series model is required.

You find whether data is truly random or if there are patterns in the data over time.

Autocorrelation | Wiki in the discrete time case is the correlation of a signal with a delayed copy of itself as a function of delay.

AutoCorrelation Plot

Autocorrelation plots show a series autocorrelation coefficients for increasing delays (lags) and are used for checking randomness in a data set. Computing correlations for data at varying temporal delays and plotting them against the delay, allows you to identify:

Partial Autocorrelation

We have a partial autocorrelation function

\text{PACF}_k

lag(k)

is the autocorrelation between

y_t

and

y_{t-k}

that is not accounted for by the autocorrelations from the

1^{st}

to the

(k-1)^{st}

lags.

\begin{bmatrix} ACF(0) & ACF(1) & \cdots & ACF(k-1)\\ ACF(1) & ACF(0) & \cdots & ACF(k-2)\\ \vdots & \vdots & \ddots & \vdots \\ ACF(k-1) & ACF(k-2) & \cdots & ACF(0)\\ \end{bmatrix} \begin{bmatrix} PACF_1\\ PACF_2\\ \vdots\\ PACF_k \end{bmatrix} = \begin{bmatrix} ACF(1) \\ ACF(2) \\ \vdots \\ ACF(k) \\ \end{bmatrix}

Partial Autocorrelation Function (PACF) | Wiki in a time series analysis gives the partial correlation of a stationary time series with its own lagged values, regressed the values of the time series at all shorter lags. Apparently, the contrast is that autocorrelation does not control other lags. It’s purpose is more aimed at identifying the extent of the lag in an autoregressive (AR) model.

ACF vs. PACF

For the plot, you would have the lag on the x-axis. So, at

x=1

, that is a delay of one. And the y-axis is

r

. That means you plot nearly a histogram, or bar chart, of correlation to lag. That is autocorrelation.

Let’s try to understand with an example. Suppose you have a bag of coloured blocks and you are drawing one block out at a time.

To get a high partial autocorrelation, you need a sequence of draws where the colours are highly correlated, even after taking into account the colour of the previous blocks. Consider drawing a pattern of blocks like: red, blue, red, red, blue, red, red, blue,…

The partial autocorrelation would he high because each marble colour is strongly correlated with the pervious marble colour, even after taking into account the colour of the current marble.

Partial autocorrelation is more for like seasonal things. Partial autocorrelation is rarely high when autocorrelation is low. That really only happens in situations where data is highly seasonal. The autocorrelation would be low because the data is not correlated with itself at different lags. But the partial autocorrelation could be high because the values of the time series are correlated with themselves at different lags, after taking into account seasonal pattern.

Autocorrelation is simply the correlation between time series and itself at different lags. Partial autocorrelation is much more involved, as it is the correlation between a time series and itself at a given lag, after taking into account the correlations between the time series and itself at all shorter lags. One calculation for partial autocorrelation is to use Yule-Walker equations. Another way to calculate is to use autoregressive (AR) models, which were discussed above.

Calculate the AR model for a time series. Then calculate the partial autocorrelation coefficients for the time series with:

\phi_k= \phi_{k-1} + a_k \cdot \phi_{k-1}

This process is recursive, meaning you start with lag 1 and calculate partial autocorrelation coefficients at subsequent lags using the above formula.

Autoregressive Integrated Moving Average (ARIMA) Model

Something also important is Stationary data, meaning that neither the mean nor variance changes overtime in the data. If this occurs, then correlation analysis is apparently not possible, or at least not as easy. That is when you resort to methods such as the difference of mean between intervals. So, the rate of change might be constant.

Two stochastic processes are strongly stationary if their joint probability distribution does not change over time. This means that their mean and variance also does not change significantly.

ARIMA models mix both the AR and MA models with integrated parameters in one model to obtain a better understanding of time-series data. An integrated parameter is the degree of differencing which is performed on the dataset to transform it into a stationary time-series.

It looks like the sum of a constant (y-intercept), weighted sum of the previous

p

values of

y

, and the weighted sum of the previous

q

forecast errors.

y_t = c + p_1 y_{t-1} + \cdots + p_n y_{t-n} + q_1 \varepsilon_{t-1} + \cdots + q_m \varepsilon_{t-n}

So, how many terms of

p

and

q

do we use? Typically, you must look at plots of autocorrelation and partial autocorrelation functions of time-series, which is why we covered it, then consider the following:

The book then covers some examples. The PACF and ACF charts are almost like histograms with correlation on the y-axis and the lag term on the x-axis.

The first example shows PACF around

0.5

for lags 1 and 2, and then drop close to zero, where the ACF is also high at first but has a more gradual decline over lag, not dropping off until around lag 7. Therefore, we set

p=2

and

q=0

and get:

\text{ARIMA}(2,0,0) = y_t = c + p_1\ y_{t-1} + p_2\ y_{t-2}

The second example is for

ARIMA(0,0,1)

because there is a spike at

lag(1)

for ACF and PACF decays to

lag(6)

. This means:

ARIMA(0,\ 0,\ 1) = y_t = c + q_1 \varepsilon_{t-1}

The we have example 3. They do a one degree differencing, so

d=1

, to create more stationary data. Then the graphs are quite confusing, but the spikes are

ACF_2

and

PACF_5

, giving

ARIMA(5,1,2)

. I won’t write

y_t

because that would be a lot.

Once a good estimation of number of terms in developed model

y_t

is made, we use the input time-series data to obtain unknown coefficients. The residuals time-series

R_t

is the difference between the input time-series and the model’s forecasted time-series.

If the ARIMA model was correctly developed, there should be no significant spikes in the ACF and PACF plots of

R_t

. Spikes at later lags don’t automatically mean the model is wrong. However, if there’s a spike at say

lag(2)

and we used

ARIMA(1,0,0)

ARIMA(0,0,1)

, then we use the following rules:

In practice,

p,q \le 3

for any developed ARIMA model for a business application. It is also advised to avoid mixed models, with both

q

and

p

terms. And if you do add additional terms to a developed model on the basis of the residual analysis recommendation, do so one parameter at a time.

Also, some time-series data may first need nonlinear transformation to convert into a form with consistent distribution over time and symmetry in appearance.

ARMA(p,q) = x_t = c+\sum_{i=1}^p \varphi_ix_{t-i} +\sum_{i=1}^q\theta_i \varepsilon_{t-i}

Combination with

p

previous points and

q

error terms. I think it is really interesting to see such a distinction between

p

and

q

ARIMA(p,d,q) = x_t = c + \sum_{i=1}^p \varphi_i(x_t-x_{t-1}) + \sum_{i=1}^q \theta_i \varepsilon_{t-i}

This generates different stationary by calculating differences in the

d^{th}

order such as:

\begin{align*} d_1 &= (x_t-x_{t-1})\\ d_2 &= (x_t-x_{t-1})-(x_{t-1}-x_{t-2})\\ \vdots \end{align*}

Some More ARIMA Information

The course text book has some rules of thumb about using certain ARIMA functions. The rules are part of the exam! Know them well! Check around p. 109.

Seasonal Autogression Integrated Model (SARIMA)

ARIMA cannot handle datasets with seasonal components. Seasonal time-series data are cyclical, and require the seasonal autoregressive integrated model (SARIMA). The notation is

SARIMA(p,d,q)(P,D,Q)s

. The first three are for

ARIMA

. The

(P,D,Q)

are for the backshifts of the seasonal period. The

s

denotes the number of time steps for a single seasonal period.

Python’s statsmodels library supports complete designing, fitting, and forecasting of SARIMA model.

There’s also SARIMAX models, which permit the existence of a exogenous variable

X

in the dataset. That is an external variable that influences the time-series variations and needs to be considered during the analysis.

5.5 - Transformation Approaches

A dataset transformation involves replacing each

x_i

with

x_j

using some function

f

The book uses

X_i

for the new variable, but capital variables typically denote random variables, and there’s no random component I am aware of yet.

Why would you transform your data? You might transform variables into a new space, like Cartesian to Radial, and improve interpretability… that’s really all they listed.

It’s kind of like, if you see the dependent and independent variables have a relationship, but it isn’t linear, you want to understand how to make it linear. Then you can regress on it.

Logarithm Transformation

One sort of common transformation is a logarithmic transformation, where we have:

I prefer the natural log, but you can try base 10. This can help make clumpy data appear linear.

Power Law Transformation

The power law transformation represents a family of transformations based on nonnegative parameter (

\gamma

Basically,

\gamma

is initially estimated and then continuously updated during the training phase of the model-building process until the highest performance accuracy is achieved.

The parameter is meant to be nonnegative, but it can still be fractional, giving an interesting shape to the curve.

Reciprocal Transformation

x_i'=\frac{c}{x_i}

You can do just

\frac{1}{x}

but I thought I would liven it up with

c

. Sometime you are given data like gallons per mile, but you need it in miles per gallon.

Radial Transformation

Radial Transformation focuses on the distance between the value of a variable and the origin. I guess you combine two variables and transform them into the radial coordinates of radius and angle:

\begin{gather*} r=\sqrt{x_1^2+x_2^2}\\ \theta=\tan^{-1}\left( \frac{x_1}{x_2} \right) \end{gather*}

Discrete Fourier Transform

The Fourier Transform shifts variable from their traditional domain of

x

versus time

t

, to a frequency domain. The Fourier transform determines which frequencies can represent the distribution of a given variable, and is given by:

x_n= \sum_{k=0}^{N-1}\left( x_k e^{-2 \pi i n k / N} \right)

Video (Session 6 of 7)

going to try and integrate lecture into notes since it’s going to probably have a lot of equations.

Forecasting: Principles and Practice

Preface

This is regarding the 3rd edition, but the 2nd is also available online. The 2nd edition uses the forecast package in R, where the 3rd edition uses tsibble and fable packages.

The course for Data Science suggests reading Ch. 9 of the 2nd edition, which is Dynamic regression models, but in the 3rd edition it is ARIMA models. Both are relevant and the 3rd edition seems to be more current, so I’ll go with that.

Forecasting: Principles and Practice

Preface

This is regarding the 3rd edition, but the 2nd is also available online. The 2nd edition uses the forecast package in R, where the 3rd edition uses tsibble and fable packages.

Ch. 9 - ARIMA Models

Exponential smoothing (Ch. 8 of this book) and ARIMA models are probably most widely used time series forecasting approaches. ARIMA models try to describe the autocorrelations in data.

Stationarity and Differencing

Stationary time series has statistical properties that do not depend on the time at which the series is observed. So, seasonality is not time series, and might require apparently exponential smoothing. But white noise series is stationary.

However, cyclic behaviour can be stationary if it doesn’t have a trend of seasonality. As long as it doesn’t have predictable patterns in the long-term.

Differencing

Differencing is a way to make non-stationary time series stationary by computing the differences between consecutive observations. Additionally, transformation like logarithms can help stabilise the variance of a time series.

ACF is useful for identifying non-stationary time series. If the data is stationary, the ACF will drop to zero quick enough. However, if the data is non-stationary, the ACF will decay slowly.

Random Walk

For a series with

T

values, if you calculate the difference series, you have

T-1

values as you cannot obtain

y_0'

because you don’t have a

y_{(-1)}

value (best I can describe).

y_t-y_{t-1} = \varepsilon_t

So, the

\varepsilon_t

denotes the white noise. And you get a random walk model with

y_t=y_{t-1} + \varepsilon_t

Second-order differencing

\begin{align*} y_t'' &= y_t' - y_{t-1}'\\ &=(y_t-y_{t-1}) - (y_{t-1}-y_{t-2})\\ &= y_t - 2y_{t-1} + y_{t-2} \end{align*}

So, this set now has

T-2

records. Apparently in practice, you typically won’t go beyond second-order differences.

Seasonal Differencing

Seasonal differencing is difference between observation and the previous observation from the same season. Therefore:

Sometimes you would take both a seasonal and ordinary difference (AKA first difference) to obtain stationary data. Sorry if the AKA isn’t clear, it only applies to ordinary difference, not the combo of ordinary and seasonal.

It is recommended to do seasonal differences first because sometimes the resulting series is stationary without applying a further first difference. Applying more differences than required can induce incorrect autocorrelations.

Unit Root Tests

A unit root test is a statistical hypothesis test of stationarity that is designed to determine whether differencing is required. There are many tests, this book covers the Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test. This is a null hypothesis that data are stationary, and we test if null hypothesis is false. Small p-values (eg.

p \lt 0.05

) suggests differencing is required.

R has a function called unitroot_kpss, check the book for examples as I don’t know R well enough at this time.

I did a little digging and found that Python statsmodel has a KPSS test tool as well.

The book discusses a unitroot_nsdiffs() R function for determining if seasonal differencing is required.

Backshift notation

Sometime people use

L

for “lag”, but usually not. This means something like

y_{t-12}=B^{12}y_t

, which is just the notation and not actually

B

to the 12th power.

y_t'=y_t-y_{t-1} = y_t - By_t=(1-B)y_t

\begin{align*} (1-B)(1-B^m)y_t &= (1-B-B^m+B^{m+1})y_t\\ &= y_t-y_{t-1}-y_{t-m}+y_{t-m-1} \end{align*}

Autoregressive models

The term autoregression indicates that it is a regression of the variable against itself. An autoregressive model of order

p

can be written as

y_t = c + \phi_1 \cdot y_{t-1} + \phi_2 \cdot y_{t-2} + \cdots + \phi_p \cdot y_{t-p} + \varepsilon_t

where

\varepsilon_t

is white noise. It is like multiple regression but with lagged values of

y_t

as predictors. We call this an

AR(p)

model, an autoregressive model of order

p

We usually restrict autoregressive models to stationary data. We also put some constraints on the values of parameters required:

And when

p \ge 3

the restrictions get more complicated. Luckily, R has the fable package that can take care of these restrictions for us when estimating a model.

Moving Average Models

A moving average model uses past forecast errors in a regression-like model, like so

y_t = c + \varepsilon_t + \theta_1 \cdot \varepsilon_{t-1} + \theta_2 \cdot \varepsilon_{t-2} + \cdots + \theta_q \cdot \varepsilon_{t-q}

This is called an

MA(q)

model, moving average model of order

q

. Note that you wouldn’t actually observe values of

\varepsilon_t

, so it is not a regression in the typical sense. Each

y_t

can be thought of as a weighted moving average of the past few forecast errors.

Note: Do not confuse moving average with smoothing (discussed in previous chapter of book).

\begin{align*} y_t &= \phi_1y_{t-1}+\varepsilon_t\\ &= \phi_1\left( \phi_1y_{t-2}+\varepsilon_{t-1} \right)+\varepsilon_t\\ &= \phi_1^2 y_{t-2}+ \phi_1 \varepsilon_{t-1} +\varepsilon_t\\ &= \phi_1^2 \left( \phi_1y_{t-3}+\varepsilon_{t-2} \right) + \phi_1 \varepsilon_{t-1} +\varepsilon_t\\ &= \phi_1^3 y_{t-3}+\phi_1^2 \varepsilon_{t-2} + \phi_1 \varepsilon_{t-1} +\varepsilon_t\\ &= \dots \end{align*}

Provided that

-1 \lt \phi_1 \lt 1

, then

\phi_1^k

will get smaller as

k

gets bigger. I guess that means eventually

\phi_1^k y_{t-k} \approx 0

The reverse can also hold true with some constraints on the

MA

parameters. Thus,

MA

model is called invertible. These models have some desirable mathematical properties, if you wanted to know why we jump down this rabbit hole.

The book provides an example and explains why

| \theta_1 | \lt 1

, has to do with lag.

Non-Seasonal ARIMA Models

Combine differencing with autoregression and a moving average model to obtain a non-seasonal ARIMA model.

ARIMA=

AutoRegressive Integrated Moving Average. Integration is the reverse of differencing in this context.

y_t' = c + \phi_1 y_{t-1}' + \cdots + \phi_p y_{t-p}' + \theta_1 \varepsilon_{t-1} + \cdots + \theta_q \varepsilon_{t-q} + \varepsilon_t

y_t'

is the differenced series. The book here repeats much what is in the course notes for data science - selected math techniques. Cool beans are special cases of the

ARIMA

model

Recommended to use backshift notation when working with more complicated models.

Understanding

ARIMA

Models

The book goes over more in-depth definition of components so you know what an automated function is doing.

ACF and PACF Plots

Estimation and Order Selection

Maximum likelihood estimation

The R package fable uses the maximum likelihood estimation when estimating the

ARIMA

model.

Information Criteria

Akaike’s Information Criterion (AIC) is useful for selecting predictors for regression and for determining the order of an

ARIMA

model. It can be expressed as:

AIC = -2 \log(L) + 2 (p + q + k + 1)

where L is the likelihood of the data,

k=1

, if

c \ne 0

, and

k=0

c=0

AICc=AIC + \frac{ 2(p+q+k+1)(p+q+k+2) }{ T-p-q-k-2 }

BIC = AIC + \left[ \log(T) - 2 \right] \left( p+q+k+1 \right)

Basically, good models are obtained by minimising the

AIC

AICc

, or

BIC

. The book prefers

AICc

. These are better at selecting values for

p

and

q

but not

d

ARIMA

Modelling in fable

Forecasting

Point Forecasts

Prediction Intervals

Calculating

ARIMA

prediction intervals is more difficult, and beyond the scope of… this.

Seasonal

ARIMA

models

The

ARIMA(p,d,q)(P,D,Q)_m

model is used to represent seasonal data, the later portion being for the seasonal terms, and

m

being the seasonal period, number of observations per year.

(1-\phi_1B)(1-\Phi_1B^4)(1-B)(1-B^4)y_t = (1+\theta_1B)(1+\Theta_1B^4)\varepsilon_t

ACF/PACF

Seasonal part of an

AR

MA

model will be seen in the seasonal lags of the

PACF

and

ACF

ARIMA

ETS

ETS

models and

ARIMA

models have some overlap, but there are many models that have no similar counterparts.

Ch. 10 - Dynamic Regression Models

Exponential Smoothing and the

ARIMA

model allow for past observations but not exactly the inclusion of perhaps other relevant information. A regular regression model can include a lot of information, but nothing regarding subtle time series dynamics.

y_t = \beta_0 + \beta_1 x_{1,t} + \cdots + \beta_k x_{k,t} + \varepsilon_t

We have

k

predictor variables and

\varepsilon_t

is usually assumed to be an uncorrelated error term (i.e. white noise).

The book mentions a Ljung-Box test for assessing whether the resulting residuals were significantly correlated. Worth looking into perhaps.

We are going to allow the errors from a regression to contain autocorrelation! To emphasise this change, we replate

\varepsilon_t

with

\eta_t

. This error is assumed to follow an

ARIMA

model.

So, the error of regression will be

\eta_t

and the error of

ARIMA

will be

\varepsilon_t

. And only the

ARIMA

error will be considered as white noise, since regression error is now assumed to be composed of

ARIMA

and white noise.

Estimation

Based on our previous assumption with the error terms, if we try to minimise the sum of squared errors,

\eta_t

, there are several issues:

We can minimise the sum of squares on

\varepsilon_t

or use a maximum likelihood estimation.

If estimating a regression with

ARMA

errors, ensure all variables are stationary. Else, if any variables are non-stationary, the estimated coefficients will not be consistent estimates, and therefore garbage. Exception if non-stationary variables are co-integrated.

Therefore, we difference non-stationary variables in the model. To maintain the form of the relationship between

y_t

and the predictors, it is common to difference all the variables if any of them need differencing. The resulting model is called a model in differences. A model in levels is what we obtain when the original data are used without differencing.

If all variables in the model are stationary, we only need to consider an

ARMA

process for the errors. A regression model with

ARIMA

errors is equivalent to a regression model in differences with

ARMA

errors.

Regression with

ARIMA

errors using fable

In R, the ARIMA() function will fit a regression model with

ARIMA

errors if the exogenous regressors are included in the formula. Suppose we have a model where

\eta_t

is an

ARIMA(1,1,0)

error. Our regression model is

\begin{align*} y_t &= \beta_0 + \beta_1 x_t + \eta_t\\ y_t' &= (\beta_0 + \beta_1 x_t + \eta_t)-(\beta_0 + \beta_1 x_{t-1} + \eta_{t-1})\\ y_t' &= (\beta_0-\beta_0) + (\beta_1 x_t - \beta_1 x_{t-1})+ (\eta_t - \eta_{t-1})\\ y_t' &= \beta_1 x_t' + \eta_t'\\ &\tt{where}\\ \eta_t' &= \phi_1 \eta_{t-1}' + \varepsilon_t \end{align*}

Take that with a pinch of salt as it’s a good explanation for why the constant term drops, but not 100% sure about that pesky error term.

Whether differencing is required is determined by applying a

KPSS

test to the residuals from the regression model estimated using ordinary least squares.

If differencing is required, all variables are differenced and the model re-estimated using maximum likelihood estimation.

The

AICc

should be calculated for the final model, which can be used to determine the best predictors.

Forecasting

Stochastic and deterministic Tre…

y_t=\beta_0 + \beta_1t+\eta_t

Stochastic Trend is the same model but where

\eta_t

is an

ARIMA

process with

d=1

. You can difference both sides so that you have:

y_t = y_{t-1} + \beta_1 + \eta_t'

Similar to a random walk with drift, but the error term is an

ARMA

process instead of white noise.

Dynamic Harmonic Regression

Oh God… When there are long seasonal periods, a dynamic regression with Fourier terms is often better than other models considered so far. Seasonal versions of

ARIMA

and

ETS

models are designed for shorter periods, like 12 months for monthly data, for 4 quarters for quarterly data. Basically, there are

m-1

parameter to be estimated for the initial seasonal states. So, for a big

m

, the estimations become unmanageable.

For something like a daily period, we prefer a harmonic regression approach where seasonal pattern is modelled using Fourier terms with short-term time series dynamics handled by an

ARMA

error.

Lagged Predictors

Consider an advertising campaign, or Covid. Sitting in a room of people with Covid won’t immediately get you sick. You would have a lag of several days. A model that allows for lag can be expressed as:

y_t = \beta_0 + \gamma_0 x_t + \gamma_1 x_{t-1} + \cdots + \gamma_k x_{t-k} + \eta_t

Again, the

\eta_t

is an

ARIMA

process. The

k

can be found using

AICc

, along with values of

p

and

q

for the

ARIMA

error.

Knowledge Check

Q1: Which model or analysis involves the operation of sorting data variables according to their level of changeability along data records?

This analysis does not remove irrelevant variables, reduce data dimensionality, nor estimate missing values.

It’s a Regression Model! I clearly don’t understand the wording of the question.

Q5: What transformation approach transfers data variable to their frequency domain?

d	x1	x2
1	2.5	2.4
2	0.5	0.7
3	2.2	2.9
4	1.9	2.2
5	3.1	3
6	2.3	2.7
7	2	1.6
8	1	1.1
9	1.5	1.6
10	1.1	0.9

d	x1	x2
1	2.5	2.4
2	0.5	0.7
3	2.2	2.9
4	1.9	2.2
5	3.1	3
6	2.3	2.7
7	2	1.6
8	1	1.1
9	1.5	1.6
10	1.1	0.9

Unit 5: Selected Mathematical Techniques

Introduction

Feature Extraction

5.1 - Principal Component Analysis

The PCA Algorithm

Step 1: Get and Subtract the Mean

Step 2: Calculate the Covariance Matrix

Step 3: Calculate the EigenValues and EigenVectors

Step 4: Formulate the PCs

Step 5: Dimensionality Reduction

Step 6: Reconstruct the dataset

PCA Example

Per Video Session 6

EigenVector Application

5.2 - Cluster Analysis

K-Means Clustering

Example

K-Nearest Neighbour

Hierarchical Clustering

Agglomerative Clustering

Example

5.3 - Linear Regression

Linear Regression Model

Correlation

Coefficient of Determination

Simple Linear Regression Model

Example

Multiple Linear Regression Model

5.4 - Time-Series Forecasting

Concept of Stationary

Autoregressive (AR) Model

Moving Average (MA) Model

Autocorrelation

AutoCorrelation Plot

Partial Autocorrelation

ACF vs. PACF

Autoregressive Integrated Moving Average (ARIMA) Model

Some More ARIMA Information

Seasonal Autogression Integrated Model (SARIMA)

5.5 - Transformation Approaches

Logarithm Transformation

Power Law Transformation

Reciprocal Transformation

Radial Transformation

Discrete Fourier Transform

Video (Session 6 of 7)

Forecasting: Principles and Practice

Preface

Forecasting: Principles and Practice

Preface

Ch. 9 - ARIMA Models

Stationarity and Differencing

Differencing

Random Walk

Second-order differencing

Seasonal Differencing

Unit Root Tests

Backshift notation

Autoregressive models

Moving Average Models

Non-Seasonal ARIMA Models

Understanding ARIMAARIMAARIMA Models

ACF and PACF Plots

Estimation and Order Selection

Maximum likelihood estimation

Information Criteria

ARIMAARIMAARIMA Modelling in fable

Forecasting

Point Forecasts

Prediction Intervals

Seasonal ARIMAARIMAARIMA models

ACF/PACF

ARIMAARIMAARIMA vs ETSETSETS

Ch. 10 - Dynamic Regression Models

Estimation

Regression with ARIMAARIMAARIMA errors using fable

Forecasting

Stochastic and deterministic Tre…

Dynamic Harmonic Regression

Lagged Predictors

Understanding $ARIMA$ Models

$ARIMA$ Modelling in `fable`

Seasonal $ARIMA$ models

$ARIMA$ vs $ETS$

Regression with $ARIMA$ errors using `fable`

d	x1	x2
1	2.5	2.4
2	0.5	0.7
3	2.2	2.9
4	1.9	2.2
5	3.1	3
6	2.3	2.7
7	2	1.6
8	1	1.1
9	1.5	1.6
10	1.1	0.9