title: Data Science
subtitle: DLMBDSA01
authors: Prof. Dr. Claudia Heß
publisher: IU International University of Applied Sciences
date: 2022

Not sure if this belongs here or in the Maths section, but here is good for now.

Unit 5: Selected Mathematical Techniques

You will learn about:

Introduction

We will discuss mathematical techniques and models used to transform data into insightful information. There are two modeling approaches for prediction:

Basically, the flow of this section is the same order as the learning objectives. It’s noted that there we separate time-series data because it is both common and requires additional considerations.

Feature Extraction

Feature extraction is achieved by reducing the dimensions to the ones that significantly contribute to the features of interest. It assumes that data approximately lie on a lower dimensional space. This is achieved by methods such as convolution, radial transforms, integral transforms, or principal components analysis.

We want to isolate important characteristics. Linear regression is simple feature extraction. A reduction from R3R^3 to R1R^1, that is a reduction of 3 dimensional data, will be included in the last session.

5.1 - Principal Component Analysis

Input data usually include correlated variables, either redundant or irrelevant. If the correlation is high but not 100%, then there is still some amount of independent information within each variable. However, the juice isn’t worth the squeeze, it place more burden on the prediction models than what they are worth.

So, we come up with a correlation threshold and if a variable exceed that threshold with another variable, we can assume it as redundant information and can safely remove from the dataset.

Definition - Principal Component Analysis (PCA): A statistical analysis method applied to transform potentially correlated variables into uncorrelated variables called principle components (PCs).

This is a method that is applied to transform linearly-correlated variables into uncorrelated variables called principal components (PCs). PCA also sorts the produced uncorrelated variables according to their variance along the data records. Variables at the bottom of the list are said to have low changeability, and can be excluded. This results in a desired reduction in dimensionality of the dataset.

The goal is to construct a new set of variables such that most information is contained within the first few variables. Then, in the machine learning model and regressions phase, we might only use a subset of the new variables.

Are there steps? Actually, the book shows the PCA Algorithm, very cools.

The PCA Algorithm

Step 1: Get and Subtract the Mean

The book describes this a little weird so I will give my interpretation. If we have a table of data, each record is a row, and the columns are our features, or variables as the book says. But if a record is a Person, then one column could be “age”, another could be “height”, and another “gross income”, etc… We can have MM different variables, each represented as…

{xi    iN,1iM}\left\{ x_i\;|\;i \in \mathbb{N},1\le i \le M \right\}

We also have NN records. The average (mean) is:

xˉi=1Nk=1Nxik\bar{x}_i=\frac{1}{N} \sum_{k=1}^{N} x_{ik}

Where i=1, 2, , Mi = 1,\ 2,\ \dots,\ M. Then, you simply subtract each value from the mean, resulting in a dataset whose mean is centred around 0, simplifying the remaining steps of the PCA algorithm.

xi=xixˉix_i = x_i - \bar{x}_i

Step 2: Calculate the Covariance Matrix

Covariance C(xi,xj)C(x_i, x_j) is the measure of changes in variables xix_i with respect to the changes in variable xjx_j according to:

C(xi, xj)=1N1k=1N(xixj)kC(x_i,\ x_j)=\frac{1}{N-1} \sum_{k=1}^{N} (x_i \cdot x_j)_k

Always fun to note that the covariance of a variable with itself is the variance. Since we are solving for the covariance of all variables with all other variables, you get a symmetric matrix with dimensions [MM]\left[ M \cdot M \right].

C=[C(x1,x1)C(x1,x2)C(x1,xM)C(x2,x1)C(x2,x2)C(x2,xM)C(xM,x1)C(xM,x2)C(xM,xM)]C = \begin{bmatrix} C(x_1, x_1) & C(x_1, x_2) & \dots & C(x_1, x_M)\\ C(x_2, x_1) & C(x_2, x_2) & \dots & C(x_2, x_M)\\ \vdots & \vdots & \ddots & \vdots \\ C(x_M, x_1) & C(x_M, x_2) & \dots & C(x_M, x_M) \end{bmatrix}

Remember that covariance can be either positive or negative, and a covariance close to 0 indicates variables are uncorrelated.

For 2 dimensions, it might look like:

Σ=(σX2σXYσXYσY2)\Sigma= \begin{pmatrix} \sigma_X^2 & \sigma^{XY}\\ \sigma_{XY} & \sigma_Y^2 \end{pmatrix}

Step 3: Calculate the EigenValues and EigenVectors

If you have read through the advanced Maths notes, you would have come across these.

The EigenVector shows the largest variance that we must map to, the direction of the largest variance of the sample covariance matrix. The largest EigenValue is the span along the largest variance. Yes, EigenVectors are typical features of typical data. They have meaning, in this case as the variance. But more genearlly, they are the rotation of the invariant in space.

The objective of PCA is to transform the calculated covariance matrix into an optimum form where all variables are uncorrelated linearly to first order. That is, C(xi,xj)=0, ijC(x_i, x_j) = 0,\ i \ne j. Notice how we state uncorrelated linearly to the first order.

For y=ln(x)y = \ln{(x)}, even though yy is dependent solely on xx, it wouldn’t appear correlated because correlation measure the linear relationship between variables.

The resulting matrix is a diagonal matrix where all elements equal 0 except those in the diagonal.

0=C[λ1λM]I0 = C - \left[ \begin{array}{ccc} \lambda_1 & \cdots & \lambda_M \end{array} \right] \cdot I

And the diagonal elements of the transformed matrix are called the eigenvalues (λ\lambda). To solve, you take the determinant of the bloody thing.

det(CλI)=CλI=0\text{det}(C-\lambda \cdot I) = \left| C - \lambda \cdot I \right| = 0

And, the Principal Components (PCs) are the EigenV\vec{V}ectors of the calculated EigenValues.

What is an EigenVector in this context? It is a vector that, when transformed by the covariance matrix, results in a scaled version of the vector. Remember that in the pure mathematical context, The EigenVector was typically a unit vector where elements could be chosen freely in a manner of speaking.

The scale is the associated eigenvalue, given by

CPCi=(λiI)PCiagaini=1, 2, , M\begin{align*} C \cdot PC_i &= \left( \lambda_i \cdot I \right) \cdot PC_i\\ \text{a} & \text{gain}\dots\\ i &= 1,\ 2,\ \dots,\ M \end{align*}

So, you get the ithi^{th} principal component.

Since there are no correlations between the obtained PCs, the eigenvectors are orthogonal vectors.

EigenVectors of a transformation are those vectors that don’t change their direction, but just their magnitude, which is the EigenValue. Multiplying the transformation matrix with the EigenVector results in the EigenVector scaled by the EigenValue λ\lambda.

Av=λv\vec{A}v = \lambda v

I think one of those is the transformation matrix. EigenVectors shows a direction of a feature, which is important to us and t he EigenValue tell us something about the magnitude of the feature.

Step 4: Formulate the PCs

The next step is to order all other PCs according to their EigenValues, highest to lowest. The percentage of how much variance each PC represents is calculated by:

HPCi=λiλ1++λM100%H_{PC_i} = \frac{\lambda_i}{\lambda_1 + \cdots + \lambda_M} \cdot 100\%

So, it is a weighted average. And since we centred the variables mean around 0, their weights should be comparable.

Step 5: Dimensionality Reduction

You can choose to ignore the PCs with less significance. They will appear at the bottom of the PC list with the lowest eigen values. You will have a dataset now with MM' variables, where M<MM' \lt M.

Step 6: Reconstruct the dataset

Then, the data set is reconstructed by the produced PCs with the following:

[y]T=[PC1PCM]T[x]T[y]^T=\begin{bmatrix} PC_1 & \cdots & PC_{M'} \end{bmatrix}^T \cdot [x]^T

Notice we have MM', which might be obvious to some but just in case…

PCA Example

If you are anything like me you need an example to set the record straight. Luckily, the book comes with one.

dx1x2
12.52.4
20.50.7
32.22.9
41.92.2
53.13
62.32.7
721.6
811.1
91.51.6
101.10.9

![[/images/notes/data-science/pca-datascience-graph-0001.png]]

The data records are scattered around the diagonal. This means that the diagonal itself would be a better primary axis because it captures the most important variance of the data records.

The logic here is a little confusing. Not all of the data records are on the diagonal because of some variance. We expect that a second axis perpendicular to the diagonal will capture the second-highest variability of these data records. Not sure what that means…

The Principal Component Analysis (PCA) algorithm is an unsupervised machine learning algorithm that attempts to reduce the dimensionality (number of features) within a dataset while still retaining as much information as possible.

When PCA is applied to a dataset, it finds the principal components of the data. The principal components are new axes that are perpendicular to each other and capture the maximum variability in the data.

In the example provided, the data records are not all on the diagonal. This means that the data is not perfectly correlated with any one axis. However, we can expect that a second axis perpendicular to the diagonal will capture the second-highest variability of the data records.

This means that the information in the graph is better described using the diagonal and a new axis perpendicular to it. If we need to reduce the number of variables, we could use only the new (diagonal) axis, as it captures most of the information, and neglect the second new axis which contains less significant information about the variance of the data points.

Here is a simple example to help you understand this concept:

Imagine that we have a dataset of two features: height and weight. We can plot this data in a two-dimensional scatter plot. If the data is perfectly correlated, then all of the data points will lie on a straight line. However, if the data is not perfectly correlated, then the data points will be spread out over the two dimensions.

We can use PCA to find the principal components of this dataset. The first principal component will be the axis that captures the maximum variability in the data. In this case, the first principal component will be an axis that runs along the diagonal of the scatter plot.

The second principal component will be the axis that captures the second-highest variability in the data. In this case, the second principal component will be an axis that is perpendicular to the first principal component.

If we need to reduce the number of variables in our dataset, we can use only the first principal component. This will capture most of the information in the data, and we can neglect the second principal component without losing too much information.

First, find the mean of each variable.

xˉ1=1.81xˉ2=1.91\begin{gather*} \bar{x}_1 = 1.81\\ \bar{x}_2 = 1.91 \end{gather*}

Second, subtract mean from values.

dx1x2x1xˉ1x_1-\bar{x}_1x2xˉ2x_2-\bar{x}_2(x1xˉ1)(x2xˉ2)(x_1-\bar{x}_1) \cdot (x_2-\bar{x}_2)(x1xˉ1)2(x_1-\bar{x}_1)^2(x2xˉ2)2(x_2-\bar{x}_2)^2
12.52.40.690.490.33810.47610.2401
20.50.7-1.31-1.211.58511.71611.4641
32.22.90.390.990.38610.15210.9801
41.92.20.090.290.02610.00810.0841
53.131.291.091.40611.66411.1881
62.32.70.490.790.38710.24010.6241
721.60.19-0.31-0.05890.03610.0961
811.1-0.81-0.810.65610.65610.6561
91.51.6-0.31-0.310.09610.09610.0961
101.10.9-0.71-1.010.71710.50411.0201
SUM----5.5395.5496.449

Third is the covariance matrix. Just sanity check on Covariance…

C(xi, xj)=1N1k=1N((x1xˉ1)(x2xˉ2)k)C(xi, xj)=1101(5.539)C(xi, xj)=5.5399C(xi, xj)0.61544ˉ\begin{align*} C(x_i,\ x_j) &= \frac{1}{N-1} \sum_{k=1}^{N} \left( (x_1-\bar{x}_1) \cdot (x_2-\bar{x}_2)_k \right)\\ C(x_i,\ x_j) &= \frac{1}{10-1} \left( 5.539 \right)\\ C(x_i,\ x_j) &= \frac{5.539}{9}\\ C(x_i,\ x_j) &\approx 0.6154\bar{4} \end{align*}

But, the Covariance matrix is a mixture of variance and covariance. Also, it’s not the variance of the actual variable, but its transformation around its mean. Makes the values smaller…

C=[C(x1,x1)C(x1,x2)C(x2,x1)C(x2,x2)]C=19[5.5495.5395.5396.449]C=[0.61660.61540.61540.7166]\begin{align*} C &= \begin{bmatrix} C(x_1,x_1) & C(x_1,x_2) \\ C(x_2,x_1) & C(x_2,x_2) \end{bmatrix} \\ \\ C &= \frac{1}{9} \begin{bmatrix} 5.549 & 5.539 \\ 5.539 & 6.449 \end{bmatrix}\\ C &= \begin{bmatrix} 0.6166 & 0.6154 \\ 0.6154 & 0.7166 \end{bmatrix} \end{align*}

Fourth comes the EigenValues:

0=CλI0=0.6166λ0.61540.61540.7166λ0=(0.6166λ)(0.7166λ)(0.61542)0=λ20.6166λ0.7166λ+0.44180.37880=λ20.6166λ0.7166λ0.37880=λ21.3331λ+0.063\begin{align*} 0 &= \left| C-\lambda \cdot I \right| \\ 0 &= \begin{vmatrix} 0.6166-\lambda & 0.6154 \\ 0.6154 & 0.7166-\lambda \end{vmatrix} \\ 0 &= (0.6166-\lambda) \cdot (0.7166-\lambda)-(0.6154^2)\\ 0 &= \lambda^2 - 0.6166 \lambda - 0.7166 \lambda + 0.4418 - 0.3788 \\ 0 &= \lambda^2 - 0.6166 \lambda - 0.7166 \lambda - 0.3788 \\ 0 &= \lambda^2 - 1.3331 \lambda + 0.063 \\ \end{align*}

I’m not going to use the quadratic equation, but trust in the book…

λ1=1.284, λ2=0.049\therefore \lambda_1 = 1.284,\ \lambda_2=0.049

Fifth, we calculate the EigenVectors:

0=(Cλ1I)PC10=[0.6166λ10.61540.61540.7166λ1][a1b1]0=[0.61661.2840.61540.61540.71661.284][a1b1]0=[0.66740.61540.61540.5674][a1b1]0=[0.6674a1+0.6154b10.6154a10.5674b1]PC1=[a1b1]=[11.084]\begin{align*} 0 &= (C-\lambda_1 \cdot I) \cdot PC_1 \\ 0 & = \begin{bmatrix} 0.6166-\lambda_1 & 0.6154 \\ 0.6154 & 0.7166-\lambda_1 \end{bmatrix} \cdot \begin{bmatrix} a_1\\ b_1 \end{bmatrix}\\ 0 & = \begin{bmatrix} 0.6166-1.284 & 0.6154 \\ 0.6154 & 0.7166-1.284 \end{bmatrix} \cdot \begin{bmatrix} a_1\\ b_1 \end{bmatrix}\\ 0 & = \begin{bmatrix} -0.6674 & 0.6154 \\ 0.6154 & -0.5674 \end{bmatrix} \cdot \begin{bmatrix} a_1\\ b_1 \end{bmatrix}\\ 0 & = \begin{bmatrix} -0.6674 \cdot a_1 + 0.6154 \cdot b_1 \\ 0.6154 \cdot a_1 - 0.5674 \cdot b_1 \end{bmatrix}\\ \therefore PC_1 &= \begin{bmatrix} a_1\\ b_1 \end{bmatrix} = \begin{bmatrix} 1 \\ 1.084 \end{bmatrix}\\ \end{align*}

What does this mean? It’s the slope of the line of best fit, or something like that, where

m=bam = \frac{b}{a}

I guess, as such, it’s advantageous to solve for a1=1a_1 = 1.

So, what about PC2PC_2? This will be the line perpendicular to PC1PC_1. You can already then guess the values of the eigenvector. Because the math is exactly the same, just different numbers, here are the values:

PC2=[a2b2]=[1.0841]\therefore PC_2 = \begin{bmatrix} a_2\\ b_2 \end{bmatrix} = \begin{bmatrix} -1.084 \\ 1 \end{bmatrix}\\

According to This Article with Example This is where you can arrange and select you top most Eigen values.

Sixth, we reconstruct the dataset by re-orientating the data from the original axes. The final data set will be the standardized original data set times the feature vectors.

yi=[PC1PC2]T[xi]Ty_i = \begin{bmatrix} PC_1 & PC_2 \end{bmatrix}^T \cdot [x_i]^T

Per ChatGPT 3.5

Certainly! Principal Component Analysis (PCA) involves transforming your data into a new coordinate system based on the eigenvectors of the data’s covariance matrix. This transformation can be represented as a matrix multiplication.

Here’s how you can perform the PCA transformation step by step:

  1. Center the Data: Start by centering your data by subtracting the mean of each feature (column) from each data point. This ensures that the data is mean-centered.

  2. Compute the Covariance Matrix: Calculate the covariance matrix of the mean-centered data. The covariance matrix represents how each feature relates to every other feature in the data.

  3. Eigenvalue Decomposition: Perform an eigenvalue decomposition of the covariance matrix. This involves finding the eigenvalues and eigenvectors of the covariance matrix. The eigenvectors represent the directions (components) along which the data varies the most, and the eigenvalues represent the amount of variance explained by each eigenvector.

  4. Select Principal Components: Sort the eigenvectors in descending order based on their corresponding eigenvalues. These eigenvectors represent the principal components of the data. You typically choose the top k eigenvectors, where k is the number of dimensions you want to reduce your data to.

  5. Create the Projection Matrix: The projection matrix is composed of the top k eigenvectors, typically arranged as columns. This matrix defines the transformation from the original feature space to the new principal component space.

  6. Transform the Data: To transform your original data into the principal component space, simply multiply your mean-centered data by the projection matrix.

Here’s the matrix multiplication formula for the transformation:

NewData = MeanCenteredData * ProjectionMatrix

Where:

To summarize, the key steps are mean-centering your data, calculating the covariance matrix, finding the eigenvectors and eigenvalues, selecting the top k eigenvectors as your projection matrix, and then using this matrix to transform your data into the principal component space.

Keep in mind that the transformed data will have as many columns as the number of eigenvectors you selected (k), which is typically fewer than the original number of features, achieving dimensionality reduction while retaining most of the data’s variance.

Me again, I’ll now walk through calculating the new first row of data. Note, we will use transpose so that the data fits matrix multiplication.

yi=[PC1PC2]T[xi]Ty1=[11.0841.0841]T[0.690.49]Ty1=[11.0841.0841][0.690.49]y1=[10.69+1.0840.491.0840.69+10.49]y1=[1.221160.25796]\begin{align*} y_i &= \begin{bmatrix} PC_1 & PC_2 \end{bmatrix}^T \cdot [x_i]^T\\ y_1 &= \begin{bmatrix} 1 & -1.084\\ 1.084 & 1 \end{bmatrix}^T \cdot \begin{bmatrix} 0.69 & 0.49 \end{bmatrix}^T\\ y_1 &= \begin{bmatrix} 1 & 1.084\\ -1.084 & 1 \end{bmatrix}\cdot \begin{bmatrix} 0.69 \\ 0.49 \end{bmatrix}\\ y_1 &= \begin{bmatrix} 1 \cdot 0.69 + 1.084 \cdot 0.49 \\ -1.084 \cdot 0.69 + 1 \cdot 0.49 \end{bmatrix}\\ y_1 &= \begin{bmatrix} 1.22116\\ -0.25796 \end{bmatrix}\\ \end{align*}

You would again, look at the transpose of that, but it’s your new x1x_1 and x2x_2 values! 🥳

I solved row 2 and compared to the book. Looks like that is how you do it. It amazingly produces a graph with what looks to be 0 correlation. I suppose here you could also then see how the new variables relate to your predictor and drop one if the correlation is low.

Per Video Session 6

With the spread of data as points say going up and to the right, we say that that trend is the largest variance. Then, the perpendicular line captures the second largest variance. We use the vectors of variance to transform the data into a new coordinate system. This can be done for any NN dimensions.

In signal processing, this is also called discrete Karhunen-Loeve Transform (KLT).

EigenVector Application

Example of how EigenVectors are used in data science, we talk about how pixels change during a movie. Moving orientation would be a transformation calculation in 3-d space. If we look at a rotation, the EigenVector is the vector such that it does not change during the rotation? It essentially becomes the axis of rotation.

This can help recognize pixels in images.

5.2 - Cluster Analysis

p. 91

Definition - Clustering: Method of grouping objects together such that objects in the same group have more in common with one another than those objects in other groups. Clustering is a unsupervised learning technique that permits the input data to be grouped into unlabeled, meaningful clusters. Each group shares a certain level of similarity.

Several clustering approaches:

We will look at K-means clustering and agglomerative clustering.

K-Means Clustering

K-means clustering is an algorithm for grouping a given NN data records into KK clusters. It does this by using the centroids (arithmetic means). The distance measure is the Euclidean distance. It is considered to be unsupervised since an initial classification does not exist.

The algorithm is:

Step 1: decide number of clusters, KK.

Step 2: Select random data records to represent the centroids of these clusters. The video says more like… Calculate kk prototype centroids from random groups of similar vectors.

Step 3: Create the kk clusters by calculate the distance between each data record and the defined centroids. Then assign the data record to the clustering containing the centroid closest to the data record.

We use Euclidean distance, given by:

d(i, c)=(x1,ix1,c)2++(xM,ixM,c)2d(i,\ c)= \sqrt{ (x_{1,i}-x_{1,c})^2 + \cdots + (x_{M,i}-x_{M,c})^2 }

Where (x1, x2, , xM)(x_1,\ x_2,\ \dots,\ x_M) are the MM data variables, ii denotes the data record, and cc denotes the cluster’s centroid.

Step 4: Recalculate the new centroid for each cluster by averaging the included data records.

Step 5: repeat steps 3 and 4 until there are no further changes in the calculated centroids.

Step 6: The final clusters comprise the data records included within them.

The video lecture goes over interesting graph images. It’s a process starting with random centroids. Calculate distance to create random clusters. Calculate new centroids from the clusters. Calculate new distances, perhaps some values change centroids. And continue until you cannot make anymore meaningful changes.

Check rules of thumb for choosing kk. Too many will not lead to convergence because too much fragmentation.

Example

The book includes an example but the concept seems straight-forward enough.

So, the first centroid is a point of data. The second and further centroids actually appear to be average points. The example shows that even though the distances may update with the new centroid, since no records changed clusters, there’s no need to recalculate everything.

K-Nearest Neighbour

This is from the video lecture. It sounds similar to the K-Mean clustering. However, K-Nearest Neighbour assigns new data to a cluster according to the proximity to kk classified neighbours. Distance can be measured by many methods like Manhattan, Euclidean, Cosine, MSE, etc…

It is like if we already did classification and we just want to assign a random data point into a known cluster.

This is a supervised method since the classification already exists. The selection of kk is critical:

Rule of thumb is to use the square root of nn as well as an odd kk to prevent confusion with two classes.

Uses include:

Hierarchical Clustering

Hierarchical Clustering is applied to data that has an underlying hierarchy. The book uses example of apples and bananas are fruits, aubergine and courgette are vegetables; however, both are fresh produce.

There are two approaches to hierarchical clustering: divisive (top-down) and agglomerative (bottom-up).

Agglomerative Clustering

Agglomerative Clustering creates a bottom-up tree (dendrogram) of clusters that repeatedly merge two nearest points, or clusters, into a bigger super cluster. Algorithm is formulated as follows:

  1. Assign each record of the given N Data records a unique cluster, forming NN clusters.
  2. Merge data records, or clusters, with minimum Euclidean distance between them into a single cluster.
  3. Repeat this process until there is only one cluster remaining, forming a hierarchy of clusters.

Example

Those steps, to me, are a little confusing so lets dive into an example. The book looks at a very small dataset of 6 records with 2 variables each. Each dataset is its own cluster at first. The Euclidean distance is found between points. Data with smallest distances are merged.

You may think you got straight from 6 clusters to 3, I did at first. But you only merge smallest distances for non-overlapping points. So record 5 cannot merge with record 4 and record 3 in the same round.

If distance from row 4 to 5 is the smallest, then we create cluster C1C_1. Then, if row 5 to 3 has the smallest distance, we create cluster C2C_2 which incorporates all of C1C_1, not just row 5, and the new point. Continue on this trend, clustering points and clusters.

Apparently you still measure distances between points even after you cluster. I thought you’d get the cluster’s centre, but doesn’t look to be the case. This creates sort of a hierarchy of points as well.

5.3 - Linear Regression

p. 99

Definition - Linear Regression: A method for modeling linear relationships between a dependent variable and one or more independent variables.

The objective of Regression is prediction based on historic data. Building a model is an iterative process to relate independent variables to the dependent variables.

Linear Regression Model

Again, we assume nn data records, each with mm features, or variables. We want to discover a relationship something like:

y^=w0+w1x1+w2x2++wmxm\hat{y}=w_0 + w_1 x_1 + w_2 x_2 + \dots + w_m x_m

We are using ww as the coefficient to indicate weight or the variable. Of course y^\hat{y} is the predicted value and differs from the actual value by a random error variable,

yi=y^i+εiy_i = \hat{y}_i + \varepsilon_i

This is because the input data is not an exact linear relationship between independent and dependent variables. Or perhaps we are missing information. Note that:

εi=y^iyi\varepsilon_i = |\hat{y}_i - y_i|

We use the absolute value because when you plot points, the error is the vertical distance between the points. And distance cannot be negative.

Correlation

Before we dive into any examples, the instructor talks about the correlation coefficient, or the product-moment correlation according to Pearson. It determines the extent to which values of two variables are linearly related to each other.

population:

ρ=cov(x,y)σxσy=σxyσxσy\rho=\frac{cov(x,y)}{\sigma_x \cdot \sigma_y} = \frac{\sigma_{xy}}{\sigma_x \cdot \sigma_y}

sample:

ρ=sxysxsy=i=1n(xixˉ)(yiyˉ)i=1n(xixˉ)2i=1n(yiyˉ)2\rho=\frac{s_{xy}}{s_x \cdot s_y}= \frac{\sum_{i=1}^n(x_i- \bar{x})(y_i- \bar{y})}{ \sqrt{\sum_{i=1}^n(x_i- \bar{x})^2} \sqrt{\sum_{i=1}^n(y_i- \bar{y})^2} }

where you can have

sxy=1n1i=1n(xixˉ)(yiyˉ)sx=1n1i=1n(xixˉ)2sy=1n1i=1n(yiyˉ)2\begin{align*} s_{xy} &= \frac{1}{n-1}\sum_{i=1}^n(x_i- \bar{x})(y_i- \bar{y})\\ s_x &= \sqrt{\frac{1}{n-1}\sum_{i=1}^n(x_i- \bar{x})^2}\\ s_y &= \sqrt{\frac{1}{n-1}\sum_{i=1}^n(y_i- \bar{y})^2}\\ \end{align*}

The n1n-1 is what makes it a sample. However, you can see that in the grand scheme of things it cancels out.

The population is much more than just the entire set of data, but consider that you only know the data right now. You don’t have future data and therefore just a sample of data.

Correlation coefficient ranges between ±1\pm 1. You cannot say something is twice as related if the coefficients are 0.25 and 0.50, that’s not really how it works. But yes, it is more related. Additionally, high correlation does not necessarily indicate a causative relationship between variables. However, if you hypothesise a relationship, the correlation can support your hypothesis.

Coefficient of Determination

The Coefficient of Determination (r2)(r^2) estimates the ratio of values explaining the relationship given by the regression line. In linear regression with a single variable, it is the square of the Pearson correlation coefficient:

r2=sxy2sx2sy2r^2=\frac{s_{xy}^2}{s_x^2 \cdot s_y^2}

It answers the question: how much of the variation in yy is described by the variation in xx? That is 1r21-r^2 shows how much of the change in yy is associated to a random error in xx. This coefficient ranges from (0, 1)(0,\ 1), and can be interpreted as a percentage.

What does 64% mean? Changing something on one axis will change on the other. It means 64% of the variation in one axis is represented in the other, leaving 36% left for the bias. So it is like how much of the variation in yy is explained by the variation in xx.

Might be helpful to look at examples between rr and r2r^2. You’ll notice that R2R^2 changes much faster as variation increases. Note that R2R^2 will also always be positive.

Simple Linear Regression Model

This is like a y=mx+by=mx+b situation, just one independent variable. In this easy example, the line of best fit for a dataset would minimize the sum of squared error, also called the least-squares method.

i=1nεi2=i=1n(y^iyi)2\sum_{i=1}^{n} \varepsilon_i^2 = \sum_{i=1}^{n} (\hat{y}_i - y_i)^2

So, we pick the weights, wnw_n that minimize this function. When you think minimize, you may think solving for when the derivative is 0.

i=1n(y^iyi)2=i=1n(y^iw0w1x)2\sum_{i=1}^{n} (\hat{y}_i - y_i)^2 = \sum_{i=1}^{n} (\hat{y}_i - w_0-w_1x)^2

Since we are solving for weights, I guess that is what we take derivatives with respect to.

w0i=1nεi2=0\frac{\partial}{\partial w_0} \sum_{i=1}^{n} \varepsilon_i^2 = 0

This would provide a point where the error is minimal. Work that into our summation:

0=w0i=1n(y^iw0w1x)2=i=1n2(y^iw0w1x)(010)=2i=1n(w0+w1xy^i)=2nw0+2i=1n(w1xy^i)2nw0=2i=1n(w1xy^i)w0=22ni=1n(w1xy^i)w0=1ni=1n(y^i)w1ni=1n(x)\begin{align*} 0 &= \frac{\partial}{\partial w_0} \sum_{i=1}^{n} (\hat{y}_i - w_0-w_1x)^2\\ &= \sum_{i=1}^{n} 2(\hat{y}_i - w_0-w_1x)\cdot(0-1-0)\\ &= 2 \sum_{i=1}^{n} (w_0+w_1x-\hat{y}_i)\\ &= 2nw_0 + 2 \sum_{i=1}^{n} (w_1x-\hat{y}_i)\\ -2nw_0 &= 2 \sum_{i=1}^{n} (w_1x-\hat{y}_i)\\ w_0 &= \frac{2}{-2n} \sum_{i=1}^{n} (w_1x-\hat{y}_i)\\ w_0 &= \frac{1}{n} \sum_{i=1}^{n}(\hat{y}_i) - \frac{w_1}{n} \sum_{i=1}^{n}(x)\\ \end{align*}

That wasn’t too bad, but we are going to make an even bigger mess finding w1w_1. This isn’t a statistics course, and I will have notes on a statistics course soon… The steps to finding w1w_1 are more complicated but similar. I’ll leave it to either the statistics course to derive for the sake of my time and sanity.

0=w1i=1nεi20=w1i=1n(y^iw0w1x)2=i=1n2(y^iw0w1x)(x)=w1=ni=1n(y^x)i=1n(y^)i=1n(x)ni=1n(x2)(i=1n(x))2\begin{align*} 0 &= \frac{\partial}{\partial w_1} \sum_{i=1}^{n} \varepsilon_i^2\\ 0 &= \frac{\partial}{\partial w_1} \sum_{i=1}^{n} (\hat{y}_i - w_0-w_1x)^2\\ &= \sum_{i=1}^{n} 2(\hat{y}_i - w_0-w_1x)\cdot(-x)\\ &= \dots\\ w_1 &= \frac{ \displaystyle n \sum_{i=1}^n (\hat{y}x) - \sum_{i=1}^n (\hat{y})\sum_{i=1}^n (x) }{ \displaystyle n \sum_{i=1}^n (x^2) -\left(\sum_{i=1}^n(x)\right)^2 } \end{align*}

Since w1w_1 appears in the formula for w0w_0, you calculate that first.

Example

Basically, you need to calculate the following

i=1n(x)i=1n(x2)i=1n(y^x)i=1n(y^)\begin{gather*} \sum_{i=1}^{n}(x)\\ \sum_{i=1}^{n}(x^2)\\ \sum_{i=1}^{n}(\hat{y}x)\\ \sum_{i=1}^{n}(\hat{y})\\ \end{gather*}

The y^\hat{y} values that we use are the dependent variables of our dataset. It makes me think the formula is backwards from the beginning where we should have expanded y^\hat{y}, but the expansion of that is actually more like y^=y+ε\hat{y} = y + \varepsilon. So, expanding just yy is the right call.

Then, with those calculations, you plug-n-chug for weights / coefficients. Once you have those, your can reconstruct the regression formula.

Multiple Linear Regression Model

p. 104

What about more realistic datasets, with many records, each having many independent variables? Let’s extend our understanding to a dataset with 2 independent variables.

y=w0+w1x1+w2x2y = w_0 + w_1 x_1 + w_2 x_2

The process would be taking the derivative of the error with respect to each to each coefficient and setting it equal to zero. Now, I just want to look at the following derivatives,

0=w0i=1nεi2=w0i=1n(y^y)2=w0i=1n(y^(w0+w1x1+w2x2))2=w0i=1n(y^w0w1x1w2x2)2=2i=1n((y^w0w1x1w2x2)(0100))\begin{align*} 0 &= \frac{\partial}{\partial w_0}\sum_{i=1}^{n} \varepsilon_i^2\\ &= \frac{\partial}{\partial w_0}\sum_{i=1}^{n}(\hat{y}-y)^2\\ &= \frac{\partial}{\partial w_0}\sum_{i=1}^{n}(\hat{y}-(w_0 + w_1 x_1 + w_2 x_2))^2\\ &= \frac{\partial}{\partial w_0}\sum_{i=1}^{n}(\hat{y}-w_0 - w_1 x_1 - w_2 x_2)^2\\ &= 2\sum_{i=1}^{n}\left( (\hat{y}-w_0 - w_1 x_1 - w_2 x_2)\cdot(0-1-0-0) \right)\\ \end{align*}

So, w0w_0 is composed of w1w_1 and w2w_2, making it a little harder to work with. Additionally,

0=w1i=1nεi2=w1i=1n(y^w0w1x1w2x2)2=2i=1n((y^w0w1x1w2x2)(00x10))=2i=1n(w0x1w1x12w2x2x1y^x1)\begin{align*} 0 &= \frac{\partial}{\partial w_1}\sum_{i=1}^{n} \varepsilon_i^2\\ &= \frac{\partial}{\partial w_1}\sum_{i=1}^{n}(\hat{y}-w_0 - w_1 x_1 - w_2 x_2)^2\\ &= 2\sum_{i=1}^{n}\left( (\hat{y}-w_0 - w_1 x_1 - w_2 x_2)\cdot(0-0-x_1-0) \right)\\ &= 2\sum_{i=1}^{n}\left( w_0x_1 - w_1 x_1^2 - w_2 x_2 x_1 - \hat{y}x_1 \right)\\ \end{align*}

I think it is apparent that additional variables drastically complicates what we are solving for. However, we have as many variables as we do unique functions, and therefore it is solvable. It would be easier to construct a system of linear equations and then solve for unknowns using techniques of linear algebra though.

The book also notes that a large coefficient / weight wmw_m implies that the target yy is highly correlated to the variable xmx_m, and vice versa. But as the number of independent variables increases, the assumption of a linear relationship between the target variable and other variables becomes weak.

Consequently, nonlinear regression models will produce more accurate predictions.

5.4 - Time-Series Forecasting

p. 105

Definition - Forecasting Models: This is a model for forecasting (predicting) future events based on data and information gleaned from the past.

When people think about forecasting models, the first thought is probably stocks or sales. Some questions you might ask are:

One difference between regular regression analysis and this time-series forecasting is that individual data records depend on previous data records (to some extent). Thus, the forecasting technique must take this into consideration. Observations must be ordered with respect to time instances.

One popular linear forecasting technique is the autoregressive method (AR). This method assumes the expected output is a linear function of some past outputs. This means if the underlying relationship is nonlinear then the AR approach yields suboptimal results. For better results with nonlinear data, we can upgrade the AR method to include moving averages (ARMA) and integral terms (ARIMA).

Concept of Stationary

Definition - Stationary Time-Series: This type of data has a constant mean and standard deviation over time.

To apply a forecasting model to time-series data, it should be stationary time-series. Only then can the model correctly self-predict its future response from past data points.

Most time-series data is not stationary, but we can convert data into stationary data with a concept of dΔtd_{\Delta t}, which is the difference between every two data points on an interval Δt\Delta t. I suppose, even if the data increases or decreases overtime, then we aim to model the change over each time interval, hoping that is stationary. You can model back multiple intervals as well.

There is an optimum value for Δt\Delta t, which results in a completely stationary form of the time-series data.

Autoregressive (AR) Model

p. 107

Definition - lag(n)lag(n): Not sure about the math definition format… This is the backshift of a time-series by nn time steps.

The AutoRegressive (AR) model is a linear model developed to predict the value of an observation at the very next point in time using linear combination of its values at previous time instances.

The name comes from it using data from the same variable at past points in time. You might see AR(n)AR(n) for a model of order nn. This refers to the lag defined above, so it is a linear combination of previous data points going back, backshift, nn points.

yt=p0+p1yt1+p2yt2++pnytn+εty_t = p_0 + p_1y_{t-1} + p_2y_{t-2} + \cdots + p_ny_{t-n} + \varepsilon_t

The pnp_n values are model coefficients, and εt\varepsilon_t is a white noise term WN(0,σ2)WN(0, \sigma^2), which looks a bit standard normal to me.

Per video, model defined like

AR(p)=xt=c+i=1pφixtiAR(p) = x_t = c + \sum_{i=1}^p\varphi_i x_{t-i}

The next point is calculated via the previous pp points.

Moving Average (MA) Model

The moving average model predicts future observations:

yt=q0+q1εt1+q2εt2++qnεtny_t = q_0 + q_1 \varepsilon_{t-1} + q_2 \varepsilon_{t-2} + \cdots + q_n \varepsilon_{t-n}

Here, each εtn\varepsilon_{t-n} is a white noise error term WN(0, σ2)WN(0,\ \sigma^2), and the qnq_n are model coefficients.

Per the video,

MA(q)=xt=i=1qθiεtiMA(q) = x_t = \sum_{i=1}^q \theta_i \varepsilon_{t-i}

Next point is calculated from the previous qq error terms.

Moving Average Model | Wiki, per Wikipedia, looks a little difference,

Xt=μ+εt+θ1εt1++θqεtq=μ+i=1qθiεti+εt\begin{align*} X_t &= \mu + \varepsilon_t + \theta_1 \varepsilon_{t-1} + \dots + \theta_q \varepsilon_{t-q} \\ & = \mu + \sum_{i=1}^{q} \theta_i \varepsilon_{t-i} + \varepsilon_t \end{align*}

Where qq is the order of the MA model, the errors are white noise, and μ\mu is the mean of the series. The Moving-Average model is essentially a linear regression of the current value of the series against current and previous (observed) white noise error terms. Errors are assumed mutually independent and typically a normal distribution with mean of 0.

Autocorrelation

This is like correlating data with itself, previous data.

The correlation coefficient is how linearly related two variable are, represented by a number between {xR 1x1}\left\{ x \in \mathbb{R}\ | -1 \le x \le 1 \right\}, or just (1, 1)(-1,\ 1). We calculate the correlation between forecasted variables and its value at a specific lag.

The autocorrelation coefficient ACF(n)ACF(n) and lag(n)lag(n) is given by:

ACF(n)=C(yt, ytn)V(yt)V(ytn)ACF(n) = \frac{C(y_t,\ y_{t-n})}{\sqrt{V(y_t)\cdot V(y_{t-n})}}

We have the covariance divided by variance.

Autocorrelation (serial correlation) is the correlation of a signal with temporally succeeding portions of the same signal. The function considers having a signal X={x1,,xn}X=\{ x_1, \dots,x_n \} and a delayed portion Xd={x1+d,,xnd}X_d=\{ x_{1+d},\dots,x_{n-d} \}, the functions is basically the same as above.

Rewritten to express that we are comparing data with a delayed version of itself, hence the ndn-d as we cannot compare to the dd delayed versions… Hard to work, but if we are looking at comparing data to itself back 3 periods, then we must start at z1=x4x1z_1 = x_4-x_1 is the basic concept, assuming we don’t have an x0x_0 or earlier…

rd=i=1nd(xixˉ)(xi+dxˉ)(i=1nd(xixˉ)2)(i=1nd(xi+dxˉ)2)r_d = \frac{ \sum_{i=1}^{n-d}(x_i-\bar x)(x_{i+d}-\bar x) }{ \sqrt{\left( \sum_{i=1}^{n-d}(x_i-\bar x)^2 \right)} \cdot \sqrt{\left( \sum_{i=1}^{n-d}(x_{i+d}-\bar x)^2 \right)} }

The formula starts at ii and compares to future i+di+d, instead of starting at i+di+d and looking back. It’s the same think, but using a plus instead of minus dd.

Autocorrelation can be used to detect non-randomness in time series. Randomness is one of the key assumptions to define a univariate statistical process. It assumes that location, scale, and distribution are constant and that the bias is a random error rather than a systematic error. If randomness is not given, a non-linear or time-series model is required.

Typical Applications include:

You find whether data is truly random or if there are patterns in the data over time.

Autocorrelation | Wiki in the discrete time case is the correlation of a signal with a delayed copy of itself as a function of delay.

AutoCorrelation Plot

Autocorrelation plots show a series autocorrelation coefficients for increasing delays (lags) and are used for checking randomness in a data set. Computing correlations for data at varying temporal delays and plotting them against the delay, allows you to identify:

Partial Autocorrelation

We have a partial autocorrelation function PACFk\text{PACF}_k at lag(k)lag(k) is the autocorrelation between yty_t and ytky_{t-k} that is not accounted for by the autocorrelations from the 1st1^{st} to the (k1)st(k-1)^{st} lags.

[ACF(0)ACF(1)ACF(k1)ACF(1)ACF(0)ACF(k2)ACF(k1)ACF(k2)ACF(0)][PACF1PACF2PACFk]=[ACF(1)ACF(2)ACF(k)]\begin{bmatrix} ACF(0) & ACF(1) & \cdots & ACF(k-1)\\ ACF(1) & ACF(0) & \cdots & ACF(k-2)\\ \vdots & \vdots & \ddots & \vdots \\ ACF(k-1) & ACF(k-2) & \cdots & ACF(0)\\ \end{bmatrix} \begin{bmatrix} PACF_1\\ PACF_2\\ \vdots\\ PACF_k \end{bmatrix} = \begin{bmatrix} ACF(1) \\ ACF(2) \\ \vdots \\ ACF(k) \\ \end{bmatrix}

Looks hectic.

Partial Autocorrelation Function (PACF) | Wiki in a time series analysis gives the partial correlation of a stationary time series with its own lagged values, regressed the values of the time series at all shorter lags. Apparently, the contrast is that autocorrelation does not control other lags. It’s purpose is more aimed at identifying the extent of the lag in an autoregressive (AR) model.

ACF vs. PACF

For the plot, you would have the lag on the x-axis. So, at x=1x=1, that is a delay of one. And the y-axis is rr. That means you plot nearly a histogram, or bar chart, of correlation to lag. That is autocorrelation.

Partial autocorrelation somehow doesn’t consider steps between lag.

Let’s try to understand with an example. Suppose you have a bag of coloured blocks and you are drawing one block out at a time.

To get a high partial autocorrelation, you need a sequence of draws where the colours are highly correlated, even after taking into account the colour of the previous blocks. Consider drawing a pattern of blocks like: red, blue, red, red, blue, red, red, blue,…

The partial autocorrelation would he high because each marble colour is strongly correlated with the pervious marble colour, even after taking into account the colour of the current marble.

Partial autocorrelation is more for like seasonal things. Partial autocorrelation is rarely high when autocorrelation is low. That really only happens in situations where data is highly seasonal. The autocorrelation would be low because the data is not correlated with itself at different lags. But the partial autocorrelation could be high because the values of the time series are correlated with themselves at different lags, after taking into account seasonal pattern.

Autocorrelation is simply the correlation between time series and itself at different lags. Partial autocorrelation is much more involved, as it is the correlation between a time series and itself at a given lag, after taking into account the correlations between the time series and itself at all shorter lags. One calculation for partial autocorrelation is to use Yule-Walker equations. Another way to calculate is to use autoregressive (AR) models, which were discussed above.

Calculate the AR model for a time series. Then calculate the partial autocorrelation coefficients for the time series with:

ϕk=ϕk1+akϕk1\phi_k= \phi_{k-1} + a_k \cdot \phi_{k-1}

Where:

This process is recursive, meaning you start with lag 1 and calculate partial autocorrelation coefficients at subsequent lags using the above formula.

Other statistical methods for calculating Partial Autocorrelation are:

Autoregressive Integrated Moving Average (ARIMA) Model

Something also important is Stationary data, meaning that neither the mean nor variance changes overtime in the data. If this occurs, then correlation analysis is apparently not possible, or at least not as easy. That is when you resort to methods such as the difference of mean between intervals. So, the rate of change might be constant.

Two stochastic processes are strongly stationary if their joint probability distribution does not change over time. This means that their mean and variance also does not change significantly.

ARIMA models mix both the AR and MA models with integrated parameters in one model to obtain a better understanding of time-series data. An integrated parameter is the degree of differencing which is performed on the dataset to transform it into a stationary time-series.

Enter ARIMA = Autoregressive Integrated Moving Average model.

The notation is ARMIA(p, d, q)ARMIA(p,\ d,\ q), where we have:

It looks like the sum of a constant (y-intercept), weighted sum of the previous pp values of yy, and the weighted sum of the previous qq forecast errors.

yt=c+p1yt1++pnytn+q1εt1++qmεtny_t = c + p_1 y_{t-1} + \cdots + p_n y_{t-n} + q_1 \varepsilon_{t-1} + \cdots + q_m \varepsilon_{t-n}

So, how many terms of pp and qq do we use? Typically, you must look at plots of autocorrelation and partial autocorrelation functions of time-series, which is why we covered it, then consider the following:

  1. If the ACF plot cuts off sharply at lag(k)lag(k) while there is a more gradual decay in the PACF plot (significance beyond lag(k)lag(k)), then set q=kq=k and p=0p=0.
  2. If the reverse, PACF plot cuts off sharply at lag(k)lag(k) while the ACF plot has more gradual decay beyond lag(k)lag(k), then set p=kp=k and q=0q=0.
  3. If a single spike at lag(1)lag(1) in both ACF and PACF, then p=1p=1 and q=0q=0 if the spike is positive, and p=0p=0 and q=1q=1 if the spike is negative.

A spike should exceed the 95% confidence interval.

The book then covers some examples. The PACF and ACF charts are almost like histograms with correlation on the y-axis and the lag term on the x-axis.

The first example shows PACF around 0.50.5 for lags 1 and 2, and then drop close to zero, where the ACF is also high at first but has a more gradual decline over lag, not dropping off until around lag 7. Therefore, we set p=2p=2 and q=0q=0 and get:

ARIMA(2,0,0)=yt=c+p1 yt1+p2 yt2\text{ARIMA}(2,0,0) = y_t = c + p_1\ y_{t-1} + p_2\ y_{t-2}

The second example is for ARIMA(0,0,1)ARIMA(0,0,1) because there is a spike at lag(1)lag(1) for ACF and PACF decays to lag(6)lag(6). This means:

ARIMA(0, 0, 1)=yt=c+q1εt1ARIMA(0,\ 0,\ 1) = y_t = c + q_1 \varepsilon_{t-1}

The we have example 3. They do a one degree differencing, so d=1d=1, to create more stationary data. Then the graphs are quite confusing, but the spikes are ACF2ACF_2 and PACF5PACF_5, giving ARIMA(5,1,2)ARIMA(5,1,2). I won’t write yty_t because that would be a lot.

Once a good estimation of number of terms in developed model yty_t is made, we use the input time-series data to obtain unknown coefficients. The residuals time-series RtR_t is the difference between the input time-series and the model’s forecasted time-series.

If the ARIMA model was correctly developed, there should be no significant spikes in the ACF and PACF plots of RtR_t. Spikes at later lags don’t automatically mean the model is wrong. However, if there’s a spike at say lag(2)lag(2) and we used ARIMA(1,0,0)ARIMA(1,0,0) or ARIMA(0,0,1)ARIMA(0,0,1), then we use the following rules:

  1. If the spike is the ACF plot, we increase q += 1 and refit the model.
  2. If the spike was in the PACF plot, we increase p += 1 and refit the model.

In practice, p,q3p,q \le 3 for any developed ARIMA model for a business application. It is also advised to avoid mixed models, with both qq and pp terms. And if you do add additional terms to a developed model on the basis of the residual analysis recommendation, do so one parameter at a time.

Also, some time-series data may first need nonlinear transformation to convert into a form with consistent distribution over time and symmetry in appearance.

Per the video,

ARMA(p,q)=xt=c+i=1pφixti+i=1qθiεtiARMA(p,q) = x_t = c+\sum_{i=1}^p \varphi_ix_{t-i} +\sum_{i=1}^q\theta_i \varepsilon_{t-i}

Combination with pp previous points and qq error terms. I think it is really interesting to see such a distinction between pp and qq.

There’s also:

ARIMA(p,d,q)=xt=c+i=1pφi(xtxt1)+i=1qθiεtiARIMA(p,d,q) = x_t = c + \sum_{i=1}^p \varphi_i(x_t-x_{t-1}) + \sum_{i=1}^q \theta_i \varepsilon_{t-i}

This generates different stationary by calculating differences in the dthd^{th} order such as:

d1=(xtxt1)d2=(xtxt1)(xt1xt2)\begin{align*} d_1 &= (x_t-x_{t-1})\\ d_2 &= (x_t-x_{t-1})-(x_{t-1}-x_{t-2})\\ \vdots \end{align*}

Some More ARIMA Information

Let’s cover what some of the models are used for, it’s interesting…

The course text book has some rules of thumb about using certain ARIMA functions. The rules are part of the exam! Know them well! Check around p. 109.

Seasonal Autogression Integrated Model (SARIMA)

ARIMA cannot handle datasets with seasonal components. Seasonal time-series data are cyclical, and require the seasonal autoregressive integrated model (SARIMA). The notation is SARIMA(p,d,q)(P,D,Q)sSARIMA(p,d,q)(P,D,Q)s. The first three are for ARIMAARIMA. The (P,D,Q)(P,D,Q) are for the backshifts of the seasonal period. The ss denotes the number of time steps for a single seasonal period.

Python’s statsmodels library supports complete designing, fitting, and forecasting of SARIMA model.

There’s also SARIMAX models, which permit the existence of a exogenous variable XX in the dataset. That is an external variable that influences the time-series variations and needs to be considered during the analysis.

5.5 - Transformation Approaches

A dataset transformation involves replacing each xix_i with xjx_j using some function ff:

xj=f(xi)x_j = f(x_i)

The book uses XiX_i for the new variable, but capital variables typically denote random variables, and there’s no random component I am aware of yet.

Why would you transform your data? You might transform variables into a new space, like Cartesian to Radial, and improve interpretability… that’s really all they listed.

It’s kind of like, if you see the dependent and independent variables have a relationship, but it isn’t linear, you want to understand how to make it linear. Then you can regress on it.

Logarithm Transformation

One sort of common transformation is a logarithmic transformation, where we have:

xi=ln(xi)x_i'=\ln(x_i)

I prefer the natural log, but you can try base 10. This can help make clumpy data appear linear.

Power Law Transformation

The power law transformation represents a family of transformations based on nonnegative parameter (γ\gamma):

xi=xiγx_i'=x_i^{\gamma}

Basically, γ\gamma is initially estimated and then continuously updated during the training phase of the model-building process until the highest performance accuracy is achieved.

The parameter is meant to be nonnegative, but it can still be fractional, giving an interesting shape to the curve.

Reciprocal Transformation

This is:

xi=cxix_i'=\frac{c}{x_i}

You can do just 1x\frac{1}{x} but I thought I would liven it up with cc. Sometime you are given data like gallons per mile, but you need it in miles per gallon.

Radial Transformation

Radial Transformation focuses on the distance between the value of a variable and the origin. I guess you combine two variables and transform them into the radial coordinates of radius and angle:

r=x12+x22θ=tan1(x1x2)\begin{gather*} r=\sqrt{x_1^2+x_2^2}\\ \theta=\tan^{-1}\left( \frac{x_1}{x_2} \right) \end{gather*}

You might need to look for examples of this.

Discrete Fourier Transform

Well, if it isn’t our old friend…

The Fourier Transform shifts variable from their traditional domain of xx versus time tt, to a frequency domain. The Fourier transform determines which frequencies can represent the distribution of a given variable, and is given by:

xn=k=0N1(xke2πink/N)x_n= \sum_{k=0}^{N-1}\left( x_k e^{-2 \pi i n k / N} \right)

Where NN is the length of the selected frequency band.


Video (Session 6 of 7)

going to try and integrate lecture into notes since it’s going to probably have a lot of equations.

Didn’t really cover different types of transformations oddly enough.


title: Forecasting
subtitle: Principles and Practice
edition: 3
authors: 
	- Rob J. Hyndman
	- George Athanasopoulos
publisher: OTexts
date: 2018
location: Melbourne, Australia
url: "https://otexts.com/fpp3/"

Forecasting: Principles and Practice

This should probably be its own information, but that is for the future perhaps. For now, we combine a couple additional chapters in my Data Science notes.

Preface

This is regarding the 3rd edition, but the 2nd is also available online. The 2nd edition uses the forecast package in R, where the 3rd edition uses tsibble and fable packages.

The course for Data Science suggests reading Ch. 9 of the 2nd edition, which is Dynamic regression models, but in the 3rd edition it is ARIMA models. Both are relevant and the 3rd edition seems to be more current, so I’ll go with that.

Check out the book yourself Forecasting: Principles and Practice.

Forecasting: Principles and Practice

Preface

This is regarding the 3rd edition, but the 2nd is also available online. The 2nd edition uses the forecast package in R, where the 3rd edition uses tsibble and fable packages.

The course for Data Science suggests reading Ch. 9 of the 2nd edition, which is Dynamic regression models, but in the 3rd edition it is ARIMA models. Both are relevant and the 3rd edition seems to be more current, so I’ll go with that.

Check out the book yourself Forecasting: Principles and Practice.

Ch. 9 - ARIMA Models

Exponential smoothing (Ch. 8 of this book) and ARIMA models are probably most widely used time series forecasting approaches. ARIMA models try to describe the autocorrelations in data.

Stationarity and Differencing

Stationary time series has statistical properties that do not depend on the time at which the series is observed. So, seasonality is not time series, and might require apparently exponential smoothing. But white noise series is stationary.

However, cyclic behaviour can be stationary if it doesn’t have a trend of seasonality. As long as it doesn’t have predictable patterns in the long-term.

Differencing

Differencing is a way to make non-stationary time series stationary by computing the differences between consecutive observations. Additionally, transformation like logarithms can help stabilise the variance of a time series.

ACF is useful for identifying non-stationary time series. If the data is stationary, the ACF will drop to zero quick enough. However, if the data is non-stationary, the ACF will decay slowly.

Random Walk

For a series with TT values, if you calculate the difference series, you have T1T-1 values as you cannot obtain y0y_0' because you don’t have a y(1)y_{(-1)} value (best I can describe).

yt=ytyt1y_t'=y_t-y_{t-1}

For a white-noise series, you have

ytyt1=εty_t-y_{t-1} = \varepsilon_t

So, the εt\varepsilon_t denotes the white noise. And you get a random walk model with

yt=yt1+εty_t=y_{t-1} + \varepsilon_t

Random walks typically have:

Second-order differencing

This extends out beyond second, but might look like this…

yt=ytyt1=(ytyt1)(yt1yt2)=yt2yt1+yt2\begin{align*} y_t'' &= y_t' - y_{t-1}'\\ &=(y_t-y_{t-1}) - (y_{t-1}-y_{t-2})\\ &= y_t - 2y_{t-1} + y_{t-2} \end{align*}

So, this set now has T2T-2 records. Apparently in practice, you typically won’t go beyond second-order differences.

Seasonal Differencing

Seasonal differencing is difference between observation and the previous observation from the same season. Therefore:

yt=ytytmy_t'=y_t-y_{t-m}

These are called lag-m differences.

Sometimes you would take both a seasonal and ordinary difference (AKA first difference) to obtain stationary data. Sorry if the AKA isn’t clear, it only applies to ordinary difference, not the combo of ordinary and seasonal.

It is recommended to do seasonal differences first because sometimes the resulting series is stationary without applying a further first difference. Applying more differences than required can induce incorrect autocorrelations.

Unit Root Tests

A unit root test is a statistical hypothesis test of stationarity that is designed to determine whether differencing is required. There are many tests, this book covers the Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test. This is a null hypothesis that data are stationary, and we test if null hypothesis is false. Small p-values (eg. p<0.05p \lt 0.05) suggests differencing is required.

R has a function called unitroot_kpss, check the book for examples as I don’t know R well enough at this time.

I did a little digging and found that Python statsmodel has a KPSS test tool as well.

The book discusses a unitroot_nsdiffs() R function for determining if seasonal differencing is required.

Backshift notation

Backshift operator BB is like:

Byt=yt1By_t=y_{t-1}

Sometime people use LL for “lag”, but usually not. This means something like yt12=B12yty_{t-12}=B^{12}y_t, which is just the notation and not actually BB to the 12th power.

The notation works for differencing like this:

yt=ytyt1=ytByt=(1B)yty_t'=y_t-y_{t-1} = y_t - By_t=(1-B)y_t

In general, it would look like yt(d)=(1B)dyty_t^{(d)}=(1-B)^dy_t.

Seasonal and first difference looks like…

(1B)(1Bm)yt=(1BBm+Bm+1)yt=ytyt1ytm+ytm1\begin{align*} (1-B)(1-B^m)y_t &= (1-B-B^m+B^{m+1})y_t\\ &= y_t-y_{t-1}-y_{t-m}+y_{t-m-1} \end{align*}

Autoregressive models

The term autoregression indicates that it is a regression of the variable against itself. An autoregressive model of order pp can be written as

yt=c+ϕ1yt1+ϕ2yt2++ϕpytp+εty_t = c + \phi_1 \cdot y_{t-1} + \phi_2 \cdot y_{t-2} + \cdots + \phi_p \cdot y_{t-p} + \varepsilon_t

where εt\varepsilon_t is white noise. It is like multiple regression but with lagged values of yty_t as predictors. We call this an AR(p)AR(p) model, an autoregressive model of order pp.

We usually restrict autoregressive models to stationary data. We also put some constraints on the values of parameters required:

And when p3p \ge 3 the restrictions get more complicated. Luckily, R has the fable package that can take care of these restrictions for us when estimating a model.

Moving Average Models

A moving average model uses past forecast errors in a regression-like model, like so

yt=c+εt+θ1εt1+θ2εt2++θqεtqy_t = c + \varepsilon_t + \theta_1 \cdot \varepsilon_{t-1} + \theta_2 \cdot \varepsilon_{t-2} + \cdots + \theta_q \cdot \varepsilon_{t-q}

This is called an MA(q)MA(q) model, moving average model of order qq. Note that you wouldn’t actually observe values of εt\varepsilon_t, so it is not a regression in the typical sense. Each yty_t can be thought of as a weighted moving average of the past few forecast errors.

Note: Do not confuse moving average with smoothing (discussed in previous chapter of book).

The AR(p)AR(p) model can be written as MA()MA(\infty) if you consider the following:

yt=ϕ1yt1+εt=ϕ1(ϕ1yt2+εt1)+εt=ϕ12yt2+ϕ1εt1+εt=ϕ12(ϕ1yt3+εt2)+ϕ1εt1+εt=ϕ13yt3+ϕ12εt2+ϕ1εt1+εt=\begin{align*} y_t &= \phi_1y_{t-1}+\varepsilon_t\\ &= \phi_1\left( \phi_1y_{t-2}+\varepsilon_{t-1} \right)+\varepsilon_t\\ &= \phi_1^2 y_{t-2}+ \phi_1 \varepsilon_{t-1} +\varepsilon_t\\ &= \phi_1^2 \left( \phi_1y_{t-3}+\varepsilon_{t-2} \right) + \phi_1 \varepsilon_{t-1} +\varepsilon_t\\ &= \phi_1^3 y_{t-3}+\phi_1^2 \varepsilon_{t-2} + \phi_1 \varepsilon_{t-1} +\varepsilon_t\\ &= \dots \end{align*}

Provided that 1<ϕ1<1-1 \lt \phi_1 \lt 1, then ϕ1k\phi_1^k will get smaller as kk gets bigger. I guess that means eventually ϕ1kytk0\phi_1^k y_{t-k} \approx 0.

The reverse can also hold true with some constraints on the MAMA parameters. Thus, MAMA model is called invertible. These models have some desirable mathematical properties, if you wanted to know why we jump down this rabbit hole.

The book provides an example and explains why θ1<1| \theta_1 | \lt 1, has to do with lag.

Non-Seasonal ARIMA Models

Combine differencing with autoregression and a moving average model to obtain a non-seasonal ARIMA model.

ARIMA=ARIMA= AutoRegressive Integrated Moving Average. Integration is the reverse of differencing in this context.

The model resembles what I have discussed from the course book:

yt=c+ϕ1yt1++ϕpytp+θ1εt1++θqεtq+εty_t' = c + \phi_1 y_{t-1}' + \cdots + \phi_p y_{t-p}' + \theta_1 \varepsilon_{t-1} + \cdots + \theta_q \varepsilon_{t-q} + \varepsilon_t

yty_t' is the differenced series. The book here repeats much what is in the course notes for data science - selected math techniques. Cool beans are special cases of the ARIMAARIMA model

Recall ARIMA(p,d,q)ARIMA(p,d,q):

Special Cases of ARIMA(p,d,q)ARIMA(p,d,q):

Recommended to use backshift notation when working with more complicated models.

Understanding ARIMAARIMA Models

The book goes over more in-depth definition of components so you know what an automated function is doing.

ACF and PACF Plots

See book for examples.

Estimation and Order Selection

Maximum likelihood estimation

The R package fable uses the maximum likelihood estimation when estimating the ARIMAARIMA model.

We have discussed this before.

Information Criteria

Akaike’s Information Criterion (AIC) is useful for selecting predictors for regression and for determining the order of an ARIMAARIMA model. It can be expressed as:

AIC=2log(L)+2(p+q+k+1)AIC = -2 \log(L) + 2 (p + q + k + 1)

where L is the likelihood of the data, k=1k=1, if c0c \ne 0, and k=0k=0 if c=0c=0.

The corrected AIC for ARIMAARIMA models can be written as:

AICc=AIC+2(p+q+k+1)(p+q+k+2)Tpqk2AICc=AIC + \frac{ 2(p+q+k+1)(p+q+k+2) }{ T-p-q-k-2 }

Bayesian Information Criterion can be written as:

BIC=AIC+[log(T)2](p+q+k+1)BIC = AIC + \left[ \log(T) - 2 \right] \left( p+q+k+1 \right)

Basically, good models are obtained by minimising the AICAIC, AICcAICc, or BICBIC. The book prefers AICcAICc. These are better at selecting values for pp and qq but not dd.

ARIMAARIMA Modelling in fable

This is about ARIMA() function in R’s fable package.

Read the book for an overview.

Forecasting

Point Forecasts

Point forecasts are calculated using the following 3 steps:

The book gives an example. It’s a good example.

Prediction Intervals

Calculating ARIMAARIMA prediction intervals is more difficult, and beyond the scope of… this.

Seasonal ARIMAARIMA models

The ARIMA(p,d,q)(P,D,Q)mARIMA(p,d,q)(P,D,Q)_m model is used to represent seasonal data, the later portion being for the seasonal terms, and mm being the seasonal period, number of observations per year.

Let m=4m=4 is for quarterly data, and ARIMA(1,1,1)(1,1,1)mARIMA(1,1,1)(1,1,1)_m is written as:

(1ϕ1B)(1Φ1B4)(1B)(1B4)yt=(1+θ1B)(1+Θ1B4)εt(1-\phi_1B)(1-\Phi_1B^4)(1-B)(1-B^4)y_t = (1+\theta_1B)(1+\Theta_1B^4)\varepsilon_t

ACF/PACF

Seasonal part of an ARAR or MAMA model will be seen in the seasonal lags of the PACFPACF and ACFACF.

Plenty of examples in the book.

ARIMAARIMA vs ETSETS

ETSETS models and ARIMAARIMA models have some overlap, but there are many models that have no similar counterparts.

Read more in the book.


Ch. 10 - Dynamic Regression Models

Exponential Smoothing and the ARIMAARIMA model allow for past observations but not exactly the inclusion of perhaps other relevant information. A regular regression model can include a lot of information, but nothing regarding subtle time series dynamics.

Recap - regression models can be expressed as:

yt=β0+β1x1,t++βkxk,t+εty_t = \beta_0 + \beta_1 x_{1,t} + \cdots + \beta_k x_{k,t} + \varepsilon_t

We have kk predictor variables and εt\varepsilon_t is usually assumed to be an uncorrelated error term (i.e. white noise).

The book mentions a Ljung-Box test for assessing whether the resulting residuals were significantly correlated. Worth looking into perhaps.

We are going to allow the errors from a regression to contain autocorrelation! To emphasise this change, we replate εt\varepsilon_t with ηt\eta_t. This error is assumed to follow an ARIMAARIMA model.

So, the error of regression will be ηt\eta_t and the error of ARIMAARIMA will be εt\varepsilon_t. And only the ARIMAARIMA error will be considered as white noise, since regression error is now assumed to be composed of ARIMAARIMA and white noise.

Interesting thought process.

Estimation

Based on our previous assumption with the error terms, if we try to minimise the sum of squared errors, ηt\eta_t, there are several issues:

We can minimise the sum of squares on εt\varepsilon_t or use a maximum likelihood estimation.

If estimating a regression with ARMAARMA errors, ensure all variables are stationary. Else, if any variables are non-stationary, the estimated coefficients will not be consistent estimates, and therefore garbage. Exception if non-stationary variables are co-integrated.

Therefore, we difference non-stationary variables in the model. To maintain the form of the relationship between yty_t and the predictors, it is common to difference all the variables if any of them need differencing. The resulting model is called a model in differences. A model in levels is what we obtain when the original data are used without differencing.

If all variables in the model are stationary, we only need to consider an ARMAARMA process for the errors. A regression model with ARIMAARIMA errors is equivalent to a regression model in differences with ARMAARMA errors.

Regression with ARIMAARIMA errors using fable

In R, the ARIMA() function will fit a regression model with ARIMAARIMA errors if the exogenous regressors are included in the formula. Suppose we have a model where ηt\eta_t is an ARIMA(1,1,0)ARIMA(1,1,0) error. Our regression model is

yt=β0+β1xt+ηtyt=(β0+β1xt+ηt)(β0+β1xt1+ηt1)yt=(β0β0)+(β1xtβ1xt1)+(ηtηt1)yt=β1xt+ηtwhereηt=ϕ1ηt1+εt\begin{align*} y_t &= \beta_0 + \beta_1 x_t + \eta_t\\ y_t' &= (\beta_0 + \beta_1 x_t + \eta_t)-(\beta_0 + \beta_1 x_{t-1} + \eta_{t-1})\\ y_t' &= (\beta_0-\beta_0) + (\beta_1 x_t - \beta_1 x_{t-1})+ (\eta_t - \eta_{t-1})\\ y_t' &= \beta_1 x_t' + \eta_t'\\ &\tt{where}\\ \eta_t' &= \phi_1 \eta_{t-1}' + \varepsilon_t \end{align*}

Take that with a pinch of salt as it’s a good explanation for why the constant term drops, but not 100% sure about that pesky error term.

Whether differencing is required is determined by applying a KPSSKPSS test to the residuals from the regression model estimated using ordinary least squares.

If differencing is required, all variables are differenced and the model re-estimated using maximum likelihood estimation.

The AICcAICc should be calculated for the final model, which can be used to determine the best predictors.

The book continues with examples in R.

Forecasting

To forecast using a regression model with ARIMA errors:

Book contains many examples.

Stochastic and deterministic Tre…

2 ways to model linear trend:

Deterministic Trend

yt=β0+β1t+ηty_t=\beta_0 + \beta_1t+\eta_t

The ηt\eta_t is an ARIMAARIMA process.

Stochastic Trend is the same model but where ηt\eta_t is an ARIMAARIMA process with d=1d=1. You can difference both sides so that you have:

yt=yt1+β1+ηty_t = y_{t-1} + \beta_1 + \eta_t'

Similar to a random walk with drift, but the error term is an ARMAARMA process instead of white noise.

The models have different forecasting characteristics. The book gives examples.

Dynamic Harmonic Regression

Oh God… When there are long seasonal periods, a dynamic regression with Fourier terms is often better than other models considered so far. Seasonal versions of ARIMAARIMA and ETSETS models are designed for shorter periods, like 12 months for monthly data, for 4 quarters for quarterly data. Basically, there are m1m-1 parameter to be estimated for the initial seasonal states. So, for a big mm, the estimations become unmanageable.

Even in R, the ARIMA() function allows for m350m \le 350.

For something like a daily period, we prefer a harmonic regression approach where seasonal pattern is modelled using Fourier terms with short-term time series dynamics handled by an ARMAARMA error.

Many advantages and one significant disadvantage, check the book.

Lagged Predictors

Consider an advertising campaign, or Covid. Sitting in a room of people with Covid won’t immediately get you sick. You would have a lag of several days. A model that allows for lag can be expressed as:

yt=β0+γ0xt+γ1xt1++γkxtk+ηty_t = \beta_0 + \gamma_0 x_t + \gamma_1 x_{t-1} + \cdots + \gamma_k x_{t-k} + \eta_t

Again, the ηt\eta_t is an ARIMAARIMA process. The kk can be found using AICcAICc, along with values of pp and qq for the ARIMAARIMA error.


Knowledge Check

Q1: Which model or analysis involves the operation of sorting data variables according to their level of changeability along data records?

This is the principal component analysis.

Q2: Why apply clustering analysis to a dataset?

You would do this to group data records according to their similarities.

This analysis does not remove irrelevant variables, reduce data dimensionality, nor estimate missing values.

Q3: What model or analysis provides the value of the independent variable?

It’s not principal component analysis, nor correlation analysis.

It’s a Regression Model! I clearly don’t understand the wording of the question.

Q4: The auto-regressive model assumes what?

A linear function between the future output and past output

Q5: What transformation approach transfers data variable to their frequency domain?

That is the Fourier transformation.

Better know the others too: radial, reciprocal, logarithm, etc…