title: Data Science
subtitle: DLMBDSA01
authors: Prof. Dr. Claudia Heß
publisher: IU International University of Applied Sciences
date: 2022
Not sure if this belongs here or in the Maths section, but here is good for now.
You will learn about:
We will discuss mathematical techniques and models used to transform data into insightful information. There are two modeling approaches for prediction:
Basically, the flow of this section is the same order as the learning objectives. It’s noted that there we separate time-series data because it is both common and requires additional considerations.
Feature extraction is achieved by reducing the dimensions to the ones that significantly contribute to the features of interest. It assumes that data approximately lie on a lower dimensional space. This is achieved by methods such as convolution, radial transforms, integral transforms, or principal components analysis.
We want to isolate important characteristics. Linear regression is simple feature extraction. A reduction from to , that is a reduction of 3 dimensional data, will be included in the last session.
Input data usually include correlated variables, either redundant or irrelevant. If the correlation is high but not 100%, then there is still some amount of independent information within each variable. However, the juice isn’t worth the squeeze, it place more burden on the prediction models than what they are worth.
So, we come up with a correlation threshold and if a variable exceed that threshold with another variable, we can assume it as redundant information and can safely remove from the dataset.
Definition - Principal Component Analysis (PCA): A statistical analysis method applied to transform potentially correlated variables into uncorrelated variables called principle components (PCs).
This is a method that is applied to transform linearly-correlated variables into uncorrelated variables called principal components (PCs). PCA also sorts the produced uncorrelated variables according to their variance along the data records. Variables at the bottom of the list are said to have low changeability, and can be excluded. This results in a desired reduction in dimensionality of the dataset.
The goal is to construct a new set of variables such that most information is contained within the first few variables. Then, in the machine learning model and regressions phase, we might only use a subset of the new variables.
Are there steps? Actually, the book shows the PCA Algorithm, very cools.
The book describes this a little weird so I will give my interpretation. If we have a table of data, each record is a row, and the columns are our features, or variables as the book says. But if a record is a Person, then one column could be “age”, another could be “height”, and another “gross income”, etc… We can have different variables, each represented as…
We also have records. The average (mean) is:
Where . Then, you simply subtract each value from the mean, resulting in a dataset whose mean is centred around 0, simplifying the remaining steps of the PCA algorithm.
Covariance is the measure of changes in variables with respect to the changes in variable according to:
Always fun to note that the covariance of a variable with itself is the variance. Since we are solving for the covariance of all variables with all other variables, you get a symmetric matrix with dimensions .
Remember that covariance can be either positive or negative, and a covariance close to 0 indicates variables are uncorrelated.
For 2 dimensions, it might look like:
If you have read through the advanced Maths notes, you would have come across these.
The EigenVector shows the largest variance that we must map to, the direction of the largest variance of the sample covariance matrix. The largest EigenValue is the span along the largest variance. Yes, EigenVectors are typical features of typical data. They have meaning, in this case as the variance. But more genearlly, they are the rotation of the invariant in space.
The objective of PCA is to transform the calculated covariance matrix into an optimum form where all variables are uncorrelated linearly to first order. That is, . Notice how we state uncorrelated linearly to the first order.
For , even though is dependent solely on , it wouldn’t appear correlated because correlation measure the linear relationship between variables.
The resulting matrix is a diagonal matrix where all elements equal 0 except those in the diagonal.
And the diagonal elements of the transformed matrix are called the eigenvalues (). To solve, you take the determinant of the bloody thing.
And, the Principal Components (PCs) are the Eigenectors of the calculated EigenValues.
What is an EigenVector in this context? It is a vector that, when transformed by the covariance matrix, results in a scaled version of the vector. Remember that in the pure mathematical context, The EigenVector was typically a unit vector where elements could be chosen freely in a manner of speaking.
The scale is the associated eigenvalue, given by
So, you get the principal component.
Since there are no correlations between the obtained PCs, the eigenvectors are orthogonal vectors.
EigenVectors of a transformation are those vectors that don’t change their direction, but just their magnitude, which is the EigenValue. Multiplying the transformation matrix with the EigenVector results in the EigenVector scaled by the EigenValue .
I think one of those is the transformation matrix. EigenVectors shows a direction of a feature, which is important to us and t he EigenValue tell us something about the magnitude of the feature.
The next step is to order all other PCs according to their EigenValues, highest to lowest. The percentage of how much variance each PC represents is calculated by:
So, it is a weighted average. And since we centred the variables mean around 0, their weights should be comparable.
You can choose to ignore the PCs with less significance. They will appear at the bottom of the PC list with the lowest eigen values. You will have a dataset now with variables, where .
Then, the data set is reconstructed by the produced PCs with the following:
Notice we have , which might be obvious to some but just in case…
If you are anything like me you need an example to set the record straight. Luckily, the book comes with one.
d | x1 | x2 |
---|---|---|
1 | 2.5 | 2.4 |
2 | 0.5 | 0.7 |
3 | 2.2 | 2.9 |
4 | 1.9 | 2.2 |
5 | 3.1 | 3 |
6 | 2.3 | 2.7 |
7 | 2 | 1.6 |
8 | 1 | 1.1 |
9 | 1.5 | 1.6 |
10 | 1.1 | 0.9 |
![[/images/notes/data-science/pca-datascience-graph-0001.png]]
The data records are scattered around the diagonal. This means that the diagonal itself would be a better primary axis because it captures the most important variance of the data records.
The logic here is a little confusing. Not all of the data records are on the diagonal because of some variance. We expect that a second axis perpendicular to the diagonal will capture the second-highest variability of these data records. Not sure what that means…
The Principal Component Analysis (PCA) algorithm is an unsupervised machine learning algorithm that attempts to reduce the dimensionality (number of features) within a dataset while still retaining as much information as possible.
When PCA is applied to a dataset, it finds the principal components of the data. The principal components are new axes that are perpendicular to each other and capture the maximum variability in the data.
In the example provided, the data records are not all on the diagonal. This means that the data is not perfectly correlated with any one axis. However, we can expect that a second axis perpendicular to the diagonal will capture the second-highest variability of the data records.
This means that the information in the graph is better described using the diagonal and a new axis perpendicular to it. If we need to reduce the number of variables, we could use only the new (diagonal) axis, as it captures most of the information, and neglect the second new axis which contains less significant information about the variance of the data points.
Here is a simple example to help you understand this concept:
Imagine that we have a dataset of two features: height and weight. We can plot this data in a two-dimensional scatter plot. If the data is perfectly correlated, then all of the data points will lie on a straight line. However, if the data is not perfectly correlated, then the data points will be spread out over the two dimensions.
We can use PCA to find the principal components of this dataset. The first principal component will be the axis that captures the maximum variability in the data. In this case, the first principal component will be an axis that runs along the diagonal of the scatter plot.
The second principal component will be the axis that captures the second-highest variability in the data. In this case, the second principal component will be an axis that is perpendicular to the first principal component.
If we need to reduce the number of variables in our dataset, we can use only the first principal component. This will capture most of the information in the data, and we can neglect the second principal component without losing too much information.
First, find the mean of each variable.
Second, subtract mean from values.
d | x1 | x2 | |||||
---|---|---|---|---|---|---|---|
1 | 2.5 | 2.4 | 0.69 | 0.49 | 0.3381 | 0.4761 | 0.2401 |
2 | 0.5 | 0.7 | -1.31 | -1.21 | 1.5851 | 1.7161 | 1.4641 |
3 | 2.2 | 2.9 | 0.39 | 0.99 | 0.3861 | 0.1521 | 0.9801 |
4 | 1.9 | 2.2 | 0.09 | 0.29 | 0.0261 | 0.0081 | 0.0841 |
5 | 3.1 | 3 | 1.29 | 1.09 | 1.4061 | 1.6641 | 1.1881 |
6 | 2.3 | 2.7 | 0.49 | 0.79 | 0.3871 | 0.2401 | 0.6241 |
7 | 2 | 1.6 | 0.19 | -0.31 | -0.0589 | 0.0361 | 0.0961 |
8 | 1 | 1.1 | -0.81 | -0.81 | 0.6561 | 0.6561 | 0.6561 |
9 | 1.5 | 1.6 | -0.31 | -0.31 | 0.0961 | 0.0961 | 0.0961 |
10 | 1.1 | 0.9 | -0.71 | -1.01 | 0.7171 | 0.5041 | 1.0201 |
SUM | - | - | - | - | 5.539 | 5.549 | 6.449 |
Third is the covariance matrix. Just sanity check on Covariance…
But, the Covariance matrix is a mixture of variance and covariance. Also, it’s not the variance of the actual variable, but its transformation around its mean. Makes the values smaller…
Fourth comes the EigenValues:
I’m not going to use the quadratic equation, but trust in the book…
Fifth, we calculate the EigenVectors:
What does this mean? It’s the slope of the line of best fit, or something like that, where
I guess, as such, it’s advantageous to solve for .
So, what about ? This will be the line perpendicular to . You can already then guess the values of the eigenvector. Because the math is exactly the same, just different numbers, here are the values:
According to This Article with Example This is where you can arrange and select you top most Eigen values.
Sixth, we reconstruct the dataset by re-orientating the data from the original axes. The final data set will be the standardized original data set times the feature vectors.
Per ChatGPT 3.5
Certainly! Principal Component Analysis (PCA) involves transforming your data into a new coordinate system based on the eigenvectors of the data’s covariance matrix. This transformation can be represented as a matrix multiplication.
Here’s how you can perform the PCA transformation step by step:
Center the Data: Start by centering your data by subtracting the mean of each feature (column) from each data point. This ensures that the data is mean-centered.
Compute the Covariance Matrix: Calculate the covariance matrix of the mean-centered data. The covariance matrix represents how each feature relates to every other feature in the data.
Eigenvalue Decomposition: Perform an eigenvalue decomposition of the covariance matrix. This involves finding the eigenvalues and eigenvectors of the covariance matrix. The eigenvectors represent the directions (components) along which the data varies the most, and the eigenvalues represent the amount of variance explained by each eigenvector.
Select Principal Components: Sort the eigenvectors in descending order based on their corresponding eigenvalues. These eigenvectors represent the principal components of the data. You typically choose the top k eigenvectors, where k is the number of dimensions you want to reduce your data to.
Create the Projection Matrix: The projection matrix is composed of the top k eigenvectors, typically arranged as columns. This matrix defines the transformation from the original feature space to the new principal component space.
Transform the Data: To transform your original data into the principal component space, simply multiply your mean-centered data by the projection matrix.
Here’s the matrix multiplication formula for the transformation:
NewData = MeanCenteredData * ProjectionMatrix
Where:
NewData
is the transformed data in the principal component space.MeanCenteredData
is your original data after centering (each row represents a data point, and each column represents a feature).ProjectionMatrix
consists of the top k eigenvectors as columns.To summarize, the key steps are mean-centering your data, calculating the covariance matrix, finding the eigenvectors and eigenvalues, selecting the top k eigenvectors as your projection matrix, and then using this matrix to transform your data into the principal component space.
Keep in mind that the transformed data will have as many columns as the number of eigenvectors you selected (k), which is typically fewer than the original number of features, achieving dimensionality reduction while retaining most of the data’s variance.
Me again, I’ll now walk through calculating the new first row of data. Note, we will use transpose so that the data fits matrix multiplication.
You would again, look at the transpose of that, but it’s your new and values! 🥳
I solved row 2 and compared to the book. Looks like that is how you do it. It amazingly produces a graph with what looks to be 0 correlation. I suppose here you could also then see how the new variables relate to your predictor and drop one if the correlation is low.
With the spread of data as points say going up and to the right, we say that that trend is the largest variance. Then, the perpendicular line captures the second largest variance. We use the vectors of variance to transform the data into a new coordinate system. This can be done for any dimensions.
In signal processing, this is also called discrete Karhunen-Loeve Transform (KLT).
Example of how EigenVectors are used in data science, we talk about how pixels change during a movie. Moving orientation would be a transformation calculation in 3-d space. If we look at a rotation, the EigenVector is the vector such that it does not change during the rotation? It essentially becomes the axis of rotation.
This can help recognize pixels in images.
p. 91
Definition - Clustering: Method of grouping objects together such that objects in the same group have more in common with one another than those objects in other groups. Clustering is a unsupervised learning technique that permits the input data to be grouped into unlabeled, meaningful clusters. Each group shares a certain level of similarity.
Several clustering approaches:
We will look at K-means clustering and agglomerative clustering.
K-means clustering is an algorithm for grouping a given data records into clusters. It does this by using the centroids (arithmetic means). The distance measure is the Euclidean distance. It is considered to be unsupervised since an initial classification does not exist.
The algorithm is:
Step 1: decide number of clusters, .
Step 2: Select random data records to represent the centroids of these clusters. The video says more like… Calculate prototype centroids from random groups of similar vectors.
Step 3: Create the clusters by calculate the distance between each data record and the defined centroids. Then assign the data record to the clustering containing the centroid closest to the data record.
We use Euclidean distance, given by:
Where are the data variables, denotes the data record, and denotes the cluster’s centroid.
Step 4: Recalculate the new centroid for each cluster by averaging the included data records.
Step 5: repeat steps 3 and 4 until there are no further changes in the calculated centroids.
Step 6: The final clusters comprise the data records included within them.
The video lecture goes over interesting graph images. It’s a process starting with random centroids. Calculate distance to create random clusters. Calculate new centroids from the clusters. Calculate new distances, perhaps some values change centroids. And continue until you cannot make anymore meaningful changes.
Check rules of thumb for choosing . Too many will not lead to convergence because too much fragmentation.
The book includes an example but the concept seems straight-forward enough.
So, the first centroid is a point of data. The second and further centroids actually appear to be average points. The example shows that even though the distances may update with the new centroid, since no records changed clusters, there’s no need to recalculate everything.
This is from the video lecture. It sounds similar to the K-Mean clustering. However, K-Nearest Neighbour assigns new data to a cluster according to the proximity to classified neighbours. Distance can be measured by many methods like Manhattan, Euclidean, Cosine, MSE, etc…
It is like if we already did classification and we just want to assign a random data point into a known cluster.
This is a supervised method since the classification already exists. The selection of is critical:
Rule of thumb is to use the square root of as well as an odd to prevent confusion with two classes.
Uses include:
Hierarchical Clustering is applied to data that has an underlying hierarchy. The book uses example of apples and bananas are fruits, aubergine and courgette are vegetables; however, both are fresh produce.
There are two approaches to hierarchical clustering: divisive (top-down) and agglomerative (bottom-up).
Agglomerative Clustering creates a bottom-up tree (dendrogram) of clusters that repeatedly merge two nearest points, or clusters, into a bigger super cluster. Algorithm is formulated as follows:
Those steps, to me, are a little confusing so lets dive into an example. The book looks at a very small dataset of 6 records with 2 variables each. Each dataset is its own cluster at first. The Euclidean distance is found between points. Data with smallest distances are merged.
You may think you got straight from 6 clusters to 3, I did at first. But you only merge smallest distances for non-overlapping points. So record 5 cannot merge with record 4 and record 3 in the same round.
If distance from row 4 to 5 is the smallest, then we create cluster . Then, if row 5 to 3 has the smallest distance, we create cluster which incorporates all of , not just row 5, and the new point. Continue on this trend, clustering points and clusters.
Apparently you still measure distances between points even after you cluster. I thought you’d get the cluster’s centre, but doesn’t look to be the case. This creates sort of a hierarchy of points as well.
p. 99
Definition - Linear Regression: A method for modeling linear relationships between a dependent variable and one or more independent variables.
The objective of Regression is prediction based on historic data. Building a model is an iterative process to relate independent variables to the dependent variables.
Again, we assume data records, each with features, or variables. We want to discover a relationship something like:
We are using as the coefficient to indicate weight or the variable. Of course is the predicted value and differs from the actual value by a random error variable,
This is because the input data is not an exact linear relationship between independent and dependent variables. Or perhaps we are missing information. Note that:
We use the absolute value because when you plot points, the error is the vertical distance between the points. And distance cannot be negative.
Before we dive into any examples, the instructor talks about the correlation coefficient, or the product-moment correlation according to Pearson. It determines the extent to which values of two variables are linearly related to each other.
population:
sample:
where you can have
The is what makes it a sample. However, you can see that in the grand scheme of things it cancels out.
The population is much more than just the entire set of data, but consider that you only know the data right now. You don’t have future data and therefore just a sample of data.
Correlation coefficient ranges between . You cannot say something is twice as related if the coefficients are 0.25 and 0.50, that’s not really how it works. But yes, it is more related. Additionally, high correlation does not necessarily indicate a causative relationship between variables. However, if you hypothesise a relationship, the correlation can support your hypothesis.
The Coefficient of Determination estimates the ratio of values explaining the relationship given by the regression line. In linear regression with a single variable, it is the square of the Pearson correlation coefficient:
It answers the question: how much of the variation in is described by the variation in ? That is shows how much of the change in is associated to a random error in . This coefficient ranges from , and can be interpreted as a percentage.
What does 64% mean? Changing something on one axis will change on the other. It means 64% of the variation in one axis is represented in the other, leaving 36% left for the bias. So it is like how much of the variation in is explained by the variation in .
Might be helpful to look at examples between and . You’ll notice that changes much faster as variation increases. Note that will also always be positive.
This is like a situation, just one independent variable. In this easy example, the line of best fit for a dataset would minimize the sum of squared error, also called the least-squares method.
So, we pick the weights, that minimize this function. When you think minimize, you may think solving for when the derivative is 0.
Since we are solving for weights, I guess that is what we take derivatives with respect to.
This would provide a point where the error is minimal. Work that into our summation:
That wasn’t too bad, but we are going to make an even bigger mess finding . This isn’t a statistics course, and I will have notes on a statistics course soon… The steps to finding are more complicated but similar. I’ll leave it to either the statistics course to derive for the sake of my time and sanity.
Since appears in the formula for , you calculate that first.
Basically, you need to calculate the following
The values that we use are the dependent variables of our dataset. It makes me think the formula is backwards from the beginning where we should have expanded , but the expansion of that is actually more like . So, expanding just is the right call.
Then, with those calculations, you plug-n-chug for weights / coefficients. Once you have those, your can reconstruct the regression formula.
p. 104
What about more realistic datasets, with many records, each having many independent variables? Let’s extend our understanding to a dataset with 2 independent variables.
The process would be taking the derivative of the error with respect to each to each coefficient and setting it equal to zero. Now, I just want to look at the following derivatives,
So, is composed of and , making it a little harder to work with. Additionally,
I think it is apparent that additional variables drastically complicates what we are solving for. However, we have as many variables as we do unique functions, and therefore it is solvable. It would be easier to construct a system of linear equations and then solve for unknowns using techniques of linear algebra though.
The book also notes that a large coefficient / weight implies that the target is highly correlated to the variable , and vice versa. But as the number of independent variables increases, the assumption of a linear relationship between the target variable and other variables becomes weak.
Consequently, nonlinear regression models will produce more accurate predictions.
p. 105
Definition - Forecasting Models: This is a model for forecasting (predicting) future events based on data and information gleaned from the past.
When people think about forecasting models, the first thought is probably stocks or sales. Some questions you might ask are:
One difference between regular regression analysis and this time-series forecasting is that individual data records depend on previous data records (to some extent). Thus, the forecasting technique must take this into consideration. Observations must be ordered with respect to time instances.
One popular linear forecasting technique is the autoregressive method (AR). This method assumes the expected output is a linear function of some past outputs. This means if the underlying relationship is nonlinear then the AR approach yields suboptimal results. For better results with nonlinear data, we can upgrade the AR method to include moving averages (ARMA) and integral terms (ARIMA).
Definition - Stationary Time-Series: This type of data has a constant mean and standard deviation over time.
To apply a forecasting model to time-series data, it should be stationary time-series. Only then can the model correctly self-predict its future response from past data points.
Most time-series data is not stationary, but we can convert data into stationary data with a concept of , which is the difference between every two data points on an interval . I suppose, even if the data increases or decreases overtime, then we aim to model the change over each time interval, hoping that is stationary. You can model back multiple intervals as well.
There is an optimum value for , which results in a completely stationary form of the time-series data.
p. 107
Definition - : Not sure about the math definition format… This is the backshift of a time-series by time steps.
The AutoRegressive (AR) model is a linear model developed to predict the value of an observation at the very next point in time using linear combination of its values at previous time instances.
The name comes from it using data from the same variable at past points in time. You might see for a model of order . This refers to the lag defined above, so it is a linear combination of previous data points going back, backshift, points.
The values are model coefficients, and is a white noise term , which looks a bit standard normal to me.
Per video, model defined like
The next point is calculated via the previous points.
The moving average model predicts future observations:
Here, each is a white noise error term , and the are model coefficients.
Per the video,
Next point is calculated from the previous error terms.
Moving Average Model | Wiki, per Wikipedia, looks a little difference,
Where is the order of the MA model, the errors are white noise, and is the mean of the series. The Moving-Average model is essentially a linear regression of the current value of the series against current and previous (observed) white noise error terms. Errors are assumed mutually independent and typically a normal distribution with mean of 0.
This is like correlating data with itself, previous data.
The correlation coefficient is how linearly related two variable are, represented by a number between , or just . We calculate the correlation between forecasted variables and its value at a specific lag.
The autocorrelation coefficient and is given by:
We have the covariance divided by variance.
Autocorrelation (serial correlation) is the correlation of a signal with temporally succeeding portions of the same signal. The function considers having a signal and a delayed portion , the functions is basically the same as above.
Rewritten to express that we are comparing data with a delayed version of itself, hence the as we cannot compare to the delayed versions… Hard to work, but if we are looking at comparing data to itself back 3 periods, then we must start at is the basic concept, assuming we don’t have an or earlier…
The formula starts at and compares to future , instead of starting at and looking back. It’s the same think, but using a plus instead of minus .
Autocorrelation can be used to detect non-randomness in time series. Randomness is one of the key assumptions to define a univariate statistical process. It assumes that location, scale, and distribution are constant and that the bias is a random error rather than a systematic error. If randomness is not given, a non-linear or time-series model is required.
Typical Applications include:
You find whether data is truly random or if there are patterns in the data over time.
Autocorrelation | Wiki in the discrete time case is the correlation of a signal with a delayed copy of itself as a function of delay.
Autocorrelation plots show a series autocorrelation coefficients for increasing delays (lags) and are used for checking randomness in a data set. Computing correlations for data at varying temporal delays and plotting them against the delay, allows you to identify:
We have a partial autocorrelation function at is the autocorrelation between and that is not accounted for by the autocorrelations from the to the lags.
Looks hectic.
Partial Autocorrelation Function (PACF) | Wiki in a time series analysis gives the partial correlation of a stationary time series with its own lagged values, regressed the values of the time series at all shorter lags. Apparently, the contrast is that autocorrelation does not control other lags. It’s purpose is more aimed at identifying the extent of the lag in an autoregressive (AR) model.
For the plot, you would have the lag on the x-axis. So, at , that is a delay of one. And the y-axis is . That means you plot nearly a histogram, or bar chart, of correlation to lag. That is autocorrelation.
Partial autocorrelation somehow doesn’t consider steps between lag.
Let’s try to understand with an example. Suppose you have a bag of coloured blocks and you are drawing one block out at a time.
Autocorrelation would be a measure of how similar the colour of the next block would is to the colour of the previous block. That is, if you draw a red block, then a blue block, then red again, the autocorrelation would be high. This is because the colour of the next block was similar to the colour to the previous.
Partial autocorrelation would be the measure of how similar the colour of the next block is to the colour of the previous block, after taking into account the colour of the current block. So, in the same situation of drawing red, blue, red, the partial autocorrelation would be low because the colour of the next block (red), was not similar to the colour of the previous block (blue), after taking into account of the colour of the current block (red).
Autocorrelation
Partial Autocorrelation
To get a high partial autocorrelation, you need a sequence of draws where the colours are highly correlated, even after taking into account the colour of the previous blocks. Consider drawing a pattern of blocks like: red, blue, red, red, blue, red, red, blue,…
The partial autocorrelation would he high because each marble colour is strongly correlated with the pervious marble colour, even after taking into account the colour of the current marble.
Partial autocorrelation is more for like seasonal things. Partial autocorrelation is rarely high when autocorrelation is low. That really only happens in situations where data is highly seasonal. The autocorrelation would be low because the data is not correlated with itself at different lags. But the partial autocorrelation could be high because the values of the time series are correlated with themselves at different lags, after taking into account seasonal pattern.
Autocorrelation is simply the correlation between time series and itself at different lags. Partial autocorrelation is much more involved, as it is the correlation between a time series and itself at a given lag, after taking into account the correlations between the time series and itself at all shorter lags. One calculation for partial autocorrelation is to use Yule-Walker equations. Another way to calculate is to use autoregressive (AR) models, which were discussed above.
Calculate the AR model for a time series. Then calculate the partial autocorrelation coefficients for the time series with:
Where:
This process is recursive, meaning you start with lag 1 and calculate partial autocorrelation coefficients at subsequent lags using the above formula.
Other statistical methods for calculating Partial Autocorrelation are:
Something also important is Stationary data, meaning that neither the mean nor variance changes overtime in the data. If this occurs, then correlation analysis is apparently not possible, or at least not as easy. That is when you resort to methods such as the difference of mean between intervals. So, the rate of change might be constant.
Two stochastic processes are strongly stationary if their joint probability distribution does not change over time. This means that their mean and variance also does not change significantly.
ARIMA models mix both the AR and MA models with integrated parameters in one model to obtain a better understanding of time-series data. An integrated parameter is the degree of differencing which is performed on the dataset to transform it into a stationary time-series.
Enter ARIMA = Autoregressive Integrated Moving Average model.
The notation is , where we have:
It looks like the sum of a constant (y-intercept), weighted sum of the previous values of , and the weighted sum of the previous forecast errors.
So, how many terms of and do we use? Typically, you must look at plots of autocorrelation and partial autocorrelation functions of time-series, which is why we covered it, then consider the following:
A spike should exceed the 95% confidence interval.
The book then covers some examples. The PACF and ACF charts are almost like histograms with correlation on the y-axis and the lag term on the x-axis.
The first example shows PACF around for lags 1 and 2, and then drop close to zero, where the ACF is also high at first but has a more gradual decline over lag, not dropping off until around lag 7. Therefore, we set and and get:
The second example is for because there is a spike at for ACF and PACF decays to . This means:
The we have example 3. They do a one degree differencing, so , to create more stationary data. Then the graphs are quite confusing, but the spikes are and , giving . I won’t write because that would be a lot.
Once a good estimation of number of terms in developed model is made, we use the input time-series data to obtain unknown coefficients. The residuals time-series is the difference between the input time-series and the model’s forecasted time-series.
If the ARIMA model was correctly developed, there should be no significant spikes in the ACF and PACF plots of . Spikes at later lags don’t automatically mean the model is wrong. However, if there’s a spike at say and we used or , then we use the following rules:
q += 1
and refit the model.p += 1
and refit the model.In practice, for any developed ARIMA model for a business application. It is also advised to avoid mixed models, with both and terms. And if you do add additional terms to a developed model on the basis of the residual analysis recommendation, do so one parameter at a time.
Also, some time-series data may first need nonlinear transformation to convert into a form with consistent distribution over time and symmetry in appearance.
Per the video,
Combination with previous points and error terms. I think it is really interesting to see such a distinction between and .
There’s also:
This generates different stationary by calculating differences in the order such as:
Let’s cover what some of the models are used for, it’s interesting…
The course text book has some rules of thumb about using certain ARIMA functions. The rules are part of the exam! Know them well! Check around p. 109.
ARIMA cannot handle datasets with seasonal components. Seasonal time-series data are cyclical, and require the seasonal autoregressive integrated model (SARIMA). The notation is . The first three are for . The are for the backshifts of the seasonal period. The denotes the number of time steps for a single seasonal period.
Python’s statsmodels
library supports complete designing, fitting, and forecasting of SARIMA model.
There’s also SARIMAX models, which permit the existence of a exogenous variable in the dataset. That is an external variable that influences the time-series variations and needs to be considered during the analysis.
A dataset transformation involves replacing each with using some function :
The book uses for the new variable, but capital variables typically denote random variables, and there’s no random component I am aware of yet.
Why would you transform your data? You might transform variables into a new space, like Cartesian to Radial, and improve interpretability… that’s really all they listed.
It’s kind of like, if you see the dependent and independent variables have a relationship, but it isn’t linear, you want to understand how to make it linear. Then you can regress on it.
One sort of common transformation is a logarithmic transformation, where we have:
I prefer the natural log, but you can try base 10. This can help make clumpy data appear linear.
The power law transformation represents a family of transformations based on nonnegative parameter ():
Basically, is initially estimated and then continuously updated during the training phase of the model-building process until the highest performance accuracy is achieved.
The parameter is meant to be nonnegative, but it can still be fractional, giving an interesting shape to the curve.
This is:
You can do just but I thought I would liven it up with . Sometime you are given data like gallons per mile, but you need it in miles per gallon.
Radial Transformation focuses on the distance between the value of a variable and the origin. I guess you combine two variables and transform them into the radial coordinates of radius and angle:
You might need to look for examples of this.
Well, if it isn’t our old friend…
The Fourier Transform shifts variable from their traditional domain of versus time , to a frequency domain. The Fourier transform determines which frequencies can represent the distribution of a given variable, and is given by:
Where is the length of the selected frequency band.
going to try and integrate lecture into notes since it’s going to probably have a lot of equations.
Didn’t really cover different types of transformations oddly enough.
title: Forecasting
subtitle: Principles and Practice
edition: 3
authors:
- Rob J. Hyndman
- George Athanasopoulos
publisher: OTexts
date: 2018
location: Melbourne, Australia
url: "https://otexts.com/fpp3/"
This should probably be its own information, but that is for the future perhaps. For now, we combine a couple additional chapters in my Data Science notes.
This is regarding the 3rd edition, but the 2nd is also available online. The 2nd edition uses the forecast
package in R, where the 3rd edition uses tsibble
and fable
packages.
The course for Data Science suggests reading Ch. 9 of the 2nd edition, which is Dynamic regression models, but in the 3rd edition it is ARIMA models. Both are relevant and the 3rd edition seems to be more current, so I’ll go with that.
Check out the book yourself Forecasting: Principles and Practice.
This is regarding the 3rd edition, but the 2nd is also available online. The 2nd edition uses the forecast
package in R, where the 3rd edition uses tsibble
and fable
packages.
The course for Data Science suggests reading Ch. 9 of the 2nd edition, which is Dynamic regression models, but in the 3rd edition it is ARIMA models. Both are relevant and the 3rd edition seems to be more current, so I’ll go with that.
Check out the book yourself Forecasting: Principles and Practice.
Exponential smoothing (Ch. 8 of this book) and ARIMA models are probably most widely used time series forecasting approaches. ARIMA models try to describe the autocorrelations in data.
Stationary time series has statistical properties that do not depend on the time at which the series is observed. So, seasonality is not time series, and might require apparently exponential smoothing. But white noise series is stationary.
However, cyclic behaviour can be stationary if it doesn’t have a trend of seasonality. As long as it doesn’t have predictable patterns in the long-term.
Differencing is a way to make non-stationary time series stationary by computing the differences between consecutive observations. Additionally, transformation like logarithms can help stabilise the variance of a time series.
ACF is useful for identifying non-stationary time series. If the data is stationary, the ACF will drop to zero quick enough. However, if the data is non-stationary, the ACF will decay slowly.
For a series with values, if you calculate the difference series, you have values as you cannot obtain because you don’t have a value (best I can describe).
For a white-noise series, you have
So, the denotes the white noise. And you get a random walk model with
Random walks typically have:
This extends out beyond second, but might look like this…
So, this set now has records. Apparently in practice, you typically won’t go beyond second-order differences.
Seasonal differencing is difference between observation and the previous observation from the same season. Therefore:
These are called lag-m differences.
Sometimes you would take both a seasonal and ordinary difference (AKA first difference) to obtain stationary data. Sorry if the AKA isn’t clear, it only applies to ordinary difference, not the combo of ordinary and seasonal.
It is recommended to do seasonal differences first because sometimes the resulting series is stationary without applying a further first difference. Applying more differences than required can induce incorrect autocorrelations.
A unit root test is a statistical hypothesis test of stationarity that is designed to determine whether differencing is required. There are many tests, this book covers the Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test. This is a null hypothesis that data are stationary, and we test if null hypothesis is false. Small p-values (eg. ) suggests differencing is required.
R has a function called unitroot_kpss
, check the book for examples as I don’t know R well enough at this time.
I did a little digging and found that Python statsmodel
has a KPSS test tool as well.
The book discusses a unitroot_nsdiffs()
R function for determining if seasonal differencing is required.
Backshift operator is like:
Sometime people use for “lag”, but usually not. This means something like , which is just the notation and not actually to the 12th power.
The notation works for differencing like this:
In general, it would look like .
Seasonal and first difference looks like…
The term autoregression indicates that it is a regression of the variable against itself. An autoregressive model of order can be written as
where is white noise. It is like multiple regression but with lagged values of as predictors. We call this an model, an autoregressive model of order .
We usually restrict autoregressive models to stationary data. We also put some constraints on the values of parameters required:
And when the restrictions get more complicated. Luckily, R has the fable
package that can take care of these restrictions for us when estimating a model.
A moving average model uses past forecast errors in a regression-like model, like so
This is called an model, moving average model of order . Note that you wouldn’t actually observe values of , so it is not a regression in the typical sense. Each can be thought of as a weighted moving average of the past few forecast errors.
Note: Do not confuse moving average with smoothing (discussed in previous chapter of book).
The model can be written as if you consider the following:
Provided that , then will get smaller as gets bigger. I guess that means eventually .
The reverse can also hold true with some constraints on the parameters. Thus, model is called invertible. These models have some desirable mathematical properties, if you wanted to know why we jump down this rabbit hole.
The book provides an example and explains why , has to do with lag.
Combine differencing with autoregression and a moving average model to obtain a non-seasonal ARIMA model.
AutoRegressive Integrated Moving Average. Integration is the reverse of differencing in this context.
The model resembles what I have discussed from the course book:
is the differenced series. The book here repeats much what is in the course notes for data science - selected math techniques. Cool beans are special cases of the model
Recall :
Special Cases of :
Recommended to use backshift notation when working with more complicated models.
The book goes over more in-depth definition of components so you know what an automated function is doing.
See book for examples.
The R package fable
uses the maximum likelihood estimation when estimating the model.
We have discussed this before.
Akaike’s Information Criterion (AIC) is useful for selecting predictors for regression and for determining the order of an model. It can be expressed as:
where L is the likelihood of the data, , if , and if .
The corrected AIC for models can be written as:
Bayesian Information Criterion can be written as:
Basically, good models are obtained by minimising the , , or . The book prefers . These are better at selecting values for and but not .
fable
This is about ARIMA()
function in R’s fable
package.
Read the book for an overview.
Point forecasts are calculated using the following 3 steps:
The book gives an example. It’s a good example.
Calculating prediction intervals is more difficult, and beyond the scope of… this.
The model is used to represent seasonal data, the later portion being for the seasonal terms, and being the seasonal period, number of observations per year.
Let is for quarterly data, and is written as:
Seasonal part of an or model will be seen in the seasonal lags of the and .
Plenty of examples in the book.
models and models have some overlap, but there are many models that have no similar counterparts.
Read more in the book.
Exponential Smoothing and the model allow for past observations but not exactly the inclusion of perhaps other relevant information. A regular regression model can include a lot of information, but nothing regarding subtle time series dynamics.
Recap - regression models can be expressed as:
We have predictor variables and is usually assumed to be an uncorrelated error term (i.e. white noise).
The book mentions a Ljung-Box test for assessing whether the resulting residuals were significantly correlated. Worth looking into perhaps.
We are going to allow the errors from a regression to contain autocorrelation! To emphasise this change, we replate with . This error is assumed to follow an model.
So, the error of regression will be and the error of will be . And only the error will be considered as white noise, since regression error is now assumed to be composed of and white noise.
Interesting thought process.
Based on our previous assumption with the error terms, if we try to minimise the sum of squared errors, , there are several issues:
We can minimise the sum of squares on or use a maximum likelihood estimation.
If estimating a regression with errors, ensure all variables are stationary. Else, if any variables are non-stationary, the estimated coefficients will not be consistent estimates, and therefore garbage. Exception if non-stationary variables are co-integrated.
Therefore, we difference non-stationary variables in the model. To maintain the form of the relationship between and the predictors, it is common to difference all the variables if any of them need differencing. The resulting model is called a model in differences. A model in levels is what we obtain when the original data are used without differencing.
If all variables in the model are stationary, we only need to consider an process for the errors. A regression model with errors is equivalent to a regression model in differences with errors.
fable
In R, the ARIMA()
function will fit a regression model with errors if the exogenous regressors are included in the formula. Suppose we have a model where is an error. Our regression model is
Take that with a pinch of salt as it’s a good explanation for why the constant term drops, but not 100% sure about that pesky error term.
Whether differencing is required is determined by applying a test to the residuals from the regression model estimated using ordinary least squares.
If differencing is required, all variables are differenced and the model re-estimated using maximum likelihood estimation.
The should be calculated for the final model, which can be used to determine the best predictors.
The book continues with examples in R.
To forecast using a regression model with ARIMA errors:
Book contains many examples.
2 ways to model linear trend:
Deterministic Trend
The is an process.
Stochastic Trend is the same model but where is an process with . You can difference both sides so that you have:
Similar to a random walk with drift, but the error term is an process instead of white noise.
The models have different forecasting characteristics. The book gives examples.
Oh God… When there are long seasonal periods, a dynamic regression with Fourier terms is often better than other models considered so far. Seasonal versions of and models are designed for shorter periods, like 12 months for monthly data, for 4 quarters for quarterly data. Basically, there are parameter to be estimated for the initial seasonal states. So, for a big , the estimations become unmanageable.
Even in R, the ARIMA()
function allows for .
For something like a daily period, we prefer a harmonic regression approach where seasonal pattern is modelled using Fourier terms with short-term time series dynamics handled by an error.
Many advantages and one significant disadvantage, check the book.
Consider an advertising campaign, or Covid. Sitting in a room of people with Covid won’t immediately get you sick. You would have a lag of several days. A model that allows for lag can be expressed as:
Again, the is an process. The can be found using , along with values of and for the error.
Q1: Which model or analysis involves the operation of sorting data variables according to their level of changeability along data records?
This is the principal component analysis.
Q2: Why apply clustering analysis to a dataset?
You would do this to group data records according to their similarities.
This analysis does not remove irrelevant variables, reduce data dimensionality, nor estimate missing values.
Q3: What model or analysis provides the value of the independent variable?
It’s not principal component analysis, nor correlation analysis.
It’s a Regression Model! I clearly don’t understand the wording of the question.
Q4: The auto-regressive model assumes what?
A linear function between the future output and past output
Q5: What transformation approach transfers data variable to their frequency domain?
That is the Fourier transformation.
Better know the others too: radial, reciprocal, logarithm, etc…