title: Data Science
subtitle: DLMBDSA01
authors: Prof. Dr. Claudia Heß
publisher: IU International University of Applied Sciences
date: 2022

Unit 3: Data Preprocessing

p. 51 - 66

Data typically comes from something, like a sensor, and is transmitted to a computer. When collecting mass amounts of data, sometimes errors can creep up either from the sensor or data transmission, or maybe some other source.

There are many articles out there, but this If Your Laptop or Phone Keeps Crashing, Maybe Blame Cosmic Rays | HowStuffWorks article talks about how cosmic rays can interfere with electronic devices.

Preprocessing is looking for missing values and incorrect values (outliers or values that do not make sense), and gracefully handling them. Additionally, preprocessing includes transforming variables with different scales into variables with one unique scale so all data carries the same weight. And also visualizing the data to discover errors and variable correlations.

We will discuss the four activities from above:

3.1 - Transmission of Data

Data transmission can be accomplished manually or electronically. Manual data transmission is simple yet prone to error, human error. Electronic transmission is performed using local and/or wireless area networks.

Electronic transmission uses serial or parallel transmission links.

Asynchronous and Synchronous transmission:

Data transmission rate is expressed in terms of number of bits transmitted per second (bps).

3.2 - Data Quality, Cleansing, and Transformation

Issues in data are due to values being noisy, inaccurate, incomplete, inconsistent, missing, duplicate, or outlying.

An outlier is a data record that is seen as an exceptional and incomparable case of the input data. There are real outliers, those that are accurately recorded but truly different from other records, and fake outliers, which are the result of poor data quality.

Vast majority of data science time is spent cleaning data. The prior in the Bayesian sense is the basis on which predictive models are built. Thus, resolving missing values and outliers will change the prior, which alters the basis on which the machine learning model operates.

Missing Values and Outliers

Several methods are typically used to resolve missing values and outliers:

Note that the median is not referenced. Additionally, you may introduce a new variable with “0” to denote normal data and “1” to denote records containing missing and/or outlier values that were handled. This ensures original information is not completely lost.

Duplicate Records

Typically duplicate records are removed to reduce compute time. However, they do not degrade the outcome of the analysis.

Redundancy

Different from the duplicate records, here we discuss redundant or irrelevant variables. This is why we would perform a correlation analysis between each pair of variables. Correlation can be easy to spot visually.

Recall that correlation coefficient ρ\rho between 2 data variables is

ρ(x,y)=i=1n(xixˉ)(yiyˉ)i=1n(xixˉ)2i=1n(yiyˉ)2\rho(x,y)=\frac{\sum_{i=1}^n(x_i-\bar{x})(y_i-\bar{y})} {\sqrt{\sum_{i=1}^n(x_i-\bar{x})^2} \sqrt{\sum_{i=1}^{n}(y_i-\bar{y})^2}}

The correlation coefficient is a static measurement for the degree relationship between 2 variables. It ranges from [1,1][-1,1]. If ρ=0\rho=0, there is no indication of correlation or independent variables.

Correlation is a measure of linearity between variables. So, just because it is close to zero does not imply there is no relation, just not a linear one. If a relationship can be seen visually, it might be worth transforming a variable.

You may also apply a dimensionality reduction approach, like principal component analysis (PAC). It sorts variables according to their importance, removing variables that have a minor influence on the data’s variability. This would result in a dataset with fewer number of variables.

Transformation of Data

Data Transformation is a requirement to basically convert data in the dataset into a form suitable for applying the data science models. There are 3 main transformation methods:

Because scaling requires formulas, I’ll use headings to cover

Variable Scaling

By scales, we can talk about different units like dollars, centemeters, rooms in a house, litres of petrol, etc… Modeling techniques work on scaled variables (eg between [1,1][-1, 1]) to ensure variables are weighted equally. There are a few ways to scale a variable.

Max Value Scaling
xi=ximax(X)x_i'=\frac{x_i}{\max(X)}
Average Value Scaling
xi=xixˉσxx_i'=\frac{x_i-\bar{x}}{\sigma_x}

I am using the XX, which typically refers to the random variable that represents the entire set. Also, σx\sigma_x is the standard deviation of said random variable.

Variable Decomposition

Some variables may require further decomposition into more variables for better representation. I think this can be typical of a categorical variable, like colour. Instead of one variable of multiple colours, you might expand into binary/Boolean variables to represent each colour. Then, you might realize it only matters if the object is green or something. This seems to lend itself to the fuzzy logic, get to that later though.

Also, you might decompose a date into year, month, day, hours, minutes and seconds, day of week, week of month, etc… Then, you might discover that only the month is important or something.

Variable Aggregation

Two separate variable may become more relevant when combined, adding, subtracting, multiplying, etc… Example being if you notice gross income and tax paid are both important variables, perhaps then the net income is what you are after.

3.3 - Data Visualization

Common data visualizations include histograms, scatter plots, geomaps, bar chart, pie chart, bubble chart, heat maps, etc…

The course book continues to explain and show examples of different graphs. A geomap kind of self describing, very political or economical. It colours different regions or countries on a map to display the values in question, show the scale of the variable.

A combo chart can be like a line chart on top of a bar chart, shows different types of information from different types of variables. Suppose their scales on the axes would need to overlap or the chart wouldn’t make sense though.

Bubble charts can represent many concepts across several categories. The book shows chance of success as the size of the bubble, colour represents department, and the axes are monetary values.

And the heat map shown in the book is of the continuous type. Colour scale can represent densities within a selected geographical area.


Video Lectures

This begins the data pre-processing discussion. We begin discussing Transmission Direction of data:

Multiplexing is an approach to transmit multiple signals over a common medium. This can be different frequency bands or a timely delay.

Synchronous sending of data, which is data sent one piece at a time. The sender sends a synchronisation message, which are bits like begins with 1010101010101010\dots This is technically high-voltage, low-voltage, etc… This allows the receiver to calculate the rate/frequency the sender sends data to synchronise clocks. Or, if the receiver cannot handle the data input speed, it provides feedback. Always necessary. Then it sends data, eight bits or something.

We can work Asynchronously, we transfer data in bytes with a start and stop bit. The line is kept at like 111111 which is negative voltage. When a message will be sent, we change to positive voltage. The receiver then knows 8 bits are coming.

We need to define rules how this works before we can send any data. We look now at an ethernet packet. Covers from Preamble to Check Sum. There’s an interpacket gap at the end which gives the receiver a moment to read the packet. Not counted in bits, but 96 cycles of the frequency.

The synchronous and asynchronous methods come into play in the packet in the payload where the Transport layer protocol is defined.

The Transmission Control Protocol (TCP) has many bits as flags, and one is a synchronize bit. If the bit is 1, the sender want to synchronize. The receiver, if accepting, will send back the packet with (ACK) Acknowledge bit set to 1 as well. The the sender sends ACK = 1 but Syn = 0 to tell receiver the data is coming! There’s no data up to this point, it’s all handshake.

More simple approach is User Datagram Protocol. It is connectionless protocol that is faster than TCP but:

It’s good for just a request that does not need a complete connection, like to a Domain Name System.

Now, actual pre-processing! How do we pre-process data?

Besides classification, measuring distance between two objects is the second fundamental approach in data analysis. A norm is a function that describes the abstract extent of an object. Applied to a vector, it represents the magnitude of a vector, x\|x\|.

It is often used to normalize (balance) data of different magnitude. Again, in terms of vectors, it’s the distance between the vectors, xy\|x-y\|.

What are some of these distance measures?

The L-Norm (Minkowski Distance) (Lebesgue norm LpL^p norm) is a generalization referring to the length of a vector of nn components given the pp-th root of the sum or integral of the pp-th powers of the absolute values of the vector components. Given x=(x1,x2,,xn)x=(x_1, x_2, \dots, x_n):

Lp=xp=i=1nxippL^p = \|x\|_p = \sqrt[p]{\sum_{i=1}^n|x_i|^p}

Where pp is a non-zero real number. But that is merely the length of a vector. What about the distance. Pretend the distance is a vector…

d(x,y)=i=1n(xiyi)ppd(x,y) = \sqrt[p]{\sum_{i=1}^n(x_i-y_i)^p}

L0L^0-norm is the number of non-zero components in a vector. It’s not actually a norm because it doesn’t behave homogeneously. It’s a cardinality function that simply measures a number of non-zero elements.

Ok, but we start with L1L^1-norm which is sum of absolute components. This is the Manhattan Distance.

d(x,y)=i=1nxiyid(x,y) = \sum_{i=1}^n|x_i-y_i|

Depends on rotation of the coordinate system, but not the translation. Typical applications would be:

The name comes from cab drivers in Manhattan, as the roads have a grid like system, this would be how they would calculate the shortest distance.

The L2L^2-norm is the Euclidean Distance. We should be familiar with this:

d(x,y)=i=1n(xiyi)2d(x,y) = \sqrt{\sum_{i=1}^n(x_i-y_i)^2}

Typical applications:

There’s then the Canberra Distance, which is a normalized Manhattan distance that acts like a transform of the coordinate system. It is sensitive to small changes when both coordinates are close to 0.

d(x,y)=i=1nxiyixi+yd(x,y) = \sum_{i=1}^n \frac{|x_i-y_i|}{|x_i|+|y|}

typically used:

There is the LL^{\infty}-norm referring to the largest vector in the vector space.

limpi=1nxipp\lim_{p \to \infty} \sqrt[p]{\sum_{i=1}^n|x_i|^p}

and is the limit of the LpL^p generalization or the Minkowski distance in order pp. For multiple vectors, it is the maximum difference between two vectors, aka Chebyshev distance.

Typical Applications:

We have the Mahalanobis Distance, the measure of the distance between points of an observation and a distribution in terms of standard deviations:

D=(xμ)TΣ(xμ)D = \sqrt{\frac{ (\vec{x}-\vec{\mu})^{\sf T}}{\Sigma} (\vec{x}-\vec{\mu})}

Sigma is the covariance matrix. This is used to detect outliers and anomalies. In contrast to the Euclidean distance, the Mahalanobis distance is independent of weights of individual points.

These measures are like very low-level learning for our Error Measures

You can find formulas above or somewhere.

Now, we talk about normalisation and standardisation.

Normalisation can be done over a range or over the maximum. The formulas are the same and you will find them somewhere above probably. We use a normalisation approach if we don’t know anything about the probability distribution.

Standardisation is done via the Gaussian distribution where you divide by standard deviation or you go full Z-score on it

xi=xiμσx_i'= \frac{x_i-\mu}{\sigma}

Again, these are things we do before we even start analysing data. Correlation analysis is an analysis method, but it’s also useful for pre-processing. So we use it in both parts.

At this point, the raw data is not in a form appropriate to answer our question. That is when we do transformation. From tabular to graphical is a way to visualise patterns. But we may also transform data logarithmically. We might also use the Fourier Transform. The continuous Fourier transform is given by:

Ff(y)=f(x)eiyx dx\mathcal{F}f(y) = \int_{-\infty}^{\infty} f(x) e^{-iyx}\ dx

with eiyx=cos(yx)isin(yx)e^{-iyx}=\cos(yx)-i\sin(yx). Yes, you should remember this from the Advanced Maths notes. It stems from Euler’s Formula:

eiφ=cos(φ)+isin(φ)e^{i\varphi} = \cos(\varphi) + i\sin(\varphi)

The Fourier transform can be seen as decomposition of a signal into sine waves of different frequency.

Then we dive into an example… by explanation. But there’s the cool graph you typically always see where there’s a decomposition of a wave into individual frequencies.

A time-dependent continuous signal can be represented as superposition or periodic sine functions:

Ff(t)=12πf(t)eiωt dt=f(ω)\mathcal{F}f(t)=\frac{1}{\sqrt{2\pi}} \int_{-\infty}^{\infty} f(t)e^{-i \omega t}\ dt = f(\omega)

The total frequency is a summation of many sine waves. You’ll have to check the math notes but the coefficient in the start is the duration of a cycle, or a period I believe.

This is important in data science because sometimes we need frequencies instead of time. Consider Amplitude Modulation. It’s the superposition of two periodic oscillations in amplitude modulation (AM). The example has a carrier wave of 25Hz and a constant modulation at 400Hz signal frequency.

How can this be used? Sound Analysis! it can show you if there’s a cello in an orchestra. Like a frequency fingerprint.

The continuous Fourier Transform is great mathematically, but computers must digitalize their data. This makes it more discrete. The Discrete Fourier Transform (DFT) is the equivalent of the continuous Fourier transform with signal f(t)f(t) represented by NN samples in xkx_k at time interval TT:

F[ω]=Ff[x]=k=0N1xkeiωkTF[\omega] = \mathcal{F}f[x] = \sum_{k=0}^{N-1} x_ke^{-i \omega k T}

This can be used for audio compression (cool). The signal can be decomposed into small segments, transformed, the resulting Fourier coefficients of non-perceptible high frequencies are discarded, and a back-transformation yields a compressed signal. So, you sort of trim the audio.

This compression is great for pre-processing, either sharing or analysing. Like, why should we transfer frequencies that you cannot here anyway?

Then the Inverse Discrete Fourier Transform (IDFT)

f[x]=F1F[ω]=1Nk=0N1xkeiωkTf[x] = \mathcal{F}^{-1}F[\omega] = \frac{1}{N} \sum_{k=0}^{N-1} x_ke^{-i \omega k T}

Same principle can be used for images, considering that an image can be represented as a frequency of colours. Think of the webpage, if the background is white, why should we send all of that information for each pixel? Send it for just one pixel.

Image compression comes in lossless and lossy methods. But, if we don’t need irrelevant information, we can accept the lossy methods. The idea is to transform RBG colours into something that removes redundancies and is easier to transfer. It is a colour-space transformation.

In data science, we want to find features, we want to highlight the most important aspects of data. We aren’t removing outliers here, but characteristics, and hopefully unimportant characteristics. This comes from a question.

RGB transforms to YCBCR. In JPEG, the first step is the colour space transform. Then comes the cosine transform which is very similar to the Fourier transform. So, to create an image from stored data, your graphics card does a lot of backwards calculations.

Transformations change the representation of data. We do this for feature extraction and to present the data at the end to someone who makes the decision. Visualisation is very important. Types of visualisations are:

Reasons to visualise is to present data for a decision maker or to present data to sell it.

Charts are usually involved in the transition of data to information. They change the representation of data in order to:

examples of charts:

Pitfall between scatterplot and line chart:

There’s more rule to abide when using a line chart. You have to think whether connecting points with a line actually makes sense.

Charts can easily be misleading in particular, if the axis are not scaled in an appropriate manner. Additionally, if an axis does not start at 0, it can show greater changes.

Also, outliers can have a dramatic effect on results. Outliers can make it look like regression explains more data than it really does and has a stronger correlation coefficient. Sometimes called the King Kong effect. You can look at a histogram to see that it should look like a normal distribution. A King Kong will be far from the normal.

Additionally, do not trust a correlation coefficient without looking at the data. The professor shows 3 scatterplots, all with correlation coefficient of 0.82, which is quite high. However, one of the graphs has a couple outliers, one graph is clearly polynomial shaped, and the last is probably a good fit.

To correlate data, you would typically assume a relationship between the variables. If you do not have a reason, it seems weird to run a correlation analysis. The professor makes statements about being careful with a correlation matrix. Basically, take high correlation with a grain of salt. If there’s no actual reason, then perhaps the variables are not actually related. Correlation does not indicate causality. However, correlation can support a hypothesis of a relationship. However, would the variables still both be represented in the a regression or no because they have duplicate data?


Test Your Knowledge

The bit stream is combined into long frames, and there is a constant period between deliverables in which type of transmission?

The answer is in Synchronous Data Transmission. It’s important to know the other data transmission options like manual, parallel, and asynchronous.

What is not an operation of data pre-processing?

Data transformation, transmission, and cleaning are all parts of data pre-processing (supposedly). However, Data Collection is not part of pre-processing.

Why would we apply correlation analysis on a data set?

We would apply correlation analysis to identify redundant variables. Correlation analysis will not help fill in missing values, nor identify duplicate records or outliers.

What do we call the process of removing the variable’s average and dividing by the standard deviation?

This process is called Variable Scaling. Other things we may do but are not this include variable decomposition, variable correlation analysis, and variable aggregation. All are terms to be familiar with.

What kind of chart is a data visualization tool that shows portions of a whole, where the value of the variables sum to 100% of it…

It’s a Pie Chart. Know you charts, like area chart, combo chart, and bubble chart.