title: Data Science
subtitle: DLMBDSA01
authors: Prof. Dr. Claudia Heß
publisher: IU International University of Applied Sciences
date: 2022

Not sure if this belongs here or in its own Artificial Intelligence section, but here is good for now.

Unit 6: Selected Artificial Intelligence Techniques

pp. 119 - 141

Learning objectives include:

Introduction

Regression is the go-to supervised learning data science model. Then there’s classification, where a prediction model is developed based on the input dataset in order to estimate the class of a new data record.

This unit covers things from support vector machines and artificial neural networks, to fuzzy logic and genetic algorithms. A little bit of a shoutout, but “Intelligent Techniques for Data Science”, by R. Akerar and P.S. Sajja covers most of these topics as well and provides a high-level overview of concepts and not diving too deeply in the math and implementation.

6.1 - Support Vector Machines

Definition - Support Vector Machines (SVM): A supervised learning algorithm in machine learning that is typically used in classification problems. Can also be used in regression. It is an algorithm to find the maximum margin hyperplane in supervised classification.

Note: This is not a machine but just uses this buzz-word.

The instructor wants to cover what the support vectors are particularly, so know this! Support Vectors are generated by the subset of samples that would change the position of the dividing hyperplane.

![[/images/notes/data-science/support-vector-graph.png]]

The question is, how do I separate 2 groups? You need to know classes in advance. But if there’s new data, where does it belong? The idea is to create a hyperplane to separate classes. Support vectors are the vectors between the hyperplane and the data-points because “…they support the separation of data into groups.”

The algorithm tries to find the optimal hyperplane of separation. The above shows a linear separation, but that’s more for show as it’s probably not usually linear in practice.

Support Vector Machines is a binary linear classification technique where classification rule is to develop a linear function of the input dataset variables {x1,k,x2,k,,xM,k}\{x_{1,k}, x_{2,k},\dots,x_{M,k}\}, with k=1,2,,Nk=1,2,\dots,N, and NN and MM are the number of data records and data variables respectively.

the classification function can be formulated as the linear equation:

wxi+b=0w \cdot x_i+b=0

For an example, and I thought about this, we have a large group of people. We give them a test on a computer. If they pass the test, we consider them computer literate. If they fail, we don’t.

Our classification line will separate the dataset according to class, with one class on each side of it.

There’s also now a separating channel, which is the channel that has the classification line on its center line and is bounded by support vectors on both sides. It is generated by the classification line and defined by two lines parallel to it and equidestant from both sides, namely:

+1=wxi+band1=wxi+b\begin{align*} +1 &= w \cdot x_i +b\\\\ \text{and}\\\\ -1 &= w \cdot x_i +b \end{align*}

Distance between support vector lines, or the width of the channel, is called the margin. Support vectors are data records xix_i that lie on either side of the separating channel. The margin is apparently denoted as ll, a lower-case LL. I’ll denote below with var-phi (φ)(\varphi) to avoid mistaking with number 1 or something:

l=φ=xi+1xi1φ=+1bw1bw=2w\begin{align*} l = \varphi &= x_i^{+1} - x_i^{-1}\\ \therefore \varphi &= \frac{+1-b}{w}- \frac{-1-b}{w} = \frac{2}{\|w\|} \end{align*}

The SVM technique seeks the maximum margin to obtain the optimum separation between 2 classes. We seek classification equation that yields maximum 2w\frac{2}{\|w\|}.

Kernel Trick

The Kernel Trick helps SVM handle nonlinearly separable datasets. The dataset is projected onto a higher dimensional space unto which the dataset is linearly separable. The course book has a cool figure where a 2D dataset is projected into 3D and a linear plane can separate classes where a linear line could not.

![[/images/notes/data-science/kernel-trick-graph-kevin.png]]

The SVM merely considers relative distance between data points in a virtual projection space given by an appropriate Kernel function. Actual projection is apparently unnecessary.

Common Kernel functions include:

There’s a library for support vector machines called LIBSVM. Here’s a pdf called A Library for Support Vector Machines, and the library is at www.csie.ntu.edu.tw. Should work with Python:

pip install -U libsvm-official

Additionally, SciKit-Learn apparently also has resources.

A kernel is a function that transforms the input vectors into the dot product in a feature space of higher dimensions:

k:R2R3k: \mathbb{R}^2 \to \mathbb{R}^3

The dot (scalar, inner) product of two vectors X\vec{X} and Y\vec Y is defined as:

xy=xycos(θ)\vec x \cdot \vec y = |x||y|\cos(\theta)

It is the length of the projection of x\vec x onto the unit vector y\vec y and is zero for orthogonal vectors.

The kernel function transforms inputs data, or features, into the dot product of a feature space usually of a higher dimension.

This method is a little contradictory because it increases dimensionality of the data to find features but data scientists often are tasked with reducing the dimensionality of data. Why do we want to reduce dimensionality?

As a sort of example we look at the Radial Distribution Function, which are an alternative representation of n-dimensional data, showing frequency distribution of distance in Euclidean space.

g(r)=i=0n1j=i+1neB(rrij)2g(r) = \sum_{i=0}^{n-1} \sum_{j=i+1}^n e^{-B(r-r_{ij})^2}

We measure the distance from one point to others. Then, we plot g(r) on y-axis and radius on x-axis. This creates sort of a histogram / probability distribution for a 3d model. We have just reduced a 3d model into a 1d function. Where is this useful?

The g(r)g(r) is like the intensity at a specific radius. You iterate this over all atoms I suppose in a molecule. The inner loop ensures you don’t count the same distance twice.

Then, we look at Discrete vs Pseudo-Continuous RDF, because while the RDF is typically used as discrete function, the smoothing parameter can be utilised to account for bias (e.g. due to vibration) in the position of coordinates of a molecule. Let BB be the inverse standard deviation of peaks in the discrete distribution:

g(r)=i=0n1j=i+1ne(1s2)(rrij)2g(r)= \sum_{i=0}^{n-1} \sum_{j=i+1}^n e^{-(\frac{1}{s^2})(r-r_{ij})^2}

You can use this for nearly any 3d data. Architectures can also use this to fingerprint a building.

But why would fingerprints be useful? If you know some molecules that bond with others, and you can give an ML model fingerprint, it might detect patterns for what can bind to it.

Property-Weighted Descriptors

By including characteristic vertex properties, we can weigh RDFs and extend them to multiple dimensions.

Property-weighted includes characteristic vertex properties φ\varphi

g(r,φ)=i=0n1j=i+1nφiφjeB(rrij)2g(r,\varphi) = \sum_{i=0}^{n-1} \sum_{j=i+1}^n \varphi_i \varphi_j e^{-B(r-r_{ij})^2}

Two-dimensional RDF - joint RDF of distance and characteristic vertex property ϕ\phi

g(r,ϕ)=i=0n1j=i+1ne(B(rrij)2+B(ϕϕij)2)g(r, \phi) = \sum_{i=0}^{n-1} \sum_{j=i+1}^n e^{ -\left(B(r-r_{ij})^2 +B(\phi-\phi_{ij})^2\right) }

Property-weighted two-dimensional RDF - joint RDF of distance and characteristic vertex property ϕ\phi weighted by characteristic vertex properties φ\varphi

g(r,φ,ϕ)=i=0n1j=i+1nφiφje(B(rrij)2+B(ϕϕij)2)g(r,\varphi,\phi) = \sum_{i=0}^{n-1} \sum_{j=i+1}^n \varphi_i \varphi_j e^{-\left(B(r-r_{ij})^2 +B(\phi-\phi_{ij})^2\right)}

Good for chemistry as well with a wide variety of atoms.

Molecules are hard to compare due to the different amount of 3D coordinates. Radial functions can be used to unify and reduce multidimensional data and act as descriptor of the original data.

6.2 - Artificial Neural Networks

p. 122

The purpose of developing an artificial neural network (ANN) is to produce an artificial system capable of performing sophisticated calculations similar to the human brain.

Definition - Artificial Neural Network: A computing network based on the neural networks of animal brains. It contains nodes and arrows.

The response is encoded in how and to what degree various neurons are connected. The basic architecture comprises many layers of neurons:

Deep learning is the application of artificial neural networks to learning tasks with cascading hidden layers. The strength of a link between two neurons (jj and ii) on two adjacent layers is represented by a weight values wjiw_{ji}. The network adjusts weights of its links to produce output value close to the desired value of the target variable.

Somehow, the neuron sums its weighted inputs coming from its preceding links to get aia_i and then applies a transfer function to produce output ziz_i.

ai=j=1M(wjizj)zi=f(ai)\begin{gather*} a_i = \sum_{j=1}^M (w_{ji}\cdot z_j)\\ z_i = f(a_i) \end{gather*}

Note that the zjz_j comes from the neurons before it, and ziz_i is passed on to the next neuron.

The transfer function (f)(f) is also called an activation function. It must be continuous, differentiable, non-decreasing, and easy to compute. The neurons have mostly nonlinear activation functions which allow the network to learn nonlinear and linear relationships between the variables.

I’ll include the list of common activation functions because it’s new to me:

Linear (lin): function generates outputs which are not confined to a specific range.

z=az=a

The graph is more like a y=mx+by=mx+b thing, but ok.

log-sigmoid (logsig): function generates outputs between 0 and 1

z=11+eaz = \frac{1}{1+e^{-a}}

That equations has an actuarial look to it.

Tan-sigmoid (tansig): Function generates output between 1z+1-1 \le z \le +1

z=tanh(a)=e2a1e2a+1z = \tanh(a) = \frac{e^{2a}-1}{e^{2a}+1}

Exponential Linear Unit (ELU): Function generates outputs that are not confined to a specific range. It has a linear shape for positive inputs and an exponential shape for negative outputs.

z={aa>0α(ea1)a0z = \left\{ \begin{array}{ll} a & a \gt 0\\ \alpha (e^a-1) & a \le 0 \end{array} \right.

Alpha (α)(\alpha) is some positive values usually equal to 0.010.01.

Rectified Linear Unit (ReLU): Function generates outputs that are 0 for inputs with negative values and for all other inputs, the output will equal the input number.

z=max(a, 0)z = \max(a,\ 0)

Feedforward Networks

Feedforward neural network | Wikipedia: is on of the broad types of ANN characterized by the direction of the flow of information between its layers. The flow is uni-directional, meaning it only flows in one direction, which is forward, from input nodes to output nodes through hidden nodes (if any). Contrast to recurrent neural network, which have an bi-directional flow and we will look at later. Modern feedforward networks are trained using backpropagation method, coming up soon, and are colloquially referred to as the “vanilla” neural networks.

The Wikipedia article goes deeper into activation functions (both sigmoids), and then covers a brief history of the perceptron | Wikipedia.

Feedforward Neural Networks | Brilliant: These are ANNs where connections between units do not form a cycle. These were first type of ANN invented and are simpler than their counterpart, the recurrent neural network. Information only travels forward in the network, there are no loops, first through input nodes, then through hidden nodes (if any), and finally through the output nodes. Primarily used for supervised learning where data to be learned is neither sequential nor time-dependent.

Feed Forward Neural Network | DeepAI: Same as other definitions, information flows forward and it is the simplest form of neural network.

The number of neurons, number of hidden layers, and neurons’ activation functions are completely arbitrary. Some feedforward network rules:

  1. No connections within the neurons of a layer.
  2. No direct connections between input layer and output layer. I guess that means at least one hidden layer.
  3. Fully connected between layers. That is, each neuron in one layer is connected to all neurons in its succeeding layer.
  4. Can have a number of hidden neurons per layer that is more or less than the number of inputs.

So we specify the following:

And our network is ready to use a given dataset to self-learn. Per usual data science stuff, divide the dataset into training and testing, optionally a validation as well. The book suggest training at about 60% and testing around 40%, which seems like too much testing for me.

A common learning algorithm is the back propagation algorithm.

Back Propagation Algorithm

p. 127

The back propagation algorithm is used to estimate a multilayer feedforward neural network’s weights and develop a network model which estimates an output as accurately as possible, with respect to the given desired output. The desired output can be:

I was going to make a mermaid diagram of a simple neural network but that was too much. Each node in a layer points to all nodes in the next layer and has its value multiplied by a weight.

ai=j=0M(wjizj)a_i = \sum_{j=0}^M \left( w_{ji} z_j \right)

Above is the sum of weighted inputs flowing into a node. Below is the easy calculation of a node’s output:

zj=f(aj)z_j = f(a_j)

We let f(a)f(a) be the transfer function and MM be the number of neurons in layer jj.

The back propagation algorithm consists of the following 2 phases:

  1. The forward pass.
  2. The backward pass.

Might be worth noting that training is supervised or else you wouldn’t know if you need to perform back propagation or not.

Then back to the supervised learning example with the photos of a butterfly and a flower. Then we go on to discussing models to detect if something is just a shadow or if it is actually an object.

Using Unsupervised learning to define categories in data will always provide meaningful categories. Whether they are the ones you were expecting is another question.

Forward Pass Phase

During the forward pass phase, the algorithm initially assumes random values for the weights and calculates all the network’s parameters. It’s named so because calculations proceed in a forward direction from neuron to neuron, and layer to layer.

The first hidden layer calculation is referred to as “forward pass iteration One”. Subsequent layers are named accordingly if they exist. Poorly drawn looks like:

inputsf1(a)f2(a)fn(a)fΩ(a)y\text{inputs} \rightarrow \begin{array}{c} f_1(a) \\ f_2(a)\\ \vdots\\ f_n(a) \end{array} \rightarrow \cdots \rightarrow f_{\Omega}(a) \rightarrow y

The output of a network (y)(y), with the current set of weights, is compared to the desired target value (d)(d) to obtain the network error (E)(E). The network’s weights are then updated to decrease the error metric in the next forward pass.

Gradient Descent

This term should be familiar. Back propagation is uses gradient decent to recursively calculate the needed adjustments to the network weights. However, there is no analytical formula to do this in one step for all weights. Therefore, we work backwards through the layers to calculate how the weights need to be adjusted to reduce the error value.

This cycle keeps repeating, a forward pass to calculate the error, and a backward pass to adjust network weights, until no more improvements can be made.

Through a mathematical lens, we have the change in weight Δw\Delta w and the rate of decrease in error E/w\partial E / \partial w in the following equations:

Δw=η(Ew)wt+1wt=η(Ew)wt+1=wtη(Ew)\begin{gather*} \Delta w = - \eta \left( \frac{\partial E}{\partial w} \right)\\ w_{t+1} - w_t = - \eta \left( \frac{\partial E}{\partial w} \right)\\ w_{t+1} = w_t - \eta \left( \frac{\partial E}{\partial w} \right)\\ \end{gather*}

Let η\eta be the proportional constant called the learning rate. You would keep 0<η<10 \lt \eta \lt 1 usually because bigger step sizes could cause nonlinearity issues.

The Network Error is defined by the mean square error between the network output yy and the desired output dd and averaged over the training data records:

E=121n(dy)2E = \frac{1}{2} \sum_{1}^n (d-y)^2

The 1/2 is included to cancel out during the upcoming differentiation step.

The change in error per weight between neuron ii and preceding neuron jj is obtained with the chain rule:

Note: Math in computers is interesting, but buckle up, it’s going to get a little wild

Ewji=Eziziaiaiwji=Ezif(ai)ai(j=0M(wjizj))wji=Ezif(ai)aizj\begin{align*} \frac{\partial E}{\partial w_{ji}} &= \frac{\partial E}{\partial z_{i}} \cdot \frac{\partial z_i}{\partial a_{i}} \cdot \frac{\partial a_i}{\partial w_{ji}}\\ &= \frac{\partial E}{\partial z_{i}} \cdot \frac{\partial f(a_i)}{\partial a_{i}} \cdot \frac{\partial \left(\sum_{j=0}^M \left( w_{ji} z_j \right)\right) }{\partial w_{ji}}\\ &= \frac{\partial E}{\partial z_{i}} \cdot \frac{\partial f(a_i)}{\partial a_{i}} \cdot z_j\\ \end{align*}

That makes sense with the substitution and cancelling of terms.

Now, if the neuron is in the output layer, we have:

Ezi=zi[12in(dy)2]=zi[12in(dzi)2]=(dzi)Ewji=Eziziaiaiwji=(dzi)f(ai)aizj=δizj\begin{align*} \frac{\partial E}{\partial z_i} &= \frac{\partial}{\partial z_i} \left[ \frac{1}{2}\sum_i^n (d-y)^2 \right]\\ &= \frac{\partial}{\partial z_i} \left[ \frac{1}{2}\sum_i^n (d-z_i)^2 \right]\\ &= -(d-z_i)\\ \\ \therefore \frac{\partial E}{\partial w_{ji}} &= \frac{\partial E}{\partial z_{i}} \cdot \frac{\partial z_i}{\partial a_{i}} \cdot \frac{\partial a_i}{\partial w_{ji}}\\ &= -(d-z_i) \cdot \frac{\partial f(a_i)}{\partial a_{i}} \cdot z_j\\ &= -\delta_i \cdot z_j \end{align*}

What happened at the end. Simple, we let

δi=(dzi)f(ai)ai\delta_i = (d-z_i)\cdot \frac{\partial f(a_i)}{\partial a_i}

Weights in the output layer will be adjusted according to the following formula:

(wji)t+1=(wji)t+ηδizi(w_{ji})_{t+1} = (w_{ji})_{t} + \eta \cdot \delta_i \cdot z_i

But, if the neuron (i)(i) is in the latest hidden layer, with links (j)(j) from its preceding layer and links (u)(u) to its succeeding layer (output):

Ewji=Eziziaiaiwji=(U[Ezuzuauauzi])ziaiaiwji=(U[Ezuf(au)auwiu])f(ai)aizj=(U[(dzu)f(au)auwiu])f(ai)aizj=(U[δuwiu])f(ai)aizjEwji=δizj\begin{align*} \frac{\partial E}{\partial w_{ji}} &= \frac{\partial E}{\partial z_{i}} \cdot \frac{\partial z_i}{\partial a_{i}} \cdot \frac{\partial a_i}{\partial w_{ji}}\\ &= \left( \sum_U \left[ \frac{\partial E}{\partial z_{u}} \cdot \frac{\partial z_u}{\partial a_{u}} \cdot \frac{\partial a_u}{\partial z_{i}} \right] \right) \cdot \frac{\partial z_i}{\partial a_{i}} \cdot \frac{\partial a_i}{\partial w_{ji}}\\ &= \left( \sum_U \left[ \frac{\partial E}{\partial z_{u}} \cdot \frac{\partial f(a_u)}{\partial a_{u}} \cdot w_{iu} \right] \right) \cdot \frac{\partial f(a_i)}{\partial a_{i}} \cdot z_{j}\\ &= \left( \sum_U \left[ -(d-z_u) \cdot \frac{\partial f(a_u)}{\partial a_{u}} \cdot w_{iu} \right] \right) \cdot \frac{\partial f(a_i)}{\partial a_{i}} \cdot z_{j}\\ &= \left( \sum_U \left[ -\delta_u \cdot w_{iu} \right] \right) \cdot \frac{\partial f(a_i)}{\partial a_{i}} \cdot z_{j}\\ \therefore \frac{\partial E}{\partial w_{ji}} &= -\delta_i \cdot z_j \end{align*}

Where we let

δi=(U[δuwiu])f(ai)ai\delta_i = \left( \sum_U \left[ -\delta_u \cdot w_{iu} \right] \right) \cdot \frac{\partial f(a_i)}{\partial a_{i}}

I use UU to represent the set of succeeding layer neurons (u)(u).

Now, a quick recap: The weights of the links going to a hidden layer will be adjusted according to the following formula:

(wji)t+1=(wji)t+ηδizj(w_{ji})_{t+1} = (w_{ji})_t + \eta \cdot \delta_i \cdot z_j

Where δi\delta_i is defined right above.

Backward Pass Phase

p. 133

We are now in the backward pass phase of the back propagation algorithm. We move backward from the output layer and calculate the adjusted weights of the network.

Let’s list steps to compose the algorithm:

Step 1: Select a value for the learning rate (e.g. η=0.1\eta = 0.1).

Step 2: For link connecting latest hidden layer’s neuron (j)(j) to output of layer’s neuron (i)(i), calculate the following:

δi=(dzi)f(ai)ai\delta_i = (d-z_i) \cdot \frac{\partial f(a_i)}{\partial a_i}

Step 3: With delta, calculate the new weight as:

(wji)t+1=(wji)t+ηδizj(w_{ji})_{t+1} = (w_{ji})_t + \eta \cdot \delta_i \cdot z_j

Repeat Steps 2 and 3 for all links from this hidden layer to the output layer.

Step 4: For the link between the latest hidden layer’s neuron (i)(i) and its preceding layer’s neuron (j)(j) calculate δi\delta_i:

δi=(U[δuwiu])f(ai)ai\delta_i = \left( \sum_U \left[ \delta_u \cdot w_{iu} \right] \right) \cdot \frac{\partial f(a_i)}{\partial a_i}

Step 5: With another delta, calculate the new weight again

(wji)t+1=(wji)t+ηδizj(w_{ji})_{t+1} = (w_{ji})_t + \eta \cdot \delta_i \cdot z_j

Again, repeat steps 4 and 5 for all links connected to this hidden layer from its preceding layer.

Step 6: Repeat steps 4 and 5 in a backwards direction for the links between each remaining hidden layer’s neuron (i)(i) and its preceding layer’s neuron (j)(j) until the inputs’ layer is reached.

That is just on iteration of the back propagation algorithm where weights are adjusted only once.

Step 7: Re-calculate the network error (E)(E) and check if it has reached a global minimum or a reasonable low value. If not, repeat the iteration until no additional improvement in the calculated network error can be made.

interesting note: If a network’s error cannot be improved and it has not reached its goal minimum value nor an acceptable value, we may consider changing the implemented neurons’ activation functions.

For the curious, a Convolutional Neural Network | Wiki is a type of feed-forward neural network, which makes sense based on the concept of a convolution.

Recurrent Networks and Memory Cells

The human brain is more complex than a feedforward network. It permits links to occur from succeeding layers to preceding layers, or allows for feedback. In ANN, this model is called a recurrent network.

A Recurrent Neural Network | Wiki is a bi-directional Artificial Neural Network, which means that the output of some nodes to affect subsequent input to the same nodes. Their ability to use internal state, or memory, to process arbitrary sequence of inputs makes them applicable for a different range of tasks like unsegmented, connected handwriting recognition or speech recognition.

Additionally, a Transformer | Wiki is a deep learning architecture that relies on the parallel multi-head attention mechanism. It requires less training time than previous RNNs. That isn’t really for topics of data science though.

So, the idea of RNNs is to allow a connection in the current layer to provide feedback to a previous layer, that eventually feeds information back into itself. It may also be trained using the back propagation algorithm, but must remember the previously -obtained values of the recurrent connections.

A memory cell allows new terms to be added into the network’s mathematical functions. It is in charge of sending output of a recurrent neuron backwards, both what information is sent back and to what degree. An example is the long short-term memory (LSTM) unit, comprised of input, output, and forget gates. It remembers values from the recurrent neurons over arbitrary time intervals, and the three gates regulate the flow of this information through the network.

The Vanishing Gradient Problem | Wiki is encountered when training ANNs with gradient-based learning methods and backpropagation. I suppose, the issue is during training the weights are updated proportionally to the partial derivative of the error function with respect to current weight E/w\partial E / \partial w. The issue arises when the gradient becomes vanishingly small, preventing the weight from updating. This can grind training to a halt. This affects many-layered feedforward networks and recurrent networks.

Long short-term memory | Wiki network is a RNN that is aimed to handle the vanishing gradient problem. It aims to provide a short-term memory for RNN that can last thousands of timesteps, hence long short-term memory.

It is the selecting of relevant information from the current state that allows the LSTM network to maintain useful, long-term dependencies to make predictions.

When training RNNs, it can takes some time.

Reinforcement Learning

p. 138

Reinforcement Learning | Wiki is an area of ML concerned with how intelligent agents take actions to maximize some cumulative reward. It is the 3rd basic ML paradigm, next to supervised and unsupervised learning. Reinforcement learning doesn’t use datasets in the traditional sense but gathers data from exploring its environment. The environment is usually in the form of a Markov Decision process | Wiki (MDP), to provide mathematical framework for modelling decision making in situations where outcomes are partially random, because many reinforcement learning algorithms use Dynamic Programming | Wiki techniques.

Reinforcement Learning is a goal-oriented learning approach. It is based on interaction with the environment and investigates all possible scenarios to find the optimal correct actions. The idea is to let the machine learn which actions yields the maximum reward through many iterations. A classic structure consists of an Artificial Agent with a feedback loop to reinforce the agent with rewards when the agent has performed a correct action.

The environment is the scenario the agent is facing. The internal state of the scenario is maintained by the agent to learn about the environment.

I think this was described earlier, but the process is like:

The model is usually a pre-trained neural network.

Comparison of Learning Types

Both supervised and reinforcement learning map from inputs to outputs in their developed models. However, reinforcement learning, a reward function acts as feedback to the agent. Supervised learning predicts the mathematical function that relates the output to a given input.

With unsupervised learning, there is no mapping. The objective of the learning algorithm is to find underlying patterns in the data.

Markov Property

Per video lecture.

Markov was a Russian mathematician. The Markov property is the conditional probability distribution of future states of a process given all past states depends only on the present state and not at all on the past states. This is called memoryless.

A random process with the Markov property is called a Markov process.

A Markov Chain is denoted:

Xn,nN=(x0,,xn)X_{n,n \in \mathbb{N}} = \left( x_0, \dots, x_n \right)

And the Markov property implies that

P(xi+1x0,,xi)=P(xi+1xi)P(x_{i+1}|x_0, \dots, x_i) = P(x_{i+1}|x_i)

This basically means, given all of the previous states that have occurred, the probability that we enter a specific state in xi+1x_{i+1} only depends on where we last were, and nothing before that.

You can also think that the present state contains all the information from the past states, and therefore is all that is needed to determine the future state.

Hidden Markov Models

Again, from the video…

Hidden Markov models incorporate a discrete state variable zz, a latent variable that behaves like a first-order Markov sequence (rather than the observations).

Each result of an observation is conditionally independent of a preceding observation given the value of its associated latent variable. However, the hidden state variable follows a 1st1^{st} order Markov model where the unknown states are directionally connected.

Since the path is dd-connected, the probability distribution is described like:

P(x3x1,x2)P(x_3|x_1,x_2)

Where the xx values are the independent observations that sit on top of the zz value dependent hidden states. The hidden Markov Model is kind of an assumption.

Markov Chain

A Markov chain is a stochastic model of a sequence of events in which each transition probability depends on its predecessor only… Already stated

What if pabp_{ab} is a transitional probability from states aa to bb? We can create a transition probability matrix, which is a stochastic matrix in which each entry is positive and each row sums to 1.

eventsab
a1pb1-p_bpbp_b
bpap_a1pa1-p_a

The probability of being in state aa and remaining in that state is one less the probability of moving out of that state. This easily extends to multiple states as well and ensures that the probabilities in each column sum to one.

This is right from Probability Theory and is very nice.

When you write the probabilities, you would write something like pxyp_{xy} which is the probability of going from state xx to yy. You may also find yourself in a state that you cannot leave, where all leaving probabilities are 0. This is an absorbing state.

So, Markov Chains can represent your phases of sleep, with probabilities of moving between phases, and many other factors.

Because of the Markov Theorems, the probability of following a certain path of dimension D={a,b,c,d}D=\set{a,b,c,d} with whatever transition matrix you can imagine within reason, is achieved by simply multiplying the probabilities along the path. That is:

P(xx+1x0,,xi)=pabpbbpbcpcdP(x_{x+1}|x_0,\dots,x_i) = p_{ab}p_{bb}p_{bc}p_{cd}\cdots

Because the Markov property says these events are memoryless, they are independent. So, the intersection of events, saying the probability of going from aa to bb and then bb to cc, is a just multiplication.

Markov Decision Process

The Markov Decision Process, mentioned above, is a mathematical framework for solving reinforcement learning problems. The process consists of the main parameters:

A reinforcement learning example that applies Markov decision process is the shortest path problem. The book provides a figure on p.139. It is a bunch of letters connected by lines with weights and you want to get from A to F as quickly as possible.

Let’s consider our parameters:

Common algorithm for policy-based reinforcement learning can be summarized in following steps:

6.3 - Further Approaches

Soft Computing | Wiki is a set of algorithms that are tolerant of imprecision, uncertainty, partial truth and approximation. Taken in contrast to hard computing, which are algorithms that find probably correct and optimal solutions to problems. Examples of soft computing are neural networks, evolutionary algorithms, and fuzzy logic.

Genetic Algorithms

Let’s start with Evolutionary Computation | Wiki, being a family of algorithms for global optimization inspired by biological evolution. It is a subfield of AI and soft computing. They are a family of population-based trial and error problem solvers with a metaheuristic or stochastic optimization character. Basically, the solution derived solves the problem and can be sufficiently good, especially with imperfect information, but cannot guarantee being the optimal solution. Hopefully we cover more in the AI course. In evolutionary computation, an initial set of candidate solutions are generated and iteratively updated.

Evolutionary Algorithms | Wiki are a subset of evolutionary computation and uses mechanisms inspired by biological evolution such as reproduction, mutation, recombination, and selection. The Fitness Function | Wiki help guide simulations towards optimal design solutions.

The Genetic Algorithm (GA) is an approach to solving optimization problems based on natural selection. Genetic Algorithm | Wiki is a metaheuristic inspired by the process of natural selection that is a subset of evolutionary algorithms. These algorithms provide high quality results by relying on biologically inspired operators such as mutation, crossover, and selection. Apparently optimizing decision trees can be an example.

The genetic algorithm starts by generating a population of possible solutions and applies selection rules to randomly select individuals from the current population to be parents. The cross over rules are applied to combine two parents and from children. Mutation rules are applied with random changes to parents to form different children. This stimulates evolution over successive generations.

Per video…

Genetic Algorithms (GA) use mutation and recombination operations that are typically used for optimization problems. Programming techniques include:

Some programming strategies on the topic include:

Like artificial neural networks, these are kind of based on biology. These are used to find optimal solutions to problems in an unsupervised manner. One use case you may use and not even realize is to write a program that optimizes the hyperparameters of a model! Amazing how simple it is sometimes, yet unnoticed.

There’s also gene expression programming, but we aren’t Geneticists.

Genetic Programming

So, how do we actually write a genetic program? Rather than representing a function in character strings, individuals are represented as parse trees (syntax trees).

![[/images/notes/data-science/genetic-programming-tree-diagram.png]]

You split the function by operators as much as it can go. Then you can perform mutations.

The point mutations are used to see if the new function provides a better result. So you need an measure of error as well.

This is similar to reinforcement learning.

Fuzzy Logic

The book Intelligent Techniques for Data Science has an entire chapter on this. Fuzzy logic is nearly synonymous with the theory of fuzzy sets. That deals with classes of objects and unsharp boundaries. Members have a degree of truth instead of sharp cut-offs like ±1\pm1.

Steps:

  1. Fuzzify (decompose) all input values into truth values [0,1][0,1].
  2. Fuzzy output computed by if-then rules to mimic and | or | not operators.
  3. De-Fuzzification is performed on fuzzy truths to get continuous output value.

Fuzzy logic | Wiki is explained pretty much as above but going into greater detail and providing a bit of mathematical knowledge to the topic.

Naïve Bayes Classification

The Naïve Bayes method is a classification approach based on Bayes theorem and designed for categorical data. Basically, a new record is assigned to the class that it has the highest probability of belonging to.

Naïve Bayes Classifier | Wiki is a family of simple probabilistic classifiers based on applying Bayes’ theorem with strong independence assumptions between the features. Simple, yet effective. Check out the Wiki article for more.


Knowledge Check

Q1: The Kernel trick is employed in support vector machines to do what?

The Kernel Trick is employed to deal with nonlineary separable dataset.

It is not for:

Q2: What is the objective of learning the feedforward network?

The objective of a feedforward network is to get the network’s output value very close to the desired output value.

It is not for:

Q3: What learning model does the concept of a memory cell exist in and why?

A memory cell exists in a recurrent network to control what information is fed back into the learning model.

Models that do not use memory cells include support vector machines, reinforcement learning, and feedforward networks.

Q4: What learning approach is based on a theory that relates to classes of objects with unsharp boundaries?

Fuzzy Logic is a theory that deals with classes of objects wit unsharp boundaries.

Other approaches that do not deal with this theory are: Naïve Bayes classifier, support vector machines, and genetic algorithms.

Q5: What does the Naïve Bayes approach assume about independent variables?

This approach assumes that independent variables are also random variables.

It does not assume independent variables are also: structured data variables, normalized variables, orthogonal variables.