title: Mathematics for Machine Learningauthors: - Marc Peter Deisenroth - A. Aldo Faisal - Cheng Soon Ongpublisher: Cambridge University Pressyear: 2020
Ch. 1 - Introduction and Moti## Ch. 1 - Introduction and Motivation
p.11
Ch.2 - Linear Algebra
Definition - Linear Algebra: The study of vectors and vertain rules to manipulate vectors.
The book will use bold letters to represent vectors, but I think I will continue to use the arrow, v because it is easier to write in KaTeX.
A vector is a number, or set of numbers, with both magnitude and direction. A scalar is just a number, just a magnitude. As you can image, this changes the rule to the typical game of algebra.
Examples of vectors:
Geometric Vectors
directed line segments, typically drawn
Polynomials
can be added together and multiplied by a scalar λ∈R, resulting in a polynomial.
Audio Signals - represented as a series of numbers
Elements of Rn
This book focuses on tuples of n real numbers
a=123∈R3
Definition - Closure: ?
Vectors are things that can be multiplied by scalars and added together. Polynomials can be considered vectors.
Closure is the question, what is the set of all things that can result from my proposed operations?
2.1 - Systems of Linear Equations
The general from of a system of linear equations is
a11x1+⋯+a1nxn=b1⋮am1x1+⋯+amnxn=bm
Where aij∈R and bi∈R. The a values would be constants we solve for, and the x values are the unknowns.
When solving a system of linear equations there are three types of solutions:
No solution - when equations contradict each other
Unique Solution - what we really want.
Infinite solutions - possible indication of not enough information?
We now introduce the more “compact” notation
a11⋮am1x1+⋯+a1n⋮amnxn=b1⋮bm
This then becomes
a11⋮am1⋯⋱⋯a1n⋮amnx1⋮xn=b1⋮bm
I believe this involves the matrix product, something technically not discussed yet in the book. It would be good to introduce the concept first.
It’s a system of equations like:
a1x1+...+anxn...b1x1+...+bnxn=z1=zm
Where, at the moment, everything is an element of the set of real numbers.
The author introduces a new, fun notation for systems of linear equations:
Definition - Matrix: Let m,n∈N be real-valued (m,n) matrix Amn is a m⋅n-tuple of elements aij, where i=1,⋯,m and j=1,⋯,n. This is ordered according to a rectangular scheme consisting of m rows, and n columns.
The convention is actually opposite to the Cartesian coordinates… to me at least.
You have row vectors and column vectors as special cases.
It is important to note the following:
A=[adbecf]∈R2×3
This is just to visualize the n×m notation. To me, it’s opposite of the (x,y) you would think graphically. But consider, suppose we have matrix a×b, if the first number a is for x axis so to speak, then the number represents how many rows of x there are, not necessarily how far along the x-axis to traverse. And if the second is number b is for y, then the same applies. It’s more like a count of the number of y-axes, or columns, we are dealing with.
Just remember that it is not the same as coordinates.
When thinking in terms of identifying an element in a table, it might just be easier to remember it as opposite. And maybe that makes since because then the y value goes downwards as the value increases.
B=b11b21b31b12b22b32b13b23b33
2.2.1 Matrix Addition and Multiplication
p. 22
The sum of two matrices, both A,B∈Rm×n, is just element wise. I am sure I’ve written it down in other notes.
There’s also matrix multiplication. Again, I’m sure it’s written somewhere. This text introduces the useful concept that neighbouring dimensions must match for a matrices to be compatible for multiplication.
Good point for some examples…
Definition - Identity Matrix: In Rn×n, the identity matrix is
In:=10⋮0001⋮00⋯⋱⋱⋯⋯00⋮1000⋮01
Something like that, I think you get the photo, a diagonal of 1’s and 0 everywhere else.
Now that we have defined matrix addition, multiplication, and the identity, here are some properties
Associativity
∀A∈Rm×n,B∈Rn×p,C∈Rp×q:(AB)C=A(BC)
I stress that the order does matter because of the nearest neighbour.
Distributivity
∀A,B∈Rm×n,C,D∈Rn×p:(A+B)C=AC+BCA(C+D)=AC+AD
Multiplication with Identity
∀A∈Rm×n:ImA=AIn=A
Note that Im=In for m=n.
2.2.3 Multiplication by a Scalar
p. 25
They have all the fun properties.
2.2.4 Compact Representation of Systems of Linear Equations
p. 26
We can rewrite systems of equations using matrix multiplication.
2.3 Solving Systems of Linear Equations
So, in linear equations, we have constants and unknowns.
2.3.1 Particular and General Solution
The particular solution or special solution are found by zeroing out all unknowns except one.
To find other solutions, we start manipulating both sides of the equations by scalar values to zero things out and provide more solutions.
for unique solutions, you’ll need as many useful equations and you have unknowns. I’ll say “useful” with tongue in cheek to mean that equations aren’t redundant or contradictory. They add information to our analysis.
This section is basically an introduction to introduce Gaussian elimination as an easier method to solving equations.
2.3.2 Elementary Transformations
To solve a system of linear equations we employ elementary transformation. These keep the solution set the same and transform the equation system into a simpler form. Here are some ideas:
Exchange 2 equations
multiplication of an equation (row) with a constant
adding 2 equations (rows)
We also now introduce an augmented matrix to more compactly write Ax=B. Note that A and B are matrices
adgbehcfi
It’s actually a little bit of a pain to write in KaTeX because you have to use arrays and specify the left and right… but whatever keeps your boat afloat.
Definition - Row-Echelon Form: A matrix is in row-echelon form if the following:
All rows that contain zeros are at the bottom of the matrix.
Looking at nonzero rows only, the first nonzero number from the left, called the pivot or the leading coefficient, is always strictly to the right of the pivot of the row above it.
So, it makes a zero like bottom triangle.
p. 30
old notesDefinition - Reduced Row Echelon Form: An equation system is in reduced row-echelon form if:
it is in row-echelon form
every pivot is 1
the pivot is the only nonzero entry in its column.
This chapter continues on through elementary transformation, groups, vector spaces, linear independence, basis, ranks, linear mappings, Image and Kernel, Affine spaces… the chapter is nearly a full-blown linear algebra course.
Ch. 3 - Analytic Geometry
p. 70
Ch. 4 - Matrix Decompositions
p. 98
Ch.5 - Vector Calculus
p. 139
In machine learning, we are asking a model to explain data. Many algorithms are used to optimize an objective function with respect to a set of desired model parameters. Finding good parameters can be phrased as an optimization problem.
A function f is a quantity that relates two quantities to each other. Our quantities are typically inputs x∈RD and targets f(x). We assume all are real-valued if not stated otherwise. Also, RDisthedomainoff,andthefunctionvaluesf(x)aretheimage/codomainoff$.
We will explore Difference Quotient, Partial Derivatives, Jacobian Hessian, and Taylor Series, which are concepts used in many other parts of machine learning.
Let’s map the domains,
f:RDx→R↦f(x)
We clarify that f is a mapping from RD to R. The second mapping specifies the explicit assignment of an input to an output. A function f assigns every input x exactly to one function value f(x). The y values can be the same, but one x can never point to multiple y values.
5.1 Differentiation of Univariate Functions
A univariate function is y=f(x), only one dependent variable. We will also take that x,y∈R.
Definition - The Difference Quotient: The difference quotient is
δxδy:=δxf(x+δx)−f(x)
The symbols use little delta, but still expresses small change. This computes the secant line through two points on a graph of f.
Letting the limit of δx→0, we obtain the tangent of f at x.
Definition - Derivative: For h>0, the derivative of f at x is defined as
dxdf:=h→0limhf(x+h)−f(x)
p. 142 goes through a proof of the derivative for f(x)=xn. It uses combinations.
5.1.1 Taylor Series
Definition - Taylor Polynomial: The Taylor polynomial of degree n of f:R→R at x0 is defined as
Tn(x):=k=0∑nk!f(k)(x0)(x−x0)k
For clarity, f(k)(x0) is the kth derivative of f at x0. Essentially, we sum up all of the tiny changes from moving between 2 points.
Definition - Taylor Series: For a smooth function f∈C∞,f:R→R, the Taylor series of f at x0 is
T∞(x):=k=0∑∞k!f(k)(x0)(x−x0)k
Definition - Maclaurin Series: The Maclaurin series is a special instance of the Taylor series taken for x0=0. It comes with many very useful expressions.
Note that if f∈C∞, then f must be continuously differentiable infinitely many times. A good example is an exponential function.
As an example, you should take some Taylor series expansions of f(x)=sin(x)+cos(x)∈C∞. Go for when x0=0.
Following along on pp. 144-145, you can end up finding the power series representations of the sine and cosine function.
Definition - Power Series: A Taylor series is a special case of a power series. The power series is defined as
f(x)=k=0∑∞ak(x−c)k
5.1.2 Differentiation Rules
p. 145
Defines the product rule, quotient rule, sum rule, and chain rule.
5.2 Partial Differentiation and Gradients
The generalization of the derivative to functions of several variables is the gradient. We find the gradient of f with respect to the x by varying one variable at a time, keeping the others constant. The gradient is a collection of these partial derivatives.
Definition - Partial Derivative: For function f:R→R,x↦f(x), x∈Rn of n variables we define the partial derivative as
The row vector just defined is called the gradient of f, or the Jacobian. It is a generalization of the derivatives from before.
The Jacobain is actually a special case of the general definition. More on this later.
You might commonly see the gradient defined as a column vector instead of like our row vector. We are using row vectors because
we can consistently generalize the gradient to vector-valued functions
we can immediately apply the multi-variate chain rule without paying attention to the dimension of the gradient.
pp.147-148 has rules for partial differentiation.
5.2.2 Chain Rule
p. 148
The book dives a little deeper into the chain rule. Consider f:R2→R, of the two variables x1,x2. And let them be functions of t, that is x1(t) and x2(t). To compute the gradient of f with respect to t we need the chain rule for multivariate functions
So, d denotes the gradient, and ∂ denotes partial derivatives.
We can extend this further by letting our functions xn(s,t) themselves be made of 2 variables. We take multiple partials of the original function f now with respect to the variables s and t.
Hopefully we can see how the vector multiplication works in this instance.
5.3 Gradients of Vector-Valued Functions
p. 149
Now, we generalize the concept of the gradient to vector-valued functions, or vector fields. That is, f:Rn→Rm, where n≥1 and m>1.
For a function f:Rn→Rm and a vector x=[x1,…,xn]⊤∈Rn the corresponding vector of function values is given as
f(x)=f1(x)⋮fm(x)∈Rm
It’s like running through a remapping I guess. Derivatives and partials work the same.
Definition - Jacobian: The collection of all first-order partial derivatives of a vector-valued function f:Rn→Rm is called the Jocobian. The Jacobian is an m×n matrix defined as
The special case if f:Rn→R1, which maps a vector x∈Rn onto a scalar, possesses a Jacobian that is a row vector, 1×n.
We are using the numerator layout of the derivative. That is, the derivates of df/dx of f∈Rm with respect to x∈Rn is an m×n matrix. The elements of f define the rows, and the elements of x define the columns of the corresponding Jacobian.
There also exists the denominator layout, which is the transpose of the numerator layout.
In a previous section (of the book), it was shown that the determinant can be used to compute the area of a parallelogram. In particular, the area is given as the absolute value of the determinant.
Consider a parallelogram with sides c1=[−2,1]⊤, c2=[1,1]⊤. The area in the parallelogram described by sides is
det([−2111])=∣(−2∗1)−(1∗1)∣=∣(−2)−1∣=∣3∣=3
Very interesting property. The area is three times the area of a unit square. We can find this scaling factor by finding a mapping that transforms the unit quare into the other square!
We will look at 2 approaches to identify this mapping.
Really quick, the unit square is given by vectors b1=[1,0]⊤ and b2=[0,1]⊤. The absolute value of the determinant is 1.
Approach 1: Linear Transform
+
Identify both {b1,b2} (unit square) and {c1,c2} are bases of R2.
We effectively perform a change of basis from (b1,b2) to (c1,c2).
We identify the desired basis change such that Jb1=c1 and Jb2=c2.
J=[−2111]
The absolute value of the determinant of J is the scaling factor we desire.
Approach 2: NonLinear Transform
This is more general approach with partial derivatives.
Consider a function f:R2→R2
it performs a variable transformation.
In our case, f maps coordinate representation of any vector x∈R2 with respect to (b1,b2) onto coordinate representation y∈R2 with respect to (c1,c2).
We must identify mapping to compute how area (or volume) changes when transformed by f.
We must find out how f(x) changes if we modify x by a small amount.
The question is answered with the Jacobian matrix dxdf∈R2×2. We write functions for our vectors:
y1y2=−2x1+x2=x1+x2
We now have 2 functions, each with 2 variables. We can take partial derivatives. Since everything is to the first power, holding other variables constant nicely zeros them out. You should obtain
The Jacobian represents the coordinate transformation we are looking for.
It is exact if the transformation is linear, is for this case. It recovers exactly the basis change of the matrix. The Jacobian determinant is the factor by which areas or volumes are scaled when coordinates are transformed.
Variable transformation becomes useful when transforming random variables and probability distributions. It becomes relevant in machine learning when training deep neural networks using the reparametrization trick, aka infinit perturbation analysis.
p.153 shows a generic example.
Example
Suppose
f(x)=Axf(x)∈RMA∈RM×Nx∈RN
Computing gradient df/dx requires determining dimensions. Since f:RN→RM, then df/dx∈RM×N. Then determine all partial derivatives. Then, collect those into Jacobian matrix.
□
P. 154 shows example of gradient of Least-squares loss, which is a helpful ML technique.
5.4 Gradients of Matrices
Start with, what is a tensor? Think of a tensor like a multidimensional array that collects partial derivatives. In the future, we will need to take gradients of matrices with respect to vectors (or other matrices), which results in a multidimensional tensor.
Better shown by example. Suppose we have matrices Am×n and Bp×q. The Jacobian would be (m×n)×(p×q), which is a four-dimensional tensor. The entries would be
Jijkl=∂Bkl∂Aij
Because matrices represent a linear mapping, there exists a vector-space isomorphism, a linear-invertible mapping, between space Rm×n of m×n matrices and the space Rmn of mn vectors. Thus, we re-shape our matrices into vectors of lengths mn and pq respectively. And the gradient uses these results in a Jacobian.
Matrices can be transformed into vectors by stacking the columns of the matrix, flattening.
5.5 Useful Identities for Computing Gradients
p. 158 lists some very useful gradients frequently required in Machine Learning context.