ML-MDS - Machine Learning

Spring 2023

Jose Antonio Lorencio Abril

Professor: Marta Arias

Student e-mail: jose.antonio.lorencio@estudiantat..upc.edu

This is a summary of the course Machine Learning taught at the Universitat Politècnica de Catalunya by Professor Marta Arias in the academic year 22/23. Most of the content of this document is adapted from the course notes by Arias, [2], so I won't be citing it all the time. Other references will be provided when used.

1 Introduction

2 Linear regression

2.1 Introduction

2.2 Least squares method

Intuitive interpretation

2.3 Things that could go wrong when using linear regression

2.4 Basis Functions

2.5 Probabilistic approach

2.6 Bias-Variance decomposition

2.7 Ridge Regression from Gaussian prior

2.8 LASSO regression

2.9 The full-Bayesian perspective

3 Clustering

3.1 k-Means

3.2 k-Means++

3.3 Choosing the number of cluster

K

3.4 Gaussian Mixtures

4 Linear Classifiers

4.1 Decision boundary in probabilistic models

4.2 Generative classifiers

4.3 Naïve Bayes

4.4 Perceptron and Logistic Regression

5 Nearest Neighbor Prediction

5.1 Locality: similarities and distances

5.2 Choosing

k

5.3 How to combine outputs to make predictions

5.4 Decision boundaries for nearest neighbors classifier

5.5 Final considerations

6 Trees and Random Forests

6.1 Trees

6.2 Random Forests

7 Multi-Layer Perceptron (Neural Networks)

7.1 Multiclass classification

7.2 Multi-Layer Perceptron

7.3 Error functions

7.4 Training the MLP: Backprogragation

7.5 Some activation functions

A Notes on probability theory, Bayes theorem and Bayesian learning

A.1 Probability theory basic

A.2 Bayes rule in the context of learning

A.3 Maximum likelihood estimation

A.4 Properties of estimators

A.5 Maximum a posteriori estimation

A.6 Bayesian Learning

List of Figures

Figure 1: Matlab's accidents dataset and best linear fit.

Figure 2: SVD visual.

Figure 3: The independent variable does not provide enough information.

Figure 4: The variables are not linearly related.

Figure 5: The outlier distort the fit.

Figure 6: Likelihood example.

Figure 7: Using likelihood to select the distribution.

Figure 8: Visual representation of the Bias-Variance errors.

Figure 9: Different Gaussian mixtures with the same components.

Figure 10: Different classifiers. Linear (left). Quadratic (middle). knn (right).

Figure 11: A neuron.

Figure 12: A simple perceptron.

Figure 13: Voronoi diagrams. Code in MATLAB script voronoi.mlx.

Figure 14: kNN decision regions for different values of k.

Figure 15: Multi-class classification with perceptrons.

Figure 16: Multi-class classification with perceptrons and softmax.

Figure 17: A Multi-Layer Perceptron.

Figure 18: Different kinds of NN.

Sub-Figure a: MLP.

Sub-Figure b: FFNN not MLP.

Sub-Figure c: RNN.

1 Introduction

I gave ChatGPT the index of these notes, and asked him for an introduction for them. Here is the result:

Welcome to the comprehensive study notes for your course in Machine Learning. This document is designed to provide you with a thorough overview of the primary topics in machine learning, to supplement your coursework and study.

Our journey will begin with an overview of Machine Learning and the foundational elements that build this exciting field. The aim is to provide you with a sound understanding of the core principles and techniques involved in the development and application of Machine Learning algorithms.

We will kick-off with 'Linear Regression,' a cornerstone of machine learning, going in-depth into topics like least squares method, issues encountered in linear regression, basis functions, and probabilistic approaches. It will also focus on Bias-Variance decomposition, ridge regression, LASSO regression, and finally, a full-Bayesian perspective.

Our exploration then continues with 'Clustering' where we cover important clustering techniques like k-Means, k-Means++, and ways to determine the optimal number of clusters. We delve deeper into the realm of Gaussian Mixtures and the Expectation-Maximization technique.

In the 'Linear Classifiers' chapter, we inspect the decision boundary in probabilistic models, generative classifiers, Naive Bayes, and methodologies like Perceptron and Logistic Regression.

Following this, we introduce 'Nearest Neighbor Prediction'—a different way of approaching classification problems. The section will equip you with tools to understand the concept of locality and choosing the best 'k', and how to combine outputs to make predictions.

We also explore the use of 'Trees and Random Forests' in machine learning. It will demystify decision trees and random forests, dealing with imbalanced data in classification, and providing insights into the interpretability of random forests.

Next, we journey into the world of 'Neural Networks' with the 'Multi-Layer Perceptron (MLP)', learning about the different error functions used for regression, binary classification, and multi-class classification. We also cover the critical process of training the MLP using Backpropagation and some common activation functions.

Finally, the notes conclude with an appendix section, 'Notes on probability theory, Bayes theorem and Bayesian learning.' This section is critical for understanding the probability theory basics, Bayes rule in the context of learning, maximum likelihood estimation, properties of estimators, maximum a posteriori estimation, and finally, Bayesian learning.

These notes aim to serve as a comprehensive reference guide for you as you explore machine learning, providing both a high-level understanding of the field and the mathematical and algorithmic details that underlie each concept. Happy learning!

2 Linear regression

2.1 Introduction

In Figure 1, we can observe a dataset of the population of different states plotted against the number of fatal accidents in each of the states. Here, each blue circle corresponds to a row of our data, and the coordinates are the

(p o p u l a t i o n, # a c c i d e n t s)

values in the row. The red line is the linear regression model of this data. This means it is the line that 'best' approximates the data, where best refers to minimizing some kind of error: the squared error between each point to its projection on the

y

axis of the line, in this case. This approach is called the least squares method.

image: 1_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_Machine_Learning_LectureNotes_source_pegado2.png

Figure 1: Matlab's accidents dataset and best linear fit.

2.2 Least squares method

2.2.1 Least squares in 2D

In 2D, we have a dataset

{(x_{i}, y_{i}), i = 1, ..., n}

and we want to find the line that best approximates

y

as a function of

x

. As we want a line, we need to specify its slope,

θ_{1}

, and its intercept,

θ_{0}

. So, our estimations are:

\hat{y} (x_{i}) = {\hat{y}}_{i} = θ_{0} + θ_{1} x_{i} .

The least squares linear regression method chooses

θ_{0}

and

θ_{1}

in such a way that the error function

J (θ_{0}, θ_{1}) = \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2} = \sum_{i = 1}^{n} {(y_{i} - θ_{0} - θ_{1} x_{i})}^{2}

is minimized.

Note that this function only depends on the parameters

θ_{0}

and

θ_{1}

, since the data is assumed to be fixed (they are observations).

To compute them, we just need to find the minimum of

J

, by taking partial derivatives and setting them to 0. Let's do this optimization. First, we can develop the square:

J (θ_{0}, θ_{1}) = \sum_{i = 1}^{n} y_{i}^{2} + θ_{0}^{2} + θ_{1}^{2} x_{i}^{2} - 2 θ_{0} y_{i} - 2 θ_{1} x_{i} y_{i} + 2 θ_{0} θ_{1} x_{i} .

Thus:

\frac{\partial J}{\partial θ_{0}} = \sum_{i = 1}^{n} 2 θ_{0} - 2 y_{i} + 2 θ_{1} x_{i} = 2 n θ_{0} - 2 \sum_{i = 1}^{n} y_{i} + 2 θ_{1} \sum_{i = 1}^{n} x_{i}

and

\frac{\partial J}{\partial θ_{1}} = \sum_{i = 1}^{n} 2 θ_{1} x_{i}^{2} - 2 x_{i} y_{i} + 2 θ_{0} x_{i} = 2 θ_{1} \sum_{i = 1}^{n} x_{i}^{2} - 2 \sum_{i = 1}^{n} x_{i} y_{i} + 2 θ_{0} \sum_{i = 1}^{n} x_{i} .

We have now to solve the system given by

{\begin{matrix} 2 n θ_{0} - 2 \sum_{i = 1}^{n} y_{i} + 2 θ_{1} \sum_{i = 1}^{n} x_{i} = 0 \\ 2 θ_{1} \sum_{i = 1}^{n} x_{i}^{2} - 2 \sum_{i = 1}^{n} x_{i} y_{i} + 2 θ_{0} \sum_{i = 1}^{n} x_{i} = 0 \end{matrix}

which is equivalent to

{\begin{matrix} n θ_{0} - \sum_{i = 1}^{n} y_{i} + θ_{1} \sum_{i = 1}^{n} x_{i} = 0 \\ θ_{1} \sum_{i = 1}^{n} x_{i}^{2} - \sum_{i = 1}^{n} x_{i} y_{i} + θ_{0} \sum_{i = 1}^{n} x_{i} = 0 \end{matrix} .

We can isolate

θ_{0}

from the first equation:

θ_{0} = \frac{\sum_{i = 1}^{n} y_{i} - θ_{1} \sum_{i = 1}^{n} x_{i}}{n},

and substitute it in the second one

θ_{1} \sum_{i = 1}^{n} x_{i}^{2} - \sum_{i = 1}^{n} x_{i} y_{i} + \frac{\sum_{i = 1}^{n} y_{i} - θ_{1} \sum_{i = 1}^{n} x_{i}}{n} \sum_{i = 1}^{n} x_{i} = 0,

which is equivalent to

θ_{1} \sum_{i = 1}^{n} x_{i}^{2} - \sum_{i = 1}^{n} x_{i} y_{i} + \frac{\sum_{i = 1}^{n} y_{i} \sum_{i = 1}^{n} x_{i} - θ_{1} {[\sum_{i = 1}^{n} x_{i}]}^{2}}{n} = 0

θ_{1} [\sum_{i = 1}^{n} x_{i}^{2} - \frac{{[\sum_{i = 1}^{n} x_{i}]}^{2}}{n}] - \sum_{i = 1}^{n} x_{i} y_{i} + \frac{\sum_{i = 1}^{n} y_{i} \sum_{i = 1}^{n} x_{i}}{n} = 0.

At this point, we can divide everything by

n

, yielding:

θ_{1} [\frac{\sum_{i = 1}^{n} x_{i}^{2}}{n} - \frac{{[\sum_{i = 1}^{n} x_{i}]}^{2}}{n^{2}}] - \frac{\sum_{i = 1}^{n} x_{i} y_{i}}{n} + \frac{\sum_{i = 1}^{n} y_{i} \sum_{i = 1}^{n} x_{i}}{n^{2}} = 0.

If we now assume that the observations are equiprobable, i.e.,

P (x_{i}) = \frac{1}{n}

, and we call

X

the random variable from which observations

x_{i}

are obtained and the same for the observations

y_{i}

, obtained from

Y

, then:

\frac{\sum_{i = 1}^{n} x_{i}^{2}}{n} = E [X^{2}], \frac{{[\sum_{i = 1}^{n} x_{i}]}^{2}}{n^{2}} = E {[X]}^{2}, \frac{\sum_{i = 1}^{n} x_{i} y_{i}}{n} = E [X Y], \frac{\sum_{i = 1}^{n} y_{i} \sum_{i = 1}^{n} x_{i}}{n^{2}} = E [X] E [Y] .

This means that the previous equation can be rewritten as:

θ_{1} (E [X^{2}] - E {[X]}^{2}) - (E [X Y] - E [X] E [Y]) = 0 ⟺ θ_{1} V a r [X] - C o v [X, Y] = 0

θ_{1} = \frac{C o v [X, Y]}{V a r [X]},

θ_{0} = E [Y] - θ_{1} E [X] .

2.2.2 Least squares regression: multivariate case

Now, we can assume that we have

m

independent variables

X_{1}, ..., X_{m}

which we want to use to predict the dependent variable

Y

. Again, we have

n

observations of each variable. Now, we want to construct an hyperplane in

R^{m + 1}

, whose predictions would be obtained as

\hat{y} (X_{i}) = θ_{0} + θ_{1} x_{i 1} + ... + θ_{m} x_{i m} = θ_{0} + \sum_{j = 1}^{m} θ_{j} x_{i j} = \sum_{j = 0}^{m} θ_{j} x_{i j},

where we define

x_{i 0} = 1,

for all

i

. We can represent

X = {(x_{i j})}_{i = 1, ..., n; j = 1, ..., m}, Y = [\begin{matrix} y_{1} \\ ⋮ \\ y_{n} \end{matrix}], θ = [\begin{matrix} θ_{0} \\ ⋮ \\ θ_{m} \end{matrix}],

so that we can write

\hat{Y} = X θ .

The error function is defined as in the simple case

J (θ) = \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2},

but now we can rewrite this as

J (θ) = \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2} = {(Y - \hat{Y})}^{T} (Y - \hat{Y}) = {(Y - X θ)}^{T} (Y - X θ) .

Again, to obtain

θ

we need to optimize this function using matrix calculus.

Lemma 2.1. If

A = [\begin{matrix} a_{11} & ... & a_{1 m} \\ ⋮ & ⋱ & ⋮ \\ a_{n 1} & ... & a_{n m} \end{matrix}] \in M_{n \times m} (R)

θ = [\begin{matrix} θ_{1} \\ ⋮ \\ θ_{m} \end{matrix}] \in R^{m}

and

B = [\begin{matrix} b_{11} & ... & b_{1 n} \\ ⋮ & ⋱ & ⋮ \\ b_{n 1} & ... & b_{n n} \end{matrix}] \in M_{m \times m} (R)

is a symmetric matrix, it holds:

\frac{\partial A θ}{\partial θ} = A,

\frac{\partial θ^{T} A^{T}}{\partial θ} = A,

and

\frac{\partial θ^{T} B θ}{\partial θ} = 2 θ^{T} B^{T} .

Proof. First, notice that

A θ = [\begin{matrix} \sum_{j = 1}^{m} a_{1 j} θ_{j} \\ ⋮ \\ \sum_{j = 1}^{m} a_{n j} θ_{j} \end{matrix}] \in R^{n},

so it is

\frac{\partial A θ}{\partial θ} = [\begin{matrix} \frac{\partial \sum_{j = 1}^{m} a_{1 j} θ_{j}}{\partial θ_{1}} & ... & \frac{\partial \sum_{j = 1}^{m} a_{1 j} θ_{j}}{\partial θ_{m}} \\ ⋮ & ⋱ & ⋮ \\ \frac{\partial \sum_{j = 1}^{m} a_{n j} θ_{j}}{\partial θ_{1}} & ... & \frac{\partial \sum_{j = 1}^{m} a_{n j} θ_{j}}{\partial θ_{m}} \end{matrix}] = [\begin{matrix} a_{11} & ... & a_{1 m} \\ ⋮ & ⋱ & ⋮ \\ a_{n 1} & ... & a_{n m} \end{matrix}] = A .

For the second result, the procedure is the same.

Lastly, notice that

θ^{T} B θ = \sum_{k = 1}^{m} \sum_{j = 1}^{m} θ_{k} b_{k j} θ_{j} \in R

, so

\frac{\partial θ^{T} B θ}{\partial θ} = [\begin{matrix} \frac{\partial \sum_{k = 1}^{m} \sum_{j = 1}^{m} θ_{k} b_{k j} θ_{j}}{\partial θ_{1}} & ... & \frac{\partial \sum_{k = 1}^{m} \sum_{j = 1}^{m} θ_{k} b_{k j} θ_{j}}{\partial θ_{m}} \end{matrix}] = [\begin{matrix} 2 \sum_{j = 1}^{m} b_{1 j} θ_{j} & ... & 2 \sum_{j = 1}^{m} b_{m j} θ_{j} \end{matrix}] = 2 {[B θ]}^{T} = 2 θ^{T} B^{T} .

Now, we can proceed and minimize

J

\begin{matrix} \frac{\partial J (θ)}{\partial θ} = & \frac{\partial {(Y - X θ)}^{T} (Y - X θ)}{θ} \\ = & \frac{\partial}{\partial θ} [Y^{T} Y - Y^{T} X θ - θ^{T} X^{T} Y + θ^{T} X^{T} X θ] \\ = & 0 - Y^{T} X - Y^{T} X + 2 X^{T} X θ \\ = & - 2 Y^{T} X + 2 θ^{T} X^{T} X, \end{matrix}

setting this to be 0, we get

θ^{T} X^{T} X = Y^{T} X ⟺ X^{T} X θ = X^{T} Y ⟺ θ = {(X^{T} X)}^{- 1} X^{T} Y .

Thus, the 'best' linear model is given by

θ_{l s e} = {(X^{T} X)}^{- 1} X^{T} Y .

Once we have this model, if we have an observation of

X

x' = (x'_{1}, ..., x'_{m})

and we want to make a prediction, we compute

y' = x' θ_{l s e} .

The approach that we have followed here is the optimization view of learning, which basically consists of the steps:

Set up an error function as a function of some parameters.
Optimize this function to find the suitable values for this parameters, assuming the data as given.
Use incoming values to make predictions.

2.2.3 Computation of least squares solution via the singular values decomposition (SVD)

Inverting

X^{T} X

can entail numerical problems, so the SVD can be used instead.

Theorem 2.1. Any matrix

A \in R^{m \times n}

m > n

, can be expressed as

A = U Σ V^{T},

where

U \in R^{m \times n}

has orthonormal columns

(U^{T} U = I)

Σ \in R^{n \times n}

is diagonal and contains the singular values in its diagonal and

V \in R^{n \times n}

is orthonormal

(V^{- 1} = V^{T})

Proof. Let

C = A^{T} A \in R^{n \times n}

C

is square, symmetric and positive semidefinite. Therefore,

C

is diagonalizable, so it can be written as

C = V Λ V^{T},

where

V = {(v_{i})}_{i = 1, ..., n}

is orthogonal and

Λ = d i a g (λ_{1}, ..., λ_{d})

with

λ_{1} \geq ... \geq λ_{r} > 0 = λ_{r + 1} = ... = λ_{n}

and

r

r a n k (A) \leq n

Now, define

σ_{i} = \sqrt{λ_{i}}

, and form the matrix

Σ = [\begin{matrix} d i a g (σ_{1}, ..., σ_{r}) & 0_{r \times (n - r)} \\ 0_{(m - r) \times r} & 0_{(m - r) \times (n - r)} \end{matrix}] .

Define also

u_{i} = \frac{1}{σ_{i}} X v_{i} \in R^{m}, i = 1, ..., r .

Then, this vectors are orthonormal:

u_{i}^{T} u_{j} = {(\frac{1}{σ_{i}} X v_{i})}^{T} (\frac{1}{σ_{j}} X v_{j}) = \frac{1}{σ_{i} σ_{j}} v_{i}^{T} X^{T} X v_{j} = \frac{1}{σ_{i} σ_{j}} v_{i}^{T} C v_{j} \overset{*}{=} \frac{1}{σ_{i} σ_{j}} v_{i}^{T} (λ_{j} v_{j}) \overset{(}{=} \frac{σ_{j}}{σ_{i}} v_{i}^{T} v_{j} \overset{* *}{=} 0,

where

(*)

is because

V

is formed with the eigenvectors of

C

, and

(* *)

is because

V

is orthonormal.

Now, we can complete the base with

u_{r + 1}, ..., u_{n}

(using Gram-Schmidt) in such a way that

U = [u_{1}, ..., u_{r}, u_{r + 1}, ..., u_{n}] \in R^{n \times n}

is column orthonormal.

Now, if it is the case that

X V = U Σ

, then

X = X V V^{T} = U Σ V^{T},

so it is only left to see that indeed this holds. Consider two cases:

$1 \leq i \leq r :$ $X v_{i} = u_{i} Σ$ by the definition of $u_{i}$ .
$r + 1 \leq i \leq n :$ It is $X v_{i} = 0,$ because $X^{T} X v_{i} = C v_{i} = λ_{i} v_{i} \overset{i > r}{=} 0$ . As $X, v_{i} \neq 0$ , it must be $X v_{i} = 0$ . On the other side of the equation we also have 0 because $u_{i} Σ = u_{i} σ_{i} = 0$ as $i > r$ .

This, added to the fact that if

X

has full rank,

X^{T} X

is invertible and all its eigenvalues non null, gives us:

\begin{matrix} θ_{l s e} = & {(X^{T} X)}^{- 1} X^{T} y \\ = & {((U Σ V^{T}) U Σ V^{T})}^{- 1} {(U Σ V^{T})}^{T} y \\ = & {(V Σ U^{T} U Σ V^{T})}^{- 1} V Σ U^{T} y \\ = & Σ^{- 2} V Σ U^{T} y = V Σ^{- 1} U^{T} y \\ = & V \cdot d i a g (\frac{1}{σ_{1}}, ..., \frac{1}{σ_{n}}) \cdot U^{T} y . \end{matrix}

Intuitive interpretation

The intuition behind the SVD is summarized in Figure 2. Basically, every linear transformation can be decomposed into a rotation, a scaling and a simpler transformation (column orthogonal).

image: 2_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_Machine_Learning_LectureNotes_source_SVD.png

Figure 2: SVD visual.

The intuition behind SVD lies in the idea of finding a low-rank approximation of a given matrix. The rank of a matrix is the number of linearly independent rows or columns it contains. A high-rank matrix has many linearly independent rows or columns, which makes it complex and difficult to analyze. On the other hand, a low-rank matrix has fewer linearly independent rows or columns, which makes it simpler and easier to analyze.

SVD provides a way to find the best possible low-rank approximation of a given matrix by decomposing it into three components. The left singular vectors represent the direction of maximum variance in the data, while the right singular vectors represent the direction of maximum correlation between the variables. The singular values represent the magnitude of the variance or correlation in each direction.

By truncating the diagonal matrix of singular values to keep only the top-k values, we can obtain a low-rank approximation of the original matrix that retains most of the important information. This is useful for reducing the dimensionality of data, compressing images, and solving linear equations, among other applications.

Example 2.1. How to use SVD in Python and Matlab.

import numpy as np

U, d, Vt = np.linalg.svd(X, full_matrices=False)
D = np.diag(1/d)
theta = Vt.T @ D @ U.T @ y

Algorithm 1: SVD in Python.

[U, d, V] = svd(X)
D = diag(diag(1./d))
theta = V'*D*U'*y

Algorithm 2: SVD in Matlab.

2.3 Things that could go wrong when using linear regression

2.3.1 Our independent variable is not enough

It is possible that our variable

X

does not provide enough information to predict

Y

image: 3_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_Machine_Learning_LectureNotes_source_pegado7.png

Figure 3: The independent variable does not provide enough information.

2.3.2 The relationship between the variables is not linear (underfitting)

It is also possible that the variables are related in non-linear ways.

image: 4_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_Machine_Learning_LectureNotes_source_pegado6.png

Figure 4: The variables are not linearly related.

2.3.3 Outliers affect the fit

In the presence of outliers, the model obtained can be distorted, leading to bad results.

image: 5_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_Machine_Learning_LectureNotes_source_pegado8.png

Figure 5: The outlier distort the fit.

2.4 Basis Functions

In order to fix the second problem (Subsection 2.3.2), we can make use of basis functions. The idea is to apply different transformations to the data, so that we can extend the expressive power of our model.

Definition 2.1. A feature mapping is a non-linear transformation of the inputs

φ : R^{m} \to R^{k}

The resulting predictive function or model is

y = φ (x) θ

Example 2.2. For example, we can consider the polynomial expansion of degree $k$ , which is a commonly used feature mapping that approximates the relationship between the independent variable

x

and the dependent variable

y

to be polynomial of degree

k

, i.e.:

y = θ_{0} + θ_{1} x + θ_{2} x^{2} + ... + θ_{k} x^{k} .

The feature mapping is

φ (x) = (\begin{matrix} 1 & x & x^{2} & ... & x^{k} \end{matrix})

Note that the idea is to transform the data so that the fit is still linear, even if the relationship is not. Of course, this requires to apply the same transformation whenever we receive an input for which we want to make predictions. Also, the resulting model is more complex, so complexity control is necessary to avoid overfitting.

When we apply

φ

to the input matrix

X

, we get a new input matrix, given by

Φ = (\begin{matrix} φ (x_{1}) \\ φ (x_{2}) \\ ⋮ \\ φ (x_{n}) \end{matrix}) = (\begin{matrix} φ_{1} (x_{1}) & φ_{2} (x_{1}) & \dots & φ_{m} (x_{1}) \\ φ_{1} (x_{2}) & φ_{2} (x_{2}) & \dots & φ_{m} (x_{2}) \\ ⋮ & ⋮ & ⋱ & ⋮ \\ φ_{1} (x_{n}) & φ_{2} (x_{n}) & \dots & φ_{m} (x_{n}) \end{matrix}),

and we obtain the optimal solution as before:

θ_{m i n} = arg {min}_{θ} {(y - Φ θ)}^{T} (y - Φ θ) = {(Φ^{T} Φ)}^{- 1} Φ^{T} y .

Example 2.3. A MATLAB example

image: 6_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_Ma____source_scripts_example_2_3_images_figure_0.png

image: 7_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_Ma____source_scripts_example_2_3_images_figure_1.png

2.5 Probabilistic approach

A review on probability theory, Bayes theorem and Bayesian Learning is in Appendix A.

2.5.1 Least squares regression from a probabilistic perspective

We are noe going to derive the linear regression estimates using the principle of maximum likelihood and the univariate Gaussian distribution, whose probability density function is given by

p (x) = \frac{1}{\sqrt{2 π σ^{2}}} e^{- \frac{1}{2 σ^{2}} {(x - μ)}^{2}} .

Given a sample

D = {x_{1}, ..., x_{n}}

, where

x_{i} \sim N (μ, σ)

, we define its likelihood as the function

L (μ, σ, D) = P (D; μ, σ) = \prod_{i} p (x_{i}; μ, σ) .

In Figure 6, we can see how the likelihood relates to 'how likely it is that our points have been created froma certain distribution', because the red outcomes are more likely to appear from the blue distribution than the green ones.

image: 8_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_Machine_Learning_LectureNotes_source_pegado9.png

Figure 6: Likelihood example.

Now, this can be used to select the distribution that best matches our data. As an easy approach, suppose we want to decide between two distributions

F_{1}

and

F_{2}

, and we have a dataset

D

. To decide, we can compute

L (F_{1}, D)

and

L (F_{2}, D)

and select the distribution whose likelihood is greater. This is visually exemplified in Figure 7, where we can see that given the three red outcomes and the two distributions (blue and green), the blue one should be preferred, because it mazimizes the likelihood.

image: 9_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_Machine_Learning_LectureNotes_source_pegado10.png

Figure 7: Using likelihood to select the distribution.

This way, we can think of the likelihood of a function of the unknown parameters of the distribution, with the dataset fixed, and we can maximize this function to obtain the parameters that best describe our data.

In the probabilistic setting of linear regression, we assume that each label

y_{i}

we observe is normally distributed, with mean

μ = x_{i} θ

and variance

σ^{2}

y_{i} = x_{i} θ + ϵ_{i},

where

ϵ_{i} \sim N (0, σ^{2})

. This way, we seek to obtain

θ

and

σ^{2}

that best describe our data. Remember that

y = (\begin{matrix} y_{1} \\ ⋮ \\ y_{n} \end{matrix}), X = (\begin{matrix} 1 & x_{11} & ... & x_{1 d} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ 1 & x_{n 1} & ... & x_{n d} \end{matrix}),

so that the likelihood of the parameter vector

θ

is given by

L (θ, σ) = P (y | X; θ, σ^{2}) = \prod_{i = 1}^{n} p (y_{i} | x_{i}; θ, σ^{2}) .

It is usual to maximize the log-likelihood instead, basically because the likelihood tends to give values too close to zero (we may be multiplying thousands of small values), so numerical problems may arise. Thus:

\begin{matrix} l (θ, σ^{2}) = log L (θ, σ^{2}) = & log \prod_{i = 1}^{n} p (y_{i} | x_{i}; θ, σ^{2}) \\ = & \sum_{i = 1}^{n} log p (y_{i} | x_{i}; θ, σ^{2}) \\ = & \sum_{i = 1}^{n} log [\frac{1}{\sqrt{2 π σ^{2}}} e^{- \frac{1}{2 σ^{2}} {(y_{i} - x_{i} θ)}^{2}}] \\ = & \sum_{i = 1}^{n} (log \frac{1}{\sqrt{2 π σ^{2}}} - \frac{1}{2 σ^{2}} {(y_{i} - x_{i} θ)}^{2}) \\ = & - \frac{n}{2} log (2 π σ^{2}) - \frac{1}{2 σ^{2}} \sum_{i = 1}^{n} {(y_{i} - x_{i} θ)}^{2} \\ = & - \frac{n}{2} log (2 π σ^{2}) - \frac{1}{2 σ^{2}} \sum_{i = 1}^{n} {(y - X θ)}^{T} (y - X θ) . \end{matrix}

At this point, we differentiate and set equal to 0:

\frac{\partial}{\partial θ} l (θ, σ^{2}) = - \frac{1}{2 σ^{2}} (- 2 X^{T} y + 2 X^{T} X θ) = 0

\frac{\partial}{\partial σ^{2}} l (θ, σ^{2}) = - \frac{n}{2 σ^{2}} + \frac{1}{2 σ^{4}} {(y - X θ)}^{T} (y - X θ) = 0,

obtaining

θ_{M L} = {(X^{T} X)}^{- 1} X^{T} y

and

σ_{M L}^{2} = \frac{1}{n} {(y - X θ)}^{T} (y - X θ) = \frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - x_{i} θ_{M L})}^{2} = M S E .

It is noticeable that the maximum likelihood estimates concide with the estimates we found minimizing the squared error. This is a consequence of assuming gaussian noise, and other types of distribution would give us the same estimates as minimizing a different error function.

2.6 Bias-Variance decomposition

Let

f : R^{d} \to R

be the true function that we are trying to approximate and

D = {(x_{1}, y_{1}), ..., (x_{n}, y_{n})}

a finite training dataset, where

y_{i} = f (x_{i}) + ϵ_{i}

and

ϵ_{i} \sim N (0, σ^{2})

. Let

x \in R^{d}

be a test data point. The setup is using

D

to train a model

\hat{f}

that we want to use to make predictions

{\hat{y}}_{D} = \hat{f} (x) .

We are going to see how the expected squared error

{(y - {\hat{y}}_{D})}^{2}

, where

y

is the real value, can be decomosed as a sum of the following components:

Irreducible error: given by $σ^{2}$ .
Bias: the systematic limitation that the modelling assumptions impose. For example, if we choose linear approximations, we will never be able to model non-linear data well enough.
Variance: refers to the sensitivity of the model to the training set $D$ . The more the model varies when the training set is changed, the higher variance it has.

Let's do this:

\begin{matrix} E [{(y - {\hat{y}}_{D})}^{2}] = & E [{(f (x) + ϵ - {\hat{y}}_{D})}^{2}] = E [{(f (x) + ϵ - {\hat{y}}_{D} + E [{\hat{y}}_{D}] - E [{\hat{y}}_{D}])}^{2}] \\ = & E [{(+ +)}^{2}] \\ = & E [^{2}] + E [^{2}] + E [^{2}] \\ + 2 E [] + 2 E [] \\ + 2 E [] \\ = & E [^{2}] + σ^{2} + E [^{2}] \\ + 2 E [()] + 2 E [] \\ + 2 E [] \\ = & E [^{2}] + σ^{2} + E [^{2}] . \end{matrix}

And we now define

B i a s [{\hat{y}}_{D}] = E [^{2}],

V a r i a n c e [{\hat{y}}_{D}] = E [^{2}] .

The Bias reflects the expected difference between our assumed model and the real function, while the variance reflects the difference between the assumed model and the obtained model. In Figure 8, we can see:

The linear model has high bias and low variance.
The polynomial of degree 3 has low bias and moderate variance.
The polynomial of degree 8 has low bias but high variance.

image: 10_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_Machine_Learning_LectureNotes_source_pegado14.png

Figure 8: Visual representation of the Bias-Variance errors.

A summary of commonly used errors used for regression is shown in Table 1.

Name	Abbreviation	Formula
mean squared error	MSE	$\frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - x_{i} θ)}^{2}$
root mean squared error	RMSE	$\sqrt{M S E}$
normalized root mean squared error	NRMSE	$\sqrt{\frac{M S E}{V a r (y)}}$
coefficient of determination	$R^{2}$	1-NRMSE $^{2}$
mean absolute error	MAE	$\frac{1}{n} \sum_{i = 1}^{n} \| y_{i} - x_{i} θ \|$

Table 1: Common error functions.

2.7 Ridge Regression from Gaussian prior

We are going to consider MAP estimates in this section, which are explained in the Annex A.

Assume isotropic Gaussian prior on

d

-dimensional

θ

, i.e.,

θ \sim N (μ = 0, Σ = τ^{2} I)

, so that

Σ^{- 1} = \frac{1}{τ^{2}} I

and

det Σ = τ^{2 d}

. Then, on one hand, we have

\begin{matrix} P (θ; μ = 0, Σ = τ^{2} I) = & \frac{1}{det {(Σ)}^{\frac{1}{2}} {(2 π)}^{\frac{d}{2}}} exp {- \frac{1}{2} {(y - μ)}^{T} Σ^{- 1} (y - μ)} \\ = & \frac{1}{{(2 π τ^{2})}^{\frac{d}{2}}} exp {- \frac{1}{2 τ^{2}} θ^{T} θ} \\ = & \frac{1}{{(2 π τ^{2})}^{\frac{d}{2}}} exp {- \frac{{∥ θ ∥}^{2}}{2 τ^{2}}} . \end{matrix}

On the other hand, it is

\begin{matrix} P (θ | y, X) \propto & P (y | X, θ) P (θ) \\ \propto & exp {- \frac{1}{2 σ^{2}} {(y - X θ)}^{T} (y - X θ)} exp {- \frac{{∥ θ ∥}^{2}}{2 τ^{2}}} \\ = & exp {- \frac{1}{2 σ^{2}} {(y - X θ)}^{T} (y - X θ) - \frac{{∥ θ ∥}^{2}}{2 τ^{2}}} . \end{matrix}

To obtain the MAP, we maximize the log of this expression:

\begin{matrix} θ_{M A P} = & arg {max}_{θ} log [P (y | X, θ) P (θ)] \\ = & arg {max}_{θ} [- \frac{1}{2 σ^{2}} {(y - X θ)}^{T} (y - X θ) - \frac{{∥ θ ∥}^{2}}{2 τ^{2}}] \\ = & arg {min}_{θ} [{(y - X θ)}^{T} (y - X θ) + \frac{σ^{2}}{τ^{2}} {∥ θ ∥}^{2}] \\ = & arg {min}_{θ} [{(y - X θ)}^{T} (y - X θ) + λ {∥ θ ∥}^{2}] \end{matrix}

which is the ridge regression estimate with

λ = \frac{σ^{2}}{τ^{2}}

. We can now differenciate this expression to find its minimum:

\begin{matrix} \frac{\partial}{\partial θ} [{∥ y - X θ ∥}^{2} + λ {∥ θ ∥}^{2}] = & - y^{T} X - X^{T} y + 2 X^{T} X θ + 2 λ θ = - 2 X^{T} y + 2 X^{T} X θ + 2 λ θ = 0 \\ ⟺ & (2 X^{T} X + 2 λ I) θ = 2 X^{T} y \\ ⟺ & θ_{M A P} = θ_{r i d g e} = {(X^{T} X + λ I)}^{- 1} X^{T} y . \end{matrix}

Remark 2.1. There are a few remarks here:

$λ$ controls the complexity of the solution $θ$ , the bigger $λ$ is, the smaller $θ$ tends to be, leading to simpler solutions.
$X^{T} X + λ I$ is guaranteed to be non-singular and behaves better numerically than $X^{T} X$ , specially if there is high correlation in the columns of $X$ , or if there are few observations relative to the amount of variables.
A general approach when we have a regularized objective function is to use models that are potentially more complex than needed, and then adjust $λ$ until obtaining good result and simpler models.

2.7.1 Tuning $λ$

Cross-validation

Given a dataset

D

, the

k

-cross-validation approach starts by separating

D

into two subsets:

The training dataset, $D_{t r a i n}$ .
The test dataset, $D_{t e s t}$ .

These are obtained in such a way that:

$D_{t r a i n} \cap D_{t e s t} = \emptyset$ .
$D_{t r a i n} \cup D_{t e s t} = D$ .

Now, we divide

D_{t r a i n}

into

k

subsets of equal size,

{D_{t r a i n, i}}_{i = 1}^{k}

, called folds, and imagine we want to decide on a set of values for our model's parameter

λ \in Λ = {λ_{1}, ..., λ_{l}}

For each λ ∈ Λ :
1. For each i=1,...,k :
  1. Train the model in $D_{t r a i n} ∖ D_{t r a i n, i}$ .
  2. Evaluate the model in $D_{t r a i n, i}$ .
2. Average the $k$ evaluations to obtain an estimation of the performance of the model.
Select $λ$ that gives us the best estimation.

Leave-one-out cross-validation (LOOCV)

Is a special case of cross-validation, in which

k = n

, i.e., each data point is a fold.

2.7.2 LOOCV for Ridge regression

As a particularity of linear and ridge regression, for a given value of

λ

, only one training is necessary for LOOCV, so we can proceed as follows:

For each λ ∈ Λ :
1. Compute the optimal solution ${\hat{θ}}_{λ} = {(X^{T} X + λ I)}^{- 1} X^{T} y .$
2. Compute the hat matrix or smoothing matrix $H_{λ} = {(h_{i j})}_{i j} = X {(X^{T} X + λ I)}^{- 1} X^{T} .$
3. Compute LOOCV directly for each $λ$ , without folding $L O O C V (λ) = \frac{1}{n} \sum_{i = 1}^{n} {(\frac{y_{i} - f (x_{i}) θ_{λ}}{1 - h_{i i}})}^{2} .$
Return $λ$ with minimum LOOCV.

2.7.3 Generalized Cross-Validation (GCV)

Generalized Cross-Validation (GCV) is a model selection technique used to estimate the performance of a model in terms of prediction accuracy. GCV is particularly useful for choosing the optimal parameters in regularization or smoothing methods, where the goal is to balance model complexity and goodness of fit. Examples of such methods include ridge regression, LASSO, and smoothing splines.

The main idea of GCV is to approximate the leave-one-out cross-validation (LOOCV) error without actually performing the computationally expensive process of fitting the model to all but one data point multiple times. Also, it is more computationally stable than the previous approach.

The GCV score is defined as follows:

G C V (λ) = \frac{1}{n} \sum_{i = 1}^{n} {(\frac{y_{i} - f (x_{i}) θ_{λ}}{1 - \frac{T r (H_{λ})}{n}})}^{2} = \frac{M S E_{λ}}{(1 - \frac{T r (H_{λ})}{n})},

where

T r (H_{λ})

is the trace of

H_{λ}

2.8 LASSO regression

Definition 2.2. The $L p -$ norm of a vector

θ

{∥ θ ∥}_{p} = {(\sum_{i = 1}^{n} {| θ_{d} |}^{p})}^{\frac{1}{p}} .

Remark 2.2. As we have seen, assuming an isotropic Gaussian prior on the parameters leads to ridge regression, which minimizes the L2-norm of

θ

and the squared error.

Another very common choice is

p = 1

, which leads to LASSO regression. Thus, LASSO regression minimizes the L1-norm of parameters and squared error:

θ_{L A S S O} = arg {min}_{θ} [{∥ y - X θ ∥}_{2}^{2} + λ {∥ θ ∥}_{1}] .

In fact, LASSO regression arises assuming a Lapace distribution prior over the parameters.

Some characteristics:

LASSO regularized cost function is not quadratic anymore, and it has no close solution, so an approximation procedure is used: the least angle regression, which provides an efficient way to compute the solutions for a list of possible values for $λ > 0$ , giving the regularization path.
LASSO regression gives sparse solutions, in which many coefficients/coordinates of $θ$ might be 0. This means that LASSO performs feature selection automatically.

2.9 The full-Bayesian perspective

Both maximum likelihood (ML) and maximum a priori (MAP) produce point estimates of the parameters, while in Bayesian Learning

Refer to Appendix A for more details.

we want the full posterior distribution of the parameters. The idea is that if we know the posterior distribution of the parameters, then we can use all the information provided by this distribution for our predictions, instead of just a single point-estimate that summarizes it. For instance, if the probability function of the posterior is

p (θ | D)

and we receive a new input point

x

, then we can compute the probability of

f (x) = y

p (y | x, D) = \int_{Θ}^{Θ} p (y | x, θ, D) p (θ | D) ⅆ θ .

Now, when we do this for all possible values of

y

, we obtain the full distribution of the predictions, instead of just one estimation. Nonetheless, computing this integral is usually too hard, so it needs to be approximated, but in the context of linear regression all these expressions have close-form formulas

For computational speed we may use approximations as well.

Technically, ML and MAP assume that

Y \sim N (x^{T} θ, σ^{2}),

so a prediction for a new test point

\bar{x}

is going to have a distribution

N ({\bar{x}}^{T} \hat{θ}, σ^{2}) .

Note now the lack of flexibility of this approach, since the width of the normal distribution is going to be the same for any new test point, which may be a dangerous assumption.

Let

D = {(x_{i}, y_{i})}_{i = 1}^{n}

be our dataset, with

x_{i} \in R^{d}

and

y_{i} \in R

, and assume:

$Y_{i}, ..., Y_{n}$ are independent given $θ$ .
$Y_{i} \sim N (x_{i}^{T} θ, \frac{1}{a})$ , with $a > 0$ being the prevision of the noise in the observations $(a = \frac{1}{σ^{2}})$ .
A spherical or isotropic Gaussian for the parameter's prior, $p (θ) \sim N (0, b^{- 1} I)$ .
$a$ and $b$ are known.
The only parameter variables are the coefficients $θ = {(θ_{0}, ..., θ_{d})}^{T}$ .

Then, the likelihood function is

p (D | θ) \propto exp {- \frac{a}{2} {(y - X θ)}^{T} (y - X θ)} .

As usual, using Bayes, we can derive the posterior distribution

\begin{matrix} p (θ | D) \propto & p (D | θ) p (θ) \\ \propto & exp {- \frac{a}{2} {(y - X θ)}^{T} (y - X θ)} exp {- \frac{b}{2} θ^{T} θ} \\ \propto & exp {- \frac{a}{2} {(y - X θ)}^{T} (y - X θ) - \frac{b}{2} θ^{T} θ} . \end{matrix}

Notice here that the exponent of this expression is quadratic on

θ

, so it is going to be a multivariate Gaussian

The idea is that the product of two Gaussians is also Gaussian. Think of it as if you are measuring height and IQ: these variables can be assumed independent, and have a normal distribution. You can multiply them and obtain a combined normal distribution in 2-D. The idea here is basically the same.

. We need to turn the exponent into something resembling

{(θ - μ)}^{T} Q (θ - μ),

so that we can derive what the mean

μ

and the precision

Q

of the posterior density is. For this, we are going to complete squares so that we can 'match the terms' between

{(θ - μ)}^{T} Q (θ - μ)

and

a {(y - X θ)}^{T} (y - X θ) + b θ^{T} θ

. On one hand:

\begin{matrix} {(θ - μ)}^{T} Q (θ - μ) = & θ^{T} Q θ - θ Q μ - μ^{T} Q θ + μ^{T} Q μ \\ = & θ^{T} Q θ - 2 θ^{T} Q μ + c o n s t . \end{matrix}

We don't mind about constant terms with respect to

θ

, since we only care about proportionality. On the other hand, we have:

\begin{matrix} a {(y - X θ)}^{T} (y - X θ) + b θ^{T} θ = & a y^{T} y - a y^{T} X θ - a θ^{T} X^{T} y + a θ^{T} X^{T} X θ + b θ^{T} θ \\ = & a y^{T} y - 2 a θ^{T} X^{T} y + θ^{T} (a X^{T} X + b I) θ . \end{matrix}

From here, we obtain that

Q = a X^{T} X + b I,

and we now match

- 2 a θ^{T} X^{T} y

with

- 2 θ^{T} Q μ

Q

is invertible because it is positive definite, thanks to the

+ b I

, with

b > 0

a X^{T} y = Q μ ⟺ μ = a Q^{- 1} X^{T} y .

Thus, the posterior probability is

p (θ | D) \sim N (μ, Q^{- 1})

with

Q = a X^{T} X + b I, μ = a Q^{- 1} X^{T} y .

The MAP estimate can be directly obtained from here, since the maximum density is obtained at the mean in any Gaussian distribution. Additionally, in ridge regression we let

λ = \frac{b}{a}

and turn it into a parameter that we can tune to control complexity against training error.

2.9.1 Using the posterior distribution for predictions

Let's now see how to compute the predictive distribution, i.e.,

p (y | x, D) = \int_{Θ}^{Θ} p (y | x, θ, D) p (θ | D) ⅆ θ .

For this, we substitute the densities

\begin{matrix} p (y | x, D) = & \int_{Θ}^{Θ} N (y | x^{T} θ, \frac{1}{a}) N (θ | Q^{- 1}) ⅆ θ \\ \overset{w . r . t . y}{\propto} & \int_{Θ}^{Θ} exp {- \frac{a}{2} {(y - x^{T} θ)}^{2}} exp {- \frac{1}{2} {(θ - μ)}^{T} Q (θ - μ)} ⅆ θ \\ = & \int_{Θ}^{Θ} exp {- \frac{a}{2} (y^{2} - 2 (x^{T} θ) y + {(x^{T} θ)}^{2}) - \frac{1}{2} (θ^{T} Q θ - 2 θ^{T} Q μ + μ^{T} Q μ)} ⅆ θ \\ \propto & \int_{Θ}^{Θ} exp {- \frac{a}{2} (y^{2} - 2 (x^{T} θ) y + {(x^{T} θ)}^{2}) - \frac{1}{2} (θ^{T} Q θ - 2 θ^{T} Q μ)} ⅆ θ . \end{matrix}

Now, our objective is to set this integral to equate (or be proportional to) another of the form

\int_{Θ}^{Θ} N (θ | ...) g (y) ⅆ θ = g (y) \int_{Θ}^{Θ} N (θ | ...) ⅆ θ = g (y),

and finally to see that

g (y) \propto N (y | ...)

. We then have

\begin{matrix} p (y | x, D) \propto & \int_{Θ}^{Θ} exp {- \frac{a}{2} (y^{2} - 2 (x^{T} θ) y + {(x^{T} θ)}^{2}) - \frac{1}{2} (θ^{T} Q θ - 2 θ^{T} Q μ)} ⅆ θ \\ = & \int_{Θ}^{Θ} exp {- \frac{1}{2} [a y^{2} - 2 a x^{T} θ y + a θ^{T} x x^{T} θ + θ^{T} Q θ - 2 θ^{T} Q μ]} ⅆ θ \\ = & \int_{Θ}^{Θ} exp {- \frac{1}{2} [θ^{T} (a x x^{T} + Q) θ - 2 θ^{T} (x y a + Q μ) + a y^{2}]} ⅆ θ . \end{matrix}

Again, we want to match to something of the form

{(θ - m)}^{T} L (θ - m) = θ^{T} L θ - 2 θ^{T} L m + m^{T} L m,

L = a x x^{T} + Q

and

L m = x y a + Q μ ⟺ m = L^{- 1} (x y a + Q μ),

assuming

L^{- 1}

exists for now. Then:

\begin{matrix} p (y | x, D) \propto & \int_{Θ}^{Θ} exp {- \frac{1}{2} [θ^{T} (a x x^{T} + Q) θ - 2 θ^{T} (x y a + Q μ) + a y^{2}]} ⅆ θ \\ = & \int_{Θ}^{Θ} exp {- \frac{1}{2} [θ^{T} L θ - 2 θ^{T} L m + m^{T} L m - m^{T} L m + a y^{2}]} ⅆ θ \\ = & \int_{Θ}^{Θ} exp {- \frac{1}{2} [{(θ - m)}^{T} L (θ - m) - m^{T} L m + a y^{2}]} ⅆ θ . \end{matrix}

Notice here that

m

is independent of

θ

so that we have

\begin{matrix} p (y | x, D) \propto & \int_{Θ}^{Θ} exp {- \frac{1}{2} [{(θ - m)}^{T} L (θ - m) - m^{T} L m + a y^{2}]} ⅆ θ \\ = & \int_{Θ}^{Θ} exp {- \frac{1}{2} [{(θ - m)}^{T} L (θ - m)]} exp {- \frac{1}{2} {a y^{2} - m^{T} L m}} ⅆ θ \\ = & \int_{Θ}^{Θ} exp {- \frac{1}{2} [{(θ - m)}^{T} L (θ - m)]} g (y) ⅆ θ \\ = & g (y) \int_{Θ}^{Θ} exp {- \frac{1}{2} [{(θ - m)}^{T} L (θ - m)]} ⅆ θ \\ \propto & g (y) = exp {- \frac{1}{2} {a y^{2} - m^{T} L m}} . \end{matrix}

And now... we complete squares again :D

\begin{matrix} m^{T} L m = & {(a y x + Q μ)}^{T} L^{- 1} L L^{- 1} (a y x + Q μ) \\ = & a y x^{T} L^{- 1} a y x + 2 a y x^{T} L^{- 1} Q μ + \\ = & (a^{2} x^{T} L^{- 1} x) y^{2} + 2 (a x^{T} L^{- 1} Q μ) y + c o n s t . \end{matrix}

a y^{2} - m^{T} L m = (a - a^{2} x^{T} L^{- 1} x) y^{2} - 2 (a x^{T} L^{- 1} Q μ) y + c o n s t

g (y)

is a Gaussian, then this should look something like

λ {(y - u)}^{2} = λ y^{2} - 2 λ u y + λ u^{2},

λ = a - a^{2} x^{T} L^{- 1} x

and

λ u = a x^{T} L^{- 1} Q μ,

u = \frac{1}{λ} a x^{T} L^{- 1} Q μ .

And then, we have

\begin{matrix} p (y | x, D) \propto & g (y) \\ \propto & exp {- \frac{λ}{2} {(y - u)}^{2}}, \end{matrix}

so we have that

p (y | x, D) = N (y | u, \frac{1}{λ})

Not only this, but

The derivation for these values is a bit involved. It can be consulted in https://www.youtube.com/watch?v=LCISTY9S6SQ&t=287s.

u = μ^{T} x

and

\frac{1}{λ} = \frac{1}{a} + x^{T} Q^{- 1} x .

We note now that the predictive distribution's mean prediction equals the point-prediction of the MAP. However, the variance of the prediction does depend on

x

, which is good, since the unvertainty of out predictions depends on how far observed samples are:

If observed samples are near from our new inputs, then we should be more certain.
If they are far, then we should be less certain.

3 Clustering

The goal of clustering is to partition a dataset into groups called clusters, in such a way that observations in the same cluster tend to be more similar than observations in different clusters. The input data is embedded in a

d

-dimensional space with a similarity/dissimilarity function defined among elements in the space, which should capture relatedness among elements in this space. Two elements are understood to be related when they are close in the space. Thus, a cluster is a compact group that is separated from other groups or elements outside the cluster.

There is a large variety of clustering algorithms, such as hierarchical bottom-up or top-down clustering, probabilistic clustering, possibilistic clustering, algorithmic clustering, sprectral clustering or density-based clustering.

The problem of clustering is quite complex, as if we have

N

data points which we want to separate into

K

clusters, then there are

S (N, K) = \frac{1}{K!} \sum_{i = 1}^{K} {(- 1)}^{i} (\binom{K}{i}) {(K - i)}^{N}

possibilities. This is the stirling number of the second king.

If in addition we don't know how many clusters we want to use, we have to add all possible

K = 1, ..., N

, summing up to

B (N) = \sum_{K = 1}^{N} S (N, K)

possibilites. This number is the Bell number, which is really gigantic

For example,

B (71) \approx 4 \times 1 0^{71}

3.1 k-Means

The

k

-Means clustering algorithm takes a dataset

D = {x_{1}, ..., x_{n}}

and an integer

k > 1

as input, and separates

D

into

k

disjoint clusters. It is a representative-based clustering, meaning that each cluster is represented by one single point. In the case of

k

-means, the representative is the cluster center,

μ_{k} \in R^{d}, k = 1, ..., K

, i.e., the average of all the points in that cluster. Each point in thus assigned to its closest representative point.

C_{k}

is the

k

-th cluster, then we consider it a better cluster when the value

\sum_{x \in C_{k}} {∥ x - μ_{k} ∥}^{2}

is smaller.

Now, let's formalize all this. First, we introduce an indicator variable

r_{i k} = {\begin{matrix} 1 & i f x_{i} \in C_{k} \\ 0 & o t h e r w i s e \end{matrix},

and the objective function

J (μ, r) = \sum_{k = 1}^{K} \sum_{i = 1}^{n} r_{i j} {∥ x_{i} - μ_{k} ∥}^{2},

which we aim to minimize by selecting appropriate

{μ_{k}}_{k}

and

{r_{i k}}_{i k}

. The issue is that this problem is NP-hard, so we use an heuristic method that is only guaranteed to find local minima. This method relies in two facts:

For fixed cluster centers $μ_{k}$ , it is easy to optimize cluster assignments $r_{i k}$ .
Proof. Assume fixed ${μ_{k}}_{k}$ , then we assign $x_{i}$ to the closest $μ_{k}$ , because if we assign it to a different center, $μ_{j}$ , then we can minimize the sum as ${∥ x_{i} - μ_{j} ∥}^{2} > {∥ x_{i} - μ_{k} ∥}^{2} .$
For fixed cluster assignments $r_{i k}$ , it is easy to optimize cluster centers $μ_{k}$ .
Proof. Assume fixed ${r_{i j}}_{i j}$ , then $\begin{matrix} \frac{\partial}{\partial μ_{j}} J (μ_{1}, ..., μ_{K}) = & \sum_{k = 1}^{K} \sum_{i = 1}^{n} \frac{\partial}{\partial μ_{j}} r_{i j} {∥ x_{i} - μ_{k} ∥}^{2} \\ \overset{\frac{\partial}{\partial μ_{j}} μ_{k} = 0, \forall k \neq j}{=} & \sum_{i = 1}^{n} r_{i j} \frac{\partial}{\partial μ_{j}} {(x_{i} - μ_{j})}^{T} (x_{i} - μ_{j}) \\ = & \sum_{i = 1}^{n} r_{i j} \frac{\partial}{\partial μ_{j}} (x_{i}^{T} x_{i} - 2 μ_{j}^{T} x_{i} + μ_{j}^{T} μ_{j}) \\ = & \sum_{i = 1}^{n} r_{i j} (- 2 x_{i} + 2 μ_{j}) \\ = & - 2 \sum_{i = 1}^{n} r_{i j} x_{i} + 2 μ_{j} \sum_{i = 1}^{n} r_{i j} . \end{matrix}$ Thus, the minimum is obtained at $μ_{j} = \frac{\sum_{i = 1}^{n} r_{i j} x_{i}}{\sum_{i = 1}^{n} r_{i j}} = \frac{\sum_{x \in C_{j}} x}{c a r d (C_{j})} .$ This is the average of the points of the cluster.

The pseudocode is illustrated in Algorithm 3.

Algorithm 3: k-Means.

The characteristics of

k

-Means are:

:
1. Easy implementation.
2. Fast, even for large datasets.
:
1. Can converge to local minimum.
2. Needs the number of clusters, $K$ , as input.
3. Hard cluster assignments: meaning that each point corresponds to a single cluster, which may not always be what we want.
4. Sensitive to ourliers and clusters of different sizes and densities.
5. Sensitive to initialization, so it is usual to run it many times and keep the best run.
6. Biased towards rounded clusters, because it uses the Euclidean distance.

3.2 k-Means++

k

-Means++ is a variant of

k

-Means that uses a heuristic for initializing cluster centers as in Algorithm 4.

choose first center <@$\mu_1$@> uniformly at random from all available examples

for k=2,...,K do
	choose next center <@$\mu_k$ at random, with probability proportional to $||x_i-\mu_l||$@>
	here, <@$\mu_l$@> is the closest center picked so far

proceed with standard k-Means

Algorithm 4: k-Means++.

3.3 Choosing the number of cluster $K$

The number of clusters is a hyper-parameter that has to be set by the user, and there is no obvious way to choose an optimal

K

, since it may not even exist. This is due to the fact that there is no such thing as a true clustering against to compare. Nevertheless, there are reasonable cluster quality criteria, that can be used to select

K

. These criteria measure a balance between separation of clusters and their compactness. Depending on the problem, one criterion or another should be chosen.

3.3.1 Calinski-Harabasz index

The CH index uses Euclidean distances to measure cluster quality, so it is usually used with

k

-means. It measures the ratio between:

Separation of cluster centers: sum of distances of cluster centers to the overall mean.
Cluster compactness: sum of distances from each point to its assigned cluster center.

Thus, it is:

C H = \frac{N - K}{K - 1} \frac{\sum_{k = 1}^{K} n_{k} {∥ μ_{k} - \bar{x} ∥}^{2}}{\sum_{k = 1}^{K} \sum_{i = 1}^{n} r_{i k} {∥ x_{i} - μ_{k} ∥}^{2}},

where

\bar{x} = \frac{\sum x_{i}}{n}

. Notice that the quantities are normalized by

\frac{N - K}{K - 1}

to avoid larger

K

having better values.

The usual approach is run

k

-Means with different values of

K

, and then select the

K

that maximizes the index.

Example 3.1. A

k

-Means example in Matlab.

image: 11_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_M___Notes_source_scripts_kmeans_images_figure_0.png

image: 12_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_M___Notes_source_scripts_kmeans_images_figure_1.png

image: 13_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_M___Notes_source_scripts_kmeans_images_figure_2.png

image: 14_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_M___Notes_source_scripts_kmeans_images_figure_3.png

3.4 Gaussian Mixtures

A mixture of Gaussians is a distributions that is built using a convex sum of Gaussians, making it more flexible than a single Gaussian distribution. If the components of the mixture are

N (μ_{k}, Σ_{k}), k = 1, ..., K

and

π_{k}, k = 1, ..., K

are the mixing coefficients, with

0 \leq π_{k} \leq 1, \sum_{k} π_{k} = 1

, then, the density function of the mixture is given by

p (x | θ) = \sum_{k = 1}^{K} π_{k} N (x; μ_{k}, Σ_{k}),

where

θ = {π_{k}, μ_{k}, Σ_{k}}_{k = 1, ..., K}

represents the parameters of the distribution.

The key assumption is that each data point has been generated from only one component, we just don't know which one.

In Figure 9, we can see different Gaussian Mixtures that arise from the same components, choosing different mixing coefficients.

image: 15_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_Machine_Learning_LectureNotes_source_pegado15.png

Figure 9: Different Gaussian mixtures with the same components.

3.4.1 Clustering with a Gaussian mixture

The idea is to identify each component with a cluster, so we want to determine the component from which each point in the dataset is more likely to have been created, as well as the parameters of each of the distribution.

Thus, to cluster a dataset

D = {x_{1}, ..., x_{n}}

into

K

clusters, the approach is the following:

Use Expectation-Maximization (EM) to estimate the mixture, obtaining approximations ${\hat{π}}_{k}, {\hat{μ}}_{k}$ and ${\hat{Σ}}_{k}$ for each $k = 1, ..., K$ .
Find assignments for each $x_{i}$ to the clusters.

In this case, the clustering is soft, in opposition to the hard clustering of

k

-Means. This means that we will obtain the probability for each point belonging to each cluster.

3.4.2 A generative mixture of Gaussians

To sample from a mixture of Gaussians, we use a generative model that uses a latent variable

z = (z_{1}, ..., z_{K})

whose components are all 0, except one which denotes the component from which we sample, and we do:

Pick component $k$ with probability $π_{k}$ . This means that we set $z_{k} = 1$ with probability $π_{k}$ .
Generate a sample $x$ according to $N (μ_{k}, Σ_{k})$ .

The probability of generating a sample

x

using this generative model is

p (x) = \sum_{z} p (x, z) = \sum_{k} p (x, z_{k} = 1) = \sum_{k} p (x | z_{k} = 1) p (z_{k} = 1) = \sum_{k} π_{k} N (x; μ_{k}, Σ_{k}),

the joint distribution of

x

and

z

is given by

p (x, z) = \prod_{k} π_{k}^{z_{k}} N {(x; μ_{k}, Σ_{k})}^{z_{k}},

and the marginal distribution over

x

\begin{matrix} p (x) = & \sum_{k} π_{k} N (x; μ_{k}, Σ_{k}) \\ = & \sum_{z} p (x, z) = \sum_{z} \prod_{k'} π_{k'}^{z_{k'}} N {(x; μ_{k'}, Σ_{k'})}^{z_{k'}} . \end{matrix}

Therefore, we can use Bayes to compute the conditional distribution of

z

given

x

\begin{matrix} p (z_{k} = 1 | x) = & \frac{p (x | z_{k} = 1) p (z_{k} = 1)}{p (x)} \\ = & \frac{p (x | z_{k} = 1) p (z_{k} = 1)}{\sum_{k'} π_{k'} N (x; μ_{k'}, Σ_{k'})} \\ = & \frac{π_{k} N (x; μ_{k}, Σ_{k})}{\sum_{k'} π_{k'} N (x; μ_{k'}, Σ_{k'})} & = γ_{k} (x) . \end{matrix}

The quantity

γ_{k} (x)

indicates how probable it is that a particular data point

x

has been generated by the mixture component

k

. Or, in the context of clustering: how probable it is that $x$ belongs to cluster $k$ . We use these quantities as the soft membership to each cluster. If a hard membserhip is needed, then we assign

x

to cluster

j

, where

j = arg {max}_{k'} γ_{k'} (x)

3.4.3 Learning Gaussian mixtures with Expectation-Maximization

We have a dataset of unlabelled observations

D = {x_{1}, ..., x_{n}}

and we want to model it as a Gaussian mixture, with unknown parameters

θ = {π_{k}, μ_{k}, Σ_{k}}_{k = 1, ..., K}

, with a fixed

K

. For this we use the maximum likelihood approach. First, we compute the loglikelihood of

θ

\begin{matrix} l (θ) = log L (θ) = & log \prod_{i = 1}^{n} p (x_{i}; θ) \\ = & log \prod_{i} \sum_{k} π_{k} N (x; μ_{k}, Σ_{k}) \\ = & \sum_{i} log \sum_{k} π_{k} N (x; μ_{k}, Σ_{k}) . \end{matrix}

This is hard to optimize... so we use the Expectation-Maximization approach. First, we can differenciate it to see what conditions must hold for local maxima, with

\frac{\partial}{\partial μ_{k}} l (θ) = 0

leading to

{\hat{μ}}_{k} = \frac{\sum_{i} γ_{k} (x_{i}) x_{i}}{\sum_{i} γ_{k} (x_{i})} = \frac{\sum_{i} p (z_{k} = 1 | x_{i}) x_{i}}{\sum_{i} p (z_{k} = 1 | x_{i})},

which is a weighted average of the points in our data, with weights being the soft assignments of each point to cluster

k

The problem now, is we cannot know

γ_{k} (x)

without

μ_{k}, Σ_{k}

and

π_{k}

. Now,

\frac{\partial}{\partial Σ_{k}} l (θ) = 0

gives us

{\hat{Σ}}_{k} = \frac{\sum_{i} γ_{k} (x_{i}) (x_{i} - {\hat{μ}}_{k}) {(x_{i} - {\hat{μ}}_{k})}^{T}}{\sum_{i} γ_{k} (x_{i})} = \frac{\sum_{i} p (z_{k} = 1 | x_{i}) (x_{i} - {\hat{μ}}_{k}) {(x_{i} - {\hat{μ}}_{k})}^{T}}{\sum_{i} p (z_{k} = 1 | x_{i})},

which is the sample covariance matrix of all

x_{i}

, weigthed by the soft assignments of each point to cluster

k

. We have the same problem!

Since we have the constraint

\sum_{k} π_{k} = 1

, we now maximize the Lagrangian

L = l (θ) - λ (\sum_{k} π_{k} - 1),

obtaining

{\hat{π}}_{k} = \frac{1}{n} \sum_{i} γ_{k} (x_{i}),

which is the average of all soft assignments for each point

x

. Again the same problem is present.

Therefore, we are in a situation in which we can estimate

π_{k}, Σ_{k}

and

μ_{k}

if we know

γ_{k}

, and we can compute

γ_{k}

from the estimates

{\hat{π}}_{k}, {\hat{Σ}}_{k}

and

{\hat{μ}}_{k}

; and we can use this in our benefit by following the pseudocode depicted in Algorithm 5. Commonly, the initializations are done using the result of

k

-Means in the following manner:

Run $k$ -Means with $k = K$ .
Set ${\hat{μ}}_{k}$ to the mean of cluster $k$ .
Set ${\hat{Σ}}_{k}$ to the sample covariance of cluster $k$ .
Set ${\hat{π}}_{k}$ as the fraction of examples assigned to cluster $k$ .

Algorithm 5: EM algorithm.

Example 3.2. An example of the EM algorithm using Matlab

image: 16_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_M___s_source_scripts_EM_example_images_figure_0.png

image: 17_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_M___s_source_scripts_EM_example_images_figure_1.png

3.4.4 Special cases

The shape of the Gaussians can be restricted, obtaining special cases of mixtures:

No restrictions on $Σ_{k}$ : the general case, in which each cluster can have general Gaussian shape.
$Σ_{k}$ are diagonal: each Gaussian component is forced to have no correlation among input dimensions.
$Σ_{k} = σ^{2} I$ are isotropic or spherical: each Gaussian component is forced to be spherical, so no correlation among input variables and same scaling across each input variable.

4 Linear Classifiers

In classification, we have a labelled dataset,

D = {(x_{1}, y_{1}), ..., (x_{n}, y_{n})}

where

x_{i} \in R^{d}

and

y_{i} \in Y = {l_{1}, ..., l_{K}}

are labels. Thus, the tuple

(x_{i}, y_{i})

means that the vector

x_{i}

is associated to the label

y_{i}

. With this setup, we aim at producing a classification model, meaning a function that, given a new input

x'

, predicts the label

y'

that it should be associated with. Usually, we distinguish between two kinds of classification, attending to the number of labels considered:

In binary classification, there are only two possible labels, $| Y | = 2$ .
In multi-class classification, there are more than two labels, $| Y | > 2$ .

Now, we introduce some useful terms:

Definition 4.1. The decision regions are a partition of the feature space,

{P_{1}, ..., P_{K}} \subset R^{d}

, such that

⋂_{j = 1}^{K} P_{j} = \emptyset

⋃_{j = 1}^{K} P_{j} = R^{d}

. Intuitively, they are the regions in which we divide the feature space, so that all elements inside a region have the same label.

The decision boundaries are the points in the frontier between decision regions.

A classifier can then be understood as the creation of the decision regions.

A linear classifier is a classifier in which the decision boundaries are

d - 1

-dimensional hyperplanes.

Example 4.1. A visualization.

In Figure 10, we can see different classification models and the regions they generate for the iris dataset. The left model correspond to a linear classifier, and we can see how the decision boundaries are linear. The other two models are not linear.

image: 18_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_Machine_Learning_LectureNotes_source_pegado16.png

image: 19_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_Machine_Learning_LectureNotes_source_pegado17.png

image: 20_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_Machine_Learning_LectureNotes_source_pegado18.png

Figure 10: Different classifiers. Linear (left). Quadratic (middle). knn (right).

Many useful classifiers don't just predict the input's class, but also the probabilities of the input belonging to each class. This is desirable, as it enables us to express uncertainty about the prediction that we make. For example, in binary classification it is a common approach to map the target labels to be 0 or 1, and then make predictions as a continuous value in

[0, 1]

. More explicitly, we can have labels

Y = {' s i c k',' h e a l t h y'}

and encode

' s i c k' = 1

' h e a l t h y' = 0

, so that given a patient

x'

, we obtain a prediction

y' \in [0, 1]

, indicating the probability of the patient being sick.

In classification with

K > 2

classes, it is common to encode the labels using one-hot encoding, meaning that we map the labels into the set

{0, 1}^{K}

. For example, if we have three labels, then they are encoded as

{(1, 0, 0), (0, 1, 0), (0, 0, 1)}

. When in this scenario, predictions are usually points in the

(K - 1)

-simplex, i.e., a prediction

y' = (y'_{1}, ..., y'_{K})

must be

0 \leq y'_{k} \leq 1, \forall k

and

\sum_{k} y'_{k} = 1

. Each

y'_{k}

represents the probability of the input belonging to class

k

4.1 Decision boundary in probabilistic models

From a probabilistic perspective, we can think of the joint probability of examples and labels,

p (x, y)

. When we are building a classifier, we then want to minimize the expected loss, or expected error. Note, nonetheless, that loss here is a bit different from the regression loss. A natural way to think about this loss is through loss or cost matrices. A cost matrix is a table as the following:

Real	$y = l_{1}$	$y = l_{2}$	...	$y = l_{K}$
Predicted	$y = l_{1}$	$y = l_{2}$	...	$y = l_{K}$
$y' = l_{1}$	0	$c_{1, 2}$	...	$c_{1, K}$
$y' = l_{2}$	$c_{2, 1}$	0	...	$c_{2, K}$
$⋮$	$⋮$	$⋮$	$⋱$	$⋮$
$y' = l_{K}$	$c_{K, 1}$	$c_{K, 2}$	...	0

This matrix indicates the cost of each error. Note that not all errors need to have the same cost. For example, in a medical context, it has a higher cost to predict that a sick patient is healthy (this person could potentially die), than to predict that a healthy person is sick (in which case further tests would probably correct the mistake). A cost matrix for this case could look like the following:

Real	$y = h e a l t h y$	$y = s i c k$
Predicted	$y = h e a l t h y$	$y = s i c k$
$y' = h e a l t h y$	0	60
$y' = s i c k$	10	0

We will focus in the case of all errors having the same impact, which is called the 0-1 loss.

Let's now see how, given a new example

x

, we can use a 'rule' to choose the label that

x

should have. We have random variables

X

and

Y

with joint distribution

p (X, Y)

. We can compute the expected loss of assigning the label

c \in Y

x

. For this, we first define the loss function as the 0-1 loss:

L (a, b) = {\begin{matrix} 0 & i f a = = b \\ 1 & o t h e r w i s e \end{matrix} .

Therefore:

\begin{matrix} E_{Y} [L (Y, c)] = & \sum_{y \in Y} L (y, c) p (Y = y | x) \\ = & \sum_{y \neq c} p (Y = y | x) \\ = & p (Y \neq c | x) \\ = & 1 - p (Y = c | x) . \end{matrix}

Of course, we want to minimize the expected loss, so we aim at predicting the class

y'

minimizing

E_{Y} [L (Y, y')]

y' = arg {min}_{y} E_{Y} [L (Y, y)] = arg {min}_{y} {1 - p (y | x)} = arg {max}_{y} p (y | x) .

This is called the Bayes classifier, and it is optimal when we use 0-1 loss. Its error is given by the so-called Bayes error rate:

B E R = 1 - E_{X} [p (y' | x)] = 1 - \int_{x}^{x} p (y' | x) p (x) ⅆ x = 1 - \int_{x}^{x} p (x | y') p (y') ⅆ x .

Of course, we can use this classifier to partition the feature space into regions

R_{c}, c \in Y

, and we can compute the BER summing over all regions:

B E R = 1 - \sum_{c} \int_{x \in R_{c}}^{x \in R_{c}} p (x | c) p (c) ⅆ x .

Before, we claimed that the Bayes classifier is optimal. However, in practice we don't know the distribution

p (y, x)

, so it cannot be implemented exactly. Therefore,

p (y | x)

is estimated from data, and this estimates are used for classification, incurring in additional errors. To learn

p (y | x)

, there are two basic approaches, namely discrimintative classifiers and generative classifiers.

4.2 Generative classifiers

Generative classifiers learn

p (y | x)

through the Bayes rule.

4.2.1 Discriminant analysis

Discriminant analysis is the result of implementing a Bayes classifier assuming that the class-conditional distributions

p (x | y)

are gaussian. This means that, having

Y = {c_{1}, ..., c_{K}}

, then it is

p (x | y = c_{k}) \sim N (μ_{k}, Σ_{k}) .

If we also assume that the prior distributions are

p (y = c_{k}) = π_{k},

with

\sum_{k} π_{k} = 1

, then we define the discriminant functions $\begin{matrix} g_{k} (x) = & log (P (y = c_{k}) P (x | y = x_{k})) \\ = & log (π_{k} \cdot \frac{1}{det {(Σ_{k})}^{\frac{1}{2}} {(2 π)}^{\frac{d}{2}}} exp {- \frac{1}{2} {(x - μ_{k})}^{T} Σ^{- 1} (x - μ_{k})}) \\ = & log π_{k} - \frac{1}{2} log (det Σ_{k}) - \frac{d}{2} log (2 π) - \frac{1}{2} {(x - μ_{k})}^{T} Σ^{- 1} (x - μ_{k}) \\ = & log π_{k} - \frac{1}{2} [log (det Σ_{k} + {(x - μ_{k})}^{T} Σ_{k}^{- 1} (x - μ_{k}))] + c o n s t . \end{matrix}$ This is a quadratic discriminant function, and the corresponding classifier is implemented by predicting

y' = c_{k'}, w h e r e k' = arg {max}_{k} g_{k} (x) .

This corresponds to chossing the label with maximum probability a posteriori.

The decision boundaries in this case are those regions in which there exist

k_{1}, k_{2}

with

g_{k_{1}} (x) = g_{k_{2}} (x) .

These corresponds to hyper-quadrics in the feature space, and this is a quadratic method, usually called quadratic discriminant analysis (QDA).

Of course, we can further simplify our assumptions, by assuming that all labels have the same covariance matrix,

Σ_{k} = Σ

for all

k = 1, ..., K

. In this simpler case, the discriminant functions end up being

g_{k} (x) = log π_{k} + μ_{k}^{T} Σ^{- 1} x - \frac{1}{2} μ_{k}^{T} Σ^{- 1} μ_{k},

because now

det Σ_{k} = det Σ

is constant for all

k

, so we can remove it. Furthermore, the term

x^{T} Σ_{k}^{- 1} x = x^{T} Σ^{- 1} x

is also constant with respect to

k

, so it will not affect the

k

chosen. Therefore, we end up with linear discriminant functions, in which the decision boundaries correspond to hyperplanes in the feature space. This is a linear method, usually called linear discriminant analysis (LDA).

In Figure 10, the left diagram corresponds to a LDA partitioning of the feature space for the iris dataset, while the one in the center is a QDA partitioning.

Further assumptions

Of course, we can make more simplifying assumptions for our model, such as:

$Σ = d i a g (σ_{1}^{2}, ..., σ_{d}^{2})$ is diagonal. In this case, we obtain $g_{k} (x) = log π_{k} - \frac{1}{2} \sum_{j = 1}^{d} \frac{{(μ_{k j} - x_{j})}^{2}}{σ_{j}^{2}} .$
$Σ$ is an isotropic Gaussian, i.e., $Σ = σ^{2} I$ . In this case, $g_{k} (x) = log π_{k} - \frac{1}{2 σ^{2}} {∥ μ_{k} - x ∥}^{2} .$
$π_{k} = \frac{1}{K}, \forall k = 1, ..., K$ , all priors are equal. In this case $g_{k} (x) = - \frac{1}{2} {∥ μ_{k} - x ∥}^{2} .$

Distance-based learning perspective

In all seen cases, we have a minimum-distance classifier in

R^{d}

The general QDA case corresponds to using different Mahalanobis distance from $x$ to each class center $μ_{k}$ .
The LDA case uses the same Mahalanobis distance from $x$ to each class center $μ_{k}$ .
In the case of all covariance matrix being equal and diagonal, the distance is the weighted Euclidean distance.
In the isotropic Gaussians case, the distance corresponds to the usual Euclidean distance.

Implementation

It is usual to use MLE and estimate the centers and covariance matrices using the training dataset. If we define the sets

S_{k} = {x_{i} | y_{i} = c_{k}, (x_{i}, y_{i}) \in D}

and

n_{k} = c a r d (S_{k})

, then, the estimates are

{\hat{π}}_{k} = \frac{n_{k}}{n}, {\hat{μ}}_{k} = \frac{1}{n_{k}} \sum_{x \in S_{k}} x,

and the covariance matrix is:

In QDA: ${\hat{Σ}}_{k} = \frac{1}{n_{k} - 1} \sum_{x \in S_{k}} (x - μ_{k}) {(x - μ_{k})}^{T} .$
In LDA $\hat{Σ} = \sum_{k = 1}^{K} \frac{n_{k} - 1}{n - n_{k}} {\hat{Σ}}_{k} .$

Final remarks on discriminant analysis

Bayesian classifiers are optimal when the class-conditional densities and priors are known. This means that, if know the underlying distributions, then these classifiers are our best choice. Of course, this is not realistic, and estimations need to be made. However, normal distributions appear in a wide range of real scenarios, and even if we have to estimate the centers and covariance matrices, QDA and LDA are a very good choice when data resembles Gaussian distributions. In addition, they are well-principled, having a solid mathematical theory behind them, they are fast and reliable.

Of course, if the real distribution is very far from being a Gaussian, then the model obtained will be poor, so one should take care of this.

Also, it is important to ensure that we are correctly estimating the parameters of the Gaussians, because otherwise the model will not work, not even with underlying Gaussian data.

And it is clear that once we are relying on sample estimates instead of population parameters, we loose the optimality of the method.

In practice, it is really hard to assess which assumptions hold and which ones do not, so we can be limited to use a trial and error approach.

4.2.2 Regularized discriminant analysis (RDA)

When data is scarce, some problems can arise while using discriminant analysis. For example:

If there are more dimensions than samples with some label, i.e. $d > n_{k}$ , for some $k$ , then QDA cannot be applied, because ${\hat{Σ}}_{k}$ is singular. The reason is that each of the sample is adding a rank-1 matrix. Therefore, to get a rank- $d$ matrix, we need at least $n_{k}$ samples. Not only this, but in fact, as we are subtracting the sample mean, the last value is linearly dependant of the previous $n - 1$ values. Therefore, we need at least $n_{k} + 1$ samples to be able to construct a rank- $d$ covariance matrix.
If $d > n$ , then we cannot use QDA nor LDA, as all covariance matrices are singular.

RDA computes the covariance matrices as

{\hat{Σ}}_{k} (α) = α {\hat{Σ}}_{k} + (1 - α) \hat{Σ},

where

α \in [0, 1]

is the regularization parameter. The method is QDA when

α = 1

and is LDA when

α = 0

. In any other case, it is something in between.

A further way to regularize the matrices is by

{\hat{Σ}}_{k} (α, γ) = (1 - γ) {\hat{Σ}}_{k} (α) + γ {\hat{σ}}^{2} I,

where

{\hat{σ}}^{2} = \frac{T r [{\hat{Σ}}_{k} (α)]}{d}

, and the diagonal term improves the well-conditioning of the method.

4.3 Naïve Bayes

The Naïve Bayes Classifier is a Bayesian classifier that assumes that the features are pair-wise independent in the class-conditional distribution. This means that the probability can be written as

p (x | y) = \prod_{j = 1}^{d} p (x_{j} | y) .

This assumptions does not hold in general, but this approach can provide a good approximation in many cases. Also, it is practical, as the amount of parameters to estimate is small.

As before, we classify the input record

x

in the class

c_{k}

that maximizes the discriminant function:

g_{k} (x) = log π_{k} + \sum_{j = 1}^{K} log p (x_{j} | y = c_{k}) .

Therefore, we need to

Estimate the class priors as the sample frequency, $π_{k} = \frac{n_{k}}{n} .$
Estimate the class-conditional densities for each input feature independently.

Naïve Bayes can also be used in the case of categorical variables. To model binary features, we can use, for example, the Bernoulli distribution

P (x | p) = p^{x} {(1 - p)}^{1 - x},

where

x \in {0, 1}

and

p \in [0, 1]

is the probability of the event happening. For a binary feature, we would need to estimate

K

parameters, one for each class, so that

P (x | y = c_{k}) = p_{k}^{x} {(1 - p_{k})}^{1 - x} .

If we had all our features as binary, then the discriminant functions would be

\begin{matrix} g_{k} (x) = & log π_{k} + \sum_{j = 1}^{d} log P (x_{j} | y = c_{k}) \\ = & log π_{k} + \sum_{j = 1}^{d} [x_{j} log p_{k, j} + (1 - x_{j}) log (1 - p_{k, j})], \end{matrix}

where

p_{k, j}

is the Bernoulli parameter for the

j^{t h}

feature and the

k^{t h}

class. Note that this is a linear function with respect to to

x

If instead we have categorical features with more values, we can then use the Categorical distribution

g_{k} (x) = log π_{k} + \sum_{j = 1}^{d} \sum_{v} [x_{j} = v] log p_{k, j, v},

where

[e x p r]

is 1 if

e x p r

is True and 0 otherwise, and

p_{k, j, v}

is the Categorical parameter for the value

v

of the

j^{t h}

feature and the

k^{t h}

class.

Now, we need to estimate these parameters, for which we can use the sample frequencies. Note how 0-frequencies can be a problem, so it is a common approach to utilize Laplace smoothing

\hat{p} (v | y = c_{k}) = \frac{n_{k, v} + p}{n_{k} + p V} .

Here:

$p \in R^{+}$ is a weight parameter assigned to the prior distribution of observing values $v$ . It is typically set to 1, just to avoid 0 values.
$V$ is the number of modalities (number of distinct values) of the feature modelled.

Example 4.2. Naïve Bayes in MATLAB

4.3.1 Gaussian Naïve Bayes

If we have numericla features, the usual approach is to assume they follow Gaussian distributions, then we estimate their mean and variance using MLE. If all features are assumed Gaussian, this approach is equivalent to QDA with diagonal covariance matrices.

Other approaches are:

Discretize numerical values and proceed with Categorical NB.
Assume a different distribution and estimate its sample parameters from data.

Note that when the data is mixed, we can assume a different distribution for each feature, and then add the log-likelihoods altogether.

4.4 Perceptron and Logistic Regression

The perceptron is a mathematical model of the functioning of a neuron in our brain. In Figure 11 we can see a diagram of a neuron. Other neurons transmit signals into our neuron via the dendrites, then 'something' happens inside the neuron, and it transmits or not signals through its axon towards the neurons to which it is connected.

image: 21_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_Machine_Learning_LectureNotes_source_neuron.png

Figure 11: A neuron.

This behavior was mathematically modeled by Roosenblatt in his pioneer paper [4]. He did this with the concept of perceptron:

A perceptron is a function F: R d → { -1,1 } , where
- $d$ is the number of inputs. In the neuron analogy, it is the number of dendrites of the neuron.
- ${- 1, 1}$ represents that a neuron can send a signal, $F = 1$ , or not, $F = - 1$ .
This function operates in the following maner:
- For each input $x_{i}$ , there is a weight $w_{i}$ , for $i = 1, ..., d$ . This models somehow the strength of the connection between two neurons.
- There is an aritificial input called bias, which is always set to 1 and serves the purpose of being able to center the inputs. It is usually depicted by $x_{0} = 1$ and goes with its weight $w_{0}$ , used to determine the true bias.
- The inputs are weighted and sum, to form a state of the neuron $S = \sum_{i = 0}^{d} w_{i} x_{i} = W^{T} X .$
- Finally, we apply the activation function $f (S) = {\begin{matrix} 1 & i f S > 0 \\ - 1 & o t h e r w i s e \end{matrix} .$
Therefore, $F (x) = f (S (x)) .$

This is depicted in Figure 12. Notice

image: 22_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_M___rning_LectureNotes_source_perceptron_drawio.png

Figure 12: A simple perceptron.

This way, we can classify the input variables

X \in R^{d}

into two classes, 1 or -1. But... what are the weights?

Example 4.3. Imagine we have the following classification function

c l a s s (x) = {\begin{matrix} 1 & i f x > 0.5 \\ - 1 & o t h e r w i s e \end{matrix},

then it is easy to construct a perceptron that can classify the instances. We have:

$d = 1$ .
$w_{0} = 0.5, w_{1} = 1$ .
Therefore, we have $F (X) = f (w_{0} + w_{1} x) = f (x - 0.5) = {\begin{matrix} 1 & i f x - 0.5 > 0 \\ - 1 & o t h e r w i s e \end{matrix} = {\begin{matrix} 1 & i f x > 0.5 \\ - 1 & o t h e r w i s e \end{matrix} .$

And we are good!

In this previous example we can grab some intuition on how the weights are chosen, but doing this by hand is not scalable at all. When we have several input variables, this process is complicated quite a lot. Luckily, the perceptron algorithm helps us estimate the weigths.

This algorithm is an on-line algorithm, meaning that it process one training example at a time, updating

W

incrementally. The training examples are pairs

(X, y)

, where

X \in R^{d}

and

y \in {- 1, 1}

. Also, we can scale all

X

to lie in the unit sphere, because this also scales the hyperplanes of classification and classes remain the same. The algorithm goes as follows:

init W=0

for i in range(1,e):
	for each (X,y):
	
		compute prediction=<@$F(S(W^TX))$@>
	
		if prediction != y:
			# update W
			W = W + lr*X*y
return W

Algorithm 6: Perceptron algorithm (input X, classes y, learning rate lr, epochs e) -> weights W

The updates are made like that because we can consider the error function

E (W) = - W^{T} X y

only when

X

is misclasified. In that case,

W^{T} X y < 0

and thus the minus sign. We want to minimize this error, for which we follow a gradient descent approach (we will deepen on this later), so we update

W

W' = W - η \nabla E = W + η X y .

The basic idea is that if we now utilize this new weights in the same input, we would get

W'^{T} X = {(W + η X y)}^{T} X = W^{T} X + η y X^{T} X .

Now, say

y = 1

, then it is

W'^{T} X = W^{T} X + η {∥ X ∥}^{2} > W^{T} X,

so we have made the input to be closer to be positive, and thus correctly classified. If

y = - 1

the same logic applies.

In fact, under some conditions, convergence is ensured:

Theorem 4.1. Perceptron convergence theorem

For any finite set of linearly separable labeled examples, the Perceptron Learning Algorithm will halt after a finite number of iterations. In other words, after a finite number of iterations, the algorithm yields a vector W that classifies perfectly all the examples.

The proof can be read in https://web.mit.edu/course/other/i2course/www/vision_and_learning/perceptron_notes.pdf.

Example 4.4. A simple perceptron in MATLAB

image: 23_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_M____scripts_perceptron_example_images_figure_0.png

image: 24_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_M____scripts_perceptron_example_images_figure_1.png

image: 25_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_M____scripts_perceptron_example_images_figure_2.png

image: 26_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_M____scripts_perceptron_example_images_figure_3.png

We can improve the approach by choosing a different activation function, which would be differentiable ideally. A usual alternative is the logistic function

σ (z) = \frac{1}{1 + e^{- z}},

which maps

R \to [0, 1]

, so that this input can be interpreted as a probability, allowing us to represent uncertainty in the prediction. This function has the following properties:

It verifies a pseudo-symmetry $σ (- z) = 1 - σ (z) .$
It is differentiable, with derivative $σ' (z) = σ (z) (1 - σ (z)) .$
Its inverse is the logit function $l o g i t (p) = log (\frac{p}{1 - p}) .$

When we use the logistic function, we are performing what is called logistic regression for binary classification. Again, we have a dataset of pairs

(X_{i}, y_{i})

where

X_{i} \in R^{d}

and

y_{i} \in [0, 1]

, associating label

y_{i} = 1

with a positive example, and

y_{i} = 0

with a negative example. In logistic regression we model

P (y = 1 | x) = σ (W^{T} X),

where, again,

W

is the weight vector. Therefore, we are modelling

y_{i} \sim B e r (p_{i}),

where

p_{i} = σ (W^{T} X_{i})

. Notice that we don't assume anything about the distribution of

X

, though.

Why is logistic regression a linear classifier?

When we classify, we need to set a threshold above which we classify the label to be 1. Usually, this threshold is set to be

0.5

. Now,

σ (z) > 0.5 ⟺ \frac{1}{1 - e^{- z}} > 0.5 ⟺ 1 > 0.5 (1 - e^{- z}) ⟺ 2 > 1 - e^{- z} ⟺ 1 > e^{- z}

⟺ 0 > - z ⟺ z > 0.

And remember that

z = W^{T} X

, so, in fact, we are classifying using the same criteria as before.

At this point, we can try to use maximum likelihood in the search of an optimal

W

\begin{matrix} L (W) = & \prod_{i = 1}^{n} P (y_{i} | X_{i}, W) = \prod_{i = 1}^{n} B e r (y_{i} | {\hat{y}}_{i} = σ (W^{T} X_{i})) \\ = & \prod_{i = 1}^{n} {\hat{y}}_{i}^{y_{i}} {(1 - {\hat{y}}_{i})}^{1 - y_{i}} \\ = & \prod_{i = 1}^{n} σ {(W^{T} X_{i})}^{y_{i}} {(1 - σ (W^{T} X_{i}))}^{1 - y_{i}} . \end{matrix}

As usual, we take the log-likelihood, but in this case we minimize the negative log-likelihood instead, which can be interpreted as an error function:

\begin{matrix} E (W) = - log L (W) = & - \sum_{i = 1}^{n} log P (y_{i} | X_{i}, W) \\ = & - \sum_{i = 1}^{n} y_{i} log {\hat{y}}_{i} + (1 - y_{i}) log (1 - {\hat{y}}_{i}) . \end{matrix}

This error expression is known as log loss or binary cross-entropy, and it is widely used in classification schemes.

Now, as we aim to find its minimum, we compute its gradient. For this, recall that

σ' (z) = σ (z) (1 - σ (z)),

and notice that

\frac{\partial W^{T} X_{i}}{\partial W} = X_{i} .

Recall also that

(f ○ g)' (x) = [g (f (x))]' = g' (f (x)) \cdot f' (x) .

Now, let's go for this derivative:

\begin{matrix} \nabla E (W) = & \frac{\partial E (W)}{\partial W} = - \sum_{i = 1}^{n} \frac{\partial y_{i} log σ (W^{T} X_{i})}{\partial W} + \frac{\partial (1 - y_{i}) log (1 - σ (W^{T} X_{i}))}{\partial W} \\ = & - \sum_{i = 1}^{n} \frac{y_{i}}{σ (W^{T} X_{i})} σ (W^{T} X_{i}) (1 - σ (W^{T} X_{i})) X_{i} - \frac{1 - y_{i}}{1 - σ (W^{T} X_{i})} σ (W^{T} X_{i}) (1 - σ (W^{T} X_{i})) X_{i} \\ = & \sum_{i = 1}^{n} (1 - y_{i}) σ (W^{T} X_{i}) X_{i} - y_{i} (1 - σ (W^{T} X_{i})) X_{i} \\ = & \sum_{i = 1}^{n} [(1 - y_{i}) σ (W^{T} X_{i}) - y_{i} (1 - σ (W^{T} X_{i}))] X_{i} \\ = & \sum_{i = 1}^{n} [σ (W^{T} X_{i}) - y_{i} σ (W^{T} X_{i}) - y_{i} + y_{i} σ (W^{T} X_{i})] X_{i} \\ = & \sum_{i = 1}^{n} [σ (W^{T} X_{i}) - y_{i}] X_{i} \\ = & \sum_{i = 1}^{n} [{\hat{y}}_{i} - y_{i}] X_{i} . \end{matrix}

In this case, it is not possible to find a close-form solution, so we use iterative methods to approximate local minima. We are going to use gradient descent, which is one of the simplest approaches, but it is widely used.

4.4.1 Gradient descent

Gradient descent is a general numerical method to find local minima of a differentiable function

F (Z)

. The idea is to use the fact that the gradient of a vector function points in the direction of maximum growth, and the negative gradient points in the direction of maximum decrease. Therefore, if we 'follow' this direction, we should approach a minimum of the function, although it might not be the global minimum.

The approach works as follows. To approximate a local minimum of

F (W)

We start at a weight vector $W_{0}$ and set $k = 0$ .
Compute $\nabla F (W_{k})$ .
Update the weight vector W k+1 = W k - γ ∇ F( W k ) . Also update k=k+1 . The parameter γ is called learning rate, and it quantifies 'how much' we advance in the direction of the gradient. It has to be chosen beforehand and it is critical, since:
1. If $γ$ is too big, we might jump over the minimum and the method might never converge.
2. If $γ$ is too small, the method might converge too slowly.
Related to the learning rate is the importance of feature scaling, as the same learning rate impacts all features, so if they have different scales, the method might be good for some of them, but bad for others.
Repeat until convergence or maximum number of steps is reached.

4.4.2 Newton's algorithm

Newton's algorithm

There is an infinite amount of Newton's algorithms haha, this is one of them :D

is an algorithm used to find roots of a function which converges faster than gradient descent and does not need a learning rate parameter. As this algorithm finds roots (i.e. zeros), we don't apply directly to

F

, but rather to its gradient. Thus, in this case, we need

F

to be twice differentiable and to be able to afford computing its second derivatives, which can sometimes be costly.

In this case, the algorithm works as follows:

We start at a weight vector $W_{0}$ and set $k = 0$ .
Compute the Hessian of $F$ : $H (F (W)) = \nabla^{2} F (W) = (\begin{matrix} \frac{\partial^{2} F (W)}{\partial W_{1}^{2}} & \frac{\partial F (W)}{\partial W_{1} \partial W_{2}} & \dots & \frac{\partial F (W)}{\partial W_{1} \partial W_{d}} \\ \frac{\partial F (W)}{\partial W_{2} \partial W_{1}} & \frac{\partial F (W)}{\partial^{2} W_{2}} & \dots & \frac{\partial F (W)}{\partial W_{2} \partial W_{d}} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ \frac{\partial F (W)}{\partial W_{d} \partial W_{1}} & \frac{\partial F (W)}{\partial W_{d} \partial W_{2}} & \dots & \frac{\partial F (W)}{\partial^{2} W_{d}} \end{matrix}) .$
Update the weight vector $W_{k + 1} = W_{k} - H {(F (W_{k}))}^{- 1} \nabla F (W_{k}),$ and set $k = k + 1$ .
Repeat until convergence or maximum number of steps is reached.

Remark 4.1. Notice that:

The update rule is obtained by finding the minimum of the second order Taylor series of $F$ around $W_{k}$ .
No learning rate is needed this time, but we need to compute the Hessian.

Iterated Reweighted Least Squares (IRLS)

When we apply Newton's method to the log-likelihood of the logistic regression, it is called the IRLS method. The gradient of the log-likelihood is

\nabla E (W) = \sum_{i = 1}^{n} (y_{i} - {\hat{y}}_{i}) X_{i} = X^{T} (y - σ (W^{T} X))

and the Hessian

H (F (W)) = \sum_{i = 1}^{n} {\hat{y}}_{i} (1 - {\hat{y}}_{i}) X_{i} X_{i}^{T} = X^{T} d i a g ({\hat{y}}_{i} (1 - {\hat{y}}_{i})) X .

Example 4.5. Logistic regression example using MATLAB

image: 27_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_M___logistic_regression_example_images_figure_0.png

image: 28_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_M___logistic_regression_example_images_figure_1.png

image: 29_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_M___logistic_regression_example_images_figure_2.png

image: 30_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_M___logistic_regression_example_images_figure_3.png

4.4.3 Multi-class logistic regression

The multi-class case,

K > 2

, is handled by having a separator

W^{(k)}

for each class

k = 1, ..., K

. In this case, instead of a Bernoulli distribution for the targets

y_{i}

, we use a categorical distribution. Each target

y_{i}

is represented with its one.hot encoding.

In this case, the likelihood function is

L (W^{(1)}, ..., W^{(K)}) = \prod_{i = 1}^{n} \prod_{k = 1}^{K} {\hat{y}}_{i k}^{y_{i k}},

and the cross-entropy loss is

E (W^{(1)}, ..., W^{(K)}) = - \sum_{i = 1}^{n} \sum_{k = 1}^{k} y_{i k} log {\hat{y}}_{i k} .

As in the binary case, we optimize this by using gradient descent or Newton's method.

4.4.4 Regularization

Finally, note that we can add regularization to the weights, in the same way we did for linear regression, by adding the

L 1

L 2

norms of the weights to the cross-entropy loss

E_{l a s s o} (W) = - \sum_{i = 1}^{n} [y_{i} log {\hat{y}}_{i} + (1 - y_{i}) log (1 - {\hat{y}}_{i})] + λ \sum_{j = 1}^{d} | W_{j} |,

and

E_{r i d g e} (W) = - \sum_{i = 1}^{n} [y_{i} log {\hat{y}}_{i} + (1 - y_{i}) log (1 - {\hat{y}}_{i})] + λ \sum_{j = 1}^{d} W_{j}^{2} .

Note that this is for the binary case, but for the general case is similar.

5 Nearest Neighbor Prediction

The nearest neighbor predictor uses the local neighborhood of an input point to compute a prediction. A well known family of this kind is the

k N N

models, which predicts the output of the input point by combining the known outputs of the

k

nearest training data points, for example by voting if we are classifying the data.

The approach of these method is quite straightforward, and is detailed in Algorithm 7.

Training
	Store all training examples

Prediction
	Given an input X:
		1. Compute distance/similarity with all examples in the training set.
		2. Locate <@$k$@> closest points.
		3. Emit prediction by combining outputs of the <@$k$@> closest points.

Algorithm 7:

k N N

Pseudocode.

Of course, we have to decide:

The distance/similarity function.
How many neighbors to choose, $k$ .
How to combine the outputs of the $k$ nearest neighbors to emit the final prediction.

5.1 Locality: similarities and distances

It is important to decide what distance or similarity between points means in our feature space, since this will determine the neighborhoods of the points.

Definition 5.1. A distance function is a function

d : R^{d} \times R^{d} \to R,

verifying:

Non-negativity: $d (a, b) \geq 0, \forall a, b \in R^{d}$ .
Zero distance is equal: $d (a, b) = 0 ⟺ a = b, \forall a, b \in R^{d} .$
Symmetric: $d (a, b) = d (b, a), \forall a, b \in R^{d} .$
Triangle inequality: $d (a, b) \leq d (a, c) + d (c, b), \forall a, b, c \in R^{d} .$

A similarity function is a function

s : R^{d} \times R^{d} \to R,

verifying:

Ranged: $s (a, b) \in [- 1, 1], \forall a, b \in R^{d}$ or $s (a, b) \in [0, 1], \forall a, b \in R^{d}$ .
Completely similar is equal: $s (a, b) = 1 ⟺ a = b, \forall a, b \in R^{d} .$
Symmetry: $s (a, b) = s (b, a), \forall a, b \in R^{d} .$

Example 5.1. Some examples of distances are:

The Minkownski distance family: $d (a, b) = {∥ a - b ∥}_{p} = {(\sum_{i = 1}^{d} {| a_{j} - b_{j} |}^{p})}^{\frac{1}{p}} .$ Which has as special cases the Euclidean distance ( $p = 2$ ) or the Manhattan distance ( $p = 1$ ).
The Mahalanobis distance is an interesting distance between points that takes into account the covariance matrix of the features, $Σ$ . Its formula is $d (a, b) = {(a - b)}^{T} Σ^{- 1} (a - b),$ which might already be familiar to you, since we have used it for some probabilistic methods.

Some examples of similarities are:

The cosine similarity function: $s (a, b) = \frac{a \cdot b}{{∥ a ∥}_{2} {∥ b ∥}_{2}} = cos (∠ a, b),$ which leverages the dot product, $a \cdot b = a^{T} b = {∥ a ∥}_{2} {∥ b ∥}_{2} cos (∠ a, b)$ . Therefore, this captures 'how parallel' $a$ and $b$ are.
The Pearson correlation measure is similar to cosine similarity, but centers data: $s (a, b) = \frac{{(a - μ)}^{T} (b - μ)}{{∥ a - μ ∥}_{2} {∥ b - μ ∥}_{2}},$ where $μ$ is the mean of the points.
The Hamming distance is computed between two sequences of bits, and it is just the proportion of common bits.
The Jaccard coefficient is also computed between two sequences of bits, and it is $s (a, b) = \frac{\sum_{j} [a_{j} = b_{j} = 1]}{\sum_{j} [a_{j = 1} = 1 | b_{j} = 1]} .$
If we have $a = (10010000110000), b = (11000001100001),$ then $s_{H a m m i n g} (a, b) = \frac{9}{14} = 0.643,$ and $s_{J a c c a r d} (a, b) = \frac{2}{7} = 0.286.$

5.2 Choosing $k$

Nearest neighbor methods are very sensitive to the chosen value of

k

. If it is too low, it is easy to overfit; if it is too large, we can underfit. Therefore, we need to choose it thoughfully. This will depend on the dataset, and the typical approach is to use cross-validation, or other resampling methods, seeing

k

as an hyperparamter that trades-off bias and variance of the resulting model.

5.3 How to combine outputs to make predictions

5.3.1 Classification

Majority vote, broking ties randomly.
Distance-weighted vote: a more advanced approach that weights the votes higher for closer points and lower for further points.

5.3.2 For regression

Use the average of the outputs of the $k$ nearest neighbors.
Use the weighted average of the outputs of the $k$ nearest neighbors. Again, the weights should be inversely proportional to distance or proportional to similarity.

When we weight the predictions, we can relax the constraint of using only

k

examples, and we can even use the whole training set for making predictions, since this approach lowers the chances of underfitting.

5.4 Decision boundaries for nearest neighbors classifier

In 1-nearest neighbor, the decision regions correspond the union of each example's Voronoi cell, with appropriate class. Given a set of points

{p_{i}}_{i \in I} \subset R^{d}

, the Voronoi cell of point

p_{j}

corresponds to the set of points whose nearest neighbor is

p_{j}

. For example, here I show two different Voronoi diagrams, which show the Voronoi diagram of 6 different points:

image: 31_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_Machine_Learning_LectureNotes_source_pegado19.png

image: 32_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_Machine_Learning_LectureNotes_source_pegado20.png

Figure 13: Voronoi diagrams. Code in MATLAB script voronoi.mlx.

The decision boundaries and regions are non-linear, but they get smoother as we increase the value of

k

. This effect can be observed in the following Figure:

image: 33_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_Machine_Learning_LectureNotes_source_pegado28.png

image: 34_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_Machine_Learning_LectureNotes_source_pegado22.png

image: 35_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_Machine_Learning_LectureNotes_source_pegado27.png

image: 36_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_Machine_Learning_LectureNotes_source_pegado26.png

image: 37_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_Machine_Learning_LectureNotes_source_pegado25.png

Figure 14: kNN decision regions for different values of

k

5.5 Final considerations

Making predictions can be slow, specially if the training dataset is large. To improve the prediction speed, there are several approaches:
- Use data structures like $k d - t r e e s$ to speed up neighbor retrieval, although this is only good when we have a moderate amount of features. A , or k-dimensional tree, is a binary search tree data structure used to organize points in k-dimensional space. In the context of kNN models, kd-trees can help speed up neighbor retrieval by partitioning the space and reducing the search area for nearest neighbors. However, the efficiency gain of kd-trees is reduced when there are many features (i.e., high-dimensional data), as the search time approaches that of a linear search.
- Use prototypes. A protoype is a representative point or a summary of a group of points that belong to the same class. Using prototypes can reduce the size of the training dataset while still maintaining its overall structure. The idea is to replace multiple points with a single prototype, thus reducing the number of distance calculations during prediction. Examples of prototype selection methods include the nearest neighbor condensation algorithm and the edited nearest neighbor rule.
- Use approximate neighbors, for example using locality sensitive hashing. The main idea behind LSH is to hash the input items in such a way that similar items are more likely to be hashed to the same bucket. By doing this, it reduces the search space for nearest neighbors and allows for faster retrieval. LSH trades off some accuracy for improved speed, making it suitable for large-scale applications where an approximate answer is acceptable.
kNN is prone to overfitting. To avoid this, usual approaches include:
- Remove noisy examples from the training data, for example, data points with neierst neighbors of different class.
- Again, use prototypes.
- Set the appropriate value of $k$ .
It suffers from the curse of dimensionality: as dimension increases, everything seems to be close.
Suffers from the presence of irrelevant features, so feature selection is an important pre-processing step.
Standardization of features is crucial to avoid domination of features with larger values.

6 Trees and Random Forests

An ensemble method is a method that combines two or more predictors, instead of using a single prediction. These are useful when we have several models that are better that just a random guess, and that are independent between them. The combination of the models can be done averaging predictions, when in regression, or by majority vote in classification.

The main types of ensemble methods are:

Stacking: involves training a learning algorithm to combine the predictions of several other learning algorithms. First, all sub-models are trained based on the complete training set, then the meta-model is fitted based on the outputs — meta-features — of the sub-models in the ensemble. Therefore, stacking allows you to use multiple heterogeneous, possibly weak learning models and "stack" them together in a manner that allows you to use information from their predictions to make a final prediction which often has better performance.
Bagging: is a way to decrease the variance of the prediction by generating additional data for training from dataset using combinations with repetitions to produce multi-sets of the original data. Bagging helps to avoid overfitting by averaging or voting the prediction from the multiple models. Random Forest is a classic example of bagging algorithm.
Boosting: is a sequential technique in which the first algorithm is trained on the entire data set, and the following algorithms are built by fitting the residuals of the first algorithm, thereby giving higher weight to those observations that were poorly predicted by the previous model. It relies on creating a series of weak learners each of which might not be good at the entire problem, but might be good at recognizing some part of it, and then combining their predictions to get the final prediction. The idea is to add new models to the ensemble sequentially. At each particular iteration, a new weak, base-learner model is trained with respect to the error of the whole ensemble learnt so far. Gradient Boosting, AdaBoost and XGBoost are examples of boosting algorithms.

6.1 Trees

A regression tree partitions the feature space into axis-parallel regions and predicts using the average of training points that fall into that region. We can think of a tree in terms of nodes and edges: each node in the tree specifies a test of some attribute, and each edge descending from that node corresponds to one of the possible outcomes for the test. The leaf nodes (or terminal nodes) of the tree contain an output value which is used to make a prediction. When a new data point is presented to the tree for prediction, it is routed down the tree based on the outcome of the tests in each node, starting from the root and ending at a leaf node. The value in the leaf node is then returned as the prediction.

A classification tree also partitions the feature space into axis-parallel regions, each of them representing a class. It is NP-hard to find the optimal tree of minimum size, so greedy approaches are usually used. A general approach is explained in Algorithm 8.

1- Feature selection
	Find the feature that best splits the data into two subsets, minimizing a impurity function.

2- Binary splitting
	Decide how to split the feature. If it is categorical, split in each possible value. If it is continuous, find a value that divides it in two, minimizing the impurity.

3- Recursion
	Repeat 1- and 2- until the stopping criterion is met.

4- Pruning
	To avoid overfitting.

Algorithm 8: Training a Tree.

We are going to use the Gini impurity metric to create trees.

Let

S

be a subset of example of the input data

S \subset D = {(x_{1}, y_{1}), ..., (x_{n}, y_{n})}

, with

x_{i} \in R^{d}

and

y_{i} \in Y

, with

| Y | = K \geq 2

. Let also

p_{k} (S)

be the proportion of examples in

S

that belongs to class

k

. Then, the Gini impurity metric is computed as

G i n i (S) = \sum_{k = 1}^{K} p_{k} (S) (1 - p_{k} (S)) = 1 - \sum_{k} {(p_{k} (S))}^{2} .

Therefore, to find the node to expand the tree, we need to find the pair

(v^{*}, s^{*})

, where

v^{*}

is a feature and

s^{*} \in R

, such that

(v^{*}, s^{*}) = arg {min}_{v, s} [\frac{| S_{v \leq s} |}{| S |} G i n i (S_{v \leq s}) + \frac{| S_{v > s} |}{| S |} G i n i (S_{v > s})] .

In the case of a regression tree, the approach is similar, but instead we minimize the sum of squared errors, relative to the average target value in each induced partition. In other words, we now seek for the pair

(v^{*}, s^{*})

such that

(v^{*}, s^{*}) = arg {min}_{v, s} [S S E (S_{v \leq s}) + S S E (S_{v > s})],

where

S S E (S) = \sum_{i | (x_{i}, y_{i}) \in S} {(y_{i} - a v g (S))}^{2} .

Example 6.1. Classification trees in MATLAB

image: 38_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_M___cripts_classification_trees_images_figure_0.png

image: 39_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_M___cripts_classification_trees_images_figure_1.png

image: 40_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_M___cripts_classification_trees_images_figure_2.png

6.2 Random Forests

Let's deepen a bit in the bagging technique, which we said it can be used to reduce the variance of our estimates by averaging the estimates obtained from independent models. Assume we have

B

models

{M_{b}}_{b = 1}^{B}

, and an input

x

. Each model produces an estimate

Y_{b} = M_{b} (x),

which tries to predict the real value

y

. Let's suppose these estimates are unbiased, i.e.

E [Y_{b}] = y, \forall b

. Then, the average of the estimates is

E [\frac{1}{B} \sum_{b} Y_{b}] = \frac{1}{B} \sum_{b} E [Y_{b}] = \frac{1}{B} \sum_{b} y = y .

Therefore, this average is also an unbiased estimate of the value

y

. Let's see what happens with the variance when we perform this average. If we assume that the models are independent and that they all have the same variance, then it follows

\begin{matrix} V a r [\frac{1}{B} \sum_{b} Y_{b}] = & \frac{1}{B^{2}} V a r [\sum_{b} Y_{b}] \\ = & \frac{1}{B^{2}} (\sum_{b} V a r [Y_{b}] +) \\ = & \frac{1}{B^{2}} \sum_{b} V a r [Y_{b}] \\ = & \frac{V a r [Y_{b}]}{B} . \end{matrix}

Thus, the variance is reduced, making the predictions more consistent.

Since trees typically suffer from high variance (overfitting), so they seem like a good candidate to apply this technique. The application of bagging to trees are the random forests.

First, a bootstrapped dataset is created by randomly selecting samples from the original dataset with repetition. The samples that not chosen for the bootstrapped dataset are placed in a separate dataset, the out-of-bag dataset (OOB).

Then, we want to generate a diverse collection of trees that are better than a random choice, and are as independent from each other as possible. If we construct all these trees using the usual algorithm, then they will be very similar, so we have to inject stochasticity into this process. This is done by:

The sampling introduces stochasticity, but it is usually not enough to provide independence between the trees.
We can manipulate features or targets, by not taking some of them into account. This is done for each tree, so that different trees focus on different features, providing the independence that we seek.
Change the learning parameters, the impurity function,...
Performing combinations of the above methods.

The pseudocode of the random forest training algorithm is shown in Algorithm 9.

for b=1 to n_estimators do
	Sample <@$D_b$@> as a bootstrap of max_samples from D
	
	Create tree <@$T_b$@> on <@$D_b$@>, adding nodes by
		- Select max_features variables at random from all variables
		- Pick the best variable/split point according to Gini/SSE
		- Split the current node according to the split found
end

Output forest <@$\{T_1,...,T_B\}$@>

Algorithm 9: random_forest(D,n_estimators,max_samples,max_features)

To make predictions on a test example

x

For classification: output class probabilities or majority vote among ${T_{1} (x), ..., T_{B} (x)}$ .
For regression: output average prediction $\frac{1}{B} \sum_{b} T_{b} (x)$ .

Then, to estimate the generalization error, we can use the OOB dataset to compute the OOB error. This generalization error can be used as a validation error to select appropriate values for the hyperparameters, so we don't need to perform cross-validation.

6.2.1 Interpretability of random forests

If n_estimators is too big (there are many trees in the forest), then it can be difficult to comprehend the decision process of the model: it is not interpretable. We can do variable importance plot to interpret better the results:

Gini based variable importance: we can add gini impurity gains for variables in the splits in each tree in the forest, and sort the variables by their sum. This approach is biased towards categorical variables with many splits.
Permutation based variable importance: for each variable, we can permmute values and compute the difference in the OOB error metrics. If a variable is important, then accuracy in the permuted copy should decrease. We then sort the variables using this difference. This approach is more reliable, but is slower.

6.2.2 Proximities

The idea of proximity of samples in a tree is that when two examples fall into the same leaf of a decision tree, this is evidence supporting the fact that these two examples are similar in some sense.

In the wider sense of forests, we can build a

n \times n

similarity matrix, where each entry

(i, j)

correspond to the fraction of trees in the forest such that

x_{i}, x_{j}

end up in the same leaf. This matrix can be used in distance/similarity based methods, or to apply PCA to obtain numeric representation of the data.

6.2.3 Imbalanced data in classification

In random forest, there are several techniques to deal with imbalaced data:

Use $F - s c o r e$ for model selection.
Under-sampling majority class/over-sampling minority class when building the bootstrapped dataset.
Use of weights when computing errors.

7 Multi-Layer Perceptron (Neural Networks)

In Section 4.4 we have seen the Perceptron, a mathematical model of a neuron that can be used in binary classification. We noted that this model is equivalent to a linear classifier, being thus a very limited model for a more general scenario. In this section, we are going to see how to extend this model to perform more complex tasks.

7.1 Multiclass classification

We saw how to use a single perceptron to perform binary classification, and there is a very natural way to extend this model to multi-class classification, using a one-hot encoding for the class.

For instance, let's work with a dataset

D = {x, y}

where

x \in R^{d}

and

y \in L = {l_{1}, ..., l_{K}}

. We now encode

y

as a one-hot vector, i.e.,

y_{k} = {\begin{matrix} 0 & i f y \neq l_{k} \\ 1 & i f y = l_{k} \end{matrix},

for

k = 1, ..., K

. The idea is then to use

K

perceptrons, each of them focusing on predicting one of the classes and trained independently. The idea is depicted in Figure 15. Notice that in this case is almost compulsory to use the logistic function (or a similar function, but definitely not the step function), since this way we can interpret the values as probabilities, and therefore select the value with highest probability as our prediction. Also, remember than in

x

we are adding an artifical 1 to account for the bias.

image: 41_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_Machine_Learning_LectureNotes_source_mlp_1.png

Figure 15: Multi-class classification with perceptrons.

Let's formalize this idea a little bit. We have

K

independent perceptrons, each of them predicting a variable, so that if we call

g

our activation function, we obtain, for each

k = 1, ..., K

y_{k} (x) = g (w_{k}^{T} x),

where

w_{k}

is the weight vector of perceptron

k

. We can integrate all this into a single formula by introducing the notation

f [\cdot]

, which applies

f

to each component of the input vector:

f [x] = (\begin{matrix} f (x_{1}) \\ ⋮ \\ f (x_{d}) \end{matrix}) .

This way, we can unify the previous equations as

y (x) = g [W^{T} x],

where we have arranged

W = (\begin{matrix} | & | \\ w_{1} & ... & w_{K} \\ | & | \end{matrix}) .

Now, we can define what is a layer. A layer is just a set of parallel neurons and a layer with

M

neurons outputs

M

variables. Usually, all neurons in a layer use the same activation function,

g

. Note that if this is not the case, then our previous explanations needs to be slightly adapted.

Since we have discussed that a good alternative in multiclass classification is to use the sigmoid function as activation function, we can rewrite our model as

y (x) = σ [W^{T} x] .

Note, nonetheless, that here we are obtaining a bunch of different probabilities, one for each class, which could be further used to increase the information about our prediction. For instance, instead of computing the maximum out of these independent probabilities, we can add a softmax function to the outputs, resulting in a normalized vector that can now be considered as a probability vector. The softmax function is

s o f t m a x {(y)}_{k} = \frac{e^{z_{k}}}{\sum_{k'} e^{z_{k'}}} .

Also, at doing this, it is not necessary to apply the activation function,

g

, beforehand. This approach is shown in Figure 16.

image: 42_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_Machine_Learning_LectureNotes_source_mlp_2.png

Figure 16: Multi-class classification with perceptrons and softmax.

Note that we are still in the linear classification scenario, and the next step is to improve the flexibility of the model to be able to cope with more complex relationships.

7.2 Multi-Layer Perceptron

Notice now an interesting fact: if our activation function is

g : R \to I

, i.e.,

I

is the range of

g

, then a layer with

M

perceptrons can be interpreted as a function

G : R^{n} \to I^{M} .

For the sake of simplicity, let's say

G : R^{n} \to R^{M}

. Then, it is natural to think on stacking different layers, so that the output of a layers becomes the input of the following, effectively performing the composition of the functions defined by these layers. For example, if we have two layers, which we interpret as two functions

G^{(1)} : R^{n} \to R^{M_{1}}

and

G^{(2)} : R^{M_{1}} \to R^{M_{2}}

, we can take the output of the first layer and use it as input for the second one, effectively obtaining

G^{(2)} ○ G^{(1)} : R^{n} \to R^{M_{2}} .

This is interesting mainly because a single layer, as we have seen, is only able to perform linear classification, but let's deep a bit more on what is happening on the second layer. The second layer is using an activation function

g^{(2)} : R \to R

, which is applied in each perceptron of the second layer as

y_{k}^{(2)} (x) = g_{k}^{(2)} (w_{k}^{{(2)}^{T}} \cdot y^{(1)} (x)) = g_{k}^{(2)} (w_{k}^{{(2)}^{T}} \cdot g^{(1)} [W^{{(1)}^{T}} \cdot x]) .

This means that we are:

Creating $M_{1}$ combinations of the input variables, $W^{{(1)}^{T}} \cdot x$ , where $W^{(1)} \in M_{M_{1} \times (n + 1)}$ (the $+ 1$ is to account for the bias).
Applying a different transformation $g_{k}^{(1)}$ to each combination of variables. Note that this transformation can be arbitrary, but should be non-linear if we want to go out of the linear world, and should be differentiable if we want a smooth training process.
For the second layer, the inputs are the transformations of the combinations of the input variables, which effectively become a base of functions (just as polynomials are), which are combined again by $w_{k}^{{(2)}^{T}}$ .
Finally, we apply $g_{k}^{(2)}$ , which will be again an activation function.

The key steps for going beyond linearity are the second and third. Let's compare it to a normal linear regression. In that case, we introduced basis functions to be able to go out of the linearity world, by transforming the input variables. But then, we used still linear regression over this transformation. Therefore, it was the modification of the input what allowed us to model non-linear relationship, and not the algorithm itself, which was not changed at all. Here, the idea is similar. In fact, we could just perform the same transformations that we used to in this new case, and keep applying just one layer of perceptrons to obtain non-linear classifiers. But this is exactly the power of the multi-layer perceptron: we don't need to assume or guess the shape of the basis function! The different weights of the different layers will adapt to our problem, effectively auto-selecting how these functions should be. Of course, this is not free, and the cost is paid by obtaining harder trainings.

After this explanation, we can first integrate the previous formula as

y^{(2)} (x) = g^{(2)} [W^{{(2)}^{T}} \cdot g^{(1)} [W^{{(1)}^{T}} x]] .

We can also define the multi-layer perceptron as composition of layers of perceptrons, with the addition of what is called the input layer, which is not really a layer of perceptrons, but rather represents and fixes the size of the input, and the output layer, which is the final layer, outputing the predictions of the model. The rest of the layers, the real perceptron layers, are called hidden layers. This is depicted in Figure 17. In this case, the first layer has

M_{1}

perceptrons, the second one has

M_{2}

perceptrons, and the output layers has as many perceptrons as possible labels,

K

image: 43_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_Machine_Learning_LectureNotes_source_mlp_3.png

Figure 17: A Multi-Layer Perceptron.

Remark 7.1. Observe that we have connected all outputs of each layer to all perceptrons in the second layer, also, note that two neurons within the same layer do not connect. These are just decisions and is not compulsory, and many different architectures exist.

More generally, a neural network is a directed graph whose nodes are perceptrons. If there are no cycles within the network, then it is a feed-forward neural network (FFNN), and it is called a recurrent network (RNN) in the case of cycles existing within the network.

MLP are thus a special case of feed-forward neural network, in which:

Neurons are arranged in layers.
There is at least one hidden layer.
Every layer is fully connected to the next one.
No connections are allowed within layers.

A brief visualization on different kinds of networks is depicted in Figure 18.

Sub-Figure a: MLP.

image: 44_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_Machine_Learning_LectureNotes_source_mlp_4.png

Sub-Figure b: FFNN not MLP.

image: 45_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_Machine_Learning_LectureNotes_source_mlp_5.png

Sub-Figure c: RNN.

Figure 18: Different kinds of NN.

7.3 Error functions

7.3.1 Regression

The error in the regression case can be the empirical mean square error, i.e.,

E (W) = \frac{1}{2} \sum_{i = 1}^{n} {(y_{i} - y_{W} (x_{i}))}^{2},

where the dataset is

{(x_{i}, y_{i})}_{i = 1}^{n}, y_{W} (x_{i})

is the prediction of

x_{i}

with the weight matrix

W

. Note that in this formula we are assuming a single output, but it can be easily generalized as

E (W) = \frac{1}{2} \sum_{i = 1}^{n} {∥ y_{i} - y_{W} (x_{i}) ∥}^{2} .

As usual, our objective will be to minimize this error.

7.3.2 Binary classification

If the target values are

y_{i} \in {0, 1}

, then we can express

P (y | x) = y_{W} {(x)}^{y} {(1 - y_{W} (x))}^{1 - y} .

If we assume independent and identically distributed examples, then we can define our likelihood function as

L (W) = \prod_{i = 1}^{n} y_{W} {(x_{i})}^{y_{i}} {(1 - y_{W} (x_{i}))}^{1 - y_{i}},

and the negative log-likelihood defines the cross-entropy error:

E (W) = - log L (W) = - \sum_{i = 1}^{n} y_{i} log (y_{W} (x_{i})) + (1 - y_{i}) log (1 - y_{W} (x_{i})) .

7.3.3 Multi-class classification

The last error formula can be generalized for multi-class classification with

K > 2

classes, as the generalized cross-entropy error:

E (W) = - \sum_{i = 1}^{n} \sum_{k = 1}^{K} y_{i, k} log (y_{W, k} (x_{i})) .

7.4 Training the MLP: Backprogragation

Notice that the error functions that we have defined can be very complex, and there is no closed form solution to minimize them. Therefore, some iterative approach is necessary, as we did with the perceptron. But we encounter now a new problem: if we do gradient descent on the neural network, we will need to compute lots of gradients, and computing a gradient is not cheap. The idea to make this process more efficient is to use the chain rule and to notice that many of the gradients that we compute are actually used in many places along the network for the computation of the final gradient. Therefore, an improved version for computing gradients was developed: the backpropagation algorithm.

7.4.1 The chain rule

The chain rule

For more information you can just visit Wikipedia.

states that, given two composable differentiable functions

f, g

, it is

\frac{d}{d x} f (g (x)) = f' (g (x)) \cdot g' (x) .

In the multivariate case, when we want to compute the derivative of

f (g_{1} (x), ..., g_{m} (x))

, it is

\frac{d (f (g_{1} (x), ..., g_{m} (x)))}{d x} = \sum_{i = 1}^{m} \frac{\partial f}{\partial y_{i}} (g_{1} (x), ..., g_{m} (x)) \cdot \frac{d g_{i}}{d x} (x) .

This is sometimes written by naming

z = f (g_{1} (x), ..., g_{m} (x))

and

y_{i} = g_{i} (x)

, so that the formula becomes

Notice the abuse of notation though.

\frac{d z}{d x} = \sum_{i = 1}^{m} \frac{d z}{d y_{i}} \frac{d y_{i}}{d x} .

Example 7.1. Let's do an example to understand how this can be helpful in the case of neural networks.

Imagine we have the functions:

\begin{matrix} y_{1} & = x_{1}^{2} \cdot e^{x_{2}} \\ y_{2} & = x_{2}^{3} \\ z = & y_{1} (1 - e^{y_{2}}), \end{matrix}

and we want to compute the gradient of

z

in terms of

X

. We can write this as a diagram:

image: 46_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_Machine_Learning_LectureNotes_source_backprop_1.png

We simply need to compute the gradient using the chain rule:

\nabla z = (\begin{matrix} \frac{d z}{d x_{1}} \\ \frac{d z}{d x_{2}} \end{matrix}) = (\begin{matrix} \frac{d y_{1}}{d x_{1}} + \frac{d y_{2}}{d x_{1}} \\ \frac{d y_{1}}{d x_{2}} + \frac{d y_{2}}{d x_{2}} \end{matrix}) = (\begin{matrix} 2 x_{1} e^{x_{2}} - \cdot 0 \\ x_{1}^{2} e^{x_{2}} - 3 x_{2}^{2} \end{matrix}) = (\begin{matrix} 2 x_{1} e^{x_{2}} \\ x_{1}^{2} e^{x_{2}} - 3 x_{2}^{2} \end{matrix}) .

Basically, the idea is to be able to reuse as many values as possible (notice the coloured expressions). To do this, we can compute local derivatives in each layer, and propagate them back to the previous layer, multiplying them. The previous computation can be done in the diagram as follows:

image: 47_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_Machine_Learning_LectureNotes_source_backprop_2.png

Note that:

We can avoid unnecessary computations by storing at each node:
1. Forward values: when we evaluate $y_{1} (x_{1}, x_{2})$ , we can keep this value stored in the node, because we will need it later. In this example we did symbolic differentiation, so this was not necessary, but in neural networks we need to first to the forward pass, and then assess the error.
2. Local derivatives: when we evaluate $\frac{d y_{1}}{d x_{2}}$ , we can keep this value stored in the node, because we will use it several times.
3. Backwards gradients: local derivatives are sent to those nodes that will need them, so that each of this derivatives is computed only once. Then the gradient is constructed little by little, backwards, in a dynamic programming manner.
When the gradients are flowing backwards, each node has to sum all incoming gradients, according to the multivariate chain rule (if there is only one incoming gradient, we just take that one).

7.4.2 The backpropagation algorithm

The pseudocode for backpropagation is as in Algorithm 10. Let's break it down.

a[layer]= $a^{(l)}$ is the input vector for layer $l$ .
z[layer]= $z^{(l)}$ is the output vector for layer $l$ , i.e., $z^{(l)} = g_{l} [a^{(l)}]$ , where $g_{l}$ is the activation function of layer $l$ (note in the pseudocode we are assuming the same $g$ working on all layers).
W[layer]= $W^{(l)}$ is the weight matrix of layer $l$ .
w[layer]= ${\hat{W}}^{(l)}$ is the weight matrix of layer $l$ , without the weight of the bias.
d[layer]= $δ^{(l)}$ is the local gradient at layer $l$ , computed as $δ^{(o u t)} = g_{o u t}' [a^{(o u t)}] ⊙ (z^{(o u t)} - y)$ for the output layer, $o u t$ or d[c+1] in the pseudocode, and $δ^{(l)} = g_{l}' [a^{(l)}] ⊙ ({\hat{W}}^{(l)} δ^{(l + 1)})$ for the rest of the layers, i.e., the hidden layers. Here, $⊙$ is the Hadamard product, i.e., the component-wise product. In the pseudocode, this is expressed by '.*'.

1. Forward pass
	- for layer in {1...c+1}
		a[layer] = W[layer]^T * z[layer-1]
		z[layer] = g[a[layer]] # apply g[.] to each component of a[layer]

2. Backward pass
	- d[c+1] = dg[a[c+1]] .* (z[c+1]-y) # .* is the component-wise product and dg is the derivative of g
	- grad[c+1] = z[c]*d[c+1]^T
	- for layer in {c...1}
		d[layer] = dg[a[layer]] .* (w[layer+1]*d[layer+1]) # w[k] is W[k] without the bias weigth
		grad[layer] = z[layer-1]*d[layer]^T

return grad

Algorithm 10: Backpropagation.

Example 7.2. Let's perform an example. Here is our neural network:

image: 48_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_Machine_Learning_LectureNotes_source_backprop_3.png

It is a fully connected two-layer network. Let's follow the algorithm, defining each of the variables at a time.

We define $a^{(1)} = x, a^{(2)} = z^{(1)} = g_{1} [a^{(1)}], a^{(o u t)} = z^{(2)} = g_{2} [a^{2}]$ and $z^{(o u t)} = z = g_{o u t} [a^{(o u t)}]$ :
1. Define $d^{(o u t)} = g_{o u t}' [a^{(o u t)}] ⊙ (z^{(o u t)} - y)$ and $g r a d^{(o u t)} = z^{(2)} \cdot δ^{{(o u t)}^{T}}$ :
1. Define $d^{(2)} = g_{2}' [a^{(2)}] ⊙ ({\hat{W}}^{(o u t)} δ^{(o u t)})$ and $g r a d^{(2)} = z^{(1)} δ^{{(2)}^{T}}$ :
1. Define $d^{(1)} = g_{1}' [a^{(1)}] ⊙ ({\hat{W}}^{(2)} δ^{(2)})$ and $g r a d^{(1)} = z^{(0)} δ^{{(1)}^{T}} = a^{(1)} δ^{{(1)}^{T}}$ :

And that's it! Now we can use the gradient at each layer to train the network, updating the weights in layer

l

according to:

W^{(l)} = W^{(l)} - γ \cdot g r a d^{(l)},

where

γ

is the learning rate.

7.5 Some activation functions

7.5.1 Logistic

The logistic function is

σ : R \to (0, 1)

, defined by

σ (x) = \frac{1}{1 + e^{- x}}

with derivative

\frac{d}{d x} σ (x) = σ (x) (1 - σ (x)) .

In this case, for the backpropagation algorithm, we use

g' [a^{(l)}] = g [a^{(l)}] ⊙ (1 - g [a^{(l)}]) = z^{(l)} ⊙ (1 - z^{(l)}),

which is a very efficient formula, since we have all this values computed in the forward pass.

7.5.2 Hyperbolic tangent

The hyperbolic tangent function is

t a n h : R \to (- 1, 1)

, defined by

t a n h (x) = \frac{e^{x} - e^{- x}}{e^{x} + e^{- x}},

with derivative

\frac{d}{d x} t a n h (x) = 1 - {(t a n h (x))}^{2} .

Thus, we end up with another very efficient formula

g' [a^{(l)}] = 1 - z^{(l)} ⊙ z^{(l)} .

Example 7.3. Implementing a NN for digit classification in MATLAB

image: 53_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_M___e_scripts_neuralnet_example_images_figure_0.png

image: 54_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_M___e_scripts_neuralnet_example_images_figure_1.png

image: 55_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_M___e_scripts_neuralnet_example_images_figure_2.png

image: 56_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_M___e_scripts_neuralnet_example_images_figure_3.png

image: 57_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_M___e_scripts_neuralnet_example_images_figure_4.png

image: 58_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_M___e_scripts_neuralnet_example_images_figure_5.png

image: 59_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_M___e_scripts_neuralnet_example_images_figure_6.png

image: 60_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_M___e_scripts_neuralnet_example_images_figure_7.png

A Notes on probability theory, Bayes theorem and Bayesian learning

These notes are adapted from [3].

A.1 Probability theory basic

Let

Ω

be a sample space, i.e., the set of possible outcomes of an event, and

A \subset Ω

an event. A probability measure is a function

P : P (Ω) \to R,

that assigns a real number to every event,

P (A)

. This represents how likely it is that the experiment's outcome is in

A

For

(Ω, P)

to be a probability space

In fact, we need one more ingredient,

F

, which is a

σ

-algebra of subsets of

Ω

. This is just a formalization, but we can abuse notation here to simplify some things.

we have to impose three axioms:

$P (A) \geq 0, \forall A \subset Ω$ .
$P (Ω) = 1$ .
$P (A \cup B) = P (A) + P (B)$ if $A \cap B = \emptyset$ .

From these axioms, some consequences can be derived:

$P (\bar{A}) \overset{d e f}{: =} P (Ω ∖ A) = 1 - P (A)$ .
Proof. We can write $Ω = A \cup (Ω ∖ A)$ and $A \cap (Ω ∖ A) = \emptyset$ , so we have $1 \overset{A 2}{=} P (Ω) = P (A \cup (Ω ∖ A)) \overset{A 3}{=} P (A) + P (Ω ∖ A),$ thus: $P (\bar{A}) = P (Ω ∖ A) = 1 - P (A) .$
$P (\emptyset) = 0$ .
Proof. $P (\emptyset) = P (\bar{Ω}) = 1 - P (Ω) = 0$ .
If $A \subset B$ , then $P (A) \leq P (B)$ .
Proof. In this case, we can write $B = A \cup (B ∖ A)$ , so $P (B) = P (A) + P (B ∖ A)$ and we have the inequality.
$P (A \cup B) = P (A) + P (B) - P (A \cap B)$ .
Proof. We have $A = (A ∖ B) \cup (A \cap B)$ , $B = (B ∖ A) \cup (A \cap B)$ and $A \cup B = (A ∖ B) \cup (A \cap B) \cup (B ∖ A)$ , so $\begin{matrix} P (A \cup B) = & P (A ∖ B) + P (A \cap B) + P (B ∖ A) \\ = & P (A) - P (A \cap B) + P (A \cap B) + P (B) - P (A \cap B) \\ = & P (A) + P (B) - P (A \cap B) . \end{matrix}$
$P (A \cup B) \leq P (A) + P (B)$ .
Proof. This is obvious from the previous result.

A.1.1 Joint probability

It is usal to be interested in the probability of two events happening simultaneously. This is called the joint probability of events

A

and

B

P (A, B) \overset{d e f}{: =} P (A \cap B) .

The join probability is useful when an event can be decomposed in simpler disjoint events, i.e., we have

A \subset B_{1} \cup ... \cup B_{n}

, with

B_{i} \cap B_{j} = \emptyset, \forall i \neq j

. In this situation, we can use the sum rule:

P (A) = \sum_{i = 1}^{n} P (A, B_{i}) .

This is also known as marginalization: when we know the joint probability

p (x, y)

and we want to compute

p (x)

, we marginalize out

y

. This basically means that if we know the probabilities of all possible pairs

(x, y)

, we can know the probability of

x

by exploring all the possibilities. Here, 'exploring' is using the sum rule:

p (x) = \sum_{y} p (x, y)

p (x) = \int_{y}^{y} p (x, y) ⅆ y,

y

is continuous.

Example A.1. We have two events:

$x$ : earns more than 100k or earns less than 100k.
$y$ : is a professor, a software engineer or a data scientist.

We have some sources of information, and are able to determine that

p (> 100, p r o f) = 0.05, p (> 100, s e n g) = 0.1, p (> 100, d s c i) = 0.2,

then we can conclude that

p (> 100) = 0.05 + 0.1 + 0.2 = 0.35.

A.1.2 Conditional probability

The conditional probability of

B

given

A

is the probability that

B

occurs, knowing that

A

has occurred. This means that we have to restrict the space of possible outcomes, from

Ω

A

Note that

P (A) > 0

is needed.

P (B | A) = \frac{P (A \cap B)}{P (A)} .

If we rearrange the terms, we can obtain the product rule:

P (A, B) = P (B | A) P (A) .

This formula can be generalized to an arbitrary number of events, the general product rule, given by

Note that here it is needed that all the intersections

A_{1} \cap ... \cap A_{i}

have non-zero probability, for

i = 1, ..., n - 1

P (A_{1}, ..., A_{n}) = \prod_{i = 1}^{n} P (A_{i} | A_{1}, ..., A_{i - 1}) .

Exercise A.1. Prove that

P (A, B, C) = P (A) P (B | A) P (C | A, B) = P (C) P (B | C) P (A | B, C) .

Then, prove the general product rule.

Basically, we are asked to prove the general product rule by induction.

For

n = 2

, we have already proven it.

For

n = 3

, we hav

\begin{matrix} P ((A_{1} \cap A_{2}) \cap A_{3}) = & P (A_{3} | A_{1} \cap A_{2}) P (A_{1} \cap A_{2}) \\ = & P (A_{3} | A_{1} \cap A_{2}) P (A_{2} | A_{1}) P (A_{1}), \end{matrix}

which is what we wanted.

Now, assume it is true for

n - 1

. Then, we have

\begin{matrix} P (A_{1} \cap ... \cap A_{n}) = & P ((A_{1} \cap ... \cap A_{n - 1}) \cap A_{n}) \\ = & P (A_{n} | A_{1} \cap ... \cap A_{n - 1}) P (A_{1} \cap ... \cap A_{n - 1}) \\ \overset{i n d u c t i o n}{=} & P (A_{n} | A_{1} \cap ... \cap A_{n - 1}) \prod_{i = 1}^{n - 1} P (A_{i} | A_{1}, ..., A_{i - 1}) \\ = & \prod_{i = 1}^{n} P (A_{i} | A_{1}, ..., A_{i - 1}) . \end{matrix}

A.1.3 Bayes rule

Bayes theorem gives an alternative formula for the conditional probability:

P (A | B) = \frac{P (B | A) P (A)}{P (B)} .

Proof. Assuming all involved probabilities are non-zero:

P (A | B) = \frac{P (A \cap B)}{P (B)} \overset{p r o d r u l e}{=} \frac{P (B | A) P (A)}{P (B)} .

This rule is known to be useful to update the probability of an event happening, when we are able to gather new information of related events.

P (A)

is usually called the prior probability, and

P (A | B)

is the a posteriori probability, which means that we have observed

B

, and want to update the probability estimate for

A

Example A.2. Example of the Bayes rule in action

An English-speaking tourist visits a city whose language is not English. A local friend tells him that 1 in 10 natives speak English, 1 in 5 people in the streets are tourists and that half of the tourists speak English. Our visitor stops comeone in the street and finds that this person speaks English. What is the probability that this person is a tourist?

We have

P (E N | T o u r i s t) = \frac{1}{2}, P (E N | L o c a l) = \frac{1}{10}, P (T o u r i s t) = \frac{1}{5}

We want to update our knowledge about the event of this person being a tourist. The prior probability is

\frac{1}{5}

, but since we know that this person speaks english, we have new information useful for updating the probability.

First, the total probability of someone speaking english is

P (E N) \overset{s u m r u l e + p r o d u c t r u l e}{=} P (E N | T o u r i s t) P (T o u r i s t) + P (E N | L o c a l) P (L o c a l) = \frac{1}{2} \frac{1}{5} + \frac{1}{10} \frac{4}{5} = \frac{9}{50} .

Now, the a posteriori probability of the person being a tourist, now that we know that he speaks english is

P (t o u r i s t | E N) = \frac{P (E N | T o u r i s t) P (T o u r i s t)}{P (E N)} = \frac{\frac{1}{2} \frac{1}{5}}{\frac{9}{50}} = \frac{5}{9} .

As we can see (and as we should expect), knowing that the person speaks english, our confidence that he is a tourist increases.

A.2 Bayes rule in the context of learning

As have been explained, Bayes rule allows us to reason about hypotheses from data:

P (h y p o t h e s i s | d a t a) = \frac{P (d a t a | h y p o t h e s i s) P (h y p o t h e s i s)}{P (d a t a)} .

In the jargon of parameters and datasets, this is: let $θ$ be a random variable with support $Θ$ , and let $D$ be the data that has been observed. Then, it is $P (θ | D) = \frac{P (D | θ) P (θ)}{P (D)} = \frac{P (D | θ) P (θ)}{\int_{Θ}^{Θ} P (D | θ) P (θ) ⅆ θ} .$ Here,

P (θ)

is the prior distribution of $θ$ . This means it is the distribution that we assume, before observing

D

P (D | θ)

is the likelihood of

θ

: the probability of observing

D

if the parameters are

θ

P (D)

is the evidence or expected likelihood.

P (θ | D)

is the posterior distribution of

θ

, our quantity of interest, expressing what we know about

θ

after having observed

D

Thus, we can continue this line of thought to tackle a new way of creating a model: given some data

D

, find the best possible values for the unknown parameters

θ

, so that the posterior or its likelihood is maximized.

There are, basically, two different approaches:

Maximum likelihood: in this case, we want to choose $θ$ in order to maximize its likelihood: $θ_{M L} = arg {max}_{θ} P (D | θ) .$
Maximum a posteriori: in this case, we take into account a prior distribution for $θ$ and estimate its value maximizing its posterior distribution: $θ_{M A P} = arg {max}_{θ} P (D | θ) P (θ) .$

A.3 Maximum likelihood estimation

Given a sample

D = {x_{1}, ..., x_{n}}

, where

x_{i}

are independent are identically distributed observations from a random variable

X

, following a distribution

p (X; θ)

, with

θ

the paremeters of the distribution. Our objective will be to obtain the best values for

θ

, according to our data, assuming some special form for

p

For this, the likelihood can be used: since the

x_{i}

are independent, the probability of having the sample

D

p (D; θ) = \prod_{i = 1}^{n} p (x; θ) .

The likelihood function is thus defined as

L (θ) = p (D; θ) .

Note that this is done with a fixed data

D

, so it is not a probability distribution, but a function of

θ

. This way, the maximum likelihood estimator for

θ

is given by

θ_{M L} = arg {max}_{θ} L (θ) .

There is a numerical issue here, though: as we are multiplying probabilities, which are values between 0 and 1, and we are likely to be multiplying many of them, we expect to obtain values very close to 0, which can lead to underflow in computations. Thus, it is convenient to use the log-likelihood:

θ_{M L} = arg {max}_{θ} L (θ) = arg {max}_{θ} [log L (θ)] = arg {min}_{θ} [- log L (θ)] .

Example A.3. Compute the maximum likelihood estimator (MLE) for univariate Gaussian distribution.

For this, we assume the data

D = {x_{1}, ..., x_{n}}

, where each

x_{i} \sim N (μ, σ^{2}) .

In this situation, the parameters are

μ

and

σ^{2}

. The likelihood function is

\begin{matrix} L (D; μ, σ^{2}) = & \prod_{i = 1}^{n} f (x; μ, σ^{2}) = \prod_{i = 1}^{n} \frac{1}{\sqrt{2 π σ^{2}}} e^{- \frac{1}{2} \frac{{(x_{i} - μ)}^{2}}{σ^{2}}} \\ = & \frac{1}{{(2 π σ^{2})}^{\frac{n}{2}}} e^{- \frac{1}{2 σ^{2}} \sum {(x_{i} - μ)}^{2}} . \end{matrix}

Now, the log-likelihood is

\begin{matrix} l o g L (D; μ, σ^{2}) = log (L (D; μ, σ^{2})) = & log (\frac{1}{{(2 π σ^{2})}^{\frac{n}{2}}} e^{- \frac{1}{2 σ^{2}} \sum {(x_{i} - μ)}^{2}}) \\ = & log (\frac{1}{{(2 π σ^{2})}^{\frac{n}{2}}}) + log (e^{- \frac{1}{2 σ^{2}} \sum {(x_{i} - μ)}^{2}}) \\ = & log (1) - log ({(2 π σ^{2})}^{\frac{n}{2}}) - \frac{1}{2 σ^{2}} \sum {(x_{i} - μ)}^{2} log (e) \\ = & 0 - \frac{n}{2} log (2 π σ^{2}) - \frac{1}{2 σ^{2}} \sum {(x_{i} - μ)}^{2} . \end{matrix}

At this point, to obtain the MLE, we need to maximize this function with respect to

μ

and

σ^{2}

\frac{\partial}{\partial μ} l o g L = - \frac{1}{σ^{2}} \sum (x_{i} - μ) = 0 ⟺ \sum (x_{i} - μ) = 0 ⟺ \sum x_{i} - n μ = 0 ⟺ μ_{M L E} = \frac{\sum x_{i}}{n},

\frac{\partial}{\partial σ^{2}} l o g L = - \frac{n}{2} \frac{1}{2 π σ^{2}} 2 π + \frac{1}{2 σ^{4}} \sum {(x_{i} - μ)}^{2} = - \frac{n}{2 σ^{2}} + \frac{1}{2 σ^{4}} \sum {(x_{i} - μ)}^{2} = 0

⟺ \frac{1}{2 σ^{4}} \sum {(x_{i} - μ)}^{2} = \frac{n}{2 σ^{2}} ⟺ \sum {(x_{i} - μ)}^{2} = n σ^{2} ⟺ σ^{2} = \frac{\sum {(x_{i} - μ)}^{2}}{n} .

Now, we substitute the value obtained for

μ

σ_{M L E}^{2} = \frac{\sum {(x_{i} - μ_{M L E})}^{2}}{n} .

Example A.4. Compute the MLE for a Bernoulli distribution.

Now the observations are the results of

n

coin tosses. We have to compute the parameter

p_{M L}

of the Bernoulli random variable, whose probability function is given by

f (x) = p^{x} {(1 - p)}^{1 - x}, p \in (0, 1), x \in {0, 1} .

Now, we have

D = {x_{1}, ..., x_{n}} \subset {0, 1}^{n}

. The likelihood function is

\begin{matrix} L (D; p) = & \prod_{i = 1}^{n} f (x_{i}) = \prod_{i = 1}^{n} p^{x_{i}} {(1 - p)}^{1 - x_{i}} = p^{\sum x_{i}} {(1 - p)}^{n - \sum x_{i}} . \end{matrix}

We differentiate:

\begin{matrix} \frac{\partial}{\partial p} L = & \sum x_{i} p^{\sum x_{i} - 1} {(1 - p)}^{n - \sum x_{i}} - p^{\sum x_{i}} [n - \sum x_{i}] {(1 - p)}^{n - \sum x_{i} - 1} \\ = & p^{\sum x_{i} - 1} {(1 - p)}^{n - \sum x_{i} - 1} [\sum x_{i} (1 - p) - p (n - \sum x_{i})] \\ = & p^{\sum x_{i} - 1} {(1 - p)}^{n - \sum x_{i} - 1} [\sum x_{i} - p \sum x_{i} - p n + p \sum x_{i}] \\ = & p^{\sum x_{i} - 1} {(1 - p)}^{n - \sum x_{i} - 1} [\sum x_{i} - p n] . \end{matrix}

This derivative is zero if and only if

\sum x_{i} - p n = 0 ⟺ p = \frac{\sum x_{i}}{n} = \bar{x} .

A.4 Properties of estimators

Definition A.1. If we have a dataset

D = {x_{1}, ..., x_{n}}

where

x_{i} \sim X

and

X

is a random variable, we call estimator a function

h : R^{n} \to R^{k}

Usually, we focus on estimators that tell us something about the underlying distribution

X

. For example, it is usual to assume that

X

belongs to a certain family of random variables

X \in F (θ)

, so that we need to estimate the parameters

θ = (θ_{1}, ..., θ_{k})

There are some properties that are considered desirable for an estimator to have:

Unbiasedness: an estimator $\hat{θ}$ is unbiased if in the long run it takes the value of estimated parameter. If we define the bias of an estimator as $B i a s [\hat{θ}] = E [\hat{θ}] - θ,$ then, the estimator is unbiased if $B i a s [\hat{θ}] = 0.$
Low variance: the variance of an estimator tells us how sensitive it is to variations in the input data $D$ : $V a r [\hat{θ}] = E [{(\hat{θ} - E [\hat{θ}])}^{2}] = E [{\hat{θ}}^{2}] - E {[\hat{θ}]}^{2} .$ In the case that $V a r [{\hat{θ}}_{1}] < V a r [{\hat{θ}}_{2}],$ we say that ${\hat{θ}}_{1}$ is more efficient than ${\hat{θ}}_{2}$ .
The Cramer-Rao bound gives a theoretical lower bound to the variance of an estimator, under some hypotheses that are usually assumed true
13
For further information one could read my notes from a course in Statistics (in Spanish), [1].

: $V a r (\hat{θ}) \geq \frac{{(\frac{\partial E [\hat{θ}]}{\partial θ})}^{2}}{E [{(\frac{\partial ln L}{\partial θ})}^{2}]} .$
Efficiency: an estimator is efficient if:
- It is unbiased.
- There is no other unbiased estimator with lower variance.
Consistency: a sequence of estimators ${{\hat{θ}}_{n}}_{n \in N}$ is consistent if it converges in probability to the true value of the parameter $θ$ : $\forall ϵ > 0, {lim}_{n \to \infty} P (| θ - \hat{θ} | < ϵ) = 1.$
Remark A.1. If the bias and the variance of an estimator tend to 0 with $n$ , then it is consistent.
Mean squeared error: the mean squared error of an estimator is $M S E (\hat{θ}) = E [{(θ - \hat{θ})}^{2}],$ which is a value that we seek to minimize.

Example A.5. Show that

M S E (\hat{θ}) = B i a s {[\hat{θ}]}^{2} + V a r [\hat{θ}] .

Let's start from the definition:

\begin{matrix} M S E (\hat{θ}) = E [{(θ - \hat{θ})}^{2}] = & E [θ^{2} - 2 θ \hat{θ} + {\hat{θ}}^{2}] = E [θ^{2}] - 2 θ E [\hat{θ}] + E [{\hat{θ}}^{2}] \\ = & θ^{2} - θ E [\hat{θ}] - θ E [\hat{θ}] + E [{\hat{θ}}^{2}] - \\ = & θ () - θ E [\hat{θ}] + + E {[\hat{θ}]}^{2} \\ = & - θ \cdot + E [\hat{θ}] () \\ = & + () \\ = & +^{} . \end{matrix}

Example A.6. Compute the bias and the variance of the ML estimates

(μ, σ^{2})

of an univariate Gaussian. Show that

σ_{M L}

is biased and that we can correct its biasedness by using a different estimator

{\hat{σ}}^{2} = \frac{n}{n - 1} σ_{M L}^{2} .

Compute the bias and the variance of this new estimator.

Let's start with

μ

\begin{matrix} B i a s (μ_{M L E}) = & B i a s (\frac{\sum x_{i}}{n}) = E [\frac{\sum x_{i}}{n}] - E [X] = \frac{1}{n} E [\sum x_{i}] - E [X] = \frac{1}{n} \sum E [x_{i}] - E [X] \\ = & \frac{1}{n} \sum E [X] - E [X] = \frac{n}{n} E [X] - E [X] = 0, \end{matrix}

\begin{matrix} V a r i a n c e (μ_{M L E}) = & E [{(\frac{\sum x_{i}}{n})}^{2}] - E {[\frac{\sum x_{i}}{n}]}^{2} \\ = & \frac{1}{n^{2}} E [{(\sum x_{i})}^{2}] - \frac{1}{n^{2}} E {[\sum x_{i}]}^{2} \\ = & \frac{1}{n^{2}} E [\sum x_{i}^{2} + \sum_{i \neq j} x_{i} x_{j}] - \frac{1}{n^{2}} {(\sum E [x_{i}])}^{2} \\ = & \frac{1}{n^{2}} (\sum E [x_{i}^{2}] + \sum_{i \neq j} E [x_{i} x_{j}]) - \frac{1}{n^{2}} {(\sum E [X])}^{2} \\ = & \frac{1}{n^{2}} (\sum E [X^{2}] + \sum_{i \neq j} E [x_{i}] E [x_{j}]) - \frac{1}{n^{2}} n^{2} E {[X]}^{2} \\ = & \frac{1}{n ²} (n E [X^{2}] + \sum_{i \neq j} E {[X]}^{2}) - E {[X]}^{2} \\ = & \frac{E [X^{2}] + n (n - 1) E {[X]}^{2} - n^{2} E {[X]}^{2}}{n^{2}} \\ = & \frac{E [X^{2}] - E {[X]}^{2}}{n^{2}} = \frac{V a r [X]}{n^{2}} . \end{matrix}

Now

σ_{M L E}^{2}

\begin{matrix} B i a s (σ_{M L E}^{2}) = & B i a s (\frac{\sum {(x_{i} - μ_{M L E})}^{2}}{n}) = E [\frac{\sum {(x_{i} - μ_{M L E})}^{2}}{n}] - V a r [X] \\ = & \frac{1}{n} E [\sum x_{i}^{2} - 2 μ_{M L E} \sum x_{i} + n μ_{M L E}^{2}] - V a r [X] \\ = & E [X^{2}] - \frac{2}{n} E [μ_{M L E}] n E [X] + E [μ_{M L E}^{2}] - V a r [X] \\ = & E [X^{2}] - 2 E {[X]}^{2} + E [{(\frac{\sum x_{i}}{n})}^{2}] - V a r [X] \\ = & E [X^{2}] - E {[X]}^{2} - E {[X]}^{2} + E [{(\frac{\sum x_{i}}{n})}^{2}] - V a r [X] \\ = & \frac{1}{n^{2}} (n E [X^{2}] + \sum_{i \neq j} E {[X]}^{2}) - E {[X]}^{2} \\ = & \frac{n E [X^{2}] + n (n - 1) E {[X]}^{2} - n^{2} E {[X]}^{2}}{n^{2}} \\ = & \frac{n E [X^{2}] - n E {[X]}^{2}}{n^{2}} = \frac{V a r [X]}{n} . \end{matrix}

\begin{matrix} B i a s ({\hat{σ}}^{2}) = B i a s (\frac{n}{n - 1} σ_{M L E}^{2}) = & B i a s (\frac{\sum {(x_{i} - μ_{M L E})}^{2}}{n - 1}) = E [\frac{\sum {(x_{i} - μ_{M L E})}^{2}}{n - 1}] - V a r [X] \\ = & \frac{1}{n - 1} E [\sum x_{i}^{2} - 2 μ_{M L E} \sum x_{i} + n μ_{M L E}^{2}] - V a r [X] \\ = & \frac{n}{n - 1} E [X^{2}] - \frac{2}{n - 1} E [μ_{M L E}] n E [X] + \frac{n}{n - 1} E [μ_{M L E}^{2}] - V a r [X] \\ = & \frac{n}{n - 1} E [X^{2}] - \frac{2 n}{n - 1} E {[X]}^{2} + \frac{n}{n - 1} \frac{1}{n^{2}} (n E [X^{2}] + \sum_{i \neq j} E {[X]}^{2}) - V a r [X] \\ = & \frac{n^{2} E [X^{2}] - 2 n^{2} E {[X]}^{2} + n E [X^{2}] + n (n - 1) E {[X]}^{2} - n (n - 1) V a r [X]}{n (n - 1)} \\ = & \frac{- + - - +}{n (n - 1)} \\ = & 0. \end{matrix}

And we see how this one is, in fact, unbiased.

A.5 Maximum a posteriori estimation

MAP (maximum a posteriori) estimation is a method of estimating the parameters of a statistical model by finding the parameter values that maximize the posterior probability distribution of the parameters, given the observed data and a prior probability distribution over the parameters. In contexts where the amount of data is limited or noisy, incorporating prior knowledge or beliefs can help to produce more stable and accurate estimates. The prior distribution serves as a regularization term, allowing us to control the degree of influence that the prior has on the estimate:

{\hat{θ}}_{M A P} = arg {max}_{θ} P (θ | D) = arg {max}_{θ} \frac{P (D | θ) P (θ)}{P (D)} = arg {max}_{θ} P (D | θ) P (θ),

where the denominator can be ignored because it is constant for all possible

θ

, so it does not change the

arg max

Example A.7. Find the MAP estimate for

μ

of an univariate Gaussian

X \sim N (μ, σ^{2})

with Gaussian prior distribution for

μ \sim N (μ_{0}, σ_{0}^{2})

, where

σ, σ_{0}

and

μ_{0}

are assumed to be known.

Let

D = {x_{1}, ..., x_{n}}

where

x_{i} \sim X

, then

\begin{matrix} P (μ | D) = & \frac{P (D | μ) P (μ)}{P (D)} = \frac{\prod_{i = 1}^{n} P (x_{i} | μ) P (μ)}{P (D)} = \frac{\prod_{i = 1}^{n} \frac{1}{\sqrt{2 π σ^{2}}} e^{- \frac{1}{2} \frac{{(x_{i} - μ)}^{2}}{σ^{2}}} \cdot \frac{1}{\sqrt{2 π σ_{0}^{2}}} e^{- \frac{1}{2} \frac{{(μ - μ_{0})}^{2}}{σ_{0}^{2}}}}{P (D)} \\ = & \frac{\frac{1}{{(2 π σ^{2})}^{\frac{n}{2}}} \frac{1}{\sqrt{2 π σ_{0}^{2}}} e^{- \frac{1}{2 σ^{2}} \sum_{i = 1}^{n} {(x_{i} - μ)}^{2}} \cdot e^{- \frac{1}{2} \frac{{(μ - μ_{0})}^{2}}{σ_{0}^{2}}}}{P (D)}, \end{matrix}

so we want to maximize

f (μ) = e^{- \frac{1}{2 σ^{2}} \sum_{i = 1}^{n} {(x_{i} - μ)}^{2}} \cdot e^{- \frac{1}{2} \frac{{(μ - μ_{0})}^{2}}{σ_{0}^{2}}},

or, its log function

g (μ) = log f (μ) = - \frac{1}{2 σ^{2}} \sum_{i = 1}^{n} {(x_{i} - μ)}^{2} - \frac{1}{2} \frac{{(μ - μ_{0})}^{2}}{σ_{0}^{2}} .

Thus,

\begin{matrix} \frac{d}{d μ} g (μ) = & \frac{1}{2 σ^{2}} \sum_{i = 1}^{n} 2 (x_{i} - μ) - \frac{1}{2} \frac{2 (μ - μ_{0})}{σ_{0}^{2}} \\ = & \frac{\sum (x_{i} - μ)}{σ^{2}} - \frac{μ - μ_{0}}{σ_{0}^{2}} \\ = & \frac{\sum x_{i} - n μ}{σ^{2}} - \frac{μ - μ_{0}}{σ_{0}^{2}} \\ = & \frac{\sum x_{i}}{σ^{2}} - \frac{n μ σ_{0}^{2} + σ^{2} μ}{σ^{2} σ_{0}^{2}} + \frac{μ_{0}}{σ_{0}^{2}} \\ = & \frac{\sum x_{i}}{σ^{2}} - μ \frac{n σ_{0}^{2} + σ^{2}}{σ^{2} σ_{0}^{2}} + \frac{μ_{0}}{σ_{0}^{2}} \end{matrix}

and this is zero if and only if

μ_{M A P} = \frac{σ^{2} σ_{0}^{2}}{σ^{2} + n σ_{0}^{2}} (\frac{\sum x_{i}}{σ^{2}} + \frac{μ_{0}}{σ_{0}^{2}}) = \frac{σ^{2} σ_{0}^{2}}{σ^{2} + n σ_{0}^{2}} (\frac{σ_{0}^{2} \sum x_{i} + σ^{2} μ_{0}}{σ^{2} σ_{0}^{2}}) = \frac{σ_{0}^{2} \sum x_{i} + σ^{2} μ_{0}}{σ^{2} + n σ_{0}^{2}} .

A different way to do this is that we have

P (μ | D) = \frac{\frac{1}{{(2 π σ^{2})}^{\frac{n}{2}}} \frac{1}{\sqrt{2 π σ_{0}^{2}}} e^{- \frac{1}{2 σ^{2}} \sum_{i = 1}^{n} {(x_{i} - μ)}^{2}} \cdot e^{- \frac{1}{2} \frac{{(μ - μ_{0})}^{2}}{σ_{0}^{2}}}}{P (D)} \propto e^{- \frac{1}{2 σ^{2}} \sum_{i = 1}^{n} {(x_{i} - μ)}^{2}} \cdot e^{- \frac{1}{2} \frac{{(μ - μ_{0})}^{2}}{σ_{0}^{2}}},

and we want

(μ_{n}, σ_{n})

such that

P (μ | D) \propto e^{- \frac{{(μ - μ_{n})}^{2}}{2 σ_{n}^{2}}}

and

P (μ | D) \sim N (μ_{n}, σ_{n}^{2})

, so we can solve

e^{- \frac{1}{2 σ^{2}} \sum_{i = 1}^{n} {(x_{i} - μ)}^{2}} \cdot e^{- \frac{1}{2} \frac{{(μ - μ_{0})}^{2}}{σ_{0}^{2}}} = e^{- \frac{{(μ - μ_{n})}^{2}}{2 σ_{n}^{2}}},

obtaining

μ_{n} = \frac{σ_{0}^{2} \sum x_{i} + σ^{2} μ_{0}}{σ^{2} + n σ_{0}^{2}}

and

σ_{n} = \frac{σ^{2} σ_{0}^{2}}{σ^{2} + n σ_{0}^{2}} .

Example A.8. The maximum likelihood estimate of

p

for a Bernoulli r.v.

X

is given by

p_{M L} = \frac{\sum x_{i}}{n} .

If we have

K > 2

outcomes, then we have the categorical distribution, also known as multinoulli or generalized Bernoulli, which has support

{1, ..., K}

. Its parameters are

p = (p_{1}, ..., p_{K})

, representing the probability of observing each of the possible outcomes, with

p_{i} \in [0, 1], \forall i

and

\sum p_{i} = 1

It is convenient to use the one-of-K encoding (also called one-hot encoding) for each outcome. Thus, the pmf of this distribution becomes

p (x) = \prod_{i = 1}^{K} p_{i}^{x_{i}} .

Now, given a sample

D = {x_{1}, ..., x_{n}}

of possible outcomes for a multinoulli r.v.

X

, the maximum likelihood estimate for

p

{\hat{p}}_{k} = \frac{1}{n} \sum x_{i k},

for each

k \in {1, ..., K}

. We can write this compactely as

\hat{p} = \frac{1}{n} \sum x_{i} .

If some category

k

is not present in our sample, then its corresponding ML estimate is going to be 0. These 0-estimates are problematic in predictive applications, because unseen outcomes in the training data are considered to be impossible to happen, and thus can never be predicted. To avoid this, pseudocounts are used instead. These represent prior knowledge in the form of (imagined) counts

c_{k}

for each category

k

. The idea is to assume that the data is augmented with our pseudocounts, and then we estimate using maximum likelihood over the augmented data, namely

\hat{p} = \frac{c + \sum_{i} x_{i}}{n + \sum_{k} c_{k}},

so the

k

-th parameter is

{\hat{p}}_{k} = \frac{c_{k} + \sum_{i} x_{i k}}{n + \sum_{k} c_{k}} .

As an example, imagine that we obtain a sample from a die:

{1, 3, 5, 4, 4, 6}

. If the vector of pseudocounts is

c = (1, 1, 1, 1, 1, 1),

then the estimates are

{\hat{p}}_{1} = \frac{1 + 1}{6 + 6} = \frac{1}{6},

{\hat{p}}_{2} = \frac{1}{6 + 6} = \frac{1}{12},

and so on. Notice that although 2 has not been observed in

D

, its probability estimate is not 0. This special case where all pseudocounts are 1 is known as Laplace smoothing.

Prove that using maximum likelihood with pseudocounts corresponds to a MAP estimate with Dirichlet prior with parameters

(c_{1} + 1, ..., c_{K} + 1)

A Dirichlet distribution,

D i r (c_{1} + 1, ..., c_{K} + 1)

has the density function:

f (x) = \frac{Γ (\sum (c_{i} + 1))}{\prod Γ (c_{i} + 1)} \prod x_{i}^{c_{i}} = \frac{Γ (\sum c_{i} + K)}{\prod c_{i}!} \prod x_{i}^{c_{i}} = \frac{(\sum c_{i} + K - 1)!}{\prod c_{i}!} \prod x_{i}^{c_{i}} .

To compute the MAP estimate, we use

\begin{matrix} P (D | p) P (p) = & \prod_{i} P (x_{i} | p) P (p) = \prod_{i} \prod_{k} p_{k}^{x_{i k}} \cdot \frac{(\sum c_{i} + K - 1)!}{\prod c_{i}!} \prod_{k} p_{k}^{c_{k}} . \end{matrix}

Thus, we want to minimize

f (p_{1}, ..., p_{K}) = \prod_{i, k} p_{k}^{x_{i k}} \prod_{k} p_{k}^{c_{k}},

or, equivalently, its log

g (p_{1}, ..., p_{K}) = log f (p_{1}, ..., p_{K}) = \sum_{i k} x_{i k} log (p_{k}) + \sum_{k} c_{k} log (p_{k}) .

We have

\begin{matrix} min & g (p_{1}, ..., p_{K}) \\ s . a . & \sum p_{i} = 1 \\ p_{i} \in [0, 1], \forall i \end{matrix} .

The lagrangian is

L = g - λ (\sum p_{i} - 1),

so that

\begin{matrix} \frac{\partial}{\partial p_{k}} L = \frac{\partial}{\partial p_{k}} g - λ = & \sum_{i} x_{i k} \frac{1}{p_{k}} + c_{k} \frac{1}{p_{k}} - λ = 0 ⟺ λ = \frac{\sum_{i} x_{i k} + c_{k}}{p_{k}} ⟺ p_{k} = \frac{\sum_{i} x_{i k} + c_{k}}{λ}, \end{matrix}

and then

1 = \sum_{k} p_{k} = \sum_{k} \frac{\sum_{i} x_{i k} + c_{k}}{λ} = \frac{1}{λ} (n + \sum_{k} c_{k}) ⟺ λ = n + \sum_{k} c_{k} .

Finally, substituting back, we obtain

p_{k} = \frac{c_{k} + \sum_{i} x_{i k}}{n + \sum_{k} c_{k}},

as we wanted!

A.6 Bayesian Learning

In the Bayesian Learning framework, instead of working with point estimates of our unknown parameter variables

θ

, we work with the whole posterior distribution

P (θ | D)

. In this case, learning is the process by which starting with some prior belief about the parameters and when facing some observations in the form of a dataset

D

, we update our belief about the possible values for our parameters in the form of the posterior distribution. This process is iterated over newly received data, so it can be viewed as a sequential process, in which Bayes is invoked each time we need a new posterior

P (θ | D_{1}, D_{2}, ...)

Example A.9. Bayesian learning in Matlab

image: 61_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_M___e_scripts_bayesian_learning_images_figure_0.png

A.6.1 Predictive posterior

When doing prediction in this framework, the whole distribution

P (θ | D)

is used. We view a prediction as a weighted average of all predictions each value for

θ

can make, weighted by its posterior probability (its expected value):

P (x' | D) = E_{θ | D} [x] = \int_{Θ}^{Θ} p (x' | θ, D) P (θ | D) ⅆ θ .

Example A.10. An insightful example

References

1Jose A. Lorencio Abril, "Apuntes de Inferencia EstadÃstica" (2021).

2Marta Arias, "Machine Learning".

3Marta Arias, "Notes on probability theory, Bayes theorem and Bayesian learning".

4F. Rosenblatt, "The perceptron: A probabilistic model for information storage and organization in the brain.", Psychological Review 65, 6 (1958), pp. 386--408.

ML-MDS - Machine Learning

Table of Contents

List of Figures

1 Introduction

2 Linear regression

2.1 Introduction

2.2 Least squares method

2.2.1 Least squares in 2D

2.2.2 Least squares regression: multivariate case

2.2.3 Computation of least squares solution via the singular values decomposition (SVD)

Intuitive interpretation

2.3 Things that could go wrong when using linear regression

2.3.1 Our independent variable is not enough

2.3.2 The relationship between the variables is not linear (underfitting)

2.3.3 Outliers affect the fit

2.4 Basis Functions

2.5 Probabilistic approach

2.5.1 Least squares regression from a probabilistic perspective

2.6 Bias-Variance decomposition

2.7 Ridge Regression from Gaussian prior

2.7.1 Tuning λ

Cross-validation

Leave-one-out cross-validation (LOOCV)

2.7.2 LOOCV for Ridge regression

2.7.3 Generalized Cross-Validation (GCV)

2.8 LASSO regression

2.9 The full-Bayesian perspective

2.9.1 Using the posterior distribution for predictions

3 Clustering

3.1 k-Means

3.2 k-Means++

3.3 Choosing the number of cluster K

3.3.1 Calinski-Harabasz index

3.4 Gaussian Mixtures

3.4.1 Clustering with a Gaussian mixture

3.4.2 A generative mixture of Gaussians

3.4.3 Learning Gaussian mixtures with Expectation-Maximization

3.4.4 Special cases

4 Linear Classifiers

4.1 Decision boundary in probabilistic models

4.2 Generative classifiers

4.2.1 Discriminant analysis

Further assumptions

Distance-based learning perspective

Implementation

Final remarks on discriminant analysis

4.2.2 Regularized discriminant analysis (RDA)

4.3 Naïve Bayes

4.3.1 Gaussian Naïve Bayes

4.4 Perceptron and Logistic Regression

Why is logistic regression a linear classifier?

4.4.1 Gradient descent

4.4.2 Newton's algorithm

Iterated Reweighted Least Squares (IRLS)

4.4.3 Multi-class logistic regression

4.4.4 Regularization

5 Nearest Neighbor Prediction

5.1 Locality: similarities and distances

5.2 Choosing k

5.3 How to combine outputs to make predictions

5.3.1 Classification

5.3.2 For regression

5.4 Decision boundaries for nearest neighbors classifier

5.5 Final considerations

6 Trees and Random Forests

6.1 Trees

6.2 Random Forests

6.2.1 Interpretability of random forests

6.2.2 Proximities

6.2.3 Imbalanced data in classification

7 Multi-Layer Perceptron (Neural Networks)

7.1 Multiclass classification

7.2 Multi-Layer Perceptron

7.3 Error functions

7.3.1 Regression

7.3.2 Binary classification

7.3.3 Multi-class classification

7.4 Training the MLP: Backprogragation

7.4.1 The chain rule

7.4.2 The backpropagation algorithm

2.7.1 Tuning $λ$

3.3 Choosing the number of cluster $K$

5.2 Choosing $k$