ML-MDS - Machine Learning

Spring 2023
Jose Antonio Lorencio Abril
image: 0_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_Big_Data_Mng_LectureNotes_source_upc-logo.png
Professor: Marta Arias
Student e-mail: jose.antonio.lorencio@estudiantat..upc.edu

This is a summary of the course Machine Learning taught at the Universitat Politècnica de Catalunya by Professor Marta Arias in the academic year 22/23. Most of the content of this document is adapted from the course notes by Arias, [2], so I won't be citing it all the time. Other references will be provided when used.

Table of Contents


List of Figures

Figure 1: Matlab's accidents dataset and best linear fit.
Figure 2: SVD visual.
Figure 3: The independent variable does not provide enough information.
Figure 4: The variables are not linearly related.
Figure 5: The outlier distort the fit.
Figure 6: Likelihood example.
Figure 7: Using likelihood to select the distribution.
Figure 8: Visual representation of the Bias-Variance errors.
Figure 9: Different Gaussian mixtures with the same components.
Figure 10: Different classifiers. Linear (left). Quadratic (middle). knn (right).
Figure 11: A neuron.
Figure 12: A simple perceptron.
Figure 13: Voronoi diagrams. Code in MATLAB script voronoi.mlx.
Figure 14: kNN decision regions for different values of k.
Figure 15: Multi-class classification with perceptrons.
Figure 16: Multi-class classification with perceptrons and softmax.
Figure 17: A Multi-Layer Perceptron.
Figure 18: Different kinds of NN.
Sub-Figure a: MLP.
Sub-Figure b: FFNN not MLP.
Sub-Figure c: RNN.

1 Introduction

I gave ChatGPT the index of these notes, and asked him for an introduction for them. Here is the result:
Welcome to the comprehensive study notes for your course in Machine Learning. This document is designed to provide you with a thorough overview of the primary topics in machine learning, to supplement your coursework and study.
Our journey will begin with an overview of Machine Learning and the foundational elements that build this exciting field. The aim is to provide you with a sound understanding of the core principles and techniques involved in the development and application of Machine Learning algorithms.
We will kick-off with 'Linear Regression,' a cornerstone of machine learning, going in-depth into topics like least squares method, issues encountered in linear regression, basis functions, and probabilistic approaches. It will also focus on Bias-Variance decomposition, ridge regression, LASSO regression, and finally, a full-Bayesian perspective.
Our exploration then continues with 'Clustering' where we cover important clustering techniques like k-Means, k-Means++, and ways to determine the optimal number of clusters. We delve deeper into the realm of Gaussian Mixtures and the Expectation-Maximization technique.
In the 'Linear Classifiers' chapter, we inspect the decision boundary in probabilistic models, generative classifiers, Naive Bayes, and methodologies like Perceptron and Logistic Regression.
Following this, we introduce 'Nearest Neighbor Prediction'—a different way of approaching classification problems. The section will equip you with tools to understand the concept of locality and choosing the best 'k', and how to combine outputs to make predictions.
We also explore the use of 'Trees and Random Forests' in machine learning. It will demystify decision trees and random forests, dealing with imbalanced data in classification, and providing insights into the interpretability of random forests.
Next, we journey into the world of 'Neural Networks' with the 'Multi-Layer Perceptron (MLP)', learning about the different error functions used for regression, binary classification, and multi-class classification. We also cover the critical process of training the MLP using Backpropagation and some common activation functions.
Finally, the notes conclude with an appendix section, 'Notes on probability theory, Bayes theorem and Bayesian learning.' This section is critical for understanding the probability theory basics, Bayes rule in the context of learning, maximum likelihood estimation, properties of estimators, maximum a posteriori estimation, and finally, Bayesian learning.
These notes aim to serve as a comprehensive reference guide for you as you explore machine learning, providing both a high-level understanding of the field and the mathematical and algorithmic details that underlie each concept. Happy learning!

2 Linear regression

2.1 Introduction

In Figure 1, we can observe a dataset of the population of different states plotted against the number of fatal accidents in each of the states. Here, each blue circle corresponds to a row of our data, and the coordinates are the ( population,#accidents ) values in the row. The red line is the linear regression model of this data. This means it is the line that 'best' approximates the data, where best refers to minimizing some kind of error: the squared error between each point to its projection on the y axis of the line, in this case. This approach is called the least squares method.
image: 1_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_Machine_Learning_LectureNotes_source_pegado2.png
Figure 1: Matlab's accidents dataset and best linear fit.

2.2 Least squares method

2.2.1 Least squares in 2D

In 2D, we have a dataset { ( x i , y i ) ,i=1,...,n } and we want to find the line that best approximates y as a function of x . As we want a line, we need to specify its slope, θ 1 , and its intercept, θ 0 . So, our estimations are: y ˆ ( x i ) = y ˆ i = θ 0 + θ 1 x i . The least squares linear regression method chooses θ 0 and θ 1 in such a way that the error function J( θ 0 , θ 1 ) = i=1 n ( y i - y ˆ i ) 2 = i=1 n ( y i - θ 0 - θ 1 x i ) 2 is minimized.
Note that this function only depends on the parameters θ 0 and θ 1 , since the data is assumed to be fixed (they are observations).
To compute them, we just need to find the minimum of J , by taking partial derivatives and setting them to 0. Let's do this optimization. First, we can develop the square: J( θ 0 , θ 1 ) = i=1 n y i 2 + θ 0 2 + θ 1 2 x i 2 -2 θ 0 y i -2 θ 1 x i y i +2 θ 0 θ 1 x i . Thus:
J θ 0 = i=1 n 2 θ 0 -2 y i +2 θ 1 x i =2n θ 0 -2 i=1 n y i +2 θ 1 i=1 n x i and J θ 1 = i=1 n 2 θ 1 x i 2 -2 x i y i +2 θ 0 x i =2 θ 1 i=1 n x i 2 -2 i=1 n x i y i +2 θ 0 i=1 n x i . We have now to solve the system given by { 2n θ 0 -2 i=1 n y i +2 θ 1 i=1 n x i =0 2 θ 1 i=1 n x i 2 -2 i=1 n x i y i +2 θ 0 i=1 n x i =0 which is equivalent to { n θ 0 - i=1 n y i + θ 1 i=1 n x i =0 θ 1 i=1 n x i 2 - i=1 n x i y i + θ 0 i=1 n x i =0 . We can isolate θ 0 from the first equation: θ 0 = i=1 n y i - θ 1 i=1 n x i n , and substitute it in the second one θ 1 i=1 n x i 2 - i=1 n x i y i + i=1 n y i - θ 1 i=1 n x i n i=1 n x i =0, which is equivalent to θ 1 i=1 n x i 2 - i=1 n x i y i + i=1 n y i i=1 n x i - θ 1 [ i=1 n x i ] 2 n =0 or θ 1 [ i=1 n x i 2 - [ i=1 n x i ] 2 n ] - i=1 n x i y i + i=1 n y i i=1 n x i n =0. At this point, we can divide everything by n , yielding:
θ 1 [ i=1 n x i 2 n - [ i=1 n x i ] 2 n 2 ] - i=1 n x i y i n + i=1 n y i i=1 n x i n 2 =0.
If we now assume that the observations are equiprobable, i.e., P( x i ) = 1 n , and we call X the random variable from which observations x i are obtained and the same for the observations y i , obtained from Y , then: i=1 n x i 2 n =E[ X 2 ] , [ i=1 n x i ] 2 n 2 =E [ X ] 2 , i=1 n x i y i n =E[ XY ] , i=1 n y i i=1 n x i n 2 =E[ X ] E[ Y ] . This means that the previous equation can be rewritten as: θ 1 ( E[ X 2 ] -E [ X ] 2 ) -( E[ XY ] -E[ X ] E[ Y ] ) =0 θ 1 Var[ X ] -Cov[ X,Y ] =0 So θ 1 = Cov[ X,Y ] Var[ X ] , θ 0 =E[ Y ] - θ 1 E[ X ] .

2.2.2 Least squares regression: multivariate case

Now, we can assume that we have m independent variables X 1 ,..., X m which we want to use to predict the dependent variable Y . Again, we have n observations of each variable. Now, we want to construct an hyperplane in R m+1 , whose predictions would be obtained as y ˆ ( X i ) = θ 0 + θ 1 x i1 +...+ θ m x im = θ 0 + j=1 m θ j x ij = j=0 m θ j x ij , where we define x i0 =1, for all i . We can represent X= ( x ij ) i=1,...,n;j=1,...,m ,Y=[ y 1 y n ], θ =[ θ 0 θ m ], so that we can write Y ˆ =X θ . The error function is defined as in the simple case J( θ ) = i=1 n ( y i - y ˆ i ) 2 , but now we can rewrite this as J( θ ) = i=1 n ( y i - y ˆ i ) 2 = ( Y- Y ˆ ) T ( Y- Y ˆ ) = ( Y-X θ ) T ( Y-X θ ) . Again, to obtain θ we need to optimize this function using matrix calculus.
Lemma 2.1. If A=[ a 11 ... a 1m a n1 ... a nm ] M n × m ( R ) , θ =[ θ 1 θ m ] R m and B=[ b 11 ... b 1n b n1 ... b nn ] M m × m ( R ) is a symmetric matrix, it holds: A θ θ =A, θ T A T θ =A, and θ T B θ θ =2 θ T B T .
Proof. First, notice that A θ =[ j=1 m a 1j θ j j=1 m a nj θ j ] R n , so it is A θ θ =[ j=1 m a 1j θ j θ 1 ... j=1 m a 1j θ j θ m j=1 m a nj θ j θ 1 ... j=1 m a nj θ j θ m ]=[ a 11 ... a 1m a n1 ... a nm ]=A. For the second result, the procedure is the same.
Lastly, notice that θ T B θ = k=1 m j=1 m θ k b kj θ j R , so θ T B θ θ =[ k=1 m j=1 m θ k b kj θ j θ 1 ... k=1 m j=1 m θ k b kj θ j θ m ]=[ 2 j=1 m b 1j θ j ... 2 j=1 m b mj θ j ]=2 [ B θ ] T =2 θ T B T .
Now, we can proceed and minimize J : J( θ ) θ = ( Y-X θ ) T ( Y-X θ ) θ = θ [ Y T Y- Y T X θ - θ T X T Y+ θ T X T X θ ] = 0- Y T X- Y T X+2 X T X θ = -2 Y T X+2 θ T X T X, setting this to be 0, we get θ T X T X= Y T X X T X θ = X T Y θ = ( X T X ) -1 X T Y. Thus, the 'best' linear model is given by θ lse = ( X T X ) -1 X T Y. Once we have this model, if we have an observation of X , x'=( x ' 1 ,...,x ' m ) and we want to make a prediction, we compute y'=x' θ lse . The approach that we have followed here is the optimization view of learning, which basically consists of the steps:
  1. Set up an error function as a function of some parameters.
  2. Optimize this function to find the suitable values for this parameters, assuming the data as given.
  3. Use incoming values to make predictions.

2.2.3 Computation of least squares solution via the singular values decomposition (SVD)

Inverting X T X can entail numerical problems, so the SVD can be used instead.
Theorem 2.1. Any matrix A R m × n , m>n , can be expressed as A=U Σ V T , where U R m × n has orthonormal columns ( U T U=I ) , Σ R n × n is diagonal and contains the singular values in its diagonal and V R n × n is orthonormal ( V -1 = V T ) .
Proof. Let C= A T A R n × n . C is square, symmetric and positive semidefinite. Therefore, C is diagonalizable, so it can be written as C=V Λ V T , where V= ( v i ) i=1,...,n is orthogonal and Λ =diag( λ 1 ,..., λ d ) with λ 1 ... λ r >0= λ r+1 =...= λ n and r is rank( A ) n .
Now, define σ i = λ i , and form the matrix Σ =[ diag( σ 1 ,..., σ r ) 0 r × ( n-r ) 0 ( m-r ) × r 0 ( m-r ) × ( n-r ) ]. Define also u i = 1 σ i X v i R m ,i=1,...,r. Then, this vectors are orthonormal: u i T u j = ( 1 σ i X v i ) T ( 1 σ j X v j ) = 1 σ i σ j v i T X T X v j = 1 σ i σ j v i T C v j =* 1 σ i σ j v i T ( λ j v j ) =( λ j = σ j 2 ) σ j σ i v i T v j = ** 0, where ( * ) is because V is formed with the eigenvectors of C , and ( ** ) is because V is orthonormal.
Now, we can complete the base with u r+1 ,..., u n (using Gram-Schmidt) in such a way that U=[ u 1 ,..., u r , u r+1 ,..., u n ] R n × n is column orthonormal.
Now, if it is the case that XV=U Σ , then X=XV V T =U Σ V T , so it is only left to see that indeed this holds. Consider two cases:
This, added to the fact that if X has full rank, X T X is invertible and all its eigenvalues non null, gives us: θ lse = ( X T X ) -1 X T y = ( ( U Σ V T ) U Σ V T ) -1 ( U Σ V T ) T y = ( V Σ U T U Σ V T ) -1 V Σ U T y = Σ -2 V Σ U T y=V Σ -1 U T y = Vdiag( 1 σ 1 ,..., 1 σ n ) U T y.

Intuitive interpretation

The intuition behind the SVD is summarized in Figure 2. Basically, every linear transformation can be decomposed into a rotation, a scaling and a simpler transformation (column orthogonal).
image: 2_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_Machine_Learning_LectureNotes_source_SVD.png
Figure 2: SVD visual.
The intuition behind SVD lies in the idea of finding a low-rank approximation of a given matrix. The rank of a matrix is the number of linearly independent rows or columns it contains. A high-rank matrix has many linearly independent rows or columns, which makes it complex and difficult to analyze. On the other hand, a low-rank matrix has fewer linearly independent rows or columns, which makes it simpler and easier to analyze.
SVD provides a way to find the best possible low-rank approximation of a given matrix by decomposing it into three components. The left singular vectors represent the direction of maximum variance in the data, while the right singular vectors represent the direction of maximum correlation between the variables. The singular values represent the magnitude of the variance or correlation in each direction.
By truncating the diagonal matrix of singular values to keep only the top-k values, we can obtain a low-rank approximation of the original matrix that retains most of the important information. This is useful for reducing the dimensionality of data, compressing images, and solving linear equations, among other applications.
Example 2.1. How to use SVD in Python and Matlab.
import numpy as np

U, d, Vt = np.linalg.svd(X, full_matrices=False)
D = np.diag(1/d)
theta = Vt.T @ D @ U.T @ y
Algorithm 1: SVD in Python.
[U, d, V] = svd(X)
D = diag(diag(1./d))
theta = V'*D*U'*y
Algorithm 2: SVD in Matlab.

2.3 Things that could go wrong when using linear regression

2.3.1 Our independent variable is not enough

It is possible that our variable X does not provide enough information to predict Y .
image: 3_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_Machine_Learning_LectureNotes_source_pegado7.png
Figure 3: The independent variable does not provide enough information.

2.3.2 The relationship between the variables is not linear (underfitting)

It is also possible that the variables are related in non-linear ways.
image: 4_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_Machine_Learning_LectureNotes_source_pegado6.png
Figure 4: The variables are not linearly related.

2.3.3 Outliers affect the fit

In the presence of outliers, the model obtained can be distorted, leading to bad results.
image: 5_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_Machine_Learning_LectureNotes_source_pegado8.png
Figure 5: The outlier distort the fit.

2.4 Basis Functions

In order to fix the second problem (Subsection 2.3.2), we can make use of basis functions. The idea is to apply different transformations to the data, so that we can extend the expressive power of our model.
Definition 2.1. A feature mapping is a non-linear transformation of the inputs φ : R m R k .
The resulting predictive function or model is y= φ ( x ) θ .
Example 2.2. For example, we can consider the polynomial expansion of degree k , which is a commonly used feature mapping that approximates the relationship between the independent variable x and the dependent variable y to be polynomial of degree k , i.e.: y= θ 0 + θ 1 x+ θ 2 x 2 +...+ θ k x k . The feature mapping is φ ( x ) =( 1 x x 2 ... x k ) .
Note that the idea is to transform the data so that the fit is still linear, even if the relationship is not. Of course, this requires to apply the same transformation whenever we receive an input for which we want to make predictions. Also, the resulting model is more complex, so complexity control is necessary to avoid overfitting.
When we apply φ to the input matrix X , we get a new input matrix, given by Φ =( φ ( x 1 ) φ ( x 2 ) φ ( x n ) )=( φ 1 ( x 1 ) φ 2 ( x 1 ) φ m ( x 1 ) φ 1 ( x 2 ) φ 2 ( x 2 ) φ m ( x 2 ) φ 1 ( x n ) φ 2 ( x n ) φ m ( x n ) ), and we obtain the optimal solution as before: θ min = arg min θ ( y- Φ θ ) T ( y- Φ θ ) = ( Φ T Φ ) -1 Φ T y.
Example 2.3. A MATLAB example
image: 6_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_Ma____source_scripts_example_2_3_images_figure_0.png
image: 7_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_Ma____source_scripts_example_2_3_images_figure_1.png

2.5 Probabilistic approach

A review on probability theory, Bayes theorem and Bayesian Learning is in Appendix A.

2.5.1 Least squares regression from a probabilistic perspective

We are noe going to derive the linear regression estimates using the principle of maximum likelihood and the univariate Gaussian distribution, whose probability density function is given by p( x ) = 1 2 π σ 2 e - 1 2 σ 2 ( x- μ ) 2 . Given a sample D={ x 1 ,..., x n } , where x i N ( μ , σ ) , we define its likelihood as the function L ( μ , σ ,D ) =P( D; μ , σ ) = i p( x i ; μ , σ ) . In Figure 6, we can see how the likelihood relates to 'how likely it is that our points have been created froma certain distribution', because the red outcomes are more likely to appear from the blue distribution than the green ones.
image: 8_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_Machine_Learning_LectureNotes_source_pegado9.png Figure 6: Likelihood example.
Now, this can be used to select the distribution that best matches our data. As an easy approach, suppose we want to decide between two distributions F 1 and F 2 , and we have a dataset D . To decide, we can compute L ( F 1 ,D ) and L ( F 2 ,D ) and select the distribution whose likelihood is greater. This is visually exemplified in Figure 7, where we can see that given the three red outcomes and the two distributions (blue and green), the blue one should be preferred, because it mazimizes the likelihood.
image: 9_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_Machine_Learning_LectureNotes_source_pegado10.png
Figure 7: Using likelihood to select the distribution.
This way, we can think of the likelihood of a function of the unknown parameters of the distribution, with the dataset fixed, and we can maximize this function to obtain the parameters that best describe our data.
In the probabilistic setting of linear regression, we assume that each label y i we observe is normally distributed, with mean μ = x i θ and variance σ 2 : y i = x i θ + ϵ i , where ϵ i N ( 0, σ 2 ) . This way, we seek to obtain θ and σ 2 that best describe our data. Remember that y=( y 1 y n ),X=( 1 x 11 ... x 1d 1 x n1 ... x nd ), so that the likelihood of the parameter vector θ is given by L ( θ , σ ) =P( y|X; θ , σ 2 ) = i=1 n p( y i | x i ; θ , σ 2 ) . It is usual to maximize the log-likelihood instead, basically because the likelihood tends to give values too close to zero (we may be multiplying thousands of small values), so numerical problems may arise. Thus: l( θ , σ 2 ) = log L ( θ , σ 2 ) = log i=1 n p( y i | x i ; θ , σ 2 ) = i=1 n log p( y i | x i ; θ , σ 2 ) = i=1 n log [ 1 2 π σ 2 e - 1 2 σ 2 ( y i - x i θ ) 2 ] = i=1 n ( log 1 2 π σ 2 - 1 2 σ 2 ( y i - x i θ ) 2 ) = - n 2 log ( 2 π σ 2 ) - 1 2 σ 2 i=1 n ( y i - x i θ ) 2 = - n 2 log ( 2 π σ 2 ) - 1 2 σ 2 i=1 n ( y-X θ ) T ( y-X θ ) . At this point, we differentiate and set equal to 0: θ l( θ , σ 2 ) =- 1 2 σ 2 ( -2 X T y+2 X T X θ ) =0 σ 2 l( θ , σ 2 ) =- n 2 σ 2 + 1 2 σ 4 ( y-X θ ) T ( y-X θ ) =0, obtaining θ ML = ( X T X ) -1 X T y and σ ML 2 = 1 n ( y-X θ ) T ( y-X θ ) = 1 n i=1 n ( y i - x i θ ML ) 2 =MSE. It is noticeable that the maximum likelihood estimates concide with the estimates we found minimizing the squared error. This is a consequence of assuming gaussian noise, and other types of distribution would give us the same estimates as minimizing a different error function.

2.6 Bias-Variance decomposition

Let f: R d R be the true function that we are trying to approximate and D={ ( x 1 , y 1 ) ,...,( x n , y n ) } a finite training dataset, where y i =f( x i ) + ϵ i and ϵ i N ( 0, σ 2 ) . Let x R d be a test data point. The setup is using D to train a model f ˆ that we want to use to make predictions y ˆ D = f ˆ ( x ) . We are going to see how the expected squared error ( y- y ˆ D ) 2 , where y is the real value, can be decomosed as a sum of the following components:
Let's do this: E[ ( y- y ˆ D ) 2 ] = E[ ( f( x ) + ϵ - y ˆ D ) 2 ] =E[ ( f( x ) + ϵ - y ˆ D +E[ y ˆ D ] -E[ y ˆ D ] ) 2 ] = E[ ( ++ ) 2 ] = E[ 2 ] +E[ 2 ] +E[ 2 ] +2E[ ] +2E[ ] +2E[ ] = E[ 2 ] + σ 2 +E[ 2 ] +2E[ ( ) ] +2E[ ] +2E[ ] = E[ 2 ] + σ 2 +E[ 2 ] . And we now define Bias[ y ˆ D ] =E[ 2 ] , Variance[ y ˆ D ] =E[ 2 ] . The Bias reflects the expected difference between our assumed model and the real function, while the variance reflects the difference between the assumed model and the obtained model. In Figure 8, we can see:
image: 10_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_Machine_Learning_LectureNotes_source_pegado14.png
Figure 8: Visual representation of the Bias-Variance errors.
A summary of commonly used errors used for regression is shown in Table 1.
Name
Abbreviation
Formula
mean squared error
MSE
1 n i=1 n ( y i - x i θ ) 2
root mean squared error
RMSE
MSE
normalized root mean squared error
NRMSE
MSE Var( y )
coefficient of determination
R 2
1-NRMSE 2
mean absolute error
MAE
1 n i=1 n | y i - x i θ |
Table 1: Common error functions.

2.7 Ridge Regression from Gaussian prior

We are going to consider MAP estimates in this section, which are explained in the Annex A.
Assume isotropic Gaussian prior on d -dimensional θ , i.e., θ N ( μ =0, Σ = τ 2 I ) , so that Σ -1 = 1 τ 2 I and det Σ = τ 2d . Then, on one hand, we have P( θ ; μ =0, Σ = τ 2 I ) = 1 det ( Σ ) 1 2 ( 2 π ) d 2 exp { - 1 2 ( y- μ ) T Σ -1 ( y- μ ) } = 1 ( 2 π τ 2 ) d 2 exp { - 1 2 τ 2 θ T θ } = 1 ( 2 π τ 2 ) d 2 exp { - θ 2 2 τ 2 } . On the other hand, it is P( θ |y,X ) P( y|X, θ ) P( θ ) exp { - 1 2 σ 2 ( y-X θ ) T ( y-X θ ) } exp { - θ 2 2 τ 2 } = exp { - 1 2 σ 2 ( y-X θ ) T ( y-X θ ) - θ 2 2 τ 2 } . To obtain the MAP, we maximize the log of this expression: θ MAP = arg max θ log [ P( y|X, θ ) P( θ ) ] = arg max θ [ - 1 2 σ 2 ( y-X θ ) T ( y-X θ ) - θ 2 2 τ 2 ] = arg min θ [ ( y-X θ ) T ( y-X θ ) + σ 2 τ 2 θ 2 ] = arg min θ [ ( y-X θ ) T ( y-X θ ) + λ θ 2 ] which is the ridge regression estimate with λ = σ 2 τ 2 . We can now differenciate this expression to find its minimum: θ [ y-X θ 2 + λ θ 2 ] = - y T X- X T y+2 X T X θ +2 λ θ =-2 X T y+2 X T X θ +2 λ θ =0 ( 2 X T X+2 λ I ) θ =2 X T y θ MAP = θ ridge = ( X T X+ λ I ) -1 X T y.
Remark 2.1. There are a few remarks here:

2.7.1 Tuning λ

Cross-validation

Given a dataset D , the k -cross-validation approach starts by separating D into two subsets:
These are obtained in such a way that:
  1. D train D test = .
  2. D train D test =D .
Now, we divide D train into k subsets of equal size, { D train,i } i=1 k , called folds, and imagine we want to decide on a set of values for our model's parameter λ Λ ={ λ 1 ,..., λ l } .
  1. For each λ Λ :
    1. For each i=1,...,k :
      1. Train the model in D train D train,i .
      2. Evaluate the model in D train,i .
    2. Average the k evaluations to obtain an estimation of the performance of the model.
  2. Select λ that gives us the best estimation.

Leave-one-out cross-validation (LOOCV)

Is a special case of cross-validation, in which k=n , i.e., each data point is a fold.

2.7.2 LOOCV for Ridge regression

As a particularity of linear and ridge regression, for a given value of λ , only one training is necessary for LOOCV, so we can proceed as follows:
  1. For each λ Λ :
    1. Compute the optimal solution θ ˆ λ = ( X T X+ λ I ) -1 X T y.
    2. Compute the hat matrix or smoothing matrix H λ = ( h ij ) ij =X ( X T X+ λ I ) -1 X T .
    3. Compute LOOCV directly for each λ , without folding LOOCV( λ ) = 1 n i=1 n ( y i -f( x i ) θ λ 1- h ii ) 2 .
  2. Return λ with minimum LOOCV.

2.7.3 Generalized Cross-Validation (GCV)

Generalized Cross-Validation (GCV) is a model selection technique used to estimate the performance of a model in terms of prediction accuracy. GCV is particularly useful for choosing the optimal parameters in regularization or smoothing methods, where the goal is to balance model complexity and goodness of fit. Examples of such methods include ridge regression, LASSO, and smoothing splines.
The main idea of GCV is to approximate the leave-one-out cross-validation (LOOCV) error without actually performing the computationally expensive process of fitting the model to all but one data point multiple times. Also, it is more computationally stable than the previous approach.
The GCV score is defined as follows:
GCV( λ ) = 1 n i=1 n ( y i -f( x i ) θ λ 1- Tr( H λ ) n ) 2 = MS E λ ( 1- Tr( H λ ) n ) , where Tr( H λ ) is the trace of H λ .

2.8 LASSO regression

Definition 2.2. The Lp- norm of a vector θ is θ p = ( i=1 n | θ d | p ) 1 p .
Remark 2.2. As we have seen, assuming an isotropic Gaussian prior on the parameters leads to ridge regression, which minimizes the L2-norm of θ and the squared error.
Another very common choice is p=1 , which leads to LASSO regression. Thus, LASSO regression minimizes the L1-norm of parameters and squared error: θ LASSO = arg min θ [ y-X θ 2 2 + λ θ 1 ] . In fact, LASSO regression arises assuming a Lapace distribution prior over the parameters.
Some characteristics:

2.9 The full-Bayesian perspective

Both maximum likelihood (ML) and maximum a priori (MAP) produce point estimates of the parameters, while in Bayesian Learning
1
Refer to Appendix A for more details.
we want the full posterior distribution of the parameters. The idea is that if we know the posterior distribution of the parameters, then we can use all the information provided by this distribution for our predictions, instead of just a single point-estimate that summarizes it. For instance, if the probability function of the posterior is p( θ |D ) and we receive a new input point x , then we can compute the probability of f( x ) =y by p( y|x,D ) = Θ Θ p( y|x, θ ,D ) p( θ |D ) θ . Now, when we do this for all possible values of y , we obtain the full distribution of the predictions, instead of just one estimation. Nonetheless, computing this integral is usually too hard, so it needs to be approximated, but in the context of linear regression all these expressions have close-form formulas
2
For computational speed we may use approximations as well.
.
Technically, ML and MAP assume that Y N ( x T θ , σ 2 ) , so a prediction for a new test point x ¯ is going to have a distribution N ( x ¯ T θ ˆ , σ 2 ) . Note now the lack of flexibility of this approach, since the width of the normal distribution is going to be the same for any new test point, which may be a dangerous assumption.
Let D= { ( x i , y i ) } i=1 n be our dataset, with x i R d and y i R , and assume:
Then, the likelihood function is p( D| θ ) exp { - a 2 ( y-X θ ) T ( y-X θ ) } . As usual, using Bayes, we can derive the posterior distribution p( θ |D ) p( D| θ ) p( θ ) exp { - a 2 ( y-X θ ) T ( y-X θ ) } exp { - b 2 θ T θ } exp { - a 2 ( y-X θ ) T ( y-X θ ) - b 2 θ T θ } . Notice here that the exponent of this expression is quadratic on θ , so it is going to be a multivariate Gaussian
3
The idea is that the product of two Gaussians is also Gaussian. Think of it as if you are measuring height and IQ: these variables can be assumed independent, and have a normal distribution. You can multiply them and obtain a combined normal distribution in 2-D. The idea here is basically the same.
. We need to turn the exponent into something resembling ( θ - μ ) T Q( θ - μ ) , so that we can derive what the mean μ and the precision Q of the posterior density is. For this, we are going to complete squares so that we can 'match the terms' between ( θ - μ ) T Q( θ - μ ) and a ( y-X θ ) T ( y-X θ ) +b θ T θ . On one hand: ( θ - μ ) T Q( θ - μ ) = θ T Q θ - θ Q μ - μ T Q θ + μ T Q μ = θ T Q θ -2 θ T Q μ +const. We don't mind about constant terms with respect to θ , since we only care about proportionality. On the other hand, we have: a ( y-X θ ) T ( y-X θ ) +b θ T θ = a y T y-a y T X θ -a θ T X T y+a θ T X T X θ +b θ T θ = a y T y-2a θ T X T y+ θ T ( a X T X+bI ) θ . From here, we obtain that Q=a X T X+bI, and we now match -2a θ T X T y with -2 θ T Q μ
4
Q is invertible because it is positive definite, thanks to the +bI , with b>0 .
: a X T y=Q μ μ =a Q -1 X T y. Thus, the posterior probability is p( θ |D ) N ( μ , Q -1 ) with Q=a X T X+bI, μ =a Q -1 X T y. The MAP estimate can be directly obtained from here, since the maximum density is obtained at the mean in any Gaussian distribution. Additionally, in ridge regression we let λ = b a and turn it into a parameter that we can tune to control complexity against training error.

2.9.1 Using the posterior distribution for predictions

Let's now see how to compute the predictive distribution, i.e., p( y|x,D ) = Θ Θ p( y|x, θ ,D ) p( θ |D ) θ . For this, we substitute the densities p( y|x,D ) = Θ Θ N ( y| x T θ , 1 a ) N ( θ | Q -1 ) θ w.r.t.y Θ Θ exp { - a 2 ( y- x T θ ) 2 } exp { - 1 2 ( θ - μ ) T Q( θ - μ ) } θ = Θ Θ exp { - a 2 ( y 2 -2( x T θ ) y+ ( x T θ ) 2 ) - 1 2 ( θ T Q θ -2 θ T Q μ + μ T Q μ ) } θ Θ Θ exp { - a 2 ( y 2 -2( x T θ ) y+ ( x T θ ) 2 ) - 1 2 ( θ T Q θ -2 θ T Q μ ) } θ . Now, our objective is to set this integral to equate (or be proportional to) another of the form Θ Θ N ( θ |... ) g( y ) θ =g( y ) Θ Θ N ( θ |... ) θ =g( y ) , and finally to see that g( y ) N ( y|... ) . We then have p( y|x,D ) Θ Θ exp { - a 2 ( y 2 -2( x T θ ) y+ ( x T θ ) 2 ) - 1 2 ( θ T Q θ -2 θ T Q μ ) } θ = Θ Θ exp { - 1 2 [ a y 2 -2a x T θ y+a θ T x x T θ + θ T Q θ -2 θ T Q μ ] } θ = Θ Θ exp { - 1 2 [ θ T ( ax x T +Q ) θ -2 θ T ( xya+Q μ ) +a y 2 ] } θ . Again, we want to match to something of the form ( θ -m ) T L( θ -m ) = θ T L θ -2 θ T Lm+ m T Lm, so L=ax x T +Q and Lm=xya+Q μ m= L -1 ( xya+Q μ ) , assuming L -1 exists for now. Then: p( y|x,D ) Θ Θ exp { - 1 2 [ θ T ( ax x T +Q ) θ -2 θ T ( xya+Q μ ) +a y 2 ] } θ = Θ Θ exp { - 1 2 [ θ T L θ -2 θ T Lm+ m T Lm- m T Lm+a y 2 ] } θ = Θ Θ exp { - 1 2 [ ( θ -m ) T L( θ -m ) - m T Lm+a y 2 ] } θ . Notice here that m is independent of θ so that we have p( y|x,D ) Θ Θ exp { - 1 2 [ ( θ -m ) T L( θ -m ) - m T Lm+a y 2 ] } θ = Θ Θ exp { - 1 2 [ ( θ -m ) T L( θ -m ) ] } exp { - 1 2 { a y 2 - m T Lm } } θ = Θ Θ exp { - 1 2 [ ( θ -m ) T L( θ -m ) ] } g( y ) θ = g( y ) Θ Θ exp { - 1 2 [ ( θ -m ) T L( θ -m ) ] } θ g( y ) = exp { - 1 2 { a y 2 - m T Lm } } . And now... we complete squares again :D m T Lm= ( ayx+Q μ ) T L -1 L L -1 ( ayx+Q μ ) = ay x T L -1 ayx+2ay x T L -1 Q μ + = ( a 2 x T L -1 x ) y 2 +2( a x T L -1 Q μ ) y+const. So a y 2 - m T Lm=( a- a 2 x T L -1 x ) y 2 -2( a x T L -1 Q μ ) y+const If g( y ) is a Gaussian, then this should look something like λ ( y-u ) 2 = λ y 2 -2 λ uy+ λ u 2 , so λ =a- a 2 x T L -1 x and λ u=a x T L -1 Q μ , so u= 1 λ a x T L -1 Q μ . And then, we have p( y|x,D ) g( y ) exp { - λ 2 ( y-u ) 2 } , so we have that p( y|x,D ) = N ( y|u, 1 λ ) .
Not only this, but
5
The derivation for these values is a bit involved. It can be consulted in https://www.youtube.com/watch?v=LCISTY9S6SQ&t=287s.
u= μ T x and 1 λ = 1 a + x T Q -1 x.
We note now that the predictive distribution's mean prediction equals the point-prediction of the MAP. However, the variance of the prediction does depend on x , which is good, since the unvertainty of out predictions depends on how far observed samples are:

3 Clustering

The goal of clustering is to partition a dataset into groups called clusters, in such a way that observations in the same cluster tend to be more similar than observations in different clusters. The input data is embedded in a d -dimensional space with a similarity/dissimilarity function defined among elements in the space, which should capture relatedness among elements in this space. Two elements are understood to be related when they are close in the space. Thus, a cluster is a compact group that is separated from other groups or elements outside the cluster.
There is a large variety of clustering algorithms, such as hierarchical bottom-up or top-down clustering, probabilistic clustering, possibilistic clustering, algorithmic clustering, sprectral clustering or density-based clustering.
The problem of clustering is quite complex, as if we have N data points which we want to separate into K clusters, then there are S( N,K ) = 1 K! i=1 K ( -1 ) i (Ki) ( K-i ) N possibilities. This is the stirling number of the second king.
If in addition we don't know how many clusters we want to use, we have to add all possible K=1,...,N , summing up to B( N ) = K=1 N S( N,K ) possibilites. This number is the Bell number, which is really gigantic
6
For example, B( 71 ) 4 × 1 0 71 .
.

3.1 k-Means

The k -Means clustering algorithm takes a dataset D={ x 1 ,..., x n } and an integer k>1 as input, and separates D into k disjoint clusters. It is a representative-based clustering, meaning that each cluster is represented by one single point. In the case of k -means, the representative is the cluster center, μ k R d ,k=1,...,K , i.e., the average of all the points in that cluster. Each point in thus assigned to its closest representative point.
If C k is the k -th cluster, then we consider it a better cluster when the value x C k x- μ k 2 is smaller.
Now, let's formalize all this. First, we introduce an indicator variable r ik ={ 1 if x i C k 0 otherwise , and the objective function J ( μ ,r ) = k=1 K i=1 n r ij x i - μ k 2 , which we aim to minimize by selecting appropriate { μ k } k and { r ik } ik . The issue is that this problem is NP-hard, so we use an heuristic method that is only guaranteed to find local minima. This method relies in two facts:
  1. For fixed cluster centers μ k , it is easy to optimize cluster assignments r ik .
    Proof. Assume fixed { μ k } k , then we assign x i to the closest μ k , because if we assign it to a different center, μ j , then we can minimize the sum as x i - μ j 2 > x i - μ k 2 .
  2. For fixed cluster assignments r ik , it is easy to optimize cluster centers μ k .
    Proof. Assume fixed { r ij } ij , then μ j J ( μ 1 ,..., μ K ) = k=1 K i=1 n μ j r ij x i - μ k 2 = μ j μ k =0,kj i=1 n r ij μ j ( x i - μ j ) T ( x i - μ j ) = i=1 n r ij μ j ( x i T x i -2 μ j T x i + μ j T μ j ) = i=1 n r ij ( -2 x i +2 μ j ) = -2 i=1 n r ij x i +2 μ j i=1 n r ij . Thus, the minimum is obtained at μ j = i=1 n r ij x i i=1 n r ij = x C j x card( C j ) . This is the average of the points of the cluster.
The pseudocode is illustrated in Algorithm 3.
Algorithm 3: k-Means.
The characteristics of k -Means are:
  1. :
    1. Easy implementation.
    2. Fast, even for large datasets.
  2. :
    1. Can converge to local minimum.
    2. Needs the number of clusters, K , as input.
    3. Hard cluster assignments: meaning that each point corresponds to a single cluster, which may not always be what we want.
    4. Sensitive to ourliers and clusters of different sizes and densities.
    5. Sensitive to initialization, so it is usual to run it many times and keep the best run.
    6. Biased towards rounded clusters, because it uses the Euclidean distance.

3.2 k-Means++

k -Means++ is a variant of k -Means that uses a heuristic for initializing cluster centers as in Algorithm 4.
choose first center <@$\mu_1$@> uniformly at random from all available examples

for k=2,...,K do
	choose next center <@$\mu_k$ at random, with probability proportional to $||x_i-\mu_l||$@>
	here, <@$\mu_l$@> is the closest center picked so far

proceed with standard k-Means
Algorithm 4: k-Means++.

3.3 Choosing the number of cluster K

The number of clusters is a hyper-parameter that has to be set by the user, and there is no obvious way to choose an optimal K , since it may not even exist. This is due to the fact that there is no such thing as a true clustering against to compare. Nevertheless, there are reasonable cluster quality criteria, that can be used to select K . These criteria measure a balance between separation of clusters and their compactness. Depending on the problem, one criterion or another should be chosen.

3.3.1 Calinski-Harabasz index

The CH index uses Euclidean distances to measure cluster quality, so it is usually used with k -means. It measures the ratio between:
Thus, it is: CH= N-K K-1 k=1 K n k μ k - x ¯ 2 k=1 K i=1 n r ik x i - μ k 2 , where x ¯ = x i n . Notice that the quantities are normalized by N-K K-1 to avoid larger K having better values.
The usual approach is run k -Means with different values of K , and then select the K that maximizes the index.
Example 3.1. A k -Means example in Matlab.
image: 11_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_M___Notes_source_scripts_kmeans_images_figure_0.png
image: 12_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_M___Notes_source_scripts_kmeans_images_figure_1.png
image: 13_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_M___Notes_source_scripts_kmeans_images_figure_2.png
image: 14_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_M___Notes_source_scripts_kmeans_images_figure_3.png

3.4 Gaussian Mixtures

A mixture of Gaussians is a distributions that is built using a convex sum of Gaussians, making it more flexible than a single Gaussian distribution. If the components of the mixture are N ( μ k , Σ k ) ,k=1,...,K and π k ,k=1,...,K are the mixing coefficients, with 0 π k 1, k π k =1 , then, the density function of the mixture is given by p( x| θ ) = k=1 K π k N ( x; μ k , Σ k ) , where θ = { π k , μ k , Σ k } k=1,...,K represents the parameters of the distribution.
The key assumption is that each data point has been generated from only one component, we just don't know which one.
In Figure 9, we can see different Gaussian Mixtures that arise from the same components, choosing different mixing coefficients.
image: 15_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_Machine_Learning_LectureNotes_source_pegado15.png
Figure 9: Different Gaussian mixtures with the same components.

3.4.1 Clustering with a Gaussian mixture

The idea is to identify each component with a cluster, so we want to determine the component from which each point in the dataset is more likely to have been created, as well as the parameters of each of the distribution.
Thus, to cluster a dataset D={ x 1 ,..., x n } into K clusters, the approach is the following:
  1. Use Expectation-Maximization (EM) to estimate the mixture, obtaining approximations π ˆ k , μ ˆ k and Σ ˆ k for each k=1,...,K .
  2. Find assignments for each x i to the clusters.
In this case, the clustering is soft, in opposition to the hard clustering of k -Means. This means that we will obtain the probability for each point belonging to each cluster.

3.4.2 A generative mixture of Gaussians

To sample from a mixture of Gaussians, we use a generative model that uses a latent variable z=( z 1 ,..., z K ) whose components are all 0, except one which denotes the component from which we sample, and we do:
  1. Pick component k with probability π k . This means that we set z k =1 with probability π k .
  2. Generate a sample x according to N ( μ k , Σ k ) .
The probability of generating a sample x using this generative model is p( x ) = z p( x,z ) = k p( x, z k =1 ) = k p( x| z k =1 ) p( z k =1 ) = k π k N ( x; μ k , Σ k ) , the joint distribution of x and z is given by p( x,z ) = k π k z k N ( x; μ k , Σ k ) z k , and the marginal distribution over x is p( x ) = k π k N ( x; μ k , Σ k ) = z p( x,z ) = z k' π k' z k' N ( x; μ k' , Σ k' ) z k' . Therefore, we can use Bayes to compute the conditional distribution of z given x : p( z k =1|x ) = p( x| z k =1 ) p( z k =1 ) p( x ) = p( x| z k =1 ) p( z k =1 ) k' π k' N ( x; μ k' , Σ k' ) = π k N ( x; μ k , Σ k ) k' π k' N ( x; μ k' , Σ k' ) = γ k ( x ) . The quantity γ k ( x ) indicates how probable it is that a particular data point x has been generated by the mixture component k . Or, in the context of clustering: how probable it is that x belongs to cluster k . We use these quantities as the soft membership to each cluster. If a hard membserhip is needed, then we assign x to cluster j , where j= arg max k' γ k' ( x ) .

3.4.3 Learning Gaussian mixtures with Expectation-Maximization

We have a dataset of unlabelled observations D={ x 1 ,..., x n } and we want to model it as a Gaussian mixture, with unknown parameters θ = { π k , μ k , Σ k } k=1,...,K , with a fixed K . For this we use the maximum likelihood approach. First, we compute the loglikelihood of θ : l( θ ) = log L ( θ ) = log i=1 n p( x i ; θ ) = log i k π k N ( x; μ k , Σ k ) = i log k π k N ( x; μ k , Σ k ) . This is hard to optimize... so we use the Expectation-Maximization approach. First, we can differenciate it to see what conditions must hold for local maxima, with μ k l( θ ) =0 leading to μ ˆ k = i γ k ( x i ) x i i γ k ( x i ) = i p( z k =1| x i ) x i i p( z k =1| x i ) , which is a weighted average of the points in our data, with weights being the soft assignments of each point to cluster k .
The problem now, is we cannot know γ k ( x ) without μ k , Σ k and π k . Now, Σ k l( θ ) =0 gives us Σ ˆ k = i γ k ( x i ) ( x i - μ ˆ k ) ( x i - μ ˆ k ) T i γ k ( x i ) = i p( z k =1| x i ) ( x i - μ ˆ k ) ( x i - μ ˆ k ) T i p( z k =1| x i ) , which is the sample covariance matrix of all x i , weigthed by the soft assignments of each point to cluster k . We have the same problem!
Since we have the constraint k π k =1 , we now maximize the Lagrangian L =l( θ ) - λ ( k π k -1 ) , obtaining π ˆ k = 1 n i γ k ( x i ) , which is the average of all soft assignments for each point x . Again the same problem is present.
Therefore, we are in a situation in which we can estimate π k , Σ k and μ k if we know γ k , and we can compute γ k from the estimates π ˆ k , Σ ˆ k and μ ˆ k ; and we can use this in our benefit by following the pseudocode depicted in Algorithm 5. Commonly, the initializations are done using the result of k -Means in the following manner:
Algorithm 5: EM algorithm.
Example 3.2. An example of the EM algorithm using Matlab
image: 16_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_M___s_source_scripts_EM_example_images_figure_0.png
image: 17_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_M___s_source_scripts_EM_example_images_figure_1.png

3.4.4 Special cases

The shape of the Gaussians can be restricted, obtaining special cases of mixtures:

4 Linear Classifiers

In classification, we have a labelled dataset, D={ ( x 1 , y 1 ) ,...,( x n , y n ) } where x i R d and y i Y ={ l 1 ,..., l K } are labels. Thus, the tuple ( x i , y i ) means that the vector x i is associated to the label y i . With this setup, we aim at producing a classification model, meaning a function that, given a new input x' , predicts the label y' that it should be associated with. Usually, we distinguish between two kinds of classification, attending to the number of labels considered:
Now, we introduce some useful terms:
Definition 4.1. The decision regions are a partition of the feature space, { P 1 ,..., P K } R d , such that j=1 K P j = , j=1 K P j = R d . Intuitively, they are the regions in which we divide the feature space, so that all elements inside a region have the same label.
The decision boundaries are the points in the frontier between decision regions.
A classifier can then be understood as the creation of the decision regions.
A linear classifier is a classifier in which the decision boundaries are d-1 -dimensional hyperplanes.
Example 4.1. A visualization.
In Figure 10, we can see different classification models and the regions they generate for the iris dataset. The left model correspond to a linear classifier, and we can see how the decision boundaries are linear. The other two models are not linear.
image: 18_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_Machine_Learning_LectureNotes_source_pegado16.png image: 19_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_Machine_Learning_LectureNotes_source_pegado17.png image: 20_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_Machine_Learning_LectureNotes_source_pegado18.png
Figure 10: Different classifiers. Linear (left). Quadratic (middle). knn (right).

Many useful classifiers don't just predict the input's class, but also the probabilities of the input belonging to each class. This is desirable, as it enables us to express uncertainty about the prediction that we make. For example, in binary classification it is a common approach to map the target labels to be 0 or 1, and then make predictions as a continuous value in [ 0,1 ] . More explicitly, we can have labels Y ={ 'sick','healthy' } and encode 'sick'=1 , 'healthy'=0 , so that given a patient x' , we obtain a prediction y'[ 0,1 ] , indicating the probability of the patient being sick.
In classification with K>2 classes, it is common to encode the labels using one-hot encoding, meaning that we map the labels into the set { 0,1 } K . For example, if we have three labels, then they are encoded as { ( 1,0,0 ) ,( 0,1,0 ) ,( 0,0,1 ) } . When in this scenario, predictions are usually points in the ( K-1 ) -simplex, i.e., a prediction y'=( y ' 1 ,...,y ' K ) must be 0y ' k 1,k and k y ' k =1 . Each y ' k represents the probability of the input belonging to class k .

4.1 Decision boundary in probabilistic models

From a probabilistic perspective, we can think of the joint probability of examples and labels, p( x,y ) . When we are building a classifier, we then want to minimize the expected loss, or expected error. Note, nonetheless, that loss here is a bit different from the regression loss. A natural way to think about this loss is through loss or cost matrices. A cost matrix is a table as the following:
Real
y= l 1
y= l 2
...
y= l K
Predicted
y'= l 1
0
c 1,2
...
c 1,K
y'= l 2
c 2,1
0
...
c 2,K
y'= l K
c K,1
c K,2
...
0
This matrix indicates the cost of each error. Note that not all errors need to have the same cost. For example, in a medical context, it has a higher cost to predict that a sick patient is healthy (this person could potentially die), than to predict that a healthy person is sick (in which case further tests would probably correct the mistake). A cost matrix for this case could look like the following:
Real
y=healthy
y=sick
Predicted
y'=healthy
0
60
y'=sick
10
0
We will focus in the case of all errors having the same impact, which is called the 0-1 loss.
Let's now see how, given a new example x , we can use a 'rule' to choose the label that x should have. We have random variables X and Y with joint distribution p( X,Y ) . We can compute the expected loss of assigning the label c Y to x . For this, we first define the loss function as the 0-1 loss: L( a,b ) ={ 0 ifa==b 1 otherwise . Therefore: E Y [ L( Y,c ) ] = y Y L( y,c ) p( Y=y|x ) = yc p( Y=y|x ) = p( Yc|x ) = 1-p( Y=c|x ) . Of course, we want to minimize the expected loss, so we aim at predicting the class y' minimizing E Y [ L( Y,y' ) ] : y'= arg min y E Y [ L( Y,y ) ] = arg min y { 1-p( y|x ) } = arg max y p( y|x ) . This is called the Bayes classifier, and it is optimal when we use 0-1 loss. Its error is given by the so-called Bayes error rate: BER=1- E X [ p( y'|x ) ] =1- x x p( y'|x ) p( x ) x =1- x x p( x|y' ) p( y' ) x . Of course, we can use this classifier to partition the feature space into regions R c ,c Y , and we can compute the BER summing over all regions: BER=1- c x R c x R c p( x|c ) p( c ) x . Before, we claimed that the Bayes classifier is optimal. However, in practice we don't know the distribution p( y,x ) , so it cannot be implemented exactly. Therefore, p( y|x ) is estimated from data, and this estimates are used for classification, incurring in additional errors. To learn p( y|x ) , there are two basic approaches, namely discrimintative classifiers and generative classifiers.

4.2 Generative classifiers

Generative classifiers learn p( y|x ) through the Bayes rule.

4.2.1 Discriminant analysis

Discriminant analysis is the result of implementing a Bayes classifier assuming that the class-conditional distributions p( x|y ) are gaussian. This means that, having Y ={ c 1 ,..., c K } , then it is p( x|y= c k ) N ( μ k , Σ k ) . If we also assume that the prior distributions are p( y= c k ) = π k , with k π k =1 , then we define the discriminant functions g k ( x ) = log ( P( y= c k ) P( x|y= x k ) ) = log ( π k 1 det ( Σ k ) 1 2 ( 2 π ) d 2 exp { - 1 2 ( x- μ k ) T Σ -1 ( x- μ k ) } ) = log π k - 1 2 log ( det Σ k ) - d 2 log ( 2 π ) - 1 2 ( x- μ k ) T Σ -1 ( x- μ k ) = log π k - 1 2 [ log ( det Σ k + ( x- μ k ) T Σ k -1 ( x- μ k ) ) ] +const. This is a quadratic discriminant function, and the corresponding classifier is implemented by predicting y'= c k' ,wherek'= arg max k g k ( x ) . This corresponds to chossing the label with maximum probability a posteriori.
The decision boundaries in this case are those regions in which there exist k 1 , k 2 with g k 1 ( x ) = g k 2 ( x ) .
These corresponds to hyper-quadrics in the feature space, and this is a quadratic method, usually called quadratic discriminant analysis (QDA).
Of course, we can further simplify our assumptions, by assuming that all labels have the same covariance matrix, Σ k = Σ for all k=1,...,K . In this simpler case, the discriminant functions end up being g k ( x ) = log π k + μ k T Σ -1 x- 1 2 μ k T Σ -1 μ k , because now det Σ k = det Σ is constant for all k , so we can remove it. Furthermore, the term x T Σ k -1 x= x T Σ -1 x is also constant with respect to k , so it will not affect the k chosen. Therefore, we end up with linear discriminant functions, in which the decision boundaries correspond to hyperplanes in the feature space. This is a linear method, usually called linear discriminant analysis (LDA).
In Figure 10, the left diagram corresponds to a LDA partitioning of the feature space for the iris dataset, while the one in the center is a QDA partitioning.

Further assumptions

Of course, we can make more simplifying assumptions for our model, such as:

Distance-based learning perspective

In all seen cases, we have a minimum-distance classifier in R d :

Implementation

It is usual to use MLE and estimate the centers and covariance matrices using the training dataset. If we define the sets S k ={ x i | y i = c k ,( x i , y i ) D } and n k =card( S k ) , then, the estimates are π ˆ k = n k n , μ ˆ k = 1 n k x S k x, and the covariance matrix is:

Final remarks on discriminant analysis

Bayesian classifiers are optimal when the class-conditional densities and priors are known. This means that, if know the underlying distributions, then these classifiers are our best choice. Of course, this is not realistic, and estimations need to be made. However, normal distributions appear in a wide range of real scenarios, and even if we have to estimate the centers and covariance matrices, QDA and LDA are a very good choice when data resembles Gaussian distributions. In addition, they are well-principled, having a solid mathematical theory behind them, they are fast and reliable.
Of course, if the real distribution is very far from being a Gaussian, then the model obtained will be poor, so one should take care of this.
Also, it is important to ensure that we are correctly estimating the parameters of the Gaussians, because otherwise the model will not work, not even with underlying Gaussian data.
And it is clear that once we are relying on sample estimates instead of population parameters, we loose the optimality of the method.
In practice, it is really hard to assess which assumptions hold and which ones do not, so we can be limited to use a trial and error approach.

4.2.2 Regularized discriminant analysis (RDA)

When data is scarce, some problems can arise while using discriminant analysis. For example:
RDA computes the covariance matrices as Σ ˆ k ( α ) = α Σ ˆ k +( 1- α ) Σ ˆ , where α [ 0,1 ] is the regularization parameter. The method is QDA when α =1 and is LDA when α =0 . In any other case, it is something in between.
A further way to regularize the matrices is by Σ ˆ k ( α , γ ) =( 1- γ ) Σ ˆ k ( α ) + γ σ ˆ 2 I, where σ ˆ 2 = Tr[ Σ ˆ k ( α ) ] d , and the diagonal term improves the well-conditioning of the method.

4.3 Naïve Bayes

The Naïve Bayes Classifier is a Bayesian classifier that assumes that the features are pair-wise independent in the class-conditional distribution. This means that the probability can be written as p( x|y ) = j=1 d p( x j |y ) . This assumptions does not hold in general, but this approach can provide a good approximation in many cases. Also, it is practical, as the amount of parameters to estimate is small.
As before, we classify the input record x in the class c k that maximizes the discriminant function: g k ( x ) = log π k + j=1 K log p( x j |y= c k ) . Therefore, we need to
  1. Estimate the class priors as the sample frequency, π k = n k n .
  2. Estimate the class-conditional densities for each input feature independently.
Naïve Bayes can also be used in the case of categorical variables. To model binary features, we can use, for example, the Bernoulli distribution P( x|p ) = p x ( 1-p ) 1-x , where x{ 0,1 } and p[ 0,1 ] is the probability of the event happening. For a binary feature, we would need to estimate K parameters, one for each class, so that P( x|y= c k ) = p k x ( 1- p k ) 1-x . If we had all our features as binary, then the discriminant functions would be g k ( x ) = log π k + j=1 d log P( x j |y= c k ) = log π k + j=1 d [ x j log p k,j +( 1- x j ) log ( 1- p k,j ) ] , where p k,j is the Bernoulli parameter for the j th feature and the k th class. Note that this is a linear function with respect to to x .
If instead we have categorical features with more values, we can then use the Categorical distribution g k ( x ) = log π k + j=1 d v [ x j =v ] log p k,j,v , where [ expr ] is 1 if expr is True and 0 otherwise, and p k,j,v is the Categorical parameter for the value v of the j th feature and the k th class.
Now, we need to estimate these parameters, for which we can use the sample frequencies. Note how 0-frequencies can be a problem, so it is a common approach to utilize Laplace smoothing p ˆ ( v|y= c k ) = n k,v +p n k +pV . Here:
Example 4.2. Naïve Bayes in MATLAB

4.3.1 Gaussian Naïve Bayes

If we have numericla features, the usual approach is to assume they follow Gaussian distributions, then we estimate their mean and variance using MLE. If all features are assumed Gaussian, this approach is equivalent to QDA with diagonal covariance matrices.
Other approaches are:
Note that when the data is mixed, we can assume a different distribution for each feature, and then add the log-likelihoods altogether.

4.4 Perceptron and Logistic Regression

The perceptron is a mathematical model of the functioning of a neuron in our brain. In Figure 11 we can see a diagram of a neuron. Other neurons transmit signals into our neuron via the dendrites, then 'something' happens inside the neuron, and it transmits or not signals through its axon towards the neurons to which it is connected.
image: 21_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_Machine_Learning_LectureNotes_source_neuron.png
Figure 11: A neuron.
This behavior was mathematically modeled by Roosenblatt in his pioneer paper [4]. He did this with the concept of perceptron:
This is depicted in Figure 12. Notice
image: 22_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_M___rning_LectureNotes_source_perceptron_drawio.png Figure 12: A simple perceptron.
This way, we can classify the input variables X R d into two classes, 1 or -1. But... what are the weights?
Example 4.3. Imagine we have the following classification function class( x ) ={ 1 ifx>0.5 -1 otherwise , then it is easy to construct a perceptron that can classify the instances. We have:
And we are good!
In this previous example we can grab some intuition on how the weights are chosen, but doing this by hand is not scalable at all. When we have several input variables, this process is complicated quite a lot. Luckily, the perceptron algorithm helps us estimate the weigths.
This algorithm is an on-line algorithm, meaning that it process one training example at a time, updating W incrementally. The training examples are pairs ( X,y ) , where X R d and y{ -1,1 } . Also, we can scale all X to lie in the unit sphere, because this also scales the hyperplanes of classification and classes remain the same. The algorithm goes as follows:
init W=0

for i in range(1,e):
	for each (X,y):
	
		compute prediction=<@$F(S(W^TX))$@>
	
		if prediction != y:
			# update W
			W = W + lr*X*y
return W
Algorithm 6: Perceptron algorithm (input X, classes y, learning rate lr, epochs e) -> weights W
The updates are made like that because we can consider the error function E( W ) =- W T Xy only when X is misclasified. In that case, W T Xy<0 and thus the minus sign. We want to minimize this error, for which we follow a gradient descent approach (we will deepen on this later), so we update W as W'=W- η E=W+ η Xy. The basic idea is that if we now utilize this new weights in the same input, we would get W ' T X= ( W+ η Xy ) T X= W T X+ η y X T X. Now, say y=1 , then it is W ' T X= W T X+ η X 2 > W T X, so we have made the input to be closer to be positive, and thus correctly classified. If y=-1 the same logic applies.
In fact, under some conditions, convergence is ensured:
Theorem 4.1. Perceptron convergence theorem
For any finite set of linearly separable labeled examples, the Perceptron Learning Algorithm will halt after a finite number of iterations. In other words, after a finite number of iterations, the algorithm yields a vector W that classifies perfectly all the examples.
The proof can be read in https://web.mit.edu/course/other/i2course/www/vision_and_learning/perceptron_notes.pdf.
Example 4.4. A simple perceptron in MATLAB
image: 23_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_M____scripts_perceptron_example_images_figure_0.png
image: 24_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_M____scripts_perceptron_example_images_figure_1.png
image: 25_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_M____scripts_perceptron_example_images_figure_2.png
image: 26_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_M____scripts_perceptron_example_images_figure_3.png
We can improve the approach by choosing a different activation function, which would be differentiable ideally. A usual alternative is the logistic function σ ( z ) = 1 1+ e -z , which maps R [ 0,1 ] , so that this input can be interpreted as a probability, allowing us to represent uncertainty in the prediction. This function has the following properties:
When we use the logistic function, we are performing what is called logistic regression for binary classification. Again, we have a dataset of pairs ( X i , y i ) where X i R d and y i [ 0,1 ] , associating label y i =1 with a positive example, and y i =0 with a negative example. In logistic regression we model P( y=1|x ) = σ ( W T X ) , where, again, W is the weight vector. Therefore, we are modelling y i Ber( p i ) , where p i = σ ( W T X i ) . Notice that we don't assume anything about the distribution of X , though.

Why is logistic regression a linear classifier?

When we classify, we need to set a threshold above which we classify the label to be 1. Usually, this threshold is set to be 0.5 . Now, σ ( z ) >0.5 1 1- e -z >0.51>0.5( 1- e -z ) 2>1- e -z 1> e -z 0>-zz>0. And remember that z= W T X , so, in fact, we are classifying using the same criteria as before.
At this point, we can try to use maximum likelihood in the search of an optimal W : L ( W ) = i=1 n P( y i | X i ,W ) = i=1 n Ber( y i | y ˆ i = σ ( W T X i ) ) = i=1 n y ˆ i y i ( 1- y ˆ i ) 1- y i = i=1 n σ ( W T X i ) y i ( 1- σ ( W T X i ) ) 1- y i . As usual, we take the log-likelihood, but in this case we minimize the negative log-likelihood instead, which can be interpreted as an error function: E( W ) =- log L ( W ) = - i=1 n log P( y i | X i ,W ) = - i=1 n y i log y ˆ i +( 1- y i ) log ( 1- y ˆ i ) . This error expression is known as log loss or binary cross-entropy, and it is widely used in classification schemes.
Now, as we aim to find its minimum, we compute its gradient. For this, recall that σ '( z ) = σ ( z ) ( 1- σ ( z ) ) , and notice that W T X i W = X i . Recall also that ( fg ) '( x ) =[ g( f( x ) ) ] '=g'( f( x ) ) f'( x ) . Now, let's go for this derivative:
E( W ) = E( W ) W =- i=1 n y i log σ ( W T X i ) W + ( 1- y i ) log ( 1- σ ( W T X i ) ) W = - i=1 n y i σ ( W T X i ) σ ( W T X i ) ( 1- σ ( W T X i ) ) X i - 1- y i 1- σ ( W T X i ) σ ( W T X i ) ( 1- σ ( W T X i ) ) X i = i=1 n ( 1- y i ) σ ( W T X i ) X i - y i ( 1- σ ( W T X i ) ) X i = i=1 n [ ( 1- y i ) σ ( W T X i ) - y i ( 1- σ ( W T X i ) ) ] X i = i=1 n [ σ ( W T X i ) - y i σ ( W T X i ) - y i + y i σ ( W T X i ) ] X i = i=1 n [ σ ( W T X i ) - y i ] X i = i=1 n [ y ˆ i - y i ] X i . In this case, it is not possible to find a close-form solution, so we use iterative methods to approximate local minima. We are going to use gradient descent, which is one of the simplest approaches, but it is widely used.

4.4.1 Gradient descent

Gradient descent is a general numerical method to find local minima of a differentiable function F( Z ) . The idea is to use the fact that the gradient of a vector function points in the direction of maximum growth, and the negative gradient points in the direction of maximum decrease. Therefore, if we 'follow' this direction, we should approach a minimum of the function, although it might not be the global minimum.
The approach works as follows. To approximate a local minimum of F( W ) :
  1. We start at a weight vector W 0 and set k=0 .
  2. Compute F( W k ) .
  3. Update the weight vector W k+1 = W k - γ F( W k ) . Also update k=k+1 . The parameter γ is called learning rate, and it quantifies 'how much' we advance in the direction of the gradient. It has to be chosen beforehand and it is critical, since:
    1. If γ is too big, we might jump over the minimum and the method might never converge.
    2. If γ is too small, the method might converge too slowly.
    Related to the learning rate is the importance of feature scaling, as the same learning rate impacts all features, so if they have different scales, the method might be good for some of them, but bad for others.
  4. Repeat until convergence or maximum number of steps is reached.

4.4.2 Newton's algorithm

Newton's algorithm
7
There is an infinite amount of Newton's algorithms haha, this is one of them :D
is an algorithm used to find roots of a function which converges faster than gradient descent and does not need a learning rate parameter. As this algorithm finds roots (i.e. zeros), we don't apply directly to F , but rather to its gradient. Thus, in this case, we need F to be twice differentiable and to be able to afford computing its second derivatives, which can sometimes be costly.
In this case, the algorithm works as follows:
  1. We start at a weight vector W 0 and set k=0 .
  2. Compute the Hessian of F : H( F( W ) ) = 2 F( W ) =( 2 F( W ) W 1 2 F( W ) W 1 W 2 F( W ) W 1 W d F( W ) W 2 W 1 F( W ) 2 W 2 F( W ) W 2 W d F( W ) W d W 1 F( W ) W d W 2 F( W ) 2 W d ).
  3. Update the weight vector W k+1 = W k -H ( F( W k ) ) -1 F( W k ) , and set k=k+1 .
  4. Repeat until convergence or maximum number of steps is reached.
Remark 4.1. Notice that:

Iterated Reweighted Least Squares (IRLS)

When we apply Newton's method to the log-likelihood of the logistic regression, it is called the IRLS method. The gradient of the log-likelihood is E( W ) = i=1 n ( y i - y ˆ i ) X i = X T ( y- σ ( W T X ) ) and the Hessian H( F( W ) ) = i=1 n y ˆ i ( 1- y ˆ i ) X i X i T = X T diag( y ˆ i ( 1- y ˆ i ) ) X.
Example 4.5. Logistic regression example using MATLAB
image: 27_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_M___logistic_regression_example_images_figure_0.png
image: 28_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_M___logistic_regression_example_images_figure_1.png
image: 29_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_M___logistic_regression_example_images_figure_2.png
image: 30_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_M___logistic_regression_example_images_figure_3.png

4.4.3 Multi-class logistic regression

The multi-class case, K>2 , is handled by having a separator W ( k ) for each class k=1,...,K . In this case, instead of a Bernoulli distribution for the targets y i , we use a categorical distribution. Each target y i is represented with its one.hot encoding.
In this case, the likelihood function is L ( W ( 1 ) ,..., W ( K ) ) = i=1 n k=1 K y ˆ ik y ik , and the cross-entropy loss is E( W ( 1 ) ,..., W ( K ) ) =- i=1 n k=1 k y ik log y ˆ ik . As in the binary case, we optimize this by using gradient descent or Newton's method.

4.4.4 Regularization

Finally, note that we can add regularization to the weights, in the same way we did for linear regression, by adding the L1 or L2 norms of the weights to the cross-entropy loss E lasso ( W ) =- i=1 n [ y i log y ˆ i +( 1- y i ) log ( 1- y ˆ i ) ] + λ j=1 d | W j | , and E ridge ( W ) =- i=1 n [ y i log y ˆ i +( 1- y i ) log ( 1- y ˆ i ) ] + λ j=1 d W j 2 . Note that this is for the binary case, but for the general case is similar.

5 Nearest Neighbor Prediction

The nearest neighbor predictor uses the local neighborhood of an input point to compute a prediction. A well known family of this kind is the kNN models, which predicts the output of the input point by combining the known outputs of the k nearest training data points, for example by voting if we are classifying the data.
The approach of these method is quite straightforward, and is detailed in Algorithm 7.
Training
	Store all training examples

Prediction
	Given an input X:
		1. Compute distance/similarity with all examples in the training set.
		2. Locate <@$k$@> closest points.
		3. Emit prediction by combining outputs of the <@$k$@> closest points.
Algorithm 7: kNN Pseudocode.
Of course, we have to decide:

5.1 Locality: similarities and distances

It is important to decide what distance or similarity between points means in our feature space, since this will determine the neighborhoods of the points.
Definition 5.1. A distance function is a function d: R d × R d R , verifying:
  1. Non-negativity: d( a,b ) 0,a,b R d .
  2. Zero distance is equal: d( a,b ) =0a=b,a,b R d .
  3. Symmetric: d( a,b ) =d( b,a ) ,a,b R d .
  4. Triangle inequality: d( a,b ) d( a,c ) +d( c,b ) ,a,b,c R d .
A similarity function is a function s: R d × R d R ,
verifying:
  1. Ranged: s( a,b ) [ -1,1 ] ,a,b R d or s( a,b ) [ 0,1 ] ,a,b R d .
  2. Completely similar is equal: s( a,b ) =1a=b,a,b R d .
  3. Symmetry: s( a,b ) =s( b,a ) ,a,b R d .
Example 5.1. Some examples of distances are:
Some examples of similarities are:

5.2 Choosing k

Nearest neighbor methods are very sensitive to the chosen value of k . If it is too low, it is easy to overfit; if it is too large, we can underfit. Therefore, we need to choose it thoughfully. This will depend on the dataset, and the typical approach is to use cross-validation, or other resampling methods, seeing k as an hyperparamter that trades-off bias and variance of the resulting model.

5.3 How to combine outputs to make predictions

5.3.1 Classification

5.3.2 For regression

When we weight the predictions, we can relax the constraint of using only k examples, and we can even use the whole training set for making predictions, since this approach lowers the chances of underfitting.

5.4 Decision boundaries for nearest neighbors classifier

In 1-nearest neighbor, the decision regions correspond the union of each example's Voronoi cell, with appropriate class. Given a set of points { p i } iI R d , the Voronoi cell of point p j corresponds to the set of points whose nearest neighbor is p j . For example, here I show two different Voronoi diagrams, which show the Voronoi diagram of 6 different points:
image: 31_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_Machine_Learning_LectureNotes_source_pegado19.png image: 32_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_Machine_Learning_LectureNotes_source_pegado20.png
Figure 13: Voronoi diagrams. Code in MATLAB script voronoi.mlx.
The decision boundaries and regions are non-linear, but they get smoother as we increase the value of k . This effect can be observed in the following Figure:
image: 33_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_Machine_Learning_LectureNotes_source_pegado28.png image: 34_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_Machine_Learning_LectureNotes_source_pegado22.png image: 35_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_Machine_Learning_LectureNotes_source_pegado27.png
image: 36_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_Machine_Learning_LectureNotes_source_pegado26.png image: 37_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_Machine_Learning_LectureNotes_source_pegado25.png
Figure 14: kNN decision regions for different values of k .

5.5 Final considerations


6 Trees and Random Forests

An ensemble method is a method that combines two or more predictors, instead of using a single prediction. These are useful when we have several models that are better that just a random guess, and that are independent between them. The combination of the models can be done averaging predictions, when in regression, or by majority vote in classification.
The main types of ensemble methods are:

6.1 Trees

A regression tree partitions the feature space into axis-parallel regions and predicts using the average of training points that fall into that region. We can think of a tree in terms of nodes and edges: each node in the tree specifies a test of some attribute, and each edge descending from that node corresponds to one of the possible outcomes for the test. The leaf nodes (or terminal nodes) of the tree contain an output value which is used to make a prediction. When a new data point is presented to the tree for prediction, it is routed down the tree based on the outcome of the tests in each node, starting from the root and ending at a leaf node. The value in the leaf node is then returned as the prediction.
A classification tree also partitions the feature space into axis-parallel regions, each of them representing a class. It is NP-hard to find the optimal tree of minimum size, so greedy approaches are usually used. A general approach is explained in Algorithm 8.
1- Feature selection
	Find the feature that best splits the data into two subsets, minimizing a impurity function.

2- Binary splitting
	Decide how to split the feature. If it is categorical, split in each possible value. If it is continuous, find a value that divides it in two, minimizing the impurity.

3- Recursion
	Repeat 1- and 2- until the stopping criterion is met.

4- Pruning
	To avoid overfitting.
Algorithm 8: Training a Tree.
We are going to use the Gini impurity metric to create trees.
Let S be a subset of example of the input data SD={ ( x 1 , y 1 ) ,...,( x n , y n ) } , with x i R d and y i Y , with | Y | =K2 . Let also p k ( S ) be the proportion of examples in S that belongs to class k . Then, the Gini impurity metric is computed as Gini( S ) = k=1 K p k ( S ) ( 1- p k ( S ) ) =1- k ( p k ( S ) ) 2 .
Therefore, to find the node to expand the tree, we need to find the pair ( v * , s * ) , where v * is a feature and s * R , such that ( v * , s * ) = arg min v,s [ | S vs | | S | Gini( S vs ) + | S v>s | | S | Gini( S v>s ) ] .
In the case of a regression tree, the approach is similar, but instead we minimize the sum of squared errors, relative to the average target value in each induced partition. In other words, we now seek for the pair ( v * , s * ) such that ( v * , s * ) = arg min v,s [ SSE( S vs ) +SSE( S v>s ) ] , where SSE( S ) = i|( x i , y i ) S ( y i -avg( S ) ) 2 .
Example 6.1. Classification trees in MATLAB
image: 38_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_M___cripts_classification_trees_images_figure_0.png
image: 39_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_M___cripts_classification_trees_images_figure_1.png
image: 40_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_M___cripts_classification_trees_images_figure_2.png

6.2 Random Forests

Let's deepen a bit in the bagging technique, which we said it can be used to reduce the variance of our estimates by averaging the estimates obtained from independent models. Assume we have B models { M b } b=1 B , and an input x . Each model produces an estimate Y b = M b ( x ) , which tries to predict the real value y . Let's suppose these estimates are unbiased, i.e. E[ Y b ] =y,b . Then, the average of the estimates is E[ 1 B b Y b ] = 1 B b E[ Y b ] = 1 B b y=y. Therefore, this average is also an unbiased estimate of the value y . Let's see what happens with the variance when we perform this average. If we assume that the models are independent and that they all have the same variance, then it follows Var[ 1 B b Y b ] = 1 B 2 Var[ b Y b ] = 1 B 2 ( b Var[ Y b ] + ) = 1 B 2 b Var[ Y b ] = Var[ Y b ] B . Thus, the variance is reduced, making the predictions more consistent.
Since trees typically suffer from high variance (overfitting), so they seem like a good candidate to apply this technique. The application of bagging to trees are the random forests.
First, a bootstrapped dataset is created by randomly selecting samples from the original dataset with repetition. The samples that not chosen for the bootstrapped dataset are placed in a separate dataset, the out-of-bag dataset (OOB).
Then, we want to generate a diverse collection of trees that are better than a random choice, and are as independent from each other as possible. If we construct all these trees using the usual algorithm, then they will be very similar, so we have to inject stochasticity into this process. This is done by:
The pseudocode of the random forest training algorithm is shown in Algorithm 9.
for b=1 to n_estimators do
	Sample <@$D_b$@> as a bootstrap of max_samples from D
	
	Create tree <@$T_b$@> on <@$D_b$@>, adding nodes by
		- Select max_features variables at random from all variables
		- Pick the best variable/split point according to Gini/SSE
		- Split the current node according to the split found
end

Output forest <@$\{T_1,...,T_B\}$@>
Algorithm 9: random_forest(D,n_estimators,max_samples,max_features)
To make predictions on a test example x :
Then, to estimate the generalization error, we can use the OOB dataset to compute the OOB error. This generalization error can be used as a validation error to select appropriate values for the hyperparameters, so we don't need to perform cross-validation.

6.2.1 Interpretability of random forests

If n_estimators is too big (there are many trees in the forest), then it can be difficult to comprehend the decision process of the model: it is not interpretable. We can do variable importance plot to interpret better the results:

6.2.2 Proximities

The idea of proximity of samples in a tree is that when two examples fall into the same leaf of a decision tree, this is evidence supporting the fact that these two examples are similar in some sense.
In the wider sense of forests, we can build a n × n similarity matrix, where each entry ( i,j ) correspond to the fraction of trees in the forest such that x i , x j end up in the same leaf. This matrix can be used in distance/similarity based methods, or to apply PCA to obtain numeric representation of the data.

6.2.3 Imbalanced data in classification

In random forest, there are several techniques to deal with imbalaced data:

7 Multi-Layer Perceptron (Neural Networks)

In Section 4.4 we have seen the Perceptron, a mathematical model of a neuron that can be used in binary classification. We noted that this model is equivalent to a linear classifier, being thus a very limited model for a more general scenario. In this section, we are going to see how to extend this model to perform more complex tasks.

7.1 Multiclass classification

We saw how to use a single perceptron to perform binary classification, and there is a very natural way to extend this model to multi-class classification, using a one-hot encoding for the class.
For instance, let's work with a dataset D ={ x,y } where x R d and yL={ l 1 ,..., l K } . We now encode y as a one-hot vector, i.e., y k ={ 0 ify l k 1 ify= l k , for k=1,...,K . The idea is then to use K perceptrons, each of them focusing on predicting one of the classes and trained independently. The idea is depicted in Figure 15. Notice that in this case is almost compulsory to use the logistic function (or a similar function, but definitely not the step function), since this way we can interpret the values as probabilities, and therefore select the value with highest probability as our prediction. Also, remember than in x we are adding an artifical 1 to account for the bias.
image: 41_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_Machine_Learning_LectureNotes_source_mlp_1.png
Figure 15: Multi-class classification with perceptrons.
Let's formalize this idea a little bit. We have K independent perceptrons, each of them predicting a variable, so that if we call g our activation function, we obtain, for each k=1,...,K , y k ( x ) =g( w k T x ) , where w k is the weight vector of perceptron k . We can integrate all this into a single formula by introducing the notation f[ ] , which applies f to each component of the input vector: f[ x ] =( f( x 1 ) f( x d ) ). This way, we can unify the previous equations as y( x ) =g[ W T x ] , where we have arranged W=( | | w 1 ... w K | | ). Now, we can define what is a layer. A layer is just a set of parallel neurons and a layer with M neurons outputs M variables. Usually, all neurons in a layer use the same activation function, g . Note that if this is not the case, then our previous explanations needs to be slightly adapted.
Since we have discussed that a good alternative in multiclass classification is to use the sigmoid function as activation function, we can rewrite our model as y( x ) = σ [ W T x ] . Note, nonetheless, that here we are obtaining a bunch of different probabilities, one for each class, which could be further used to increase the information about our prediction. For instance, instead of computing the maximum out of these independent probabilities, we can add a softmax function to the outputs, resulting in a normalized vector that can now be considered as a probability vector. The softmax function is softmax ( y ) k = e z k k' e z k' .
Also, at doing this, it is not necessary to apply the activation function, g , beforehand. This approach is shown in Figure 16.
image: 42_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_Machine_Learning_LectureNotes_source_mlp_2.png
Figure 16: Multi-class classification with perceptrons and softmax.
Note that we are still in the linear classification scenario, and the next step is to improve the flexibility of the model to be able to cope with more complex relationships.

7.2 Multi-Layer Perceptron

Notice now an interesting fact: if our activation function is g: R I , i.e., I is the range of g , then a layer with M perceptrons can be interpreted as a function G: R n I M . For the sake of simplicity, let's say G: R n R M . Then, it is natural to think on stacking different layers, so that the output of a layers becomes the input of the following, effectively performing the composition of the functions defined by these layers. For example, if we have two layers, which we interpret as two functions G ( 1 ) : R n R M 1 and G ( 2 ) : R M 1 R M 2 , we can take the output of the first layer and use it as input for the second one, effectively obtaining G ( 2 ) G ( 1 ) : R n R M 2 . This is interesting mainly because a single layer, as we have seen, is only able to perform linear classification, but let's deep a bit more on what is happening on the second layer. The second layer is using an activation function g ( 2 ) : R R , which is applied in each perceptron of the second layer as y k ( 2 ) ( x ) = g k ( 2 ) ( w k ( 2 ) T y ( 1 ) ( x ) ) = g k ( 2 ) ( w k ( 2 ) T g ( 1 ) [ W ( 1 ) T x ] ) . This means that we are:
The key steps for going beyond linearity are the second and third. Let's compare it to a normal linear regression. In that case, we introduced basis functions to be able to go out of the linearity world, by transforming the input variables. But then, we used still linear regression over this transformation. Therefore, it was the modification of the input what allowed us to model non-linear relationship, and not the algorithm itself, which was not changed at all. Here, the idea is similar. In fact, we could just perform the same transformations that we used to in this new case, and keep applying just one layer of perceptrons to obtain non-linear classifiers. But this is exactly the power of the multi-layer perceptron: we don't need to assume or guess the shape of the basis function! The different weights of the different layers will adapt to our problem, effectively auto-selecting how these functions should be. Of course, this is not free, and the cost is paid by obtaining harder trainings.
After this explanation, we can first integrate the previous formula as y ( 2 ) ( x ) = g ( 2 ) [ W ( 2 ) T g ( 1 ) [ W ( 1 ) T x ] ] . We can also define the multi-layer perceptron as composition of layers of perceptrons, with the addition of what is called the input layer, which is not really a layer of perceptrons, but rather represents and fixes the size of the input, and the output layer, which is the final layer, outputing the predictions of the model. The rest of the layers, the real perceptron layers, are called hidden layers. This is depicted in Figure 17. In this case, the first layer has M 1 perceptrons, the second one has M 2 perceptrons, and the output layers has as many perceptrons as possible labels, K .
image: 43_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_Machine_Learning_LectureNotes_source_mlp_3.png Figure 17: A Multi-Layer Perceptron.
Remark 7.1. Observe that we have connected all outputs of each layer to all perceptrons in the second layer, also, note that two neurons within the same layer do not connect. These are just decisions and is not compulsory, and many different architectures exist.
More generally, a neural network is a directed graph whose nodes are perceptrons. If there are no cycles within the network, then it is a feed-forward neural network (FFNN), and it is called a recurrent network (RNN) in the case of cycles existing within the network.
MLP are thus a special case of feed-forward neural network, in which:
A brief visualization on different kinds of networks is depicted in Figure 18.
image: 43_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_Machine_Learning_LectureNotes_source_mlp_3.png Sub-Figure a: MLP.
image: 44_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_Machine_Learning_LectureNotes_source_mlp_4.png Sub-Figure b: FFNN not MLP.
image: 45_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_Machine_Learning_LectureNotes_source_mlp_5.png
Sub-Figure c: RNN.
Figure 18: Different kinds of NN.

7.3 Error functions

7.3.1 Regression

The error in the regression case can be the empirical mean square error, i.e., E( W ) = 1 2 i=1 n ( y i - y W ( x i ) ) 2 , where the dataset is { ( x i , y i ) } i=1 n , y W ( x i ) is the prediction of x i with the weight matrix W . Note that in this formula we are assuming a single output, but it can be easily generalized as E( W ) = 1 2 i=1 n y i - y W ( x i ) 2 . As usual, our objective will be to minimize this error.

7.3.2 Binary classification

If the target values are y i { 0,1 } , then we can express P( y|x ) = y W ( x ) y ( 1- y W ( x ) ) 1-y . If we assume independent and identically distributed examples, then we can define our likelihood function as L ( W ) = i=1 n y W ( x i ) y i ( 1- y W ( x i ) ) 1- y i , and the negative log-likelihood defines the cross-entropy error: E( W ) = - log L ( W ) =- i=1 n y i log ( y W ( x i ) ) +( 1- y i ) log ( 1- y W ( x i ) ) .

7.3.3 Multi-class classification

The last error formula can be generalized for multi-class classification with K>2 classes, as the generalized cross-entropy error: E( W ) =- i=1 n k=1 K y i,k log ( y W,k ( x i ) ) .

7.4 Training the MLP: Backprogragation

Notice that the error functions that we have defined can be very complex, and there is no closed form solution to minimize them. Therefore, some iterative approach is necessary, as we did with the perceptron. But we encounter now a new problem: if we do gradient descent on the neural network, we will need to compute lots of gradients, and computing a gradient is not cheap. The idea to make this process more efficient is to use the chain rule and to notice that many of the gradients that we compute are actually used in many places along the network for the computation of the final gradient. Therefore, an improved version for computing gradients was developed: the backpropagation algorithm.

7.4.1 The chain rule

The chain rule
8
For more information you can just visit Wikipedia.
states that, given two composable differentiable functions f,g , it is d dx f( g( x ) ) =f'( g( x ) ) g'( x ) . In the multivariate case, when we want to compute the derivative of f( g 1 ( x ) ,..., g m ( x ) ) , it is d( f( g 1 ( x ) ,..., g m ( x ) ) ) dx = i=1 m f y i ( g 1 ( x ) ,..., g m ( x ) ) d g i dx ( x ) . This is sometimes written by naming z=f( g 1 ( x ) ,..., g m ( x ) ) and y i = g i ( x ) , so that the formula becomes
9
Notice the abuse of notation though.
dz dx = i=1 m dz d y i d y i dx .
Example 7.1. Let's do an example to understand how this can be helpful in the case of neural networks.
Imagine we have the functions: y 1 = x 1 2 e x 2 y 2 = x 2 3 z= y 1 ( 1- e y 2 ) , and we want to compute the gradient of z in terms of X . We can write this as a diagram:
image: 46_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_Machine_Learning_LectureNotes_source_backprop_1.png
We simply need to compute the gradient using the chain rule: z=( dz d x 1 dz d x 2 )=( d y 1 d x 1 + d y 2 d x 1 d y 1 d x 2 + d y 2 d x 2 )=( 2 x 1 e x 2 -0 x 1 2 e x 2 -3 x 2 2 )=( 2 x 1 e x 2 x 1 2 e x 2 -3 x 2 2 ). Basically, the idea is to be able to reuse as many values as possible (notice the coloured expressions). To do this, we can compute local derivatives in each layer, and propagate them back to the previous layer, multiplying them. The previous computation can be done in the diagram as follows:
image: 47_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_Machine_Learning_LectureNotes_source_backprop_2.png
Note that:
  1. We can avoid unnecessary computations by storing at each node:
    1. Forward values: when we evaluate y 1 ( x 1 , x 2 ) , we can keep this value stored in the node, because we will need it later. In this example we did symbolic differentiation, so this was not necessary, but in neural networks we need to first to the forward pass, and then assess the error.
    2. Local derivatives: when we evaluate d y 1 d x 2 , we can keep this value stored in the node, because we will use it several times.
    3. Backwards gradients: local derivatives are sent to those nodes that will need them, so that each of this derivatives is computed only once. Then the gradient is constructed little by little, backwards, in a dynamic programming manner.
  2. When the gradients are flowing backwards, each node has to sum all incoming gradients, according to the multivariate chain rule (if there is only one incoming gradient, we just take that one).

7.4.2 The backpropagation algorithm

The pseudocode for backpropagation is as in Algorithm 10. Let's break it down.
1. Forward pass
	- for layer in {1...c+1}
		a[layer] = W[layer]^T * z[layer-1]
		z[layer] = g[a[layer]] # apply g[.] to each component of a[layer]

2. Backward pass
	- d[c+1] = dg[a[c+1]] .* (z[c+1]-y) # .* is the component-wise product and dg is the derivative of g
	- grad[c+1] = z[c]*d[c+1]^T
	- for layer in {c...1}
		d[layer] = dg[a[layer]] .* (w[layer+1]*d[layer+1]) # w[k] is W[k] without the bias weigth
		grad[layer] = z[layer-1]*d[layer]^T

return grad
Algorithm 10: Backpropagation.
Example 7.2. Let's perform an example. Here is our neural network:
image: 48_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_Machine_Learning_LectureNotes_source_backprop_3.png
It is a fully connected two-layer network. Let's follow the algorithm, defining each of the variables at a time.
  1. We define a ( 1 ) =x, a ( 2 ) = z ( 1 ) = g 1 [ a ( 1 ) ] , a ( out ) = z ( 2 ) = g 2 [ a 2 ] and z ( out ) =z= g out [ a ( out ) ] :
    image: 49_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_Machine_Learning_LectureNotes_source_backprop_4.png
    1. Define d ( out ) = g out '[ a ( out ) ] ( z ( out ) -y ) and gra d ( out ) = z ( 2 ) δ ( out ) T :
    image: 50_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_Machine_Learning_LectureNotes_source_backprop_5.png
    1. Define d ( 2 ) = g 2 '[ a ( 2 ) ] ( W ˆ ( out ) δ ( out ) ) and gra d ( 2 ) = z ( 1 ) δ ( 2 ) T :
    image: 51_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_Machine_Learning_LectureNotes_source_backprop_6.png
    1. Define d ( 1 ) = g 1 '[ a ( 1 ) ] ( W ˆ ( 2 ) δ ( 2 ) ) and gra d ( 1 ) = z ( 0 ) δ ( 1 ) T = a ( 1 ) δ ( 1 ) T :
    image: 52_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_Machine_Learning_LectureNotes_source_backprop_7.png
And that's it! Now we can use the gradient at each layer to train the network, updating the weights in layer l according to: W ( l ) = W ( l ) - γ gra d ( l ) , where γ is the learning rate.

7.5 Some activation functions

7.5.1 Logistic

The logistic function is σ : R ( 0,1 ) , defined by σ ( x ) = 1 1+ e -x with derivative d dx σ ( x ) = σ ( x ) ( 1- σ ( x ) ) . In this case, for the backpropagation algorithm, we use g'[ a ( l ) ] =g[ a ( l ) ] ( 1-g[ a ( l ) ] ) = z ( l ) ( 1- z ( l ) ) , which is a very efficient formula, since we have all this values computed in the forward pass.

7.5.2 Hyperbolic tangent

The hyperbolic tangent function is tanh: R ( -1,1 ) , defined by tanh( x ) = e x - e -x e x + e -x , with derivative d dx tanh( x ) =1- ( tanh( x ) ) 2 . Thus, we end up with another very efficient formula g'[ a ( l ) ] =1- z ( l ) z ( l ) .
Example 7.3. Implementing a NN for digit classification in MATLAB
image: 53_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_M___e_scripts_neuralnet_example_images_figure_0.png
image: 54_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_M___e_scripts_neuralnet_example_images_figure_1.png
image: 55_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_M___e_scripts_neuralnet_example_images_figure_2.png
image: 56_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_M___e_scripts_neuralnet_example_images_figure_3.png
image: 57_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_M___e_scripts_neuralnet_example_images_figure_4.png
image: 58_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_M___e_scripts_neuralnet_example_images_figure_5.png
image: 59_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_M___e_scripts_neuralnet_example_images_figure_6.png
image: 60_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_M___e_scripts_neuralnet_example_images_figure_7.png

A Notes on probability theory, Bayes theorem and Bayesian learning

These notes are adapted from [3].

A.1 Probability theory basic

Let Ω be a sample space, i.e., the set of possible outcomes of an event, and A Ω an event. A probability measure is a function P: P ( Ω ) R , that assigns a real number to every event, P( A ) . This represents how likely it is that the experiment's outcome is in A .
For ( Ω ,P ) to be a probability space
10
In fact, we need one more ingredient, F , which is a σ -algebra of subsets of Ω . This is just a formalization, but we can abuse notation here to simplify some things.
we have to impose three axioms:
  1. P( A ) 0,A Ω .
  2. P( Ω ) =1 .
  3. P( AB ) =P( A ) +P( B ) if AB= .
From these axioms, some consequences can be derived:

A.1.1 Joint probability

It is usal to be interested in the probability of two events happening simultaneously. This is called the joint probability of events A and B : P( A,B ) := def P( AB ) . The join probability is useful when an event can be decomposed in simpler disjoint events, i.e., we have A B 1 ... B n , with B i B j =,ij . In this situation, we can use the sum rule: P( A ) = i=1 n P( A, B i ) . This is also known as marginalization: when we know the joint probability p( x,y ) and we want to compute p( x ) , we marginalize out y . This basically means that if we know the probabilities of all possible pairs ( x,y ) , we can know the probability of x by exploring all the possibilities. Here, 'exploring' is using the sum rule: p( x ) = y p( x,y ) or p( x ) = y y p( x,y ) y , if y is continuous.
Example A.1. We have two events:
We have some sources of information, and are able to determine that p( >100,prof ) =0.05,p( >100,seng ) =0.1,p( >100,dsci ) =0.2, then we can conclude that p( >100 ) =0.05+0.1+0.2=0.35.

A.1.2 Conditional probability

The conditional probability of B given A is the probability that B occurs, knowing that A has occurred. This means that we have to restrict the space of possible outcomes, from Ω to A
11
Note that P( A ) >0 is needed.
: P( B|A ) = P( AB ) P( A ) . If we rearrange the terms, we can obtain the product rule: P( A,B ) =P( B|A ) P( A ) . This formula can be generalized to an arbitrary number of events, the general product rule, given by
12
Note that here it is needed that all the intersections A 1 ... A i have non-zero probability, for i=1,...,n-1 .
P( A 1 ,..., A n ) = i=1 n P( A i | A 1 ,..., A i-1 ) .
Exercise A.1. Prove that P( A,B,C ) =P( A ) P( B|A ) P( C|A,B ) =P( C ) P( B|C ) P( A|B,C ) .
Then, prove the general product rule.
Basically, we are asked to prove the general product rule by induction.
For n=2 , we have already proven it.
For n=3 , we hav P( ( A 1 A 2 ) A 3 ) = P( A 3 | A 1 A 2 ) P( A 1 A 2 ) = P( A 3 | A 1 A 2 ) P( A 2 | A 1 ) P( A 1 ) , which is what we wanted.
Now, assume it is true for n-1 . Then, we have P( A 1 ... A n ) = P( ( A 1 ... A n-1 ) A n ) = P( A n | A 1 ... A n-1 ) P( A 1 ... A n-1 ) = induction P( A n | A 1 ... A n-1 ) i=1 n-1 P( A i | A 1 ,..., A i-1 ) = i=1 n P( A i | A 1 ,..., A i-1 ) .

A.1.3 Bayes rule

Bayes theorem gives an alternative formula for the conditional probability: P( A|B ) = P( B|A ) P( A ) P( B ) .
Proof. Assuming all involved probabilities are non-zero: P( A|B ) = P( AB ) P( B ) = prodrule P( B|A ) P( A ) P( B ) .
This rule is known to be useful to update the probability of an event happening, when we are able to gather new information of related events. P( A ) is usually called the prior probability, and P( A|B ) is the a posteriori probability, which means that we have observed B , and want to update the probability estimate for A .
Example A.2. Example of the Bayes rule in action
An English-speaking tourist visits a city whose language is not English. A local friend tells him that 1 in 10 natives speak English, 1 in 5 people in the streets are tourists and that half of the tourists speak English. Our visitor stops comeone in the street and finds that this person speaks English. What is the probability that this person is a tourist?
We have P( EN|Tourist ) = 1 2 ,P( EN|Local ) = 1 10 ,P( Tourist ) = 1 5
We want to update our knowledge about the event of this person being a tourist. The prior probability is 1 5 , but since we know that this person speaks english, we have new information useful for updating the probability.
First, the total probability of someone speaking english is P( EN ) = sumrule+productrule P( EN|Tourist ) P( Tourist ) +P( EN|Local ) P( Local ) = 1 2 1 5 + 1 10 4 5 = 9 50 .
Now, the a posteriori probability of the person being a tourist, now that we know that he speaks english is P( tourist|EN ) = P( EN|Tourist ) P( Tourist ) P( EN ) = 1 2 1 5 9 50 = 5 9 . As we can see (and as we should expect), knowing that the person speaks english, our confidence that he is a tourist increases.

A.2 Bayes rule in the context of learning

As have been explained, Bayes rule allows us to reason about hypotheses from data: P( hypothesis|data ) = P( data|hypothesis ) P( hypothesis ) P( data ) . In the jargon of parameters and datasets, this is: let θ be a random variable with support Θ , and let D be the data that has been observed. Then, it is P( θ |D ) = P( D| θ ) P( θ ) P( D ) = P( D| θ ) P( θ ) Θ Θ P( D| θ ) P( θ ) θ . Here, P( θ ) is the prior distribution of θ . This means it is the distribution that we assume, before observing D . P( D| θ ) is the likelihood of θ : the probability of observing D if the parameters are θ . P( D ) is the evidence or expected likelihood. P( θ |D ) is the posterior distribution of θ , our quantity of interest, expressing what we know about θ after having observed D .
Thus, we can continue this line of thought to tackle a new way of creating a model: given some data D , find the best possible values for the unknown parameters θ , so that the posterior or its likelihood is maximized.
There are, basically, two different approaches:

A.3 Maximum likelihood estimation

Given a sample D={ x 1 ,..., x n } , where x i are independent are identically distributed observations from a random variable X , following a distribution p( X; θ ) , with θ the paremeters of the distribution. Our objective will be to obtain the best values for θ , according to our data, assuming some special form for p .
For this, the likelihood can be used: since the x i are independent, the probability of having the sample D is p( D; θ ) = i=1 n p( x; θ ) . The likelihood function is thus defined as L( θ ) =p( D; θ ) . Note that this is done with a fixed data D , so it is not a probability distribution, but a function of θ . This way, the maximum likelihood estimator for θ is given by θ ML = arg max θ L( θ ) . There is a numerical issue here, though: as we are multiplying probabilities, which are values between 0 and 1, and we are likely to be multiplying many of them, we expect to obtain values very close to 0, which can lead to underflow in computations. Thus, it is convenient to use the log-likelihood: θ ML = arg max θ L( θ ) = arg max θ [ log L( θ ) ] = arg min θ [ - log L( θ ) ] .
Example A.3. Compute the maximum likelihood estimator (MLE) for univariate Gaussian distribution.
For this, we assume the data D={ x 1 ,..., x n } , where each x i N ( μ , σ 2 ) . In this situation, the parameters are μ and σ 2 . The likelihood function is L( D; μ , σ 2 ) = i=1 n f( x; μ , σ 2 ) = i=1 n 1 2 π σ 2 e - 1 2 ( x i - μ ) 2 σ 2 = 1 ( 2 π σ 2 ) n 2 e - 1 2 σ 2 ( x i - μ ) 2 . Now, the log-likelihood is logL( D; μ , σ 2 ) = log ( L( D; μ , σ 2 ) ) = log ( 1 ( 2 π σ 2 ) n 2 e - 1 2 σ 2 ( x i - μ ) 2 ) = log ( 1 ( 2 π σ 2 ) n 2 ) + log ( e - 1 2 σ 2 ( x i - μ ) 2 ) = log ( 1 ) - log ( ( 2 π σ 2 ) n 2 ) - 1 2 σ 2 ( x i - μ ) 2 log ( e ) = 0- n 2 log ( 2 π σ 2 ) - 1 2 σ 2 ( x i - μ ) 2 . At this point, to obtain the MLE, we need to maximize this function with respect to μ and σ 2 : μ logL=- 1 σ 2 ( x i - μ ) =0( x i - μ ) =0 x i -n μ =0 μ MLE = x i n , σ 2 logL=- n 2 1 2 π σ 2 2 π + 1 2 σ 4 ( x i - μ ) 2 =- n 2 σ 2 + 1 2 σ 4 ( x i - μ ) 2 =0 1 2 σ 4 ( x i - μ ) 2 = n 2 σ 2 ( x i - μ ) 2 =n σ 2 σ 2 = ( x i - μ ) 2 n . Now, we substitute the value obtained for μ . σ MLE 2 = ( x i - μ MLE ) 2 n .

Example A.4. Compute the MLE for a Bernoulli distribution.
Now the observations are the results of n coin tosses. We have to compute the parameter p ML of the Bernoulli random variable, whose probability function is given by f( x ) = p x ( 1-p ) 1-x ,p( 0,1 ) ,x{ 0,1 } . Now, we have D={ x 1 ,..., x n } { 0,1 } n . The likelihood function is L( D;p ) = i=1 n f( x i ) = i=1 n p x i ( 1-p ) 1- x i = p x i ( 1-p ) n- x i . We differentiate: p L= x i p x i -1 ( 1-p ) n- x i - p x i [ n- x i ] ( 1-p ) n- x i -1 = p x i -1 ( 1-p ) n- x i -1 [ x i ( 1-p ) -p( n- x i ) ] = p x i -1 ( 1-p ) n- x i -1 [ x i - p x i -pn+ p x i ] = p x i -1 ( 1-p ) n- x i -1 [ x i -pn ] . This derivative is zero if and only if x i -pn=0p= x i n = x ¯ .

A.4 Properties of estimators

Definition A.1. If we have a dataset D={ x 1 ,..., x n } where x i X and X is a random variable, we call estimator a function h: R n R k .
Usually, we focus on estimators that tell us something about the underlying distribution X . For example, it is usual to assume that X belongs to a certain family of random variables XF( θ ) , so that we need to estimate the parameters θ =( θ 1 ,..., θ k ) .
There are some properties that are considered desirable for an estimator to have:
Example A.5. Show that MSE( θ ˆ ) =Bias [ θ ˆ ] 2 +Var[ θ ˆ ] .
Let's start from the definition: MSE( θ ˆ ) =E[ ( θ - θ ˆ ) 2 ] = E[ θ 2 -2 θ θ ˆ + θ ˆ 2 ] =E[ θ 2 ] -2 θ E[ θ ˆ ] +E[ θ ˆ 2 ] = θ 2 - θ E[ θ ˆ ] - θ E[ θ ˆ ] +E[ θ ˆ 2 ] - = θ ( ) - θ E[ θ ˆ ] ++E [ θ ˆ ] 2 = - θ +E[ θ ˆ ] ( ) = +( ) = + .

Example A.6. Compute the bias and the variance of the ML estimates ( μ , σ 2 ) of an univariate Gaussian. Show that σ ML is biased and that we can correct its biasedness by using a different estimator σ ˆ 2 = n n-1 σ ML 2 . Compute the bias and the variance of this new estimator.
Let's start with μ :
Bias( μ MLE ) = Bias( x i n ) =E[ x i n ] -E[ X ] = 1 n E[ x i ] -E[ X ] = 1 n E[ x i ] -E[ X ] = 1 n E[ X ] -E[ X ] = n n E[ X ] -E[ X ] =0, Variance( μ MLE ) = E[ ( x i n ) 2 ] -E [ x i n ] 2 = 1 n 2 E[ ( x i ) 2 ] - 1 n 2 E [ x i ] 2 = 1 n 2 E[ x i 2 + ij x i x j ] - 1 n 2 ( E[ x i ] ) 2 = 1 n 2 ( E[ x i 2 ] + ij E[ x i x j ] ) - 1 n 2 ( E[ X ] ) 2 = 1 n 2 ( E[ X 2 ] + ij E[ x i ] E[ x j ] ) - 1 n 2 n 2 E [ X ] 2 = 1 n² ( nE[ X 2 ] + ij E [ X ] 2 ) -E [ X ] 2 = E[ X 2 ] +n( n-1 ) E [ X ] 2 - n 2 E [ X ] 2 n 2 = E[ X 2 ] -E [ X ] 2 n 2 = Var[ X ] n 2 . Now σ MLE 2 : Bias( σ MLE 2 ) = Bias( ( x i - μ MLE ) 2 n ) =E[ ( x i - μ MLE ) 2 n ] -Var[ X ] = 1 n E[ x i 2 -2 μ MLE x i +n μ MLE 2 ] -Var[ X ] = E[ X 2 ] - 2 n E[ μ MLE ] nE[ X ] +E[ μ MLE 2 ] -Var[ X ] = E[ X 2 ] -2E [ X ] 2 +E[ ( x i n ) 2 ] -Var[ X ] = E[ X 2 ] -E [ X ] 2 -E [ X ] 2 +E[ ( x i n ) 2 ] - Var[ X ] = 1 n 2 ( nE[ X 2 ] + ij E [ X ] 2 ) -E [ X ] 2 = nE[ X 2 ] +n( n-1 ) E [ X ] 2 - n 2 E [ X ] 2 n 2 = nE[ X 2 ] -nE [ X ] 2 n 2 = Var[ X ] n . Bias( σ ˆ 2 ) =Bias( n n-1 σ MLE 2 ) = Bias( ( x i - μ MLE ) 2 n-1 ) =E[ ( x i - μ MLE ) 2 n-1 ] -Var[ X ] = 1 n-1 E[ x i 2 -2 μ MLE x i +n μ MLE 2 ] -Var[ X ] = n n-1 E[ X 2 ] - 2 n-1 E[ μ MLE ] nE[ X ] + n n-1 E[ μ MLE 2 ] -Var[ X ] = n n-1 E[ X 2 ] - 2n n-1 E [ X ] 2 + n n-1 1 n 2 ( nE[ X 2 ] + ij E [ X ] 2 ) -Var[ X ] = n 2 E[ X 2 ] -2 n 2 E [ X ] 2 +nE[ X 2 ] +n( n-1 ) E [ X ] 2 -n( n-1 ) Var[ X ] n( n-1 ) = -+--+ n( n-1 ) = 0. And we see how this one is, in fact, unbiased.

A.5 Maximum a posteriori estimation

MAP (maximum a posteriori) estimation is a method of estimating the parameters of a statistical model by finding the parameter values that maximize the posterior probability distribution of the parameters, given the observed data and a prior probability distribution over the parameters. In contexts where the amount of data is limited or noisy, incorporating prior knowledge or beliefs can help to produce more stable and accurate estimates. The prior distribution serves as a regularization term, allowing us to control the degree of influence that the prior has on the estimate: θ ˆ MAP = arg max θ P( θ |D ) = arg max θ P( D| θ ) P( θ ) P( D ) = arg max θ P( D| θ ) P( θ ) , where the denominator can be ignored because it is constant for all possible θ , so it does not change the arg max .
Example A.7. Find the MAP estimate for μ of an univariate Gaussian X N ( μ , σ 2 ) with Gaussian prior distribution for μ N ( μ 0 , σ 0 2 ) , where σ , σ 0 and μ 0 are assumed to be known.
Let D={ x 1 ,..., x n } where x i X , then P( μ |D ) = P( D| μ ) P( μ ) P( D ) = i=1 n P( x i | μ ) P( μ ) P( D ) = i=1 n 1 2 π σ 2 e - 1 2 ( x i - μ ) 2 σ 2 1 2 π σ 0 2 e - 1 2 ( μ - μ 0 ) 2 σ 0 2 P( D ) = 1 ( 2 π σ 2 ) n 2 1 2 π σ 0 2 e - 1 2 σ 2 i=1 n ( x i - μ ) 2 e - 1 2 ( μ - μ 0 ) 2 σ 0 2 P( D ) , so we want to maximize f( μ ) = e - 1 2 σ 2 i=1 n ( x i - μ ) 2 e - 1 2 ( μ - μ 0 ) 2 σ 0 2 , or, its log function g( μ ) = log f( μ ) =- 1 2 σ 2 i=1 n ( x i - μ ) 2 - 1 2 ( μ - μ 0 ) 2 σ 0 2 . Thus, d d μ g( μ ) = 1 2 σ 2 i=1 n 2( x i - μ ) - 1 2 2( μ - μ 0 ) σ 0 2 = ( x i - μ ) σ 2 - μ - μ 0 σ 0 2 = x i -n μ σ 2 - μ - μ 0 σ 0 2 = x i σ 2 - n μ σ 0 2 + σ 2 μ σ 2 σ 0 2 + μ 0 σ 0 2 = x i σ 2 - μ n σ 0 2 + σ 2 σ 2 σ 0 2 + μ 0 σ 0 2 and this is zero if and only if μ MAP = σ 2 σ 0 2 σ 2 +n σ 0 2 ( x i σ 2 + μ 0 σ 0 2 ) = σ 2 σ 0 2 σ 2 +n σ 0 2 ( σ 0 2 x i + σ 2 μ 0 σ 2 σ 0 2 ) = σ 0 2 x i + σ 2 μ 0 σ 2 +n σ 0 2 . A different way to do this is that we have P( μ |D ) = 1 ( 2 π σ 2 ) n 2 1 2 π σ 0 2 e - 1 2 σ 2 i=1 n ( x i - μ ) 2 e - 1 2 ( μ - μ 0 ) 2 σ 0 2 P( D ) e - 1 2 σ 2 i=1 n ( x i - μ ) 2 e - 1 2 ( μ - μ 0 ) 2 σ 0 2 , and we want ( μ n , σ n ) such that P( μ |D ) e - ( μ - μ n ) 2 2 σ n 2 and P( μ |D ) N( μ n , σ n 2 ) , so we can solve e - 1 2 σ 2 i=1 n ( x i - μ ) 2 e - 1 2 ( μ - μ 0 ) 2 σ 0 2 = e - ( μ - μ n ) 2 2 σ n 2 , obtaining μ n = σ 0 2 x i + σ 2 μ 0 σ 2 +n σ 0 2 and σ n = σ 2 σ 0 2 σ 2 +n σ 0 2 .

Example A.8. The maximum likelihood estimate of p for a Bernoulli r.v. X is given by p ML = x i n . If we have K>2 outcomes, then we have the categorical distribution, also known as multinoulli or generalized Bernoulli, which has support { 1,...,K } . Its parameters are p=( p 1 ,..., p K ) , representing the probability of observing each of the possible outcomes, with p i [ 0,1 ] ,i and p i =1 .
It is convenient to use the one-of-K encoding (also called one-hot encoding) for each outcome. Thus, the pmf of this distribution becomes p( x ) = i=1 K p i x i . Now, given a sample D={ x 1 ,..., x n } of possible outcomes for a multinoulli r.v. X , the maximum likelihood estimate for p is p ˆ k = 1 n x ik , for each k{ 1,...,K } . We can write this compactely as p ˆ = 1 n x i . If some category k is not present in our sample, then its corresponding ML estimate is going to be 0. These 0-estimates are problematic in predictive applications, because unseen outcomes in the training data are considered to be impossible to happen, and thus can never be predicted. To avoid this, pseudocounts are used instead. These represent prior knowledge in the form of (imagined) counts c k for each category k . The idea is to assume that the data is augmented with our pseudocounts, and then we estimate using maximum likelihood over the augmented data, namely p ˆ = c+ i x i n+ k c k , so the k -th parameter is p ˆ k = c k + i x ik n+ k c k . As an example, imagine that we obtain a sample from a die: { 1,3,5,4,4,6 } . If the vector of pseudocounts is c=( 1,1,1,1,1,1 ) , then the estimates are p ˆ 1 = 1+1 6+6 = 1 6 , p ˆ 2 = 1 6+6 = 1 12 , and so on. Notice that although 2 has not been observed in D , its probability estimate is not 0. This special case where all pseudocounts are 1 is known as Laplace smoothing.
Prove that using maximum likelihood with pseudocounts corresponds to a MAP estimate with Dirichlet prior with parameters ( c 1 +1,..., c K +1 ) .
A Dirichlet distribution, Dir( c 1 +1,..., c K +1 ) has the density function: f( x ) = Γ ( ( c i +1 ) ) Γ ( c i +1 ) x i c i = Γ ( c i +K ) c i ! x i c i = ( c i +K-1 ) ! c i ! x i c i . To compute the MAP estimate, we use P( D|p ) P( p ) = i P( x i |p ) P( p ) = i k p k x ik ( c i +K-1 ) ! c i ! k p k c k . Thus, we want to minimize f( p 1 ,..., p K ) = i,k p k x ik k p k c k , or, equivalently, its log g( p 1 ,..., p K ) = log f( p 1 ,..., p K ) = ik x ik log ( p k ) + k c k log ( p k ) . We have min g( p 1 ,..., p K ) s.a. p i =1 p i [ 0,1 ] ,i . The lagrangian is L=g- λ ( p i -1 ) , so that p k L= p k g- λ = i x ik 1 p k + c k 1 p k - λ =0 λ = i x ik + c k p k p k = i x ik + c k λ , and then 1= k p k = k i x ik + c k λ = 1 λ ( n+ k c k ) λ =n+ k c k . Finally, substituting back, we obtain p k = c k + i x ik n+ k c k , as we wanted!

A.6 Bayesian Learning

In the Bayesian Learning framework, instead of working with point estimates of our unknown parameter variables θ , we work with the whole posterior distribution P( θ |D ) . In this case, learning is the process by which starting with some prior belief about the parameters and when facing some observations in the form of a dataset D , we update our belief about the possible values for our parameters in the form of the posterior distribution. This process is iterated over newly received data, so it can be viewed as a sequential process, in which Bayes is invoked each time we need a new posterior P( θ | D 1 , D 2 ,... ) .
Example A.9. Bayesian learning in Matlab
image: 61_home_runner_work_BDMA_Notes_BDMA_Notes_UPC_M___e_scripts_bayesian_learning_images_figure_0.png

A.6.1 Predictive posterior

When doing prediction in this framework, the whole distribution P( θ |D ) is used. We view a prediction as a weighted average of all predictions each value for θ can make, weighted by its posterior probability (its expected value): P( x'|D ) = E θ |D [ x ] = Θ Θ p( x'| θ ,D ) P( θ |D ) θ .
Example A.10. An insightful example

References

1Jose A. Lorencio Abril, "Apuntes de Inferencia Estadística" (2021).
2Marta Arias, "Machine Learning".
3Marta Arias, "Notes on probability theory, Bayes theorem and Bayesian learning".
4F. Rosenblatt, "The perceptron: A probabilistic model for information storage and organization in the brain.", Psychological Review 65, 6 (1958), pp. 386--408.