ML-MDS - Machine Learning
Spring 2023
Jose Antonio Lorencio Abril
Professor: Marta Arias
Student e-mail: jose.antonio.lorencio@estudiantat..upc.edu
This is a summary of the course
Machine Learning taught at the Universitat Politècnica de Catalunya by Professor Marta Arias in the academic year 22/23. Most of the content of this document is adapted from the course notes by Arias, [
2], so I won't be citing it all the time. Other references will be provided when used.
1 Introduction
I gave ChatGPT the index of these notes, and asked him for an introduction for them. Here is the result:
Welcome to the comprehensive study notes for your course in Machine Learning. This document is designed to provide you with a thorough overview of the primary topics in machine learning, to supplement your coursework and study.
Our journey will begin with an overview of Machine Learning and the foundational elements that build this exciting field. The aim is to provide you with a sound understanding of the core principles and techniques involved in the development and application of Machine Learning algorithms.
We will kick-off with 'Linear Regression,' a cornerstone of machine learning, going in-depth into topics like least squares method, issues encountered in linear regression, basis functions, and probabilistic approaches. It will also focus on Bias-Variance decomposition, ridge regression, LASSO regression, and finally, a full-Bayesian perspective.
Our exploration then continues with 'Clustering' where we cover important clustering techniques like k-Means, k-Means++, and ways to determine the optimal number of clusters. We delve deeper into the realm of Gaussian Mixtures and the Expectation-Maximization technique.
In the 'Linear Classifiers' chapter, we inspect the decision boundary in probabilistic models, generative classifiers, Naive Bayes, and methodologies like Perceptron and Logistic Regression.
Following this, we introduce 'Nearest Neighbor Prediction'—a different way of approaching classification problems. The section will equip you with tools to understand the concept of locality and choosing the best 'k', and how to combine outputs to make predictions.
We also explore the use of 'Trees and Random Forests' in machine learning. It will demystify decision trees and random forests, dealing with imbalanced data in classification, and providing insights into the interpretability of random forests.
Next, we journey into the world of 'Neural Networks' with the 'Multi-Layer Perceptron (MLP)', learning about the different error functions used for regression, binary classification, and multi-class classification. We also cover the critical process of training the MLP using Backpropagation and some common activation functions.
Finally, the notes conclude with an appendix section, 'Notes on probability theory, Bayes theorem and Bayesian learning.' This section is critical for understanding the probability theory basics, Bayes rule in the context of learning, maximum likelihood estimation, properties of estimators, maximum a posteriori estimation, and finally, Bayesian learning.
These notes aim to serve as a comprehensive reference guide for you as you explore machine learning, providing both a high-level understanding of the field and the mathematical and algorithmic details that underlie each concept. Happy learning!
2 Linear regression
2.1 Introduction
In Figure
1, we can observe a dataset of the population of different states plotted against the number of fatal accidents in each of the states. Here, each blue circle corresponds to a row of our data, and the coordinates are the
values in the row. The red line is the linear regression model of this data. This means it is the line that 'best' approximates the data, where best refers to minimizing some kind of error: the squared error between each point to its projection on the
axis of the line, in this case. This approach is called the
least squares method.
2.2 Least squares method
2.2.1 Least squares in 2D
In 2D, we have a dataset and we want to find the line that best approximates as a function of . As we want a line, we need to specify its slope, , and its intercept, . So, our estimations are: The least squares linear regression method chooses and in such a way that the error function is minimized.
Note that this function only depends on the parameters and , since the data is assumed to be fixed (they are observations).
To compute them, we just need to find the minimum of , by taking partial derivatives and setting them to 0. Let's do this optimization. First, we can develop the square: Thus:
and We have now to solve the system given by which is equivalent to We can isolate from the first equation: and substitute it in the second one which is equivalent to or At this point, we can divide everything by , yielding:
If we now assume that the observations are equiprobable, i.e., , and we call the random variable from which observations are obtained and the same for the observations , obtained from , then: This means that the previous equation can be rewritten as: So
2.2.2 Least squares regression: multivariate case
Now, we can assume that we have independent variables which we want to use to predict the dependent variable . Again, we have observations of each variable. Now, we want to construct an hyperplane in , whose predictions would be obtained as where we define for all . We can represent so that we can write The error function is defined as in the simple case but now we can rewrite this as Again, to obtain we need to optimize this function using matrix calculus.
Lemma 2.1.
If , and is a symmetric matrix, it holds: and
Proof.
First, notice that so it is For the second result, the procedure is the same.
Lastly, notice that , so
Now, we can proceed and minimize : setting this to be 0, we get Thus, the 'best' linear model is given by Once we have this model, if we have an observation of , and we want to make a prediction, we compute The approach that we have followed here is the optimization view of learning, which basically consists of the steps:
- Set up an error function as a function of some parameters.
- Optimize this function to find the suitable values for this parameters, assuming the data as given.
- Use incoming values to make predictions.
2.2.3 Computation of least squares solution via the singular values decomposition (SVD)
Inverting can entail numerical problems, so the SVD can be used instead.
Theorem 2.1.
Any matrix , , can be expressed as where has orthonormal columns , is diagonal and contains the singular values in its diagonal and is orthonormal .
Proof.
Let . is square, symmetric and positive semidefinite. Therefore, is diagonalizable, so it can be written as where is orthogonal and with and is .
Now, define , and form the matrix Define also Then, this vectors are orthonormal: where is because is formed with the eigenvectors of , and is because is orthonormal.
Now, we can complete the base with (using Gram-Schmidt) in such a way that is column orthonormal.
Now, if it is the case that , then so it is only left to see that indeed this holds. Consider two cases:
- by the definition of .
- It is because . As , it must be . On the other side of the equation we also have 0 because as .
This, added to the fact that if has full rank, is invertible and all its eigenvalues non null, gives us:
Intuitive interpretation
The intuition behind the SVD is summarized in Figure
2. Basically, every linear transformation can be decomposed into a rotation, a scaling and a simpler transformation (column orthogonal).
The intuition behind SVD lies in the idea of finding a low-rank approximation of a given matrix. The rank of a matrix is the number of linearly independent rows or columns it contains. A high-rank matrix has many linearly independent rows or columns, which makes it complex and difficult to analyze. On the other hand, a low-rank matrix has fewer linearly independent rows or columns, which makes it simpler and easier to analyze.
SVD provides a way to find the best possible low-rank approximation of a given matrix by decomposing it into three components. The left singular vectors represent the direction of maximum variance in the data, while the right singular vectors represent the direction of maximum correlation between the variables. The singular values represent the magnitude of the variance or correlation in each direction.
By truncating the diagonal matrix of singular values to keep only the top-k values, we can obtain a low-rank approximation of the original matrix that retains most of the important information. This is useful for reducing the dimensionality of data, compressing images, and solving linear equations, among other applications.
Example 2.1.
How to use SVD in Python and Matlab.
import numpy as np
U, d, Vt = np.linalg.svd(X, full_matrices=False)
D = np.diag(1/d)
theta = Vt.T @ D @ U.T @ y
Algorithm 1: SVD in Python.
[U, d, V] = svd(X)
D = diag(diag(1./d))
theta = V'*D*U'*y
Algorithm 2: SVD in Matlab.
2.3 Things that could go wrong when using linear regression
2.3.1 Our independent variable is not enough
It is possible that our variable does not provide enough information to predict .
2.3.2 The relationship between the variables is not linear (underfitting)
It is also possible that the variables are related in non-linear ways.
2.3.3 Outliers affect the fit
In the presence of outliers, the model obtained can be distorted, leading to bad results.
2.4 Basis Functions
In order to fix the second problem (Subsection
2.3.2), we can make use of basis functions. The idea is to apply different transformations to the data, so that we can extend the expressive power of our model.
Definition 2.1.
A feature mapping is a non-linear transformation of the inputs .
The resulting predictive function or model is .
Example 2.2.
For example, we can consider the polynomial expansion of degree , which is a commonly used feature mapping that approximates the relationship between the independent variable and the dependent variable to be polynomial of degree , i.e.: The feature mapping is .
Note that the idea is to transform the data so that the fit is still linear, even if the relationship is not. Of course, this requires to apply the same transformation whenever we receive an input for which we want to make predictions. Also, the resulting model is more complex, so complexity control is necessary to avoid overfitting.
When we apply to the input matrix , we get a new input matrix, given by and we obtain the optimal solution as before:
Example 2.3.
A MATLAB example
2.5 Probabilistic approach
A review on probability theory, Bayes theorem and Bayesian Learning is in Appendix
A.
2.5.1 Least squares regression from a probabilistic perspective
We are noe going to derive the linear regression estimates using the principle of maximum likelihood and the univariate Gaussian distribution, whose probability density function is given by
Given a sample
, where
, we define its likelihood as the function
In Figure
6, we can see how the likelihood relates to '
how likely it is that our points have been created froma certain distribution', because the red outcomes are more likely to appear from the blue distribution than the green ones.
Now, this can be used to select the distribution that best matches our data. As an easy approach, suppose we want to decide between two distributions
and
, and we have a dataset
. To decide, we can compute
and
and select the distribution whose likelihood is greater. This is visually exemplified in Figure
7, where we can see that given the three red outcomes and the two distributions (blue and green), the blue one should be preferred, because it mazimizes the likelihood.
This way, we can think of the likelihood of a function of the unknown parameters of the distribution, with the dataset fixed, and we can maximize this function to obtain the parameters that best describe our data.
In the probabilistic setting of linear regression, we assume that each label we observe is normally distributed, with mean and variance : where . This way, we seek to obtain and that best describe our data. Remember that so that the likelihood of the parameter vector is given by It is usual to maximize the log-likelihood instead, basically because the likelihood tends to give values too close to zero (we may be multiplying thousands of small values), so numerical problems may arise. Thus: At this point, we differentiate and set equal to 0: obtaining and It is noticeable that the maximum likelihood estimates concide with the estimates we found minimizing the squared error. This is a consequence of assuming gaussian noise, and other types of distribution would give us the same estimates as minimizing a different error function.
2.6 Bias-Variance decomposition
Let be the true function that we are trying to approximate and a finite training dataset, where and . Let be a test data point. The setup is using to train a model that we want to use to make predictions We are going to see how the expected squared error , where is the real value, can be decomosed as a sum of the following components:
- Irreducible error: given by .
- Bias: the systematic limitation that the modelling assumptions impose. For example, if we choose linear approximations, we will never be able to model non-linear data well enough.
- Variance: refers to the sensitivity of the model to the training set . The more the model varies when the training set is changed, the higher variance it has.
Let's do this:
And we now define
The Bias reflects the expected difference between our assumed model and the real function, while the variance reflects the difference between the assumed model and the obtained model. In Figure
8, we can see:
- The linear model has high bias and low variance.
- The polynomial of degree 3 has low bias and moderate variance.
- The polynomial of degree 8 has low bias but high variance.
A summary of commonly used errors used for regression is shown in Table
1.
Name
|
Abbreviation
|
Formula
|
mean squared error
|
MSE
|
|
root mean squared error
|
RMSE
|
|
normalized root mean squared error
|
NRMSE
|
|
coefficient of determination
|
|
1-NRMSE
|
mean absolute error
|
MAE
|
|
Table 1: Common error functions.
2.7 Ridge Regression from Gaussian prior
We are going to consider MAP estimates in this section, which are explained in the Annex
A.
Assume isotropic Gaussian prior on -dimensional , i.e., , so that and . Then, on one hand, we have On the other hand, it is To obtain the MAP, we maximize the log of this expression: which is the ridge regression estimate with . We can now differenciate this expression to find its minimum:
2.7.1 Tuning
Cross-validation
Given a dataset , the -cross-validation approach starts by separating into two subsets:
- The training dataset, .
- The test dataset, .
These are obtained in such a way that:
- .
- .
Now, we divide into subsets of equal size, , called folds, and imagine we want to decide on a set of values for our model's parameter .
- For each :
- For each :
- Train the model in .
- Evaluate the model in .
- Average the evaluations to obtain an estimation of the performance of the model.
- Select that gives us the best estimation.
Leave-one-out cross-validation (LOOCV)
Is a special case of cross-validation, in which , i.e., each data point is a fold.
2.7.2 LOOCV for Ridge regression
As a particularity of linear and ridge regression, for a given value of , only one training is necessary for LOOCV, so we can proceed as follows:
- For each :
- Compute the optimal solution
- Compute the hat matrix or smoothing matrix
- Compute LOOCV directly for each , without folding
- Return with minimum LOOCV.
2.7.3 Generalized Cross-Validation (GCV)
Generalized Cross-Validation (GCV) is a model selection technique used to estimate the performance of a model in terms of prediction accuracy. GCV is particularly useful for choosing the optimal parameters in regularization or smoothing methods, where the goal is to balance model complexity and goodness of fit. Examples of such methods include ridge regression, LASSO, and smoothing splines.
The main idea of GCV is to approximate the leave-one-out cross-validation (LOOCV) error without actually performing the computationally expensive process of fitting the model to all but one data point multiple times. Also, it is more computationally stable than the previous approach.
The GCV score is defined as follows:
where is the trace of .
2.8 LASSO regression
Definition 2.2.
The norm of a vector is
Another very common choice is , which leads to LASSO regression. Thus, LASSO regression minimizes the L1-norm of parameters and squared error: In fact, LASSO regression arises assuming a Lapace distribution prior over the parameters.
Some characteristics:
- LASSO regularized cost function is not quadratic anymore, and it has no close solution, so an approximation procedure is used: the least angle regression, which provides an efficient way to compute the solutions for a list of possible values for , giving the regularization path.
- LASSO regression gives sparse solutions, in which many coefficients/coordinates of might be 0. This means that LASSO performs feature selection automatically.
2.9 The full-Bayesian perspective
Both maximum likelihood (ML) and maximum a priori (MAP) produce point estimates of the parameters, while in Bayesian Learning we want the full posterior distribution of the parameters. The idea is that if we know the posterior distribution of the parameters, then we can use all the information provided by this distribution for our predictions, instead of just a single point-estimate that summarizes it. For instance, if the probability function of the posterior is and we receive a new input point , then we can compute the probability of byNow, when we do this for all possible values of , we obtain the full distribution of the predictions, instead of just one estimation. Nonetheless, computing this integral is usually too hard, so it needs to be approximated, but in the context of linear regression all these expressions have close-form formulas.
Technically, ML and MAP assume that so a prediction for a new test point is going to have a distribution Note now the lack of flexibility of this approach, since the width of the normal distribution is going to be the same for any new test point, which may be a dangerous assumption.
Let be our dataset, with and , and assume:
- are independent given .
- , with being the prevision of the noise in the observations .
- A spherical or isotropic Gaussian for the parameter's prior, .
- and are known.
- The only parameter variables are the coefficients .
Then, the likelihood function is As usual, using Bayes, we can derive the posterior distribution Notice here that the exponent of this expression is quadratic on , so it is going to be a multivariate Gaussian. We need to turn the exponent into something resembling so that we can derive what the mean and the precision of the posterior density is. For this, we are going to complete squares so that we can 'match the terms' between and . On one hand: We don't mind about constant terms with respect to , since we only care about proportionality. On the other hand, we have: From here, we obtain that and we now match with : Thus, the posterior probability is with The MAP estimate can be directly obtained from here, since the maximum density is obtained at the mean in any Gaussian distribution. Additionally, in ridge regression we let and turn it into a parameter that we can tune to control complexity against training error.
2.9.1 Using the posterior distribution for predictions
Let's now see how to compute the predictive distribution, i.e., For this, we substitute the densities Now, our objective is to set this integral to equate (or be proportional to) another of the form and finally to see that . We then have Again, we want to match to something of the form so and assuming exists for now. Then: Notice here that is independent of so that we have And now... we complete squares again :D So If is a Gaussian, then this should look something like so and so And then, we have so we have that .
Not only this, but and
We note now that the predictive distribution's mean prediction equals the point-prediction of the MAP. However, the variance of the prediction does depend on , which is good, since the unvertainty of out predictions depends on how far observed samples are:
- If observed samples are near from our new inputs, then we should be more certain.
- If they are far, then we should be less certain.
3 Clustering
The goal of clustering is to partition a dataset into groups called clusters, in such a way that observations in the same cluster tend to be more similar than observations in different clusters. The input data is embedded in a -dimensional space with a similarity/dissimilarity function defined among elements in the space, which should capture relatedness among elements in this space. Two elements are understood to be related when they are close in the space. Thus, a cluster is a compact group that is separated from other groups or elements outside the cluster.
There is a large variety of clustering algorithms, such as hierarchical bottom-up or top-down clustering, probabilistic clustering, possibilistic clustering, algorithmic clustering, sprectral clustering or density-based clustering.
The problem of clustering is quite complex, as if we have data points which we want to separate into clusters, then there are possibilities. This is the stirling number of the second king.
If in addition we don't know how many clusters we want to use, we have to add all possible , summing up to possibilites. This number is the Bell number, which is really gigantic.
3.1 k-Means
The -Means clustering algorithm takes a dataset and an integer as input, and separates into disjoint clusters. It is a representative-based clustering, meaning that each cluster is represented by one single point. In the case of -means, the representative is the cluster center, , i.e., the average of all the points in that cluster. Each point in thus assigned to its closest representative point.
If is the -th cluster, then we consider it a better cluster when the value is smaller.
Now, let's formalize all this. First, we introduce an indicator variable and the objective function which we aim to minimize by selecting appropriate and . The issue is that this problem is NP-hard, so we use an heuristic method that is only guaranteed to find local minima. This method relies in two facts:
- For fixed cluster centers , it is easy to optimize cluster assignments .
Proof.
Assume fixed , then we assign to the closest , because if we assign it to a different center, , then we can minimize the sum as
- For fixed cluster assignments , it is easy to optimize cluster centers .
Proof.
Assume fixed , then Thus, the minimum is obtained at This is the average of the points of the cluster.
The pseudocode is illustrated in Algorithm
3.
The characteristics of -Means are:
- :
- Easy implementation.
- Fast, even for large datasets.
- :
- Can converge to local minimum.
- Needs the number of clusters, , as input.
- Hard cluster assignments: meaning that each point corresponds to a single cluster, which may not always be what we want.
- Sensitive to ourliers and clusters of different sizes and densities.
- Sensitive to initialization, so it is usual to run it many times and keep the best run.
- Biased towards rounded clusters, because it uses the Euclidean distance.
3.2 k-Means++
-Means++ is a variant of
-Means that uses a heuristic for initializing cluster centers as in Algorithm
4.
choose first center <@$\mu_1$@> uniformly at random from all available examples
for k=2,...,K do
choose next center <@$\mu_k$ at random, with probability proportional to $||x_i-\mu_l||$@>
here, <@$\mu_l$@> is the closest center picked so far
proceed with standard k-Means
3.3 Choosing the number of cluster
The number of clusters is a hyper-parameter that has to be set by the user, and there is no obvious way to choose an optimal , since it may not even exist. This is due to the fact that there is no such thing as a true clustering against to compare. Nevertheless, there are reasonable cluster quality criteria, that can be used to select . These criteria measure a balance between separation of clusters and their compactness. Depending on the problem, one criterion or another should be chosen.
3.3.1 Calinski-Harabasz index
The CH index uses Euclidean distances to measure cluster quality, so it is usually used with -means. It measures the ratio between:
- Separation of cluster centers: sum of distances of cluster centers to the overall mean.
- Cluster compactness: sum of distances from each point to its assigned cluster center.
Thus, it is: where . Notice that the quantities are normalized by to avoid larger having better values.
The usual approach is run -Means with different values of , and then select the that maximizes the index.
Example 3.1.
A -Means example in Matlab.
3.4 Gaussian Mixtures
A mixture of Gaussians is a distributions that is built using a convex sum of Gaussians, making it more flexible than a single Gaussian distribution. If the components of the mixture are and are the mixing coefficients, with , then, the density function of the mixture is given by where represents the parameters of the distribution.
The key assumption is that each data point has been generated from only one component, we just don't know which one.
In Figure
9, we can see different Gaussian Mixtures that arise from the same components, choosing different mixing coefficients.
3.4.1 Clustering with a Gaussian mixture
The idea is to identify each component with a cluster, so we want to determine the component from which each point in the dataset is more likely to have been created, as well as the parameters of each of the distribution.
Thus, to cluster a dataset into clusters, the approach is the following:
- Use Expectation-Maximization (EM) to estimate the mixture, obtaining approximations and for each .
- Find assignments for each to the clusters.
In this case, the clustering is soft, in opposition to the hard clustering of -Means. This means that we will obtain the probability for each point belonging to each cluster.
3.4.2 A generative mixture of Gaussians
To sample from a mixture of Gaussians, we use a generative model that uses a latent variable whose components are all 0, except one which denotes the component from which we sample, and we do:
- Pick component with probability . This means that we set with probability .
- Generate a sample according to .
The probability of generating a sample using this generative model is the joint distribution of and is given byand the marginal distribution over is Therefore, we can use Bayes to compute the conditional distribution of given : The quantity indicates how probable it is that a particular data point has been generated by the mixture component . Or, in the context of clustering: how probable it is that belongs to cluster . We use these quantities as the soft membership to each cluster. If a hard membserhip is needed, then we assign to cluster , where .
3.4.3 Learning Gaussian mixtures with Expectation-Maximization
We have a dataset of unlabelled observations and we want to model it as a Gaussian mixture, with unknown parameters , with a fixed . For this we use the maximum likelihood approach. First, we compute the loglikelihood of : This is hard to optimize... so we use the Expectation-Maximization approach. First, we can differenciate it to see what conditions must hold for local maxima, with leading to which is a weighted average of the points in our data, with weights being the soft assignments of each point to cluster .
The problem now, is we cannot know without and . Now, gives us which is the sample covariance matrix of all , weigthed by the soft assignments of each point to cluster . We have the same problem!
Since we have the constraint , we now maximize the Lagrangian obtaining which is the average of all soft assignments for each point . Again the same problem is present.
Therefore, we are in a situation in which we can estimate
and
if we know
, and we can compute
from the estimates
and
; and we can use this in our benefit by following the pseudocode depicted in Algorithm
5. Commonly, the initializations are done using the result of
-Means in the following manner:
- Run -Means with .
- Set to the mean of cluster .
- Set to the sample covariance of cluster .
- Set as the fraction of examples assigned to cluster .
Algorithm 5: EM algorithm.
Example 3.2.
An example of the EM algorithm using Matlab
3.4.4 Special cases
The shape of the Gaussians can be restricted, obtaining special cases of mixtures:
- No restrictions on : the general case, in which each cluster can have general Gaussian shape.
- are diagonal: each Gaussian component is forced to have no correlation among input dimensions.
- are isotropic or spherical: each Gaussian component is forced to be spherical, so no correlation among input variables and same scaling across each input variable.
4 Linear Classifiers
In classification, we have a labelled dataset, where and are labels. Thus, the tuple means that the vector is associated to the label . With this setup, we aim at producing a classification model, meaning a function that, given a new input , predicts the label that it should be associated with. Usually, we distinguish between two kinds of classification, attending to the number of labels considered:
- In binary classification, there are only two possible labels, .
- In multi-class classification, there are more than two labels, .
Now, we introduce some useful terms:
Definition 4.1.
The decision regions are a partition of the feature space, , such that , . Intuitively, they are the regions in which we divide the feature space, so that all elements inside a region have the same label.
The decision boundaries are the points in the frontier between decision regions.
A classifier can then be understood as the creation of the decision regions.
A linear classifier is a classifier in which the decision boundaries are -dimensional hyperplanes.
Example 4.1.
A visualization.
In Figure
10, we can see different classification models and the regions they generate for the iris dataset. The left model correspond to a linear classifier, and we can see how the decision boundaries are linear. The other two models are not linear.
Many useful classifiers don't just predict the input's class, but also the probabilities of the input belonging to each class. This is desirable, as it enables us to express uncertainty about the prediction that we make. For example, in binary classification it is a common approach to map the target labels to be 0 or 1, and then make predictions as a continuous value in . More explicitly, we can have labels and encode , , so that given a patient , we obtain a prediction , indicating the probability of the patient being sick.
In classification with classes, it is common to encode the labels using one-hot encoding, meaning that we map the labels into the set . For example, if we have three labels, then they are encoded as . When in this scenario, predictions are usually points in the -simplex, i.e., a prediction must be and . Each represents the probability of the input belonging to class .
4.1 Decision boundary in probabilistic models
From a probabilistic perspective, we can think of the joint probability of examples and labels, . When we are building a classifier, we then want to minimize the expected loss, or expected error. Note, nonetheless, that loss here is a bit different from the regression loss. A natural way to think about this loss is through loss or cost matrices. A cost matrix is a table as the following:
Real
|
|
|
...
|
|
Predicted
|
|
0
|
|
...
|
|
|
|
0
|
...
|
|
|
|
|
|
|
|
|
|
...
|
0
|
This matrix indicates the cost of each error. Note that not all errors need to have the same cost. For example, in a medical context, it has a higher cost to predict that a sick patient is healthy (this person could potentially die), than to predict that a healthy person is sick (in which case further tests would probably correct the mistake). A cost matrix for this case could look like the following:
Real
|
|
|
Predicted
|
|
0
|
60
|
|
10
|
0
|
We will focus in the case of all errors having the same impact, which is called the 0-1 loss.
Let's now see how, given a new example , we can use a 'rule' to choose the label that should have. We have random variables and with joint distribution . We can compute the expected loss of assigning the label to . For this, we first define the loss function as the 0-1 loss: Therefore: Of course, we want to minimize the expected loss, so we aim at predicting the class minimizing : This is called the Bayes classifier, and it is optimal when we use 0-1 loss. Its error is given by the so-called Bayes error rate: Of course, we can use this classifier to partition the feature space into regions , and we can compute the BER summing over all regions: Before, we claimed that the Bayes classifier is optimal. However, in practice we don't know the distribution , so it cannot be implemented exactly. Therefore, is estimated from data, and this estimates are used for classification, incurring in additional errors. To learn , there are two basic approaches, namely discrimintative classifiers and generative classifiers.
4.2 Generative classifiers
Generative classifiers learn through the Bayes rule.
4.2.1 Discriminant analysis
Discriminant analysis is the result of implementing a Bayes classifier assuming that the class-conditional distributions are gaussian. This means that, having , then it is If we also assume that the prior distributions are with , then we define the discriminant functions This is a quadratic discriminant function, and the corresponding classifier is implemented by predicting This corresponds to chossing the label with maximum probability a posteriori.
The decision boundaries in this case are those regions in which there exist with
These corresponds to hyper-quadrics in the feature space, and this is a quadratic method, usually called quadratic discriminant analysis (QDA).
Of course, we can further simplify our assumptions, by assuming that all labels have the same covariance matrix, for all . In this simpler case, the discriminant functions end up being because now is constant for all , so we can remove it. Furthermore, the term is also constant with respect to , so it will not affect the chosen. Therefore, we end up with linear discriminant functions, in which the decision boundaries correspond to hyperplanes in the feature space. This is a linear method, usually called linear discriminant analysis (LDA).
In Figure
10, the left diagram corresponds to a LDA partitioning of the feature space for the iris dataset, while the one in the center is a QDA partitioning.
Further assumptions
Of course, we can make more simplifying assumptions for our model, such as:
- is diagonal. In this case, we obtain
- is an isotropic Gaussian, i.e., . In this case,
- , all priors are equal. In this case
Distance-based learning perspective
In all seen cases, we have a minimum-distance classifier in :
- The general QDA case corresponds to using different Mahalanobis distance from to each class center .
- The LDA case uses the same Mahalanobis distance from to each class center .
- In the case of all covariance matrix being equal and diagonal, the distance is the weighted Euclidean distance.
- In the isotropic Gaussians case, the distance corresponds to the usual Euclidean distance.
Implementation
It is usual to use MLE and estimate the centers and covariance matrices using the training dataset. If we define the sets and , then, the estimates are and the covariance matrix is:
- In QDA:
- In LDA
Final remarks on discriminant analysis
Bayesian classifiers are optimal when the class-conditional densities and priors are known. This means that, if know the underlying distributions, then these classifiers are our best choice. Of course, this is not realistic, and estimations need to be made. However, normal distributions appear in a wide range of real scenarios, and even if we have to estimate the centers and covariance matrices, QDA and LDA are a very good choice when data resembles Gaussian distributions. In addition, they are well-principled, having a solid mathematical theory behind them, they are fast and reliable.
Of course, if the real distribution is very far from being a Gaussian, then the model obtained will be poor, so one should take care of this.
Also, it is important to ensure that we are correctly estimating the parameters of the Gaussians, because otherwise the model will not work, not even with underlying Gaussian data.
And it is clear that once we are relying on sample estimates instead of population parameters, we loose the optimality of the method.
In practice, it is really hard to assess which assumptions hold and which ones do not, so we can be limited to use a trial and error approach.
4.2.2 Regularized discriminant analysis (RDA)
When data is scarce, some problems can arise while using discriminant analysis. For example:
- If there are more dimensions than samples with some label, i.e. , for some , then QDA cannot be applied, because is singular. The reason is that each of the sample is adding a rank-1 matrix. Therefore, to get a rank- matrix, we need at least samples. Not only this, but in fact, as we are subtracting the sample mean, the last value is linearly dependant of the previous values. Therefore, we need at least samples to be able to construct a rank- covariance matrix.
- If , then we cannot use QDA nor LDA, as all covariance matrices are singular.
RDA computes the covariance matrices as where is the regularization parameter. The method is QDA when and is LDA when . In any other case, it is something in between.
A further way to regularize the matrices is by where , and the diagonal term improves the well-conditioning of the method.
4.3 Naïve Bayes
The Naïve Bayes Classifier is a Bayesian classifier that assumes that the features are pair-wise independent in the class-conditional distribution. This means that the probability can be written as This assumptions does not hold in general, but this approach can provide a good approximation in many cases. Also, it is practical, as the amount of parameters to estimate is small.
As before, we classify the input record in the class that maximizes the discriminant function: Therefore, we need to
- Estimate the class priors as the sample frequency,
- Estimate the class-conditional densities for each input feature independently.
Naïve Bayes can also be used in the case of categorical variables. To model binary features, we can use, for example, the Bernoulli distributionwhere and is the probability of the event happening. For a binary feature, we would need to estimate parameters, one for each class, so that If we had all our features as binary, then the discriminant functions would be where is the Bernoulli parameter for the feature and the class. Note that this is a linear function with respect to to .
If instead we have categorical features with more values, we can then use the Categorical distribution where is 1 if is True and 0 otherwise, and is the Categorical parameter for the value of the feature and the class.
Now, we need to estimate these parameters, for which we can use the sample frequencies. Note how 0-frequencies can be a problem, so it is a common approach to utilize Laplace smoothing Here:
- is a weight parameter assigned to the prior distribution of observing values . It is typically set to 1, just to avoid 0 values.
- is the number of modalities (number of distinct values) of the feature modelled.
Example 4.2.
Naïve Bayes in MATLAB
4.3.1 Gaussian Naïve Bayes
If we have numericla features, the usual approach is to assume they follow Gaussian distributions, then we estimate their mean and variance using MLE. If all features are assumed Gaussian, this approach is equivalent to QDA with diagonal covariance matrices.
Other approaches are:
- Discretize numerical values and proceed with Categorical NB.
- Assume a different distribution and estimate its sample parameters from data.
Note that when the data is mixed, we can assume a different distribution for each feature, and then add the log-likelihoods altogether.
4.4 Perceptron and Logistic Regression
The
perceptron is a mathematical model of the functioning of a neuron in our brain. In Figure
11 we can see a diagram of a neuron. Other neurons transmit signals into our neuron via the dendrites, then 'something' happens inside the neuron, and it transmits or not signals through its axon towards the neurons to which it is connected.
This behavior was mathematically modeled by Roosenblatt in his pioneer paper [
4]. He did this with the concept of perceptron:
This is depicted in Figure
12. Notice
This way, we can classify the input variables into two classes, 1 or -1. But... what are the weights?
Example 4.3.
Imagine we have the following classification function then it is easy to construct a perceptron that can classify the instances. We have:
- .
- .
- Therefore, we have
And we are good!
In this previous example we can grab some intuition on how the weights are chosen, but doing this by hand is not scalable at all. When we have several input variables, this process is complicated quite a lot. Luckily, the perceptron algorithm helps us estimate the weigths.
This algorithm is an on-line algorithm, meaning that it process one training example at a time, updating incrementally. The training examples are pairs , where and . Also, we can scale all to lie in the unit sphere, because this also scales the hyperplanes of classification and classes remain the same. The algorithm goes as follows:
init W=0
for i in range(1,e):
for each (X,y):
compute prediction=<@$F(S(W^TX))$@>
if prediction != y:
# update W
W = W + lr*X*y
return W
Algorithm 6: Perceptron algorithm (input X, classes y, learning rate lr, epochs e) -> weights W
The updates are made like that because we can consider the error function only when is misclasified. In that case, and thus the minus sign. We want to minimize this error, for which we follow a gradient descent approach (we will deepen on this later), so we update as The basic idea is that if we now utilize this new weights in the same input, we would get Now, say , then it is so we have made the input to be closer to be positive, and thus correctly classified. If the same logic applies.
In fact, under some conditions, convergence is ensured:
Theorem 4.1.
Perceptron convergence theorem
For any finite set of linearly separable labeled examples, the Perceptron Learning Algorithm will halt after a finite number of iterations. In other words, after a finite number of iterations, the algorithm yields a vector W that classifies perfectly all the examples.
Example 4.4.
A simple perceptron in MATLAB
We can improve the approach by choosing a different activation function, which would be differentiable ideally. A usual alternative is the logistic function which maps , so that this input can be interpreted as a probability, allowing us to represent uncertainty in the prediction. This function has the following properties:
- It verifies a pseudo-symmetry
- It is differentiable, with derivative
- Its inverse is the logit function
When we use the logistic function, we are performing what is called logistic regression for binary classification. Again, we have a dataset of pairs where and , associating label with a positive example, and with a negative example. In logistic regression we model where, again, is the weight vector. Therefore, we are modelling where . Notice that we don't assume anything about the distribution of , though.
Why is logistic regression a linear classifier?
When we classify, we need to set a threshold above which we classify the label to be 1. Usually, this threshold is set to be . Now, And remember that , so, in fact, we are classifying using the same criteria as before.
At this point, we can try to use maximum likelihood in the search of an optimal : As usual, we take the log-likelihood, but in this case we minimize the negative log-likelihood instead, which can be interpreted as an error function: This error expression is known as log loss or binary cross-entropy, and it is widely used in classification schemes.
Now, as we aim to find its minimum, we compute its gradient. For this, recall that and notice that Recall also that Now, let's go for this derivative:
In this case, it is not possible to find a close-form solution, so we use iterative methods to approximate local minima. We are going to use gradient descent, which is one of the simplest approaches, but it is widely used.
4.4.1 Gradient descent
Gradient descent is a general numerical method to find local minima of a differentiable function . The idea is to use the fact that the gradient of a vector function points in the direction of maximum growth, and the negative gradient points in the direction of maximum decrease. Therefore, if we 'follow' this direction, we should approach a minimum of the function, although it might not be the global minimum.
The approach works as follows. To approximate a local minimum of :
- We start at a weight vector and set .
- Compute .
- Update the weight vector Also update . The parameter is called learning rate, and it quantifies 'how much' we advance in the direction of the gradient. It has to be chosen beforehand and it is critical, since:
- If is too big, we might jump over the minimum and the method might never converge.
- If is too small, the method might converge too slowly.
Related to the learning rate is the importance of feature scaling, as the same learning rate impacts all features, so if they have different scales, the method might be good for some of them, but bad for others.
- Repeat until convergence or maximum number of steps is reached.
4.4.2 Newton's algorithm
Newton's algorithm is an algorithm used to find roots of a function which converges faster than gradient descent and does not need a learning rate parameter. As this algorithm finds roots (i.e. zeros), we don't apply directly to , but rather to its gradient. Thus, in this case, we need to be twice differentiable and to be able to afford computing its second derivatives, which can sometimes be costly.
In this case, the algorithm works as follows:
- We start at a weight vector and set .
- Compute the Hessian of :
- Update the weight vector and set .
- Repeat until convergence or maximum number of steps is reached.
Iterated Reweighted Least Squares (IRLS)
When we apply Newton's method to the log-likelihood of the logistic regression, it is called the IRLS method. The gradient of the log-likelihood is and the Hessian
Example 4.5.
Logistic regression example using MATLAB
4.4.3 Multi-class logistic regression
The multi-class case, , is handled by having a separator for each class . In this case, instead of a Bernoulli distribution for the targets , we use a categorical distribution. Each target is represented with its one.hot encoding.
In this case, the likelihood function is and the cross-entropy loss is As in the binary case, we optimize this by using gradient descent or Newton's method.
4.4.4 Regularization
Finally, note that we can add regularization to the weights, in the same way we did for linear regression, by adding the or norms of the weights to the cross-entropy loss and Note that this is for the binary case, but for the general case is similar.
5 Nearest Neighbor Prediction
The nearest neighbor predictor uses the local neighborhood of an input point to compute a prediction. A well known family of this kind is the models, which predicts the output of the input point by combining the known outputs of the nearest training data points, for example by voting if we are classifying the data.
The approach of these method is quite straightforward, and is detailed in Algorithm
7.
Training
Store all training examples
Prediction
Given an input X:
1. Compute distance/similarity with all examples in the training set.
2. Locate <@$k$@> closest points.
3. Emit prediction by combining outputs of the <@$k$@> closest points.
Algorithm 7: Pseudocode.
Of course, we have to decide:
- The distance/similarity function.
- How many neighbors to choose, .
- How to combine the outputs of the nearest neighbors to emit the final prediction.
5.1 Locality: similarities and distances
It is important to decide what distance or similarity between points means in our feature space, since this will determine the neighborhoods of the points.
Definition 5.1.
A distance function is a function verifying:
- Non-negativity: .
- Zero distance is equal:
- Symmetric:
- Triangle inequality:
A similarity function is a function
verifying:
- Ranged: or .
- Completely similar is equal:
- Symmetry:
Example 5.1.
Some examples of distances are:
- The Minkownski distance family: Which has as special cases the Euclidean distance () or the Manhattan distance ().
- The Mahalanobis distance is an interesting distance between points that takes into account the covariance matrix of the features, . Its formula is which might already be familiar to you, since we have used it for some probabilistic methods.
Some examples of similarities are:
- The cosine similarity function: which leverages the dot product, . Therefore, this captures 'how parallel' and are.
- The Pearson correlation measure is similar to cosine similarity, but centers data: where is the mean of the points.
- The Hamming distance is computed between two sequences of bits, and it is just the proportion of common bits.
- The Jaccard coefficient is also computed between two sequences of bits, and it is
If we have then and
5.2 Choosing
Nearest neighbor methods are very sensitive to the chosen value of . If it is too low, it is easy to overfit; if it is too large, we can underfit. Therefore, we need to choose it thoughfully. This will depend on the dataset, and the typical approach is to use cross-validation, or other resampling methods, seeing as an hyperparamter that trades-off bias and variance of the resulting model.
5.3 How to combine outputs to make predictions
5.3.1 Classification
- Majority vote, broking ties randomly.
- Distance-weighted vote: a more advanced approach that weights the votes higher for closer points and lower for further points.
5.3.2 For regression
- Use the average of the outputs of the nearest neighbors.
- Use the weighted average of the outputs of the nearest neighbors. Again, the weights should be inversely proportional to distance or proportional to similarity.
When we weight the predictions, we can relax the constraint of using only examples, and we can even use the whole training set for making predictions, since this approach lowers the chances of underfitting.
5.4 Decision boundaries for nearest neighbors classifier
In 1-nearest neighbor, the decision regions correspond the union of each example's Voronoi cell, with appropriate class. Given a set of points , the Voronoi cell of point corresponds to the set of points whose nearest neighbor is . For example, here I show two different Voronoi diagrams, which show the Voronoi diagram of 6 different points:
The decision boundaries and regions are non-linear, but they get smoother as we increase the value of . This effect can be observed in the following Figure:
5.5 Final considerations
- Making predictions can be slow, specially if the training dataset is large. To improve the prediction speed, there are several approaches:
- Use data structures like to speed up neighbor retrieval, although this is only good when we have a moderate amount of features. A , or k-dimensional tree, is a binary search tree data structure used to organize points in k-dimensional space. In the context of kNN models, kd-trees can help speed up neighbor retrieval by partitioning the space and reducing the search area for nearest neighbors. However, the efficiency gain of kd-trees is reduced when there are many features (i.e., high-dimensional data), as the search time approaches that of a linear search.
- Use prototypes. A protoype is a representative point or a summary of a group of points that belong to the same class. Using prototypes can reduce the size of the training dataset while still maintaining its overall structure. The idea is to replace multiple points with a single prototype, thus reducing the number of distance calculations during prediction. Examples of prototype selection methods include the nearest neighbor condensation algorithm and the edited nearest neighbor rule.
- Use approximate neighbors, for example using locality sensitive hashing. The main idea behind LSH is to hash the input items in such a way that similar items are more likely to be hashed to the same bucket. By doing this, it reduces the search space for nearest neighbors and allows for faster retrieval. LSH trades off some accuracy for improved speed, making it suitable for large-scale applications where an approximate answer is acceptable.
- is prone to overfitting. To avoid this, usual approaches include:
- Remove noisy examples from the training data, for example, data points with neierst neighbors of different class.
- Again, use prototypes.
- Set the appropriate value of .
- It suffers from the curse of dimensionality: as dimension increases, everything seems to be close.
- Suffers from the presence of irrelevant features, so feature selection is an important pre-processing step.
- Standardization of features is crucial to avoid domination of features with larger values.
6 Trees and Random Forests
An ensemble method is a method that combines two or more predictors, instead of using a single prediction. These are useful when we have several models that are better that just a random guess, and that are independent between them. The combination of the models can be done averaging predictions, when in regression, or by majority vote in classification.
The main types of ensemble methods are:
- Stacking: involves training a learning algorithm to combine the predictions of several other learning algorithms. First, all sub-models are trained based on the complete training set, then the meta-model is fitted based on the outputs — meta-features — of the sub-models in the ensemble. Therefore, stacking allows you to use multiple heterogeneous, possibly weak learning models and "stack" them together in a manner that allows you to use information from their predictions to make a final prediction which often has better performance.
- Bagging: is a way to decrease the variance of the prediction by generating additional data for training from dataset using combinations with repetitions to produce multi-sets of the original data. Bagging helps to avoid overfitting by averaging or voting the prediction from the multiple models. Random Forest is a classic example of bagging algorithm.
- Boosting: is a sequential technique in which the first algorithm is trained on the entire data set, and the following algorithms are built by fitting the residuals of the first algorithm, thereby giving higher weight to those observations that were poorly predicted by the previous model. It relies on creating a series of weak learners each of which might not be good at the entire problem, but might be good at recognizing some part of it, and then combining their predictions to get the final prediction. The idea is to add new models to the ensemble sequentially. At each particular iteration, a new weak, base-learner model is trained with respect to the error of the whole ensemble learnt so far. Gradient Boosting, AdaBoost and XGBoost are examples of boosting algorithms.
6.1 Trees
A regression tree partitions the feature space into axis-parallel regions and predicts using the average of training points that fall into that region. We can think of a tree in terms of nodes and edges: each node in the tree specifies a test of some attribute, and each edge descending from that node corresponds to one of the possible outcomes for the test. The leaf nodes (or terminal nodes) of the tree contain an output value which is used to make a prediction. When a new data point is presented to the tree for prediction, it is routed down the tree based on the outcome of the tests in each node, starting from the root and ending at a leaf node. The value in the leaf node is then returned as the prediction.
A classification tree also partitions the feature space into axis-parallel regions, each of them representing a class. It is NP-hard to find the optimal tree of minimum size, so greedy approaches are usually used. A general approach is explained in Algorithm
8.
1- Feature selection
Find the feature that best splits the data into two subsets, minimizing a impurity function.
2- Binary splitting
Decide how to split the feature. If it is categorical, split in each possible value. If it is continuous, find a value that divides it in two, minimizing the impurity.
3- Recursion
Repeat 1- and 2- until the stopping criterion is met.
4- Pruning
To avoid overfitting.
Algorithm 8:
Training a Tree.
We are going to use the Gini impurity metric to create trees.
Let be a subset of example of the input data , with and , with . Let also be the proportion of examples in that belongs to class . Then, the Gini impurity metric is computed as
Therefore, to find the node to expand the tree, we need to find the pair , where is a feature and , such that
In the case of a regression tree, the approach is similar, but instead we minimize the sum of squared errors, relative to the average target value in each induced partition. In other words, we now seek for the pair such that where
Example 6.1.
Classification trees in MATLAB
6.2 Random Forests
Let's deepen a bit in the bagging technique, which we said it can be used to reduce the variance of our estimates by averaging the estimates obtained from independent models. Assume we have models , and an input . Each model produces an estimate which tries to predict the real value . Let's suppose these estimates are unbiased, i.e. . Then, the average of the estimates isTherefore, this average is also an unbiased estimate of the value . Let's see what happens with the variance when we perform this average. If we assume that the models are independent and that they all have the same variance, then it follows Thus, the variance is reduced, making the predictions more consistent.
Since trees typically suffer from high variance (overfitting), so they seem like a good candidate to apply this technique. The application of bagging to trees are the random forests.
First, a bootstrapped dataset is created by randomly selecting samples from the original dataset with repetition. The samples that not chosen for the bootstrapped dataset are placed in a separate dataset, the out-of-bag dataset (OOB).
Then, we want to generate a diverse collection of trees that are better than a random choice, and are as independent from each other as possible. If we construct all these trees using the usual algorithm, then they will be very similar, so we have to inject stochasticity into this process. This is done by:
- The sampling introduces stochasticity, but it is usually not enough to provide independence between the trees.
- We can manipulate features or targets, by not taking some of them into account. This is done for each tree, so that different trees focus on different features, providing the independence that we seek.
- Change the learning parameters, the impurity function,...
- Performing combinations of the above methods.
The pseudocode of the random forest training algorithm is shown in Algorithm
9.
for b=1 to n_estimators do
Sample <@$D_b$@> as a bootstrap of max_samples from D
Create tree <@$T_b$@> on <@$D_b$@>, adding nodes by
- Select max_features variables at random from all variables
- Pick the best variable/split point according to Gini/SSE
- Split the current node according to the split found
end
Output forest <@$\{T_1,...,T_B\}$@>
Algorithm 9:
random_forest(D,n_estimators,max_samples,max_features)
To make predictions on a test example :
- For classification: output class probabilities or majority vote among .
- For regression: output average prediction .
Then, to estimate the generalization error, we can use the OOB dataset to compute the OOB error. This generalization error can be used as a validation error to select appropriate values for the hyperparameters, so we don't need to perform cross-validation.
6.2.1 Interpretability of random forests
If n_estimators is too big (there are many trees in the forest), then it can be difficult to comprehend the decision process of the model: it is not interpretable. We can do variable importance plot to interpret better the results:
- Gini based variable importance: we can add gini impurity gains for variables in the splits in each tree in the forest, and sort the variables by their sum. This approach is biased towards categorical variables with many splits.
- Permutation based variable importance: for each variable, we can permmute values and compute the difference in the OOB error metrics. If a variable is important, then accuracy in the permuted copy should decrease. We then sort the variables using this difference. This approach is more reliable, but is slower.
6.2.2 Proximities
The idea of proximity of samples in a tree is that when two examples fall into the same leaf of a decision tree, this is evidence supporting the fact that these two examples are similar in some sense.
In the wider sense of forests, we can build a similarity matrix, where each entry correspond to the fraction of trees in the forest such that end up in the same leaf. This matrix can be used in distance/similarity based methods, or to apply PCA to obtain numeric representation of the data.
6.2.3 Imbalanced data in classification
In random forest, there are several techniques to deal with imbalaced data:
- Use for model selection.
- Under-sampling majority class/over-sampling minority class when building the bootstrapped dataset.
- Use of weights when computing errors.
7 Multi-Layer Perceptron (Neural Networks)
In Section
4.4 we have seen the Perceptron, a mathematical model of a neuron that can be used in binary classification. We noted that this model is equivalent to a linear classifier, being thus a very limited model for a more general scenario. In this section, we are going to see how to extend this model to perform more complex tasks.
7.1 Multiclass classification
We saw how to use a single perceptron to perform binary classification, and there is a very natural way to extend this model to multi-class classification, using a one-hot encoding for the class.
For instance, let's work with a dataset
where
and
. We now encode
as a one-hot vector, i.e.,
for
. The idea is then to use
perceptrons, each of them focusing on predicting one of the classes and trained independently. The idea is depicted in Figure
15. Notice that in this case is almost compulsory to use the logistic function (or a similar function, but definitely not the step function), since this way we can interpret the values as probabilities, and therefore select the value with highest probability as our prediction. Also, remember than in
we are adding an artifical 1 to account for the bias.
Let's formalize this idea a little bit. We have independent perceptrons, each of them predicting a variable, so that if we call our activation function, we obtain, for each , where is the weight vector of perceptron . We can integrate all this into a single formula by introducing the notation , which applies to each component of the input vector: This way, we can unify the previous equations as where we have arranged Now, we can define what is a layer. A layer is just a set of parallel neurons and a layer with neurons outputs variables. Usually, all neurons in a layer use the same activation function, . Note that if this is not the case, then our previous explanations needs to be slightly adapted.
Since we have discussed that a good alternative in multiclass classification is to use the sigmoid function as activation function, we can rewrite our model as Note, nonetheless, that here we are obtaining a bunch of different probabilities, one for each class, which could be further used to increase the information about our prediction. For instance, instead of computing the maximum out of these independent probabilities, we can add a softmax function to the outputs, resulting in a normalized vector that can now be considered as a probability vector. The softmax function is
Also, at doing this, it is not necessary to apply the activation function,
, beforehand. This approach is shown in Figure
16.
Note that we are still in the linear classification scenario, and the next step is to improve the flexibility of the model to be able to cope with more complex relationships.
7.2 Multi-Layer Perceptron
Notice now an interesting fact: if our activation function is , i.e., is the range of , then a layer with perceptrons can be interpreted as a functionFor the sake of simplicity, let's say . Then, it is natural to think on stacking different layers, so that the output of a layers becomes the input of the following, effectively performing the composition of the functions defined by these layers. For example, if we have two layers, which we interpret as two functions and , we can take the output of the first layer and use it as input for the second one, effectively obtaining This is interesting mainly because a single layer, as we have seen, is only able to perform linear classification, but let's deep a bit more on what is happening on the second layer. The second layer is using an activation function , which is applied in each perceptron of the second layer as This means that we are:
- Creating combinations of the input variables, , where (the is to account for the bias).
- Applying a different transformation to each combination of variables. Note that this transformation can be arbitrary, but should be non-linear if we want to go out of the linear world, and should be differentiable if we want a smooth training process.
- For the second layer, the inputs are the transformations of the combinations of the input variables, which effectively become a base of functions (just as polynomials are), which are combined again by .
- Finally, we apply , which will be again an activation function.
The key steps for going beyond linearity are the second and third. Let's compare it to a normal linear regression. In that case, we introduced basis functions to be able to go out of the linearity world, by transforming the input variables. But then, we used still linear regression over this transformation. Therefore, it was the modification of the input what allowed us to model non-linear relationship, and not the algorithm itself, which was not changed at all. Here, the idea is similar. In fact, we could just perform the same transformations that we used to in this new case, and keep applying just one layer of perceptrons to obtain non-linear classifiers. But this is exactly the power of the multi-layer perceptron: we don't need to assume or guess the shape of the basis function! The different weights of the different layers will adapt to our problem, effectively auto-selecting how these functions should be. Of course, this is not free, and the cost is paid by obtaining harder trainings.
After this explanation, we can first integrate the previous formula as
We can also define the
multi-layer perceptron as composition of layers of perceptrons, with the addition of what is called the
input layer, which is not really a layer of perceptrons, but rather represents and fixes the size of the input, and the
output layer, which is the final layer, outputing the predictions of the model. The rest of the layers, the real perceptron layers, are called
hidden layers. This is depicted in Figure
17. In this case, the first layer has
perceptrons, the second one has
perceptrons, and the output layers has as many perceptrons as possible labels,
.
More generally, a neural network is a directed graph whose nodes are perceptrons. If there are no cycles within the network, then it is a feed-forward neural network (FFNN), and it is called a recurrent network (RNN) in the case of cycles existing within the network.
MLP are thus a special case of feed-forward neural network, in which:
- Neurons are arranged in layers.
- There is at least one hidden layer.
- Every layer is fully connected to the next one.
- No connections are allowed within layers.
A brief visualization on different kinds of networks is depicted in Figure
18.
7.3 Error functions
7.3.1 Regression
The error in the regression case can be the empirical mean square error, i.e., where the dataset is is the prediction of with the weight matrix . Note that in this formula we are assuming a single output, but it can be easily generalized as As usual, our objective will be to minimize this error.
7.3.2 Binary classification
If the target values are , then we can express If we assume independent and identically distributed examples, then we can define our likelihood function as and the negative log-likelihood defines the cross-entropy error:
7.3.3 Multi-class classification
The last error formula can be generalized for multi-class classification with classes, as the generalized cross-entropy error:
7.4 Training the MLP: Backprogragation
Notice that the error functions that we have defined can be very complex, and there is no closed form solution to minimize them. Therefore, some iterative approach is necessary, as we did with the perceptron. But we encounter now a new problem: if we do gradient descent on the neural network, we will need to compute lots of gradients, and computing a gradient is not cheap. The idea to make this process more efficient is to use the chain rule and to notice that many of the gradients that we compute are actually used in many places along the network for the computation of the final gradient. Therefore, an improved version for computing gradients was developed: the backpropagation algorithm.
7.4.1 The chain rule
The chain rule states that, given two composable differentiable functions , it isIn the multivariate case, when we want to compute the derivative of , it is This is sometimes written by naming and , so that the formula becomes
Example 7.1.
Let's do an example to understand how this can be helpful in the case of neural networks.
Imagine we have the functions: and we want to compute the gradient of in terms of . We can write this as a diagram:
We simply need to compute the gradient using the chain rule: Basically, the idea is to be able to reuse as many values as possible (notice the coloured expressions). To do this, we can compute local derivatives in each layer, and propagate them back to the previous layer, multiplying them. The previous computation can be done in the diagram as follows:
Note that:
- We can avoid unnecessary computations by storing at each node:
- Forward values: when we evaluate , we can keep this value stored in the node, because we will need it later. In this example we did symbolic differentiation, so this was not necessary, but in neural networks we need to first to the forward pass, and then assess the error.
- Local derivatives: when we evaluate , we can keep this value stored in the node, because we will use it several times.
- Backwards gradients: local derivatives are sent to those nodes that will need them, so that each of this derivatives is computed only once. Then the gradient is constructed little by little, backwards, in a dynamic programming manner.
- When the gradients are flowing backwards, each node has to sum all incoming gradients, according to the multivariate chain rule (if there is only one incoming gradient, we just take that one).
7.4.2 The backpropagation algorithm
The pseudocode for backpropagation is as in Algorithm
10. Let's break it down.
- a[layer]= is the input vector for layer .
- z[layer]= is the output vector for layer , i.e., , where is the activation function of layer (note in the pseudocode we are assuming the same working on all layers).
- W[layer]= is the weight matrix of layer .
- w[layer]= is the weight matrix of layer , without the weight of the bias.
- d[layer]= is the local gradient at layer , computed as for the output layer, or d[c+1] in the pseudocode, and for the rest of the layers, i.e., the hidden layers. Here, is the Hadamard product, i.e., the component-wise product. In the pseudocode, this is expressed by '.*'.
1. Forward pass
- for layer in {1...c+1}
a[layer] = W[layer]^T * z[layer-1]
z[layer] = g[a[layer]] # apply g[.] to each component of a[layer]
2. Backward pass
- d[c+1] = dg[a[c+1]] .* (z[c+1]-y) # .* is the component-wise product and dg is the derivative of g
- grad[c+1] = z[c]*d[c+1]^T
- for layer in {c...1}
d[layer] = dg[a[layer]] .* (w[layer+1]*d[layer+1]) # w[k] is W[k] without the bias weigth
grad[layer] = z[layer-1]*d[layer]^T
return grad
Algorithm 10:
Backpropagation.
Example 7.2.
Let's perform an example. Here is our neural network:
It is a fully connected two-layer network. Let's follow the algorithm, defining each of the variables at a time.
- We define and :
-
- Define and :
- Define and :
- Define and :
And that's it! Now we can use the gradient at each layer to train the network, updating the weights in layer according to: where is the learning rate.
7.5 Some activation functions
7.5.1 Logistic
The logistic function is , defined by with derivative In this case, for the backpropagation algorithm, we use which is a very efficient formula, since we have all this values computed in the forward pass.
7.5.2 Hyperbolic tangent
The hyperbolic tangent function is , defined by with derivative Thus, we end up with another very efficient formula
Example 7.3.
Implementing a NN for digit classification in MATLAB
A Notes on probability theory, Bayes theorem and Bayesian learning
These notes are adapted from [
3].
A.1 Probability theory basic
Let be a sample space, i.e., the set of possible outcomes of an event, and an event. A probability measure is a function that assigns a real number to every event, . This represents how likely it is that the experiment's outcome is in .
For to be a probability space we have to impose three axioms:
- .
- .
- if .
From these axioms, some consequences can be derived:
- .
Proof.
We can write and , so we have thus:
- .
Proof.
.
- If , then .
Proof.
In this case, we can write , so and we have the inequality.
- .
Proof.
We have , and , so
- .
Proof.
This is obvious from the previous result.
A.1.1 Joint probability
It is usal to be interested in the probability of two events happening simultaneously. This is called the joint probability of events and : The join probability is useful when an event can be decomposed in simpler disjoint events, i.e., we have , with . In this situation, we can use the sum rule: This is also known as marginalization: when we know the joint probability and we want to compute , we marginalize out . This basically means that if we know the probabilities of all possible pairs , we can know the probability of by exploring all the possibilities. Here, 'exploring' is using the sum rule: or if is continuous.
Example A.1.
We have two events:
- : earns more than 100k or earns less than 100k.
- : is a professor, a software engineer or a data scientist.
We have some sources of information, and are able to determine that then we can conclude that
A.1.2 Conditional probability
The conditional probability of given is the probability that occurs, knowing that has occurred. This means that we have to restrict the space of possible outcomes, from to : If we rearrange the terms, we can obtain the product rule: This formula can be generalized to an arbitrary number of events, the general product rule, given by
Exercise A.1.
Prove that
Then, prove the general product rule.
Basically, we are asked to prove the general product rule by induction.
For , we have already proven it.
For , we hav which is what we wanted.
Now, assume it is true for . Then, we have
A.1.3 Bayes rule
Bayes theorem gives an alternative formula for the conditional probability:
Proof.
Assuming all involved probabilities are non-zero:
This rule is known to be useful to update the probability of an event happening, when we are able to gather new information of related events. is usually called the prior probability, and is the a posteriori probability, which means that we have observed , and want to update the probability estimate for .
Example A.2.
Example of the Bayes rule in action
An English-speaking tourist visits a city whose language is not English. A local friend tells him that 1 in 10 natives speak English, 1 in 5 people in the streets are tourists and that half of the tourists speak English. Our visitor stops comeone in the street and finds that this person speaks English. What is the probability that this person is a tourist?
We have
We want to update our knowledge about the event of this person being a tourist. The prior probability is , but since we know that this person speaks english, we have new information useful for updating the probability.
First, the total probability of someone speaking english is
Now, the a posteriori probability of the person being a tourist, now that we know that he speaks english is As we can see (and as we should expect), knowing that the person speaks english, our confidence that he is a tourist increases.
A.2 Bayes rule in the context of learning
As have been explained, Bayes rule allows us to reason about hypotheses from data: In the jargon of parameters and datasets, this is: let be a random variable with support , and let be the data that has been observed. Then, it is Here, is the prior distribution of . This means it is the distribution that we assume, before observing . is the likelihood of : the probability of observing if the parameters are . is the evidence or expected likelihood. is the posterior distribution of , our quantity of interest, expressing what we know about after having observed .
Thus, we can continue this line of thought to tackle a new way of creating a model: given some data , find the best possible values for the unknown parameters , so that the posterior or its likelihood is maximized.
There are, basically, two different approaches:
- Maximum likelihood: in this case, we want to choose in order to maximize its likelihood:
- Maximum a posteriori: in this case, we take into account a prior distribution for and estimate its value maximizing its posterior distribution:
A.3 Maximum likelihood estimation
Given a sample , where are independent are identically distributed observations from a random variable , following a distribution , with the paremeters of the distribution. Our objective will be to obtain the best values for , according to our data, assuming some special form for .
For this, the likelihood can be used: since the are independent, the probability of having the sample is The likelihood function is thus defined as Note that this is done with a fixed data , so it is not a probability distribution, but a function of . This way, the maximum likelihood estimator for is given by There is a numerical issue here, though: as we are multiplying probabilities, which are values between 0 and 1, and we are likely to be multiplying many of them, we expect to obtain values very close to 0, which can lead to underflow in computations. Thus, it is convenient to use the log-likelihood:
Example A.3.
Compute the maximum likelihood estimator (MLE) for univariate Gaussian distribution.
For this, we assume the data , where each In this situation, the parameters are and . The likelihood function is Now, the log-likelihood is At this point, to obtain the MLE, we need to maximize this function with respect to and : Now, we substitute the value obtained for .
Example A.4.
Compute the MLE for a Bernoulli distribution.
Now the observations are the results of coin tosses. We have to compute the parameter of the Bernoulli random variable, whose probability function is given by Now, we have . The likelihood function is We differentiate: This derivative is zero if and only if
A.4 Properties of estimators
Definition A.1.
If we have a dataset where and is a random variable, we call estimator a function .
Usually, we focus on estimators that tell us something about the underlying distribution . For example, it is usual to assume that belongs to a certain family of random variables , so that we need to estimate the parameters .
There are some properties that are considered desirable for an estimator to have:
Example A.5.
Show that
Let's start from the definition:
Example A.6.
Compute the bias and the variance of the ML estimates of an univariate Gaussian. Show that is biased and that we can correct its biasedness by using a different estimator Compute the bias and the variance of this new estimator.
Let's start with :
Now : And we see how this one is, in fact, unbiased.
A.5 Maximum a posteriori estimation
MAP (maximum a posteriori) estimation is a method of estimating the parameters of a statistical model by finding the parameter values that maximize the posterior probability distribution of the parameters, given the observed data and a prior probability distribution over the parameters. In contexts where the amount of data is limited or noisy, incorporating prior knowledge or beliefs can help to produce more stable and accurate estimates. The prior distribution serves as a regularization term, allowing us to control the degree of influence that the prior has on the estimate: where the denominator can be ignored because it is constant for all possible , so it does not change the .
Example A.7.
Find the MAP estimate for of an univariate Gaussian with Gaussian prior distribution for , where and are assumed to be known.
Let where , then so we want to maximize or, its log function Thus, and this is zero if and only if A different way to do this is that we have and we want such that and , so we can solve obtainingand
Example A.8.
The maximum likelihood estimate of for a Bernoulli r.v. is given by If we have outcomes, then we have the categorical distribution, also known as multinoulli or generalized Bernoulli, which has support . Its parameters are , representing the probability of observing each of the possible outcomes, with and .
It is convenient to use the one-of-K encoding (also called one-hot encoding) for each outcome. Thus, the pmf of this distribution becomes Now, given a sample of possible outcomes for a multinoulli r.v. , the maximum likelihood estimate for is for each . We can write this compactely as If some category is not present in our sample, then its corresponding ML estimate is going to be 0. These 0-estimates are problematic in predictive applications, because unseen outcomes in the training data are considered to be impossible to happen, and thus can never be predicted. To avoid this, pseudocounts are used instead. These represent prior knowledge in the form of (imagined) counts for each category . The idea is to assume that the data is augmented with our pseudocounts, and then we estimate using maximum likelihood over the augmented data, namely so the -th parameter is As an example, imagine that we obtain a sample from a die: . If the vector of pseudocounts is then the estimates are and so on. Notice that although 2 has not been observed in , its probability estimate is not 0. This special case where all pseudocounts are 1 is known as Laplace smoothing.
Prove that using maximum likelihood with pseudocounts corresponds to a MAP estimate with Dirichlet prior with parameters .
A Dirichlet distribution, has the density function: To compute the MAP estimate, we use Thus, we want to minimize or, equivalently, its log We have The lagrangian is so that and then Finally, substituting back, we obtain as we wanted!
A.6 Bayesian Learning
In the Bayesian Learning framework, instead of working with point estimates of our unknown parameter variables , we work with the whole posterior distribution . In this case, learning is the process by which starting with some prior belief about the parameters and when facing some observations in the form of a dataset , we update our belief about the possible values for our parameters in the form of the posterior distribution. This process is iterated over newly received data, so it can be viewed as a sequential process, in which Bayes is invoked each time we need a new posterior .
Example A.9.
Bayesian learning in Matlab
A.6.1 Predictive posterior
When doing prediction in this framework, the whole distribution is used. We view a prediction as a weighted average of all predictions each value for can make, weighted by its posterior probability (its expected value):
Example A.10.
An insightful example
References
1Jose A. Lorencio Abril, "Apuntes de Inferencia EstadÃstica" (2021).
2Marta Arias, "Machine Learning".
3Marta Arias, "Notes on probability theory, Bayes theorem and Bayesian learning".
4F. Rosenblatt, "The perceptron: A probabilistic model for information storage and organization in the brain.", Psychological Review 65, 6 (1958), pp. 386--408.