BDMA - Machine Learning

Fall 2023
Jose Antonio Lorencio Abril
image: 0_home_runner_work_BDMA_Notes_BDMA_Notes_Paris-___cision-Modeling_LectureNotes_source_CS-logo.png
Professor: Tom Dupuis
Student e-mail: jose-antonio.lorencio-abril@student-cs.fr

This is a summary of the course Machine Learning taught at the Université Paris Saclay - CentraleSupélec by Professor Tom Dupuis in the academic year 23/24. Most of the content of this document is adapted from the course notes by Dupuis, [1], so I won't be citing it all the time. Other references will be provided when used.

Table of Contents

Part I Deep Learning
Part II Reinforcement Learning

Part I Deep Learning

1 Introduction

Artificial Intelligence is a wide concept, encompassing different aspects and fields. We can understand the term AI as the multidisciplinary field of study that aims at recreating human intelligence using artificial means. This is a bit abstract, and, in fact, there is no single definition for what this means. Intelligence is not fully understood, and thus it is hard to assess whether an artificial invention has achieved intelligence, further than intuitively thinking so.
For instance, AI involves a whole variety of fields:
Leveraging all of these, people try to recreate or even surpass human performance in different tasks. For example, a computer program that can play chess better than any human could ever possibly play, such as Stockfish, or a system that is able to understand our messages and reply, based on the knowledge that it has learnt in the past, such as ChatGPT and similar tools. Other examples are self-driving cars, auto-controlled robots, etc.
Therefore, AI is a very wide term, which merges many different scientific fields. Machine Learning, on the other side, is a narrower term, which deals with the study of the techniques that we can use to make a computer learn to perform some task. It takes concepts from Statistics, Optimization Theory, Computer Science, Algorithms, etc. A relevant subclass of Machine Learning, which has come to be one of the most prominent fields of research in the recent years, is Neural Networks or Deep Learning, which consists on an ML technique based on the human brain. Many amazing use cases that we see everywhere, like Siri (Apple assistant), Cortana (Windows assistant), Amazon recommender system, Dall-E (OpenAI image generation system), etc. Not only this, but the trend is growing, and the interest in DL is continuously increasing.
This is partly also due to the increase in computing resources, and the continuous optimization that different techniques are constantly experiencing. For instance, for a model trained on one trillion data points, in 2021 the training process required around 16500x less compute than a model trained in 2012.
But not everything is sweet and roses when using DL. Since these systems are being involved in decision making processes, there are some questions that arise, like whose responsibility is it when a model fails? Moreover, data is needed to train the models, so it is relevant to address how datasets should be collected, and to respect the privacy of the people that produce data. In addition, the recent technologies that are able to generate new content and to modify real content, make it a new issue that AI can create false information, mistrust, and even violence or paranoia.
Nonetheless, let's not focus on the negative, there are lots of nice applications of DL, and it is a key component to deal with data, achieving higher performance than traditional ML techniques for huge amount of data.

1.1 AI History

In 1950, Alan turing aimed to answer the question 'Can machines think?' through a test, which came to be named the Turing Test, and consists in a 3 players game. First, a similar game is the following: 2 talkers, a man and a female, and 1 interrogator. The interrogator asks questions to the talkers, with the aim of determining who is the man and who is the female. The man tries to trick the interrogator, while the woman tries to help him to identify her.
Then, the Turing Test consists in replacing the man by an artificial machine. Turing thought that a machine that could trick a human interrogator, should be considered intelligent.
Later, in 1956, in the Dartmouth Workshop organized by IBM, the term Artificial Intelligence was first used to describe every aspect of learning or any other feature of intelligence can be so precisely described that a machine can be made to simulate it.
From this year on, there was a focus on researching about Symbolic AI, specially in three areas of research:
In 1958, the Perceptron was conceived, giving birth to what is called the connectionism, an approach to AI based on the human brain, and a big hype that encouraged funding to support AI research. At this era, scientists experience a bit of lack of perspective, thinking that the power of AI was much higher than it was. For instance, H. A. Simon stated in 1965 that 'machines will be capable, within twenty years, of doing any work a man can do.' We can relate to our time, with the huge hype that AI is experiencing, as well as the many apocaliptic theories that some people are making. Maybe we are again overestimating the power of AI.
The time from 1974 to 1980 is seen as the first winter of AI, in which research was slowed down and funding was reduced. This was due to several problems found at the time:
In 1980, there was a boom in expert knowledge systems that made AI recover interest. An expert system solves specific tasks following an ensemble of rules based on knowledge facilitated by experts. A remarkable use case was the XCON sorting system, developed for the Digital Equipment Corporation, which helped them save 40M$ per year. In addition, connectionism also came again on scene, thanks to the development of backpropagation applied to neurons, by Geoffrey Hinton. All these achievement made funding to come back to the field.
Nonetheless, there came a second winter of AI, from 1987 to 1994, mainly because several companies were disappointed and AI was seen as a technology that couldn't solve wide varieties of tasks. The funding was withdrawn from the field and a lot AI companies went bankrupt.
Luckily, from 1995 there started a new return of AI in the industry. The Moore's Law states that speed and memory of computer doubles every two years, and so computing power and memory was rapidly increasing, making the use of AI systems more feasible each year. During this time, many new concepts were introduced, such as intelligent agents as systems that perceive their environment and take actions which maximize their chances of success; or different probabilistic reasoning tools such as Bayesian networks, hidden Markov models, information theory, SVM,... In addition, AI researchers started to reframe their work in terms of mathematics, computer science, physics, etc., making the field more attractive for funding. A remarkable milestone during this time was the victory of Deep Blue against Garry Kasparov.
The last era of AI comes from 2011 to today, with the advent and popularization of Deep Learning (DL), which are deep graph processing layers mimicking human neurons interactions. This happened thanks to the advances of hardware technologies, that have enabled the enormous computing requirements needed for DL. The huge hype comes from the spectacular results shown by this kind of systems in a huge variety of tasks, such as computer vision, natural language processing, anomaly detection,...
In summary, we can see how the history of AI has been a succession of hype and dissapointment cycles, with many actors involved and the industry as a very important part of the process.

2 Machine Learning Basics

In this section, we review some notation, and basic knowledge of Linear Algebra, Probability and Machine Learning.

2.1 Linear Algebra Basics

A scalar is a number, either real and usually denoted x R , or natural and denoted n N . A vector is an array of numbers, usually real, x R n , or x=[ x 1 x 2 x n ]. A matrix is a 2-dimensional array of numbers, A R n × m , or A=[ A 11 A 1n A m1 A mn ]. A tensor is an n -dimensional array of numbers, for example A R m × k × p is a 3-dimensional tensor.
Usually, we will be working with matrices, which can be operated in different ways:
There are also special matrices:
Another important concept is that of the norm, which is basically measuring how far a point is from the origin of the space and can be used to measure distances:
Definition 2.1. A norm is a function f that measures the size of vectors, and must have the following properties:
  • f( x ) =0x=0,
  • f( x+y ) f( x ) +f( y ) , and
  • α R ,f( α x ) =| α | f( x ) .
A very important family of norms is the L p norm, defined as x p = ( i | x i | p ) 1 p . The Euclidean norm is the L 2 norm, noted x and equivalent to computing x T x . In Machine Learning, it is not uncommon to find the use of the squared Euclidean norm, since it maintains the ordinals and is easier to operate with. The Manhattan norm is the L 1 norm, and it is used when the difference between zero and nonzero elements is important. Finally, the Max norm is the L , or x = max i | x i | .

2.2 Probability Basics

A random variable, X , is a variable that can take different values, x , randomly. They can be discrete, like the number drawn from a dice, or continuous, like the humidity in the air.
A probability distribution, p , is a Probability Mass Function (PMF) for discrete variables, and a Probability Density Function (PDF) for continuous random variables. It must satisfy:
It is usual to have two (or more) random variables, X and Y , and to be interested in the probability distribution of their combination, p( x,y ) . In this context, we define the marginal probability of the variable X as p( X=x ) = yY yY p( x,y ) y ,xX. The conditional probability of the variable Y conditioned to X=x is p( Y=y|X=x ) = p( Y=y,X=x ) P( X=x ) . Finally, there is the chain rule of conditional probabilities, in which we start with n random variables, X 1 ,..., X n , and it follows: p( X 1 = x 1 ,..., X n = x n ) =p( X 1 = x 1 ) i=2 n p( X i = x i | X 1 = x 1 ,..., X i-1 = x i-1 ) .
Example 2.1. For example, let's say X={ 1,2,3 } , Y={ 1,2 } and Z={ 1,2 } with the following probabilities:
X
Y
Z
p( x,y,z )
1
1
1
1 6
1
1
2
1 6
1
2
1
1 12
1
2
2
1 24
2
1
1
1 24
2
1
2
1 20
2
2
1
1 20
2
2
2
1 20
3
1
1
1 6
3
1
2
1 12
3
2
1
1 20
3
2
2
1 20
Then, the marginal probabilities for the variable X are P( X=1 ) = 1 6 + 1 6 + 1 12 + 1 24 = 11 24 , P( X=2 ) = 1 24 + 1 20 + 1 20 + 1 20 = 23 120 , P( X=3 ) = 1 6 + 1 12 + 1 20 + 1 20 = 21 60 = 7 20 .
The conditional probability for the event { Y=1|X=3 } is: P( Y=1|X=3 ) = P( Y=1,X=3 ) P( X=3 ) = 1 6 + 1 12 7 20 = 1 4 7 20 = 5 7 .
The conditional probability for the event { Z=1|X=3,Y=1 } is: P( Z=1|X=3,Y=1 ) = P( X=3,Y=1,Z=1 ) P( X=3,Y=1 ) = 1 6 1 4 = 2 3 .
The probability of the event { X=3,Y=1,Z=1 } could be computed from the conditional probabilities as follows, in case we only knew these: P( X=3,Y=1,Z=1 ) = P( X=3 ) P( Y=1|X=3 ) P( Z=1|X=3,Y=1 ) = 7 20 5 7 2 3 = 10 60 = 1 6 .
When there are several variables, it is possible that the value of one of them is dependant, somehow, on the values that the other variables take; or that it is not:
Definition 2.2. Two random variables X and Y are independant, denoted XY , if xX,yY,p( X=x,Y=y ) =p( X=x ) p( Y=y ) .
X and Y are conditionally independent given the random variable Z , written X Z Y if xX,yY,zZ , p( X=x,Y=y|Z=z ) =p( X=x|Z=z ) p( Y=y|Z=z ) .
In Statistics and Machine Learning, there are some measures that summarize information about random variables, and that hold great importance.
Definition 2.3. The expectation of a function f( x ) where xp( x ) is the average value of f over x : E xp [ f( x ) ] = xX xX p( x ) f( x ) x . The variance of f( x ) measures how the values of f varies from its average: Var[ f( x ) ] = E [ ( f( x ) - E [ f( x ) ] ) 2 ] , and the standard deviation is the square root of the variance.
The covariance of two random variables provides informaiton about how much two values are linearly related. More generally, if we apply two functions f( x ) , where xp( x ) , and g( y ) , where yp( y ) , the covariance between them is: Cov[ f( x ) ,g( y ) ] = E [ ( f( x ) - E [ f( x ) ] ) ( g( y ) - E [ g( y ) ] ) ] .

2.3 Machine Learning Basics

To finalize with this review chapter, we are going to remember some basic concepts of Machine Learning.
First, let's give a definition of the concept:
Definition 2.4. A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P , if its performance at tasks in T , as measured by P , improves with experience E .
Mathematically, we can formalize this as having a dataset of m points and k features, which can be represented as a matrix X R m × k . In the case of supervised learning, X is associated with a vector of labels, y , and we aim to learn a joint distribution, p( X,y ) to infer p( Y=y|X=x ) = p( x,y ) y' p( x,y' ) . The goal is then to find a function f ˆ that associates each x to the best approximation of y , and that is capable of generalizing to unseen data. Usually, f ˆ is parameterized by a set of parameters, θ , which are learnt during training.
The main challenge of an ML model is generalization to unseen data estimated on test data after the training on training data. Overfitting occurs when the gap between training error and test error is too large, while underfitting occurs when the training error is too large. The capacity of a model is the range of functions that it is able to leanr and control how likely the model can overfit or underfit. This is visualized in Figure 1.
image: 1_home_runner_work_BDMA_Notes_BDMA_Notes_Paris-___achine-Learning_LectureNotes_source_pegado1.png
Figure 1: Appropriate capacity, overfitting and underfitting visualization.
When we want to train a model, we will define the parameters that characterize it, and then we need to obtain the best possible of the parameters, according to the data. For this, we use estimators:
Definition 2.5. Given an unknown parameter θ , we estimate it through an estimator, θ ˆ . A point estimator is a function of the data, X , θ ˆ =g( X ) . The bias of an estimator is bias( θ ˆ ) = E [ θ ˆ ] - θ . An estimator is unbiased if bias( θ ˆ ) =0 .
The variance of an estimator is Var( θ ˆ ) .
There are different ways to construct estimators, but one that is frequently used and that has solid mathematical foundations is the maximum likelihood estimator. Consider a dataset X={ x 1 ,..., x n } and p( x; θ ) a parametric family of probability distribution that maps for each x the probability p data ( x ) . This is, for each θ , p( x; θ ) is a probability density function. The maximum likelihood estimator is then θ ML = arg max θ p model ( X; θ ) = arg max θ i=1 n p model ( x i ; θ ) , considering that all instances of data are independent and identically distributed (iid). It is also a common practice to use the maximum log-likelihood instead, removing the product and avoiding floating point issues, since when the dataset is large, the product will rapidly go to 0. In addition, the logarithm does not modify the ordinals of the function. Therefore, we can use: θ ML = arg max θ i=1 n log ( p model ( x i ; θ ) ) .

3 Deep Neural Networks

3.1 Perceptron

A deeper explanation of the perceptron can be read in my notes from another course, https://lorenc1o.github.io/BDMA_Notes/universities/UPC/Machine_Learning_summary.pdf.
A perceptron is an algorithm for supervised learning of binary classifiers. That is, we have a dataset X R n × m associated with a vector of labels y { 0,1 } n . Then, the perceptron learns a function f ˆ parametrized by a vector of weights w R m and a bias b , such that: f ˆ ( x ) ={ 1 ifwx+b>0 0 .
Therefore, it is a linear classifier, which divides the input space into two regions separated by a hyperplane. This means that a perceptron cannot separate non-linear data.

3.2 Multi-layer perceptron

A deeper explanation of the MLP can be read in my notes from another course, https://lorenc1o.github.io/BDMA_Notes/universities/UPC/Machine_Learning_summary.pdf.
When we say 'Deep' neural network, we refer to a series of stacked perceptrons. However, just like this, the model is still linear. This is why activation functions are introduced. An activation function is a function that is applied to the output of a perceptron, to make it non linear.
For example, ReLU is a piecewise-linear function defined as ReLU( z ) ={ z ifz0 0 = max { z,0 } . This function preserves much of the good oprimization properties of a linear function, i.e., it is differentiable (apart from one point), and its derivative is constant.
Example 3.1. Learn the XOR function with a 2-layer MLP.
The XOR function is represented with the table:
x 1
x 2
y
0
0
0
0
1
1
1
0
1
1
1
0
We want to use a 2-layer MLP to learn this function:
image: 2_home_runner_work_BDMA_Notes_BDMA_Notes_Paris-___ine-Learning_LectureNotes_source_xor_drawio.png
In h 1 , it will be w 11 x 1 + w 12 x 2 + b 1 , and in h 2 w 11 x 1 + w 12 x 2 + b 1 . This can be represented as h= W h T X+ b h . Then, we apply ReLU max ( 0, W h T X+ b h ) , and finally the output layer y= W y T max ( 0, W h T X+ b h ) + b y .
Let's see the different inputs:
x 1
x 2
h
y
0
0
( w 11 w 12 w 21 w 22 )( 0 0 )+( b 1 b 2 )=( b 1 b 2 )
( w 1 y w 2 y )( max ( 0, b 1 ) max ( 0, b 2 ) )+ b y 0
0
1
( w 12 + b 1 w 22 + b 2 )
( w 1 y w 2 y )( max ( 0, w 12 + b 1 ) max ( 0, w 22 + b 2 ) )+ b y >0
1
0
( w 11 + b 1 w 21 + b 2 )
( w 1 y w 2 y )( max ( 0, w 11 + b 1 ) max ( 0, w 21 + b 2 ) )+ b y >0
1
1
( w 11 + w 12 + b 1 w 21 + w 22 + b 2 )
( w 1 y w 2 y )( max ( 0, w 11 + w 12 + b 1 ) max ( 0, w 21 + w 22 + b 2 ) )+ b y 0
A solution is: W h =( 1 1 1 1 ), b h =( 0 -1 ), W y =( 1 -2 ),b=0. Let's check:
x 1
x 2
h
y
0
0
( 0 -1 )
( 1 -2 )( 0 0 )=00
0
1
( 1 1-1 )
( 1 -2 )( 1 0 )=1>0
1
0
( 1 1-1 )
( 1 -2 )( 1 0 )=1>0
1
1
( 1+1 1+1-1 )
( 1 -2 )( 2 1 )=00
So, it works! Note that this solution is not unique!
What happens is actually that the solution for the XOR problem is not linearly separable:
image: 3_home_runner_work_BDMA_Notes_BDMA_Notes_Paris-___hine-Learning_LectureNotes_source_xor_space.png
But, the hidden layer transforms this space, making the problem linearly separable, and therefore solvable in the last layer:
image: 4_home_runner_work_BDMA_Notes_BDMA_Notes_Paris-___earning_LectureNotes_source_xor_space_trans.png

3.3 Cost Functions

The cost function is important when working with neural networks, because our goal is to ultimately train the model to solve some problem, and the cost functions will be the function that our model will aim at minimizing, thus guiding the training process.
Usually, we will need to choose a cost function L that is suitable for our problem. Then, we will minimize L with stochastic gradient descent, by:
In this process, we will aim to find good local minima, instead of global minimum. This is related to overfitting (learning only the training data, losing generalization capabilities), and to the empirical fact that deep neural network have surprisingly good local and non-global optima.

3.3.1 Choice of cost function

In the general case, we use the maximum likelihood principle, taking the output types of the network into account. This means that we assume our dataset { x 1 ,..., x n } to be independently and identically distributed (i.i.d.) from an unknown distribution, p data ( x ) . We choose a parametric model family p model ( x; θ ) represented as a neural network, which we use to estimate an approximation of the true distribution. For this, we utilize the maximum likelihood estimator, defined as θ ML = arg max θ i=1 n p model ( x i ; θ ) . Usually, to avoid floating point errors, the log-likelihood is used instead: θ ML = arg max θ i=1 n log p model ( x i ; θ ) .
In the maximum likelihood estimation framework, we might apply activation functions to the output layer to get a desired structure for our distribution. This choice will also influence the mathematical form of the cost function. For example, we can use linear units for regression or for Gaussian distributions, sigmoid units for binary classification or softmax units for multi-class classification.
Linear units for regression
A linear output layer is such that, given the features h , the output is y ˆ = W T h+b, where W is the weights vector and b the bias.
We can use this to predict real or vector valued variables, such as prices, biometrics,...
Linear unit for Gaussian distribution
A Gaussian output unit is such that, given features h , a linear layer produces a vector y ˆ representing the mean and the covariance matrix of a conditional Gaussian distribution: p( y|x ) = N ( y; y ˆ ,I ) .
Covariance is usually not modelled or simplified to be diagonal (in which case we need to ensure that the output is non-negative).
Binary classification
In this case, the objective is to predict a binary variable, y : the neural network must predict P( y=1|x ) . Thus, we must ensure that the output is a probability, in the interval [ 0,1 ] . For this, we can take P( y=1|x ) = max { 0, min { 1, W T h+b } } . The problem with this approach is that if W T h+b[ 0,1 ] then the gradient is 0 and the training will stop. To solve this issue, we can use a sigmoid unit, which is y ˆ = σ ( W T h+b ) = 1 1+ e - W T h+b .
Softmax unit for multi-class classification
Now our objective is to classify the input data into one among N>2 classes. We want to predict y ˆ with y i ˆ P( y=i|x ) , subject to y i ˆ [ 0,1 ] ,i and i y i =1 .
In the output layer we can have N perceptrons, each of them computing z i = log P( y=i|x ) , i.e., the logits. With this, we can apply the softmax output unit to all of them, obtaining our vector of probabilities, as softmax ( z ) i = e z i j e z j .

3.3.2 Cross-entropy

In classification problems we want to estimate the probability of different outcomes. Let the estimated probability of outcome i be p model ( x=i ) with to-be-optimized parameters θ and let the frequency of outcome i in the training set be p( x=i ) . Given N conditionally independent samples in the training set, then the likelihood of the parameters θ of the model p model ( x=i ) on the training set is: L ( θ ) = iX p model ( x=i ) Np( x=i ) . Therefore, the log-likelihood, divided by N , is 1 N log ( L ( θ ) ) = 1 N N iX p( x=i ) log p model ( x=i ) = iX p( x=i ) log p model ( x=i ) .
Cross-entropy minimization is frequently used in optimization and rare-event probability estimation. When comparing a distribution q against a fixed reference distribution p , cross-entropy and KL divergence are identical up to an additive constant (since p is fixed): According to the Gibbs' inequality, both take on their minimal values when p=q , which is 0 for KL divergence, and H( p ) for cross-entropy. In the engineering literature, the principle of minimizing KL divergence (Kullback's "Principle of Minimum Discrimination Information") is often called the Principle of Minimum Cross-Entropy (MCE), or Minxent.

3.4 Why deep NN?

Depth is the longest data path data can take from input to output. For a deep feed forward NN, depth is the number of hidden layers plus the output layer. State-of-the-art architectures used in practice have dozens to hundreds of layers.
Theorem 3.1. Universal Approximation Theorem
Let ϕ ( ) be a nonconstant, bounded, and monotonically increasing continuous function. Let I m 0 denote the m 0 - dimensional unit hypercube, [ 0,1 ] m 0 . The space of continuous functions on I m 0 is denoted by C( I m 0 ) .
Then, given any function fC( I m 0 ) and ϵ >0 , there exists an integer m 1 and sets of real constants α i , b i and w ij R where i=1,..., m 1 and j=1,..., m 0 such that we may define F( x ) = i=1 m 1 α i ϕ ( j=1 m 0 w ij x j + b i ) as an approximate realization of the function f , that is | F( x ) -f( x ) | < ϵ ,x I m .
This theorem is very relevant, because it says that for any mapping function f in supervised learning, there exists a MLP with m 1 neurons in the hidden layer which is able to approximate it with a desired precision.
However, it only proves the existence of a shallow (just one hidden layer) MLP with m 1 neurons in the hidden layer that can approximate the function, but it does not tell how to find this number.
As a rule of thumb for the generalization error, it is ϵ = V C dim ( MLP ) N , where V C dim is the Vapnik-Chervonenkis dimension, a measure of the capacity of a model. It refers to the largest set of points that the model can shatter. It is not easy to compute, but a rough upper bound for a FFNN is O( W log W ) , with W being the total number of weight in the network.
Also, this theorem hints us that having more neurons in the hidden layers will give us better training error, but worse generalization error: overfitting.
However, for most functions m 1 is very high, and becomes quickly computationally intractable: so we need to go deeper.
Theorem 3.2. No Free Lunch Theorem
Multiple informal formulations:
  • For every learning algorithm A and B, there are as many problems where A has a better generalization error than problems where B has a better one.
  • All learning algorithms ahve the same generalization error if we average over all learning problems.
  • There is no universally better learning algorithm.
Depth Property
The number of polygonal regions generated by a MLP with a ReLU function, d inputs, n neurons per hidden layer and l layers is O( (nd) d( l-1 ) n d ) .
This number grows exponentially with depth. This means that adding depth basically allows for more transformations of the input space.

3.5 Gradient-based Learning

The gradient of a function f: R n R , is f: R n R n , defined at the point p=( x 1 ,..., x n ) as f( p ) =( f x 1 ( p ) f x n ( p ) ). This is, it's the local derivative or slope of each dimension at a certain point.
Going in the opposite direction of the gradient is a naïve but practical guess of the direction of the local minimum. This is the base for the gradient descent method.
Gradient-descent method (Cauchy, 1847)
A parametric function f( θ ) can be iteratively minimized by following the opposite direction of the gradient: θ t+1 = θ t - ϵ θ f( θ ) , where ϵ >0 is the learning rate.
We stop iterating when the gradient is near to 0.
Notice that this is useless if we have a close form for the gradient! In that case it is easier to just minimize it. This is useful when this is not the case, which always happens for neural networks.
In addition, there are variations to the method, for example, we can vary ϵ during training.
Stochastic gradient descent
Given a cost function f( θ ) , parameters of the network are updated with θ θ - ϵ θ f( θ ) .
For the negative log-likelihood (MLE), the function is: f( θ ) = 1 m i=1 m L( x ( i ) , y ( i ) , θ ) , so the estimated gradient is θ f( θ ) = 1 m i=1 m θ L( x ( i ) , y ( i ) , θ ) .
The problem with this approach is that to take a single step of gradient descent, we must compute the loss over the whole dataset everytime, making the method not scalable at all. This is called batch gradient descent.
One solution is to compute the gradient with 1 sample only at each step, which is very noisy and inefficient, but works. This is the stochastic gradient descent.
In the middle ground, we find the mini-batch gradient descent, which divides the dataset into subsets, and updates the parameters after processing each of these subsets. A batch is a collection is a collection of samples used at each iteration for performing SDG in DL. A bigger batch provides a better gradient estimation, and therefore a faster learning, but also implies more device memory and slower descent.
Therefore, there is a tradeoff between money and performance at companies. In practice, a batch is set between 1 to 256 on one GPU.
But there is an even greater problem, the computation of the gradient is computationally very costly. To go around this problem, back-propagation was invented, as an efficient technique for gradient computation.

3.5.1 Back-propagation

The back-propagation algorithm is based on the chain rule for the derivative of composite functions: if we have y=g( x ) and z=f( y ) =f( g( x ) ) =( fg ) ( x ) , then df dx ( x ) =f'( g( x ) ) g'( x ) , or, abusing notation, dz dx = dz dy dy dx . This is generalized to multivariate funtions as follows: let x R m ,y R n ,g: R m R n and f: R n R . If z=f( y ) =f( g( x ) ) , then z x i ( x ) = j z y j y j x i , or, x z= ( y x ) T y z, where ( y x ) is the Jacobian of g .
Now, back-propagation is a recursive application of the chain rule, starting from the cost function. The algorithm works as follows:
  1. Forward pass: a feedforward network takes as input x and produces the output y ˆ . The information flows from layer to layer.
  2. Cost function: compute the error between expected output and actual output.
  3. Back-propagate: evaluate the individual gradient of each parameter and propagates them backwards to update them. For this, we use the concept of local derivative: the derivative of connected nodes are computed locally on the edges of the graph. For non-connected nodes, we multiply the edges connected between the nodes, and we sum over all incoming edges.
If we do this in a forward way, summing over all paths becomes intractable pretty quickly, while when doing it in a backwards way, it allows to obtain the derivative of the output with respect to every node directly in one pass. This leads to massive parallelization.
I did a more detailed explanation, with visualizations in my previous notes, https://lorenc1o.github.io/BDMA_Notes/universities/UPC/Machine_Learning_summary.pdf.

4 Deep Neural Networks: Optimization and Regularization

4.1 Optimization

Recall that neural networks learn by optimizing a cost function, J( θ ) . This optimization is, in practice, performed by gradient descent, so we must be able to compute θ J( θ ) . The problem is that this is computationally very costly. However, we have seen how back-propagation is an efficient gradient computation technique.
Now, learning is related to what we want to optimize, since we are interested in the performance on the test set, which is not always possible to ensure. Therefore, what is done is minimizing a cost function, J , hoping that it will improve the performance in the test set. But this relationship is also what makes learning and optimizing two different things! In pure optimization, our objective is minimizing J , while in learning, the objective is the ability to generalize, or perform well on the test set. Optimizing J on the training set does not ensure a good generalization, and sometimes worse results regarding the pure optimization problem in the training set, can yield better results in the learning problem (think about the overfitting problem).
Therefore, optimization is a crucial part of learning, and gradient-descent is a “cheap” optimization technique. However, it is not exempt of problems:
In addition, there are problems that are specific to the kind of functions that arise when working with neural networks:
There are different techniques to address these problems. Let's see some of them.

4.1.1 Solving Bad Convergence

It's important to stabilize and improve the convergence of gradient descent on DNNs, and for this we need good optimization techniques.
Gradient Clipping
Gradient Clipping proposes to clip the gradient norm to a threshold v , so:
  1. Compute the gradient, say g .
  2. If g >th , then g gth g , where th is a threshold to the maximum admissible gradient norm.
This way, we keep the direction of the gradient, while preventing overshooting, i.e., taking too large steps. Although this introduces a bias in the optimization process, it works well in practice.
A visualization is the following:
image: 5_home_runner_work_BDMA_Notes_BDMA_Notes_Paris-___chine-Learning_LectureNotes_source_pegado11.png
Here, we can observe how if we don't clip, the gradient is too large and we miss the minimum at the right, while clipping enables us to get ther easily.
Gradient with Momentum
Momentum represents an acceleration method for stochastic gradient descent. The idea is to smooth gradient steps with some momentum or inertia, using previous gradient steps as a “memory” of the direction.
Let's call the gradient at step t , G t ( θ ) , then we define the velocity, v t ( θ ) , as v t ( θ ) = α v t-1 ( θ ) -( 1- α ) G t ( θ ) , where 0 α <1 is a parameter controlling how much of the gradient we use for the parameter update, usually set as 0.9. If it is set to 0, we obtain the vanilla SGD.
The updates are done as θ t+1 = θ t - η v t ( θ ) , where η is the learning rate.
The momentum approach is shown below, observing a slight speedup.
image: 6_home_runner_work_BDMA_Notes_BDMA_Notes_Paris-___chine-Learning_LectureNotes_source_pegado15.png
Nesterov Momentum
This represents a variant of gradient descent with momentum. It tries to address the problem of the momentum approach, which is that is tends to oscillate around the minimum. Nesterov corrects these oscillations by estimating the gradient after the momentum update as: v t ( θ ) = α v t-1 ( θ ) -( 1- α ) G( θ - α v t-1 ( θ ) ) .
image: 7_home_runner_work_BDMA_Notes_BDMA_Notes_Paris-___chine-Learning_LectureNotes_source_pegado16.png
We can observe how the oscillations are less prominent. If the function was different and momentum oscillates around the minimum, Nesterov would help with this.
Dynamic Learning Rate
Stochastic gradient descent estimates the gradient with only one sample, and iterates over the whole dataset, as θ θ - η G( θ ; x i , y i ) . It is common to use minibatches instead: θ θ - η G( θ ; x i 0 : i 0 +B , y i 0 : i 0 +B ) = θ - η θ i= i 0 i 0 +B L( θ ; x i , y i ) . The learning rate is very important, and in fact it should decrease during training, as we get closer to the optimum. Some reasons to make this decrease are:
In practice, what we do is apply a linear decay until some point τ , and keep it constant after this point: η k =( 1- k τ ) η 0 + k τ η τ , and η τ after τ iterations.
Now, the question is, how to choose η 0 , η τ and τ ? There are several options:
Cyclical Learning Rate
Optimization difficulties comes more from plateaus than from bad local optima, and increasing η allows to go across these plateaus. For this, what is done is varying the learning rate in a cyclical manner, from a minimum bound, η min to a maximum bound, η max .
Super Convergence with One-Cycle Policy
Only once cycle works as:
image: 8_home_runner_work_BDMA_Notes_BDMA_Notes_Paris-___chine-Learning_LectureNotes_source_pegado17.png
SGD with Adaptive Learning Rates
Preconditioning: θ t+1 θ t - η t P t -1 G( θ t ) , where P t can be defined in different ways. Let's see AdaGrad and RMSProp.
AdaGrad: parameters with largest partial derivatives should have a rapid decrease. P t = [ diag( j=0 t G( θ t ) G ( θ t ) T ) ] 1 2 . More precisely: { g t G( θ t ) r t r t-1 + g t g t θ t+1 θ t - λ δ + r t g t
AdaGrad converges quickly on convex problems. However, keeping all the history with momentum can be detrimentary, and smoothing the gradient destroys information.
RMSProp: introduces momentum when computing the precondicioner. The idea is to adapt the learning rate to the curvature of the loss function, putting the brakes on when the function is steep and accelerating when the loss function is flat. P t = [ diag( α P t-1 + j=0 t G( θ t ) G ( θ t ) T ) ] 1 2 .
More precisely: v t ( θ ) = α v t-1 ( θ ) +( 1- α ) G ( θ ) 2 θ t+1 = θ t - η ε + v t ( θ ) G( θ ) . Adam
Adam (Adaptative Moment Estimation) builds on RMSProp, but also uses a moving average of the gradients. It works as: m t ( θ ) = β 1 m t-1 ( θ ) +( 1- β 1 ) G( θ ) v t ( θ ) = β 2 v t-1 ( θ ) +( 1- β 2 ) G ( θ ) 2 θ t+1 = θ t - η m( θ ) ε + v( θ ) . In practice, Adam is the most used optimizer. However, there are more efficient algorithm, like LARS (Layerwise Adaptive Rate Scaling) or LAMB (LARS+Adam).
Comparison

4.2 Initialization and Normalization

The parameters of a deep learning model need initial values. The assignation of initial values is called initialization, and can impact the optimization process in several ways:
Therefore, a difficult question arises: initial parameters can help optimization, but can detriment the generalization error.
Principle: Break Symmetries
Two identical parameters, connected to the same input, should be initialized differently, to incentivize different learning.
One option could be to initialize everything to 0, but this is not a good choice, because it disentivizes learning. Instead, there are different initialization schemes, like random initialization.

4.2.1 Random Initialization

Random initialization uses a probability distribution to initialize the weights. For example, Xavier initialization uses an uniform distribution as W i,j U( - 6 n i + n i+1 , 6 n i + n i+1 ) , where n j is the number of neurons in layer j=1,...,N . The input layer is initialized as W 0,j N ( 0, 2 n 1 ) .

4.2.2 Input normalization

Gradient descent is sensitive to strong variations in the input, and, ideally, the surface of the loss function should ahve a uniform curvature in all directions, similar to a sphere. This can be incentivized by input normalization, i.e., normalizing all input parameters so that they all lie in a similar value range.
Moreover, there is the concept of batch normalization, or adaptive reparametrization. This consists in normalizing the input of each layer of a neural network. It is motivated by the following:
In this case, the idea is to normalize the distribution of each input feature in each layer, across each minibatch, to a normal, N ( 0,1 ) : μ 1 m i=1 m x ¯ i , σ 2 1 m i=1 m ( x ¯ i - μ ) 2 , x ¯ i x ¯ i - μ σ 2 - ϵ . Remember that ϵ is the approximation of the generalization error, ϵ = V C dim N .

4.3 Regularization

The loss function can be small on the training data, but large on the dataset. This is the overfitting problem, that we have seen before. Deep NN can tend to overfit, since more depth implies more free parameters, and therefore a higher VC dimension, which can increase ϵ according the rule of thumb for its approximation.
A way to go around this is to put constraints on the weights, to reduce the VC dimension. This can be done through regularization.

4.3.1 L 2 Regularization (or Ridge)

L 2 regularization keeps the L 2 norm of the free patameters, θ , as small as possible, during learning.
The intuition is that each neuron will use all its inputs with small weights, instead of specializing on a small part with high weights.
To accomplish this, we have to minimize two things at the same time: the training loss and a penanlty term representing the norm of the weights: J ( θ ) = E D [ t-y 2 ] + λ θ 2 , where λ is the regularization parameter, controlling the strength of the regularization:
The gradient of this new loss function is θ J ( θ ) =-2( t-y ) θ y+2 λ θ , and so the parameter updates become Δ θ = η ( t-y ) θ y- η λ θ . The L 2 regularization leads to weights decay: even if there is no output error, the weight will converge to 0, forcing the weights to constatly learn, and disincentivizing the specialization on particular examples (overfitting), enhancing generalization.

4.3.2 L 1 regularization (or Lasso)

In this case, we penalize the absolute value of the weights, instead of their euclidean norm: J ( θ ) = E D [ t-y 2 ] + λ θ 1 . This method leads to very sparse representations, where a lot of neurons may be inactive, and only a few represent the input.

4.3.3 Early Stopping

During training, a common behavior is the following:
image: 9_home_runner_work_BDMA_Notes_BDMA_Notes_Paris-___chine-Learning_LectureNotes_source_pegado18.png
It's usual that the training error decreases constantly, while the validation error decreases, and the increases again. If we manage to find the optimal point in this process, we can stop earlier the training process, before the validation error gets larger.
This method is equivalent to L 2 normalization, both limiting the capacity of the model:

4.3.4 Dropout

Dropout considers all the networks can be formed by removing some units from a network. With this in mind, the method consists of:
The probability of dropout, p , is a hyperparameter. Therefore, what we do at training time is, at each step, deactivate some neurons, randomly. This incentivizes generalization, since relying on some neurons specializing a lot for some features of the training data is harder if these neurons are not always available.
Note that, at inference time, all neurons are always available.

4.3.5 Data Augmentation

The best way to avoid overfitting is collecting more data, but this can be hard, costly, or simply impossible. A simple trick is data augmentation, which consists in creating new varied data from the current data by perturbing it, while not chaning the labels associated to it.
For example:
image: 10_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___chine-Learning_LectureNotes_source_pegado19.png
Mixup is a particular case of data augmentation, consisting on creating new training new samples with labels by interpolation: x ˆ = λ x i +( 1- λ ) x j , y ˆ = λ y i +( 1- λ ) y j , where λ Beta( α , α ) , with α [ 0.1,0.4 ] for classification tasks. This method works for structured data, and can be used to stabilizing GANs (we will see this later).

4.4 Vanishing Gradients

Contrary to what we could think, adding more layers to a DNN does not necessarily lead to better performance, both on the training and test set. For instance, see the following graph, where we observe the performance of a 20 layers NN (left) and a 56 layers NN (right):
image: 11_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___chine-Learning_LectureNotes_source_pegado20.png
We observe how its performance is worse in all senses. The main reason for this is the vanishing gradient problem. The gradient of the loss function is repeatedly multiplied by a weight matrix W as it travels backwards in a deep NN, as h k h k-1 =f'( W k h k-1 + b k ) W k . When the gradient arrives to the first layer, the contribution of the weight matrices is comprised between W min d , and W max d , which are the weight matrix with the highest and lowest norm, and d is the depth of the network. We find:
We saw that exploding gradients can be solved by gradient clipping. But vanishing gradients are still the current limitation of deep NN. The solutions include the utilization of ReLU activaiton functions, unsupervised pre-training, batch normalization, residual networks, etc.

4.4.1 Residual network

A Residual Neural Network, or ResNet, is an advanced type of neural network that is specifically designed to help improve the performance of deep learning models.
ResNet introduces the concept of residual learning. Instead of expecting each stack of layers to directly fit a desired underlying mapping, ResNet layers are designed to fit a residual mapping. The key component of ResNet is the introduction of "skip connections" or "shortcuts" that bypass one or more layers. A skip connection in a ResNet allows the gradient to be directly backpropagated to earlier layers.
These skip connections perform identity mapping, and their outputs are added to the outputs of the stacked layers. This design helps in training deeper networks by mitigating the vanishing gradient problem. With residual blocks, the network learns the additive residual function with respect to the layer inputs, making it easier to optimize and gain accuracy from considerably deeper networks.
image: 12_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___chine-Learning_LectureNotes_source_pegado21.png

4.4.2 Stochastic Depth

Stochastic depth is a training technique for deep neural networks, particularly effective for very deep networks like ResNets. It was introduced as a solution to the problem of vanishing gradients and long training times in deep networks.
During training, stochastic depth randomly drops layers in the network. The idea is similar to dropout, where random neurons are turned off during training to prevent overfitting. In stochastic depth, however, it's entire layers that are dropped.
Each training iteration uses a shallower version of the network. The depth of the network varies each time, as different subsets of layers are randomly deactivated.
Notice how stochastic depth is particularly useful for ResNets, where skip connections (or residual connections) are a key feature. When a residual block is dropped, the skip connection effectively takes its place, allowing the signal to still propagate forward.
At test time, all layers are used, but their outputs are scaled appropriately to compensate for the dropout during training. This ensures that the network can benefit from its full depth during inference.
image: 13_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___chine-Learning_LectureNotes_source_pegado22.png

4.5 Double Descent

Double descent is a phenomenon observed in the training of machine learning models, particularly in relation to the model's complexity and its performance on a given task. This concept challenges the traditional understanding of the bias-variance tradeoff and has gained attention in the field of machine learning and statistics.
The double descent curve shows that after the point where the model starts to overfit (as per the traditional U-shaped bias-variance tradeoff curve), increasing the model complexity even further can lead to a decrease in the total error again.
It shows the following phases:
The double descent phenomenon is especially noticeable in scenarios with limited training data. With more data, the peak of the curve (at the interpolation threshold) becomes less pronounced.
Double descent suggests that in some cases, choosing an even more complex model after hitting the overfitting point can improve performance.
For example, observe this phenomenon in the following graphs:
image: 14_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___chine-Learning_LectureNotes_source_pegado23.png
A similar phenomenon is grokking, which occur when we increase training time. It is remarkable that shallow models don't show it, which is a reason to use deep networks!
image: 15_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___chine-Learning_LectureNotes_source_pegado24.png

5 Convolutional Neural Networks

5.1 Introduction

When working with MLP, we find several limitations:
  1. Dealing with data with a high number of features. For example, let imagine we are working with a dataset of images at full HD resolution. The feature size is 1920 × 1080 × 3=6220800 . Learning a single layer to reduce the dimension to 1000 requires around 6000K × 1K1 0 9 parameters, making the training really difficult.
  2. Use the knowledge of the input data modality to shape the network. E.g., for images, the network should be equivariant to translation. This means that a pattern should be detected independently of where it is located in the image.
Convolutional Neural Networks (CNNs) appear to mitigate these limitations.
Definition 5.1. A CNN is an NN that uses convolutions in place of general matrix multiplication in, at least, one of their layers.
CNNs deal with data arranged according to a grid, which can be temporal date, images, videos, etc.

5.1.1 History of CNNs

5.2 Convolutions

Definition 5.2. Given two functions, f,g: R R , their convolution is defined as ( f*g ) ( x ) = - f( z ) g( x-z ) z . For discrete function, it becomes ( f*g ) ( m,n ) = i,j=- i,j= f( i,j ) g( m-i,n-j ) , where in this case f and g are bidimensional.
f is called the input signal.
g is called the kernel, or filter.
f*g , the convolution output, is the feature map.
Remark 5.1. Properties of convolutions:

Remark 5.2. Observe that convolutions can be understood as taking a moving average of f around each point x , with weights provided by g . If g x =1 , then it would represent a proper moving average.
In practice, however, CNNs do not really use convolutions, but cross-correlations, which consist in sliding one signal (or function) over another and measuring the similarity at each position. It's a way to track how much one signal resembles another as you shift one of them over time or space. Their definition is very similar to convolutions, but convolutions invert the kernel function, while cross-correlations do not: ( f logof g ) ( m,n ) = i,j=- i,j= f( m,n ) g( m+i,n+i ) .
While MLPs make an interaction between each input neurons and each output neurons, CNNs have sparse interactions, thanks to the kernels, which have a small number of parameters, significantly reducing the number of parameters of the network.
Moreover, in MLP there is one weight per connection between the input and the output, while CNNs apply the same kernel to different parts of the input, also reducing the number of parameters of the network. This is called the parameter sharing property.
Definition 5.3. A function f is said to be equivariant to a transform g , if for any input x , it holds f( g( x ) ) =g( f( x ) ) .
Because of paratemer sharing, the CNNs are equivariant to translations. This makes them particulary useful for some modalities, such as image, for which we aim to detect several objects in the same manner.
Images, and some other modalities, like videos, are composed of channels. For example, RGB is a 3-channel way to compose images.
Kernels are convoluted with all channels of the input, as visualized in the following figure.
image: 16_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___chine-Learning_LectureNotes_source_pegado28.png
Figure 2: Convolution of a 3-channel input with a 3-channel kernel.
Convolutions are not equivariant to rotation, zoom, etc. And these might be interesting for our model to be invariant. To make the network invariant to such transformation, we can transform the data with data augmentations including zooming and rotating, and feeding the network with this transformed data.
Notice how convolutions can reduce the input size, whenever the kernel is larger than 1 × 1 . This is due to the necessity to fit the whole image.
The output size of the convolution of an input of size n h × n w with a kernel of size k h × k w is ( n- k h +1 ) × ( n w - k w +1 ) .

5.2.1 Fixed Kernels Weights

If we are interested in certain features, and we know how to define a kernel that is good for identifying these features, we can just do this and applying this kernel.
For example, in the following illustration we observe a kernel that is good at finding vertical edges:
image: 17_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___chine-Learning_LectureNotes_source_pegado29.png

5.2.2 Learned Kernel Weights

Learning a CNN backbone using DL is learning the weights of the kernels via backpropagation. The filters seek to learn the best representation to fit its training objective, via the training loss.
Using different kernels allows to detect various patterns, so usually CNNs contain thousands of kernels and are grouped in hierarchical layers, to learn from low-level features to high-level features.
To compute various features, such as different detections (horizontal, vertical, diagonal, etc.) we need several filters. For an input of size C × n 1 × n 2 , where C is the amount of channels, and F filters of size C × m 1 × m 2 , the feature map produced has F channels and is of size F × ( n 1 - m 1 +1 ) × ( n 2 - m 2 +1 ) .

5.2.3 Receptive fields

The receptive field is the region of the input space that a particular CNN's feature is looking at.

5.2.4 1 × 1 Convolution

The 1 × 1 convolution is a linear combination of all channels for each pixel, allowing to learn to reduce or increase the number of filters.
In general, if the input is of size F in × H × W and the filter is of size F out × 1 × 1 , then
1 × 1 convolutions can replace fully connected layers. For example, in classification, we could do F out = N classes .

5.3 Padding, Stride, Pooling

5.3.1 The Sides Problem

Convolutions are applied on fixed size patches of the input data, and they can only be shifted until there are no pixels. This introduces two potential problems:

5.3.2 Padding

Padding is a solution to the edge problem, consisting on adding data at the sides of the image. There are several strategies, but in general, we just add zeros:
image: 18_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___chine-Learning_LectureNotes_source_pegado30.png
There are several modes of zero padding. For a padding of k pixels:
The output size of a convolution of a n h × n w sized input, to which we add p h rows and p w columns of padding, and a kernel of size k h × k w becomes ( n h - k h + p h +1 ) × ( n w - k w + p w +1 ) . Generally, we want to have p h = k h -1 and p w = k w -1 , which is achieve by applying same zero padding.

5.3.3 Stride

As far as we have seen, we slide the kernel across all locations right and down. However, we could skip some of this locations, effectively reducing the feature map output and the computational cost, but at the expense of the accuracy of the representations.
If we add s h , s w for the height and width strides, the feature map size is ( n h - k h + p h +1+ s h ) s h × ( n w - k w + p w +1+ s w ) s w .

5.3.4 Pooling

Pooling consists in statistically summarizing a neighbourhood. A pooling layer takes a window of values and output one value. There are several pooling types:
For example, the following illustrates the use of a 2x2 max pooling layer:
image: 19_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___chine-Learning_LectureNotes_source_pegado31.png
Pooling is useful to make our representations invariant to minor translations. For deep layers, minor modifications can mean great changes in the input image.
Also, it reduces the amount of parameters and computation, and help reducing overfiting.
It sets up a strong prior: invariance to local translation of learned kernels.
It can be degrading for some tasks, where precise locations are required.

5.4 CNNs

LeNet by LeCun et al. in 1998 was designed after 10 uears of working with handwritten bank checks. It paved the basics of DL:
The architecture is the following:
image: 20_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___chine-Learning_LectureNotes_source_pegado32.png
AlexNet by Krizhevsky et al. in 2012 was on of the first DNN, and won the 2012 ImageNet challenge, causing a new DL area. They extended LeNet to larger images, with wider convolutions, and used ReLUs and max pooling, combined with dropout.
A very interesting fact they observed was that as the layers were deeper, the receptive field was higher: The first layers learn basic patterns, while deeper layers learn more complicated shapes.
image: 21_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___chine-Learning_LectureNotes_source_pegado33.png
Visual Geometry Group (VGG) by Simonyan et al. in 2015 was the first network with blocks. the blocks consisted in:
The architecture was the following:
image: 22_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___chine-Learning_LectureNotes_source_pegado34.png
The problem was how to maintain enough resolution to have a lot of blocks. The solution was to introduce more convolutions at low resolution, instead of few convolutions at high resolution.
This increases the number of non-linearities and reduces the number of parameter to achieve an equal receptive field.
Network in Network (NiN), by Lin et al. in 2013: previous architectures improved by enhacing the number of convolutions (width) and deepening the network (depth). This presents two problems:
NiN made improvements:
Inception Blocks: Google was facing a problem. They needed fast deployment and inference, but large convolutions with lots of channels increase inference time. Their solution was to reduce the number of channels by applying several convolutions in parallel. Instead of choosing a single filter size for a given layer, Inception blocks apply multiple different-sized filters (e.g., 1x1, 3x3, 5x5 convolutions) in parallel to the same input feature map. This allows the network to capture information at various scales.
An inception block looks as follows:
image: 23_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___chine-Learning_LectureNotes_source_pegado35.png
GoogleNet, by Szegedy et al. in 2015 was based on inception blocks. It used several classifiers throughout training to enhance discriminative features at first layers, and reduce vanishing gradients.
Residual Blocks: these blocks make use of residual connections, or shortcut connections, which forward the input to the following layer. That is, if a layer is represented as f( x ) , the same layer, when used inside a residual block, would be f( x ) +x . This is illustrated below.
image: 24_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___chine-Learning_LectureNotes_source_pegado36.png
Residual blocks make the identity function easier to learn, and reduce vanishing or exploding gradient problems.
ResNet Blocks: these blocks are a combination of VGG blocks and residual blocks. They use 3 × 3 convolutions, with batch normalization to stabilize training and residual connection concatenated to the result of applying two of these convolutions. The residual connection can be either direct, or by means of a 1 × 1 convolution. This is illustrated below:
image: 25_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___chine-Learning_LectureNotes_source_pegado37.png
Residual Neural Networks (ResNet), by He et al. in 2016, is a network architecture consisting of blocks of residual blocks. There are different ResNets depending on how many blocks or layer are used. For example, the following is a ResNet-18, as it has 18 trainable layers (the first 7x7 conv, two 3x3 convs for each internal block, and there are eight of them, and the last fully connected layer):
image: 26_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___chine-Learning_LectureNotes_source_pegado38.png
ResNet is a widely used backbone until today.
ResNexts: Group convolution, by Xie et al. in 2017. To add more non-linearities in ResNets we can add more layers, increase the width of convolutions or the number of channels. However, increasing the number of channels add quadratic complexity ( F in × F out ) . Group convolutions is a technique to reduce the computational complexity while still increasing the network's width. In group convolution, the input channels are divided into groups, and a convolution is performed independently on each group. This means that if you have g groups, each group will have c g input channels and b g filters applied to it, where c is the total number of input channels, and b is the total number of intermediate channels. Each filter in a group only interacts with the input channels within that group.
The following diagram shows a block of the ResNeXt architecture. The input channels are first split into g groups. Each group is processed by its own set of 1x1 convolutions (to reduce dimensionality), followed by 3x3 convolutions (to capture spatial hierarchies). The outputs of these groups are then concatenated, and a final 1x1 convolution is applied to the concatenated result to produce the final output channels.
image: 27_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___chine-Learning_LectureNotes_source_pegado39.png
DenseNet, by Huang et al. in 2017. This architecture is based on ResNet blocks, but instead of adding the input and the result of the convolution, it concatenates both of them plus the results of each dense blocks on the channel dimension. It makes use of transition layers, which are 1 × 1 convolutions after concatenation, to reduce feature dimension.
Some other architectures address different problems. MobileNets, by Howard et al. in 2017 was thought to be deployed on mobiles, it separates convolution in two operations; EfficientNet, by Tan et al. in 2019, uses grid search without computational budget to modify the architecture at best width, depth or input resolution...
CNNs are being slowly replaced in some cases by Visual Transformers.

5.4.1 Object Detection Problem

The goal of the object detection problem is to find a bounding box that contain an object of interest. It implies two objectives:
If we consider this problem as a classification problem, one approach is to test various positions and scales to propose various anchor boxes. This is scalable only if the classifier is fast enough. However, CNNs are computationally expensive. A solution is to select only a subset of the boxes, by finding blobs likely to contain objects without class consideration.
Region proposal: selective search, by Uijlings et al. in 2013
Selective search is a widely used tool to provide regions of interest in images. The regions are computed following the pipeline:
  1. Ascendant segmentation: The algorithm starts with a fine-grained segmentation of the image, where many small regions are created based on pixel similarity (color, texture, etc.). This is often done using a graph-based segmentation method.
  2. Fuse regions at different scales: The segmented regions are then hierarchically merged based on various similarity measures. This process is iterative, with smaller regions combining to form larger ones. This step is crucial as it operates over multiple scales, allowing the algorithm to be more robust in detecting objects of different sizes and aspects.
  3. Convert regions to potential boxes of interest: The resulting regions from the previous step are then used to generate bounding boxes, which are the "region proposals." These are the candidate regions that could contain objects and are subsequently used by an object detection model for classification.
This process is depicted below:
image: 28_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___chine-Learning_LectureNotes_source_pegado40.png
Intersection over Union (IoU)
For two sets, A and B , the Jaccard index, or Intersection over Union, is the reatio of the intersection area to their union are J( A,B ) = | AB | | AB | .
Label region proposal
Region proposal anchor boxes are associated to ground-truth bounding boxes given their IoU (above a threshold). The class of an anchor is the same as its associated bounding box. If the anchor box is not associated to a bounding box, it is labeled as background.
The offsets, relative to the position of central coordinates between anchor box and boundingbox, label the region.
That is, proposal anchor boxes are labeled with the same label of a ground-truth bounding box if their IoU is above a given threshold.
R-CNN, by Girshick et al. in 2014
R-CNN takes region proposal as input. It computes features for each of the regions, and the classification is done using several SVMs. The regression of offsets is done by using a linear regression model.
To train R-CNNs, there is a supervised pretraining of a CNN backbone, generally on ImageNet, followed by an optional fine-tuning of the backbone, to learn specialised features based on a classification obkective. The backbone is then fixed.
For each input: crop and wrap proposal regions computed by selective search and compute the features. Train a binary SVM for each class, and apply linear regression to crrect small offsets between the prediction and the actual bounding box.
image: 29_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___chine-Learning_LectureNotes_source_pegado41.png
Evaluating detection models
The meaning of true positive, false positive and false negative for the detection problem is not the same as for regular classification problems. The threshold for True and Falso positive/negatives is based on the IoU and a given threshold. For example:
image: 30_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___chine-Learning_LectureNotes_source_pegado42.png
Remember, precision= TP TP+FP and recall= TP TP+FN .
The average precision is the area ander the prevision-recall curve.
The mean-average precision is the mean of average precision for various IoU threholds.
R-CNNs is slow in testing phase, because there are as many passes in the CNN as the number of region proposals, which is around 2000. Also, SVM and regression are a bit old school, so this architecture is not usable for real world applications.
Fast R-CNN, by Girshick in 2015
This modification of R-CNN was proposed to increase its efficiency. It passes thewhole image only once in the CNN backbone. The CNN is trainable, and not fixed, and the proposed regions are associated with computed features from the output feature map.
Each region of interest can have a different size, so selective search proposed regions are concatenated with features from the CNN to form the Region of Interest (RoI) Pooling Layer, which reshape the features to feed them to fully connected layers. From these features the classes and the offsets are predicted.
The following table compares R-CNN and Fast R-CNN:
R-CNN
Fast R-CNN
Test time per image
47 s
0.32 s
Speedup
1x
146x
Test time per image with selective search
50 s
2 s
Speedup
1x
25x
Faster R-CNN, by Ren et al. in 2017
This new modification replaces selective search by a learned region proposal network. The rest is similar to Fast R-CNN.
The region proposal network (RPN) is a network learned to propose regions of interest. It slides a window on the feature size, and, at each location of the window, it makes a prediction for k anchors (propositions), which are sampled by varying scale and aspect ratio. A small network predicts if there is an object there, and a small network predicts offsets with the bounding box.
The following table compares the three approaches:
R-CNN
Fast R-CNN
Faster R-CNN
Test time per image (with proposals)
50 s
2 s
0.2 s
Speedup
1x
25x
250x
mAP (VOC 2007)
66.0
66.9
66.9
You Only Look Once (YOLO), by Redmon et al. in 2016
This model is a single stage detector: one network is used to predict the bounding boxes and the class probabilities. It is the first real time detector close to Faster R-CNN performance.
Its architecture consists of 24 convolutional layers and input dimension of 448 × 448 . For each input, it cuts it in 7 × 7 grids of cells. Each cell computes the objectness and anchor boxes, considering the object at the center of the cell, and enabling the boxes to reach out the cell.
This model has been enhanced multiple times, with YoLo v2, v3, v4, v5,...

5.4.2 Semantic Segmentation

The goal in this case is to find an object class for each pixel. There are several associated problems:
These are illustrated below:
image: 31_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___chine-Learning_LectureNotes_source_pegado43.png
Mask R-CNN by He et al. in 2017
Basically, it is Faster R-CNN with a third branch, which outputs the object mask.
It can use various backbones such as ResNet, DenseNet,...

5.5 Data and Transfer

Modern CNNs are made of millions of parameters, requiring a huge amount of data to train them. For example, ImageNet has 1.2M images for 1k classes, and is being replaced by ImageNet21k, that has 1B images.
Manually annotating such amount of images is costly in terms of money and time, and therefore there is the need for techniques to deal with lacking of data: data augmentation and transfer learning.

5.5.1 Data Augmentation

We already saw it! It increases the diversity of data by transforming the source data with invariants

5.5.2 Transfer Learning

Sometimes there is possibility to learn from scratch, by random initialisation, because of lack of data. In this case, we can make use of an already pre-trained backbone from a source task for the target task.
In all cases, starting from initialized weights is better than nothing.
Fine tuning
Fine tuning consists in training several layers of a pre-trained backbone:
image: 32_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___chine-Learning_LectureNotes_source_pegado44.png
Domain Adaptation
Sometimes, the target task and the source task share the same classes, but the target data has a different distribution than the source. This is an entire field in ML: how to adapt our features to the new domain, and have the best possible results.

5.6 Self-Supervised Learning (SSL)

We already know supervised learning. It consists on predicting a target associated to an input.
Self-Supervised Learning (SSL) is different. It consists on predicting a part of en input, from the input. This means the model learns to understand and generate data by teaching itself.
A pretext task is a task designed to teach the model something about the structure of the input data without using labels. For example, the pretext task might involve predicting the next word in a sentence, or the color version of a grayscale image. The idea is to construct targets from the data itself and learn from these artificially created labels.
SSL presents several advantages:
All this is illustrated below:
image: 33_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___chine-Learning_LectureNotes_source_pegado45.png
For images, several pretext tasks have been firstly proposed, like rotation, patch localization, colorization, counting,...
For videos, which add the time dimension, we find the same pretext tasks as in images, plus some others specific to videos, which can be masking frames, shuffling frames,...
Contrastive Learning
Constrastive Learning is a pretext task for SSL. It aligns positively pairs of images, and push away other images. We find the problem of defining what is a positive pair, which can be solved by using data augmentation.
The following figure illustrates this concept:
image: 34_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___chine-Learning_LectureNotes_source_pegado46.png
SimCLR, by Chen et al. in 2020
This model defines a siamese pipeline for contrastive learning. Positive pairs are formed from strong data augmentation, like color jittering, gaussian blur, grayscale, etc. Both paris go through an encoder and a projector, which consists on several stacked MLP layers. Then, contrastive loss is applied.
image: 35_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___chine-Learning_LectureNotes_source_pegado47.png
SCE, by Denize et al. in 2021
This model tries to address the problem of how to deal with negatives that share semantic information with the positive pairs. For example, two different cats share more information than two different dogs. Pushing negatives with high semantic similarities makes training unstable, and so SCE predicts positive pairs and estimated relations among instances.
Deep Cluster, by Caron et al. in 2018
Clustering associates several input data to a same pseudo-label, without supervision.
Deep Cluster applies clustering to traing a backbone using several stages:

6 Recurrent Neural Networks

6.1 Introduction

Recurrent Neural Networks are a family of Neural Network architectures specially designed to deal with sequential data, such as text in NLP, speech signals in speech2text tasks, temporal data in videos, or other temporal signals, like ECGs, time series, etc.
Sequential data is data with a temporal component, that usually implies correlation in the temporal dimension. More precisely, a sequence consists in a set of vectors x t , where t[ 1, τ ] is a temporal index.
Note that the temporal index is not necessarily related to time, but to order. For example, text is not temporal, but words are ordered.
Memory is essential for us to understand and interpret what we perceive. Memory can be understood as a persistent format of information. MLP don't have this persistence, each data point is processed independently of the rest. Therefore, RNNs introduce a way to store information, by adding inner loops that enables them to preserve information at each time-step.

6.2 Recurrent Neural Networks

A Recurrent Neural Network (RNN) is defined with a state, h ( t ) , by recurrence. This state depends on the current input, x ( t ) , and the previous state, h ( t-1 ) : h ( t ) =f( h ( t-1 ) , x ( t ) ; θ ) . The computational graph can be represented with a loop, or unfolded, making it direct and acyclic, as:
image: 36_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___chine-Learning_LectureNotes_source_pegado50.png
A RNN is trained to predict the future, given past information, in such a way that h ( t ) is a summary of the sequence until the timestep t . This summarization implies information compression, or loss of information, and the training must keep relevant information for the task.
Going back to the definition of the state, we can then write it alternative as h ( t ) =f( h ( t-1 ) , x ( t ) ; θ ) = g ( t ) ( x ( t ) , x ( t-1 ) ,..., x ( 1 ) ) , where g ( t ) takes the whole sequence as input, of variable length.
On the other hand, using the factorized form, with transition function f , the function is the same at each timestep, allowing for parameter sharing and generalization.
Theorem 6.1. RNNs are universal approximators.
We measure the performance of a RNN through the cost function L , between the output o and the ground truth y . Internatlly, we use softmax: y ˆ i = e o i j e o j . The parameters of a RNN are:
That is:
image: 37_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___chine-Learning_LectureNotes_source_pegado51.png

6.2.1 Direct Propagation

In the case of discrete variable prediction (like words), o represents log-likelihoods of possible output values, and the normalize probability is y ˆ =softmax( o ) .
Direct propagation works by: a ( t ) =b+W h ( t-1 ) +U x ( t ) , h ( t ) = φ ( a ( t ) ) , o ( t ) =c+V h ( t ) , y ˆ ( t ) =softmax( o ( t ) ) , where φ is the activation function.

6.2.2 Recurrent Neural Networks with output recurrence

We don't link directly hidden layers recurrently, but rather we link the output to a hidden layer. This has the cons of using less information than h ( t-1 ) , being therefore a less expressive model, but the pros of being easier to train, enabling parallelization and efficient backpropagation.
image: 38_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___chine-Learning_LectureNotes_source_pegado52.png

6.2.3 Recurrent Neural Networks with unique output

In this case, it needs the complete input sequence, and the output is a summary of the input. This kind of setting is usually used a simpler module of a more complex architecture.
image: 39_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___chine-Learning_LectureNotes_source_pegado53.png

6.2.4 More architectures

In this figure we observe different possible setups, which apply to different use cases.
image: 40_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___chine-Learning_LectureNotes_source_pegado54.png
For instance:

6.2.5 Bi-Directional Recurrent Neural Network

Sometimes, the future of the sequence can also be helpful, for example in NLP, in Character recognition and even in Speech recognition. For this, there exist the Bi-Directional RNN (BRNN), which combine a RNN processing from past to future, with state h , and a RNN processing from future to past, with state g , and the output combines the two of them.
image: 41_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___chine-Learning_LectureNotes_source_pegado55.png
Moreover, BRNN can be applied to images with 4 direction, applying them up,down and left,right. This is a bit outdated approach for visual recognition, with architectures as ReNet for classification and ReSeg for segmentation.

6.2.6 Deep Recurrent Neural Network

A classical RNN has an input layer, U , a hidden layer, W , and an output layer V . However, we can add hidden layers, allowing to go higher in abstraction.
The representation improves hierarchically:
image: 42_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___chine-Learning_LectureNotes_source_pegado56.png
Also, we can add an MLP before every recurrent layer, increasing the representation capacity of the network, at the expense of a harder training. A possibility to improve these is to use skip connections.

6.3 Recurrent Neural Networks Training

The principle is the same as with MLP and CNNs: we choose a cost function a minimize it during training using gradient descent with backpropagation.
However, we need to take into account the recurrency of the network, and so we use what is called Backpropagation through time (BPTT), which consists in applying backpropagation to an unrolled graph of the RNN.
The idea is to forward through the entire sequence to compute the loss, and then go backwards through the entire sequence to compute gradient.
However, this can be very inefficient, so we also find the truncated BPTT, which does BPTT to chunks of the sequence, instead of the complete sequence. The hidden states are carried forward in time until the end of the sequence, but we only backpropagation for some smaller number of steps.
image: 43_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___chine-Learning_LectureNotes_source_pegado57.png

6.4 Long-Short Term Memory (LSTM) and Gated Recurrent Units (GRU)

Classical RNN present two main limitations:
  1. A computation limitation: they are hard to train, and prone to gradient vanishing and explosion.
  2. Long term interactions are hard to model.
Some upgraded variants of RNNs are Long Short Term Memory (LSTM) and Gated Recurrent Units (GRU).
Gradient vanishing and exploding: the reason RNNs are prone to gradient vanishing/exploding is that the gradient needs to go through the unfolded computational graph. If the largest singular value of the hidden matrix is greater than 1, a recurrent multiplication by it would tend to increase the gradient vector, leading to gradient explosion. The opposite happens when the largest singular value is smaller than 1, leading to gradient vanishing.
The solution to gradient explosion can be using gradient clipping.
The solution to gradient vanishing is to change the RNN architecture.
Long-term dependency problem: RNNs can connect past information to present data, and, in theory, time distance of information is irrelevant. However, in practice, classical RNNs cannot model long sequences (more than 20 points).

6.4.1 LSTM

LSTM is based on a standard RNN whose neuron activates with tanh .
image: 44_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___chine-Learning_LectureNotes_source_pegado58.png
C t is the cell state, which flows through the entire chain and is updated with a sum instead of a product. This avoids memory vanishing and the gradient explosion and vanish.
Then, there are three gates foverned by sigmoid units, which define the control of in and out information.
A key component is the forget gate, f t = σ ( W f [ h t-1 , x t ] + b f ) , which can affect the affect the previous state, letting it through or not (forgetting).
Then, the input gate layer, i t = σ ( W i [ h t-1 , x t ] + b i ) , which processes the input, contributing to the cell state.
Then, there is the classical neuron, activated with tanh , which also contributes to the cell state, by C ˆ t =tanh( W C [ h t-1 , x t ] + b C ) .
Finally, to update the cell state it is C t = f t C t-1 + i t C ˆ t . Notice that the output is computed as h t =tanh( C t ) + σ ( W o [ h t-1 , x t ] + b o ) .

6.4.2 GRU

GRU is a variant of LSTM, simpler and faster, and with similar performance. It works by applying: u i = σ ( W u x i + U u h i-1 + b u ) , r i = σ ( W r x i + U r h i-1 + b r ) , h ˆ i =tanh( W h x i + r i U h h i-1 + b h ) , h i = u i h ˆ i +( 1- u i ) h i-1 .
image: 45_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___chine-Learning_LectureNotes_source_pegado59.png
Even with these improvements, RNN struggle with long sequences, and remain difficult and slow to train, because of thei autoregressive nature. In 2017, a Google paper introduced a novel architecture that changed everything: the Transformer.

7 Deep Generative Modelling

Recall that the objective of supervised learning is to learn a mapping, f , between a data sample, x , and its label, y : y=f( x ) . This can be formalized by learning the conditional distribution between labels, given a data point, that is, p( y|x ) .
For this task, we need to have access the annotated pairs ( x,y ) . The applications of supervised learning are classification, object detection, segmentation, etc.
On the other hand, unsupervised learning tries to learn the underlying hidden data structure, formalized by the unconditional distribution data points, p( x ) .
In this case, we don't need any annotations to the data. The applications of unsupervised learning are generative modelling, clustering, feature learning, summarization, etc.
Generative modelling consists in learning a model, M , that can approximate the data distribution, p M ( x ) p( x ) , and should enable sampling easily. That is, a generative model aims to provide:

7.1 Variational Auto-Encoders (VAE)

7.1.1 Auto-Encoders

AEs are a type of unsupervised model used to learn low-dimensional latent representations, z , of data samples, x . This means that the aim is to somehow try to summarize the input data space, X , into a smaller space, Z , so dim( Z ) <dim( X ) . Z should capture relevant characteristics of data, and the choice of its dimensionality is a compromise between interpretability and the amount of information to keep.
In Figure 3 we can see an illustration of autoencoders. The input is processed by the encoder, which is able to embed it into a smaller space, the latent space. The decoder can take the point in the latent space and reconstruct the original image.
To train them, this process is applied to the training set, and the loss function is the distance between the reconstructed images and the original ones: L ( M,X ) = xX M( x ) -x 2 .
image: 46_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___ne-Learning_LectureNotes_source_autoencoder.png
Figure 3: Illustration of autoencoders. Source: https://www.v7labs.com/blog/autoencoders-guide.
The problem of AEs is that they don't allow generation, because the distribution of p( z ) is difficult to sample from.
To see this more clearly, let's discuss an example done by Joseph Rocca, Understanding VAEs.
In the following image, we observe how, when using AEs, points in the latent space can have no real meaning, making it difficult to sample and generate new objects, which is the ultimate goal of generative models.
image: 47_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___chine-Learning_LectureNotes_source_pegado60.png
What is desirable, is that somehow, points in the latent space that are close to each other, should have a similar real meaning, once they are decoded:
image: 48_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___chine-Learning_LectureNotes_source_pegado61.png
This cannot be achieved with AEs, and this led to the introduction of Variational Auto-Encoders (VAEs).

7.1.2 Variational Auto-Encodersi

In this case, the latent representation is not directly learnt, but rather a latent distribution. In Figure 4, we observe the illustration of VAEs.
image: 49_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___chine-Learning_LectureNotes_source_pegado62.png
Figure 4: Illustration of a variational auto-encoder. . Source: Wikipedia.
In this case, training is more complex. We want to mazimize the data likelihood. More precisely, its log-likelihood: L ( θ , φ ,x ) = log p θ ( x ) . Notice that this is not dependent on the distribution of z , q φ , so we can write: log p θ ( x ) = E z q φ ( z|x ) [ log p θ ( x ) ] , and applying the reversed Bayes' Rule: E z [ log p θ ( x ) ] = E z [ log p θ ( x|z ) p θ ( x ) p θ ( z|x ) ] . Here, we can multiply and divide by q φ ( z|x ) and use the properties of logarithms, to get: E z [ log p θ ( x ) ] = E z [ log p θ ( x|z ) ] - E z [ log q φ ( z|x ) p θ ( z ) ] + E z [ log q φ ( z|x ) p θ ( z|x ) ] . Now, recalling the definition of the Kullback-Leibler Divergence: D KL ( PQ ) = E P [ log P( x ) Q( x ) ] , we can rewrite: L ( θ , φ ,x ) = E z [ log p θ ( x|z ) ] - D KL ( q φ ( z|x ) p θ ( z ) ) + D KL ( q φ ( z|x ) p θ ( z|x ) ) . Now, the last term is very difficult to compute, because p θ ( z|x ) is intractable. Since it is bigger than 0, we can bound L from below by L ( θ , φ ,x ) E z [ log p θ ( x|z ) ] - D KL ( q φ ( z|x ) p θ ( z ) ) , this is called the Evidence Lower Bound. The first term is called the reconstruction term, and can be approximated by sampling and serves to ensure good reconstructions of the output. The second term is the regularisation term, which makes the distribution of the encoder close to the prior.
The impact of the regularisation is that it enables for continuity in the latent representation, so that similar data samples have similar latent representations; and it helps to ensure completeness of the latent space, so that all latent points have a meaningful decoding.
Joseph Rocca illustrates this as:
image: 50_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___chine-Learning_LectureNotes_source_pegado63.png
In addition to enabling continuity and completeness, or maybe thanks to this, VAEs enable a much easier sampling and generation. We can just sample z from a normal distribution, and process it through the decoder.

7.2 Generative Adversarial Networks

Now, the objective is to sample for a high dimensional and multi-modal distribution, p( x ) , by learning a mapping, G , between a simple distribution from which we can sample and the high dimensional distribution. For this, Generative Adversarial Networks (GANs), consist of two components:
This is illustrated below:
image: 51_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___chine-Learning_LectureNotes_source_pegado64.png
To train these models, we follow a minimax approach: the discriminator aims at maximizing its success rate, while the generator aims to minimizing the success rate of the discriminator. This can be expressed as: J ( D,G,X ) = min G max D V( D,G,X ) = min G max D E x p data ( x ) [ log D( x ) ] + E z p z ( z ) [ log ( 1-D( G( z ) ) ) ] . This can be divided:
Once the training is done, generation is quite straightforward. We only need to sample z p z ( z ) and process it with the generator module.
An interesting property of GANs is that they show continuous representations of the latent space, as shown in the figure below:
image: 52_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___chine-Learning_LectureNotes_source_pegado65.png
Figure 5: Interpolating in the latent space of GANs. Source: [2].
In the following table, we compare GANs and VAEs in different aspects:
Sample Quality
Inference
Mode Coverage
Training Quality Assesment
Training Stability
VAE
Intermediate
Good
Good
Good
Good
GAN
Very good
Bad
Intermediate
Intermediate
Intermediate

7.3 Other Approaches

7.3.1 Normalizing Flows

The idea is to transform a simple distribution into acomplex one, by applying a sequence of invertible transformation functions.
The pros of this approach is that it provides good sample quality, exact inference and stable training. However, it requires invertible neural networks.

7.3.2 Pixel RNN

This approach consists in generated image pixels starting from the corner. The dependencies between the current pixel and the previous ones is modelled using an RNN. The biggest drawback of this approach is that sequential generation is very slow.

8 Denoising Diffusion Models

To train diffusion models, there are two processes involved:
This is illustrated in this Figure:
image: 53_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___chine-Learning_LectureNotes_source_pegado66.png
The forward process can be seen as a Markov process, with starting state x 0 and transition probability q( x t | x t-1 ) for t=1,...,T : x 0 q( x 1 | x 0 ) x 1 q( x 2 | x 1 ) x 2 q( x 3 | x 2 ) ... q( x T-1 | x T-2 ) x T-1 q( x T | x T-1 ) x T The transition probability or Markov kernel is usually chosen to be q( x t | x t-1 ) = N ( x t ; 1- β t x t-1 , β t I ) , β t ( 0,1 ) , but it can be any Markov Diffusion Kernel. The property of these kernels is that they converge to a standard distribution. In this case, they converge to a normal distribution, always the same, no matter the input, x 0 .
Thus, the forward trajectory can be expressed as q( x 0;T ) =q( x 0 ) t=1 T q( x t | x t-1 ) . And to go to step t , from the step t-1 , we just need to sample from the kernel distribution.
It is also possible to go advance several steps at once. For this, we can define the parameters: { α t =1- β t α ˆ t = s=1 t α s , and use the Markov kernel q( x t | x 0 ) = N ( x t ; α ˆ t x 0 ,( 1- α ˆ t ) I ) , which is equivalent to do x t = α ˆ t x 0 + 1- α ˆ t ϵ with ϵ N ( 0,I ) . This allows to go to any node in the chain in just one step.
The parameter β t is the noise scheduler, and it's chosen to have increasing variance, in such a way that q( x T ) is a standard Gaussian. If the β t -s are small enough, the reverse transitions are also Gaussian, but q( x t-1 | x t ) in intractable, but using the Bayes rule we know that q( x t-1 | x t ) q( x t-1 ) q( x t | x t-1 ) , and we can use a model to approximate this reverse process.
That is, we are going to train a model to approximate the reverse, denoising, process. This parametric reverse model:
So, how may we achieve this?
The reverse process is: x 0 p θ ( x 0 | x 1 ) x 1 p θ ( x 1 | x 2 ) x 2 p θ ( x 2 | x 3 ) ... p θ ( x T-2 | x T-1 ) x T-1 p θ ( x T-1 | x T ) x T So the reverse transitions are p θ ( x t-1 | x t ) = N ( x t-1 ;, σ t 2 I ) , and μ θ ( x t ,t ) will be our trainable network! The reverse trajectory is then p θ ( x 0;T ) =p( x T ) t=1 T p θ ( x t-1 | x t ) . To train this, we follow a maximum likelihood approach: J ( x, θ ) =- log p θ ( x 0 ) Now, we can marginalize by integrating over the latents: - log p θ ( x 0 ) =- log p θ ( x 0:T ) x 1:T , which is a shortened notation for - log ... p θ ( x 0 , x 1 ,..., x T ) x 1 ... x T . Here, we can multiply and divide by q( x 1:T | x 0 ) : - log p θ ( x 0 ) =- log p θ ( x 0:T ) q( x 1:T | x 0 ) q( x 1:T | x 0 ) x 1:T , which can be seen as an expectation: - log p θ ( x 0 ) =- log E x 1:T q( x 1:T | x 0 ) [ p θ ( x 0:T ) q( x 1:T | x 0 ) ] , and in this point we can use the Jensen's inequality: - log E x 1:T q( x 1:T | x 0 ) [ p θ ( x 0:T ) q( x 1:T | x 0 ) ] E x 1:T q( x 1:T | x 0 ) [ - log p θ ( x 0:T ) q( x 1:T | x 0 ) ] = L VLB . We have reached L VLB , the variational lower bound. After some derivations, it is possible to arrive to E q [ + t>1 - ] , here has no learnable parameters, so we can remove it, and we name: and . Now, D KL ( PQ ) is always non-negative, and is 0 when P=Q . Also, when conditioned on x t and x 0 , the reverse process becomes tractable: q( x t-1 | x t , x 0 ) = N ( x t-1 ; μ ˆ t ( x t , x 0 ) , β ˆ t I ) , where μ ˆ t ( x t , x 0 ) = α ˆ t-1 β t 1- α ˆ t x 0 + α t ( 1- α ˆ t-1 ) 1- α ˆ t x t and β ˆ t = 1- α ˆ t-1 1- α ˆ t β t .
Of course, we don't have x 0 available at inference time, so what we want is p θ ( x t-1 | x t ) to be as close as possible to q( x t-1 | x t , x 0 ) . For this, we use the normal law and approximate our μ ˆ t ( x t , x 0 ) in our approximated reverse process.

9 Transformers

Credits to Lucas Beyer's slides https://acdl2023.icas.cc/dr-lucas-beyers-slides/.

9.1 Attention

Attention is a function, similar to a soft k-v dictionary lookup. Let's delve into it more precisely:
  1. Attention weights, a 1:N represents query-key similarities. They are defined by a ˆ i =q k i , and normalized via softmax a i = e a ˆ i j e a ˆ j .
  2. Output, z , is an attention-weigthed average of the values v 1:N . That is: z= i a ˆ i v i = a ˆ v.
  3. Usually, k and v are derived from the same input x : k= W k x, v= W v x. The query, q , can come from a separate input y : q= W q y or from the same input x (in which case it's called self-attention): q= W q x.
In addition, it is usual to use many queries, q 1:M , not just one... Stacking the different queries leads to the definition of the attention matrix, A 1:N,1:M , and subsequently to many outputs: z 1:M =Attn( q 1:M ,x ) =[ Attn( q 1 ,x ) |...|Attn( q M ,x ) ] . We usually use multi-head attention. This simply means that the operation is repeated K times, with results concatenated along the feature dimension, with different W k s.
The most commonly seen formulation of the process is z=softmax( QK' d key ) V, where here actually Q is the query matrix, K is the key matrix, and V is the value vector. d key is the dimension of a key in K .

9.2 Architecture

The architecture of the transformer is the following:
image: 54_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___chine-Learning_LectureNotes_source_pegado73.png
Let's analyze it step by step:
image: 55_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___chine-Learning_LectureNotes_source_pegado74.png
image: 56_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___chine-Learning_LectureNotes_source_pegado75.png image: 57_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___chine-Learning_LectureNotes_source_pegado76.png
image: 58_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___chine-Learning_LectureNotes_source_pegado77.png image: 59_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___chine-Learning_LectureNotes_source_pegado78.png
image: 60_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___chine-Learning_LectureNotes_source_pegado79.png image: 61_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___chine-Learning_LectureNotes_source_pegado80.png
image: 62_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___chine-Learning_LectureNotes_source_pegado81.png
image: 63_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___chine-Learning_LectureNotes_source_pegado82.png
image: 64_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___chine-Learning_LectureNotes_source_pegado83.png
image: 65_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___chine-Learning_LectureNotes_source_pegado84.png
image: 66_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___chine-Learning_LectureNotes_source_pegado85.png
image: 67_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___chine-Learning_LectureNotes_source_pegado86.png

9.3 The first big takeover: Language Modeling

image: 68_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___chine-Learning_LectureNotes_source_pegado87.png

9.4 The second big takeover: Computer Vision

Previous approaches used attention by:
Then came the Vision Transformer. Many prior works attempted to introduce self-attention at the pixel level. For 224p x 2 , this is a 50k sequence length. Definitely too much. Therefore, most works restrict attention to local pixel neighbourhoods, or as high-level mechanism on top of detections.
The key breakthrough for using the full transformer architecture, standalone, was to tokenize the images by cutting it into patches of 16x16 pixels, and treating each patch as a token... i.e., embedding it into an input space.

9.5 The third big takeover: speech

The conformer is a modified architecture, following a same story as happened in the vision realm, but with spectograms of the sounds instead of images.
The conformer adds a third type of block, using convolutions, and slightly reorder blocks, but overal is very similar to transformer.
It exists as encoder-decoder, or as encoder-only variant with CTC loss.

9.6 The fourth big takeover: Reinforcement Learning

The decision transformer casts the supervised/online RL problem into a sequence modeling task. It can generate/decode sentences of actions with desired return. The trick is prompting:
“The following is a trajectory of an expert player: [obs] ...”

9.7 The Transformer's unification of communities

Anything we can tokenize, we can feed to Transformer! The idea is basically to tokenize different modalities, each one in their own way, and send them all jointly into a Transformer... And this seems to work.
This is called multimodal model, and there is currently an explosion of works doing this.

9.8 A note on efficient transformers

The self-attention operation complexity is O( N 2 ) for sequence length N , because we need to compute a matrix product of size N . We'd like to use larger N , for whole articles or books, full video movies, high resolution images, etc.
Many O( N ) approximations to the full self-attention have been proposed, but none provides yet a clear improvment. There seems to always be a trade-off between speed and quality.
image: 69_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___chine-Learning_LectureNotes_source_pegado88.png

Part II Reinforcement Learning

10 Introduction

People and animals learn by interacting with the environment that sorround them, differing from certain other types of learning. This process is active, rather than passive: the subject needs to perform interactions with the environment, to obtain knowledge. Also, the interactions are usually sequential, with future interactions possibly depending on earlier ones.
Not only this, but we are goal oriented: we act towards an objective. And, more importantly, we can learn without examples of optimal behavior! Instead, we optimise some reward signal obtained from the outcome of our actions.
It is in these observation that Reinforcement Learning (RL) arises as a learning paradigm, based on the interaction loop: there is an agent in an environment; the agent can make actions in the environment, and get observations from it.
image: 70_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___arning_LectureNotes_source_interaction_loop.png
RL relies on the reward hypothesis:
Conjecture 10.1. Reward Hypothesis
Any goal can be formalized as the outcome of maximizing a cumulative reward.
This hypothesis basically says that every objective that an agent can have, can be stated in terms of maximizing a reward associated to the actions of the agent with respect to this objective.
For example, if the objective of the agent is to fly a helicopter from point A to point B, then the reward could be negatively affected by the distance to point B, by the time taken to reach B,...
Now, it is important to realize that there exist different reasons to learn:
Well, RL can provide algorithm for both cases! Note that the second point is not just about generalization, but also to cope with the so-called data shift, efficiently, during operation.
With all this, now we can define RL:
Definition 10.1. Reinforcement Learning is the science and framework of learning to make decisions from interaction.
This requires us to think about time, consequences of actions, experience gathering, future prediction, uncertainty,...
It has a huge potential scope and is a formalisation of the AI problem.

11 Definition and components

Definition 11.1. The environment is the world of the problem at hand, with agents in it that can perform actions, over time.
At each time step t , the agent:
  • Receives observation O t and reward R t from the environment.
  • Executes action A t .
And the environment:
  • Receives action A t .
  • Emits observation O t+1 and reward R t+1 .
But, what is a reward?
A reward, R t , is a scalar feedback signal which indicates how well the agent is doing at step t : it defines how well the goal is being accomplished!
Therefore, the agent's job is to maximize the cummulative reward of the future steps, i.e., G t = R t+1 + R t+2 + R t+3 +... This is called the return. But, when one thinks about it carefully, one realizes that it is hard to know the future rewards with such precision. Therefore, it is also usual to use the value, which is the expected return, taking into account the current state, s : v( s ) = E [ G t | S t =s ] . This depends on the actions the agents takes, and the goal is to maximize it! To achieve this, the agent must pick suitable actions.
Therefore, rewards and values define the utility of states and actions, and in this setup there is no supervised feedback.
Note, also, that this values can be defined recursively as G t = R t+1 + G t+1 , v( s ) = E [ R t+1 +v( S t+1 ) | S t =s ] .
The environment state is the environment's internal state, which is usually invisible or partially visible to the agent. It is very important, but it can also contain lots of irrelevant information.
An environment is fully observable when the agent can see the full environment state, so every observation reveals the whole environment state. That is, the agent state could just be the observation: S t = O t .
Note that S t is the agent state, not the environment state!

11.1 Maximising the value by taking actions

As we have outlined, the goal is to select the actions that maximise the value. For this, we may need to take into account that actions may have long term consequences, delaying rewards. Thus, it may be better to sacrifice immediate reward to gain long-term reward.
The decision making process that for a given state chooses which action to take is called a policy.
To decide which action to take, we can also condition the value on actions: q( s,a ) = E [ G t | S t =s, A t =a ] , so, for a given state s , a possible set of actions A t s , we could decide which action to take as a t = max a A t s q( s,a ) .
Then, the history is the full sequence of observation, actions and rewards: H t = O 0 , A 0 , R 1 , O 1 ,..., O t-1 , A t-1 , R t , O t .

11.2 Markov Decision Processes

Markov Decision Processes (MDPs) are a useful mathematical framework, defined as:
Definition 11.2. A decision process is Markov if p( r,s| S t , A t ) =p( r,s| H t , A t ) .
This means that the current state is the only information needed to make a decision, we don't need the full story. For example, think in a chess game: there are many ways to arrive to a certain position, but it really does not matter how to got to the position, the past does not affect your choice now.
In order for a process to be Markov, full observability is required. When the situation is of partial observability, the observations are not Markovian, so using the observation as state is not enough to make the decision. This is called a partially observable Markov decision process (POMDP). Note that the environment state can still be Markov, but the agent does not know it. In this case, we might be able to construct a Markov agent state.
In the general case, the agent state is a function of the history: S t+1 =u( S t , A t , R t+1 , O t+1 ) , where u is a state update function.
Usually, the agent state is much smaller than the environment state.
Example 11.1. A not Markov process:
Consider the following maze to be the full environment:
image: 71_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___achine-Learning_LectureNotes_source_pegado2.png
And consider the following observations:
image: 72_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___achine-Learning_LectureNotes_source_pegado3.png image: 73_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___achine-Learning_LectureNotes_source_pegado4.png
They are indistinguishable! This process is not Markov, because only taking into account the current state, we cannot identify where we are.
To deal with partial observability, agent can construct suitable state representations. Some examples of agent states are:
Constructing a fully Markovian agent state is often not feasible and, more importantly, the state should allow for good policies and value predictions.

11.3 Policies

As we saw, a policy defines the agent's behavior: it is a map from the agent state to an action. Policies can be deterministic, A= π ( S ) , or stochastic, π ( A|S ) =p( A|S ) .

11.4 Value Functions

We saw the value before, which is the expected return. However, it is usual to introduce a discount factor, γ [ 0,1 ] , which trades off importance of immediate and long-term rewards. This way, the value becomes v π ( s ) = E [ G t | S t =s, π ] = E [ R t+1 + γ R t+2 + γ 2 R t+3 +...| S t =s, π ] . The value depends on the policy, π , and can be used to evaluate the desirability of states, as well as to select between actions.
Note the role of the discount factor: the higher it is, the higher the focus on long term outcomes.
Now, using the recursive expression of the return, G t = R t+1 + γ G t+1 , we can rewrite the value as v π ( s ) = E [ R t+1 + γ v π ( S t+1 ) | S t =a, A t π ( s ) ] , where A π ( s ) means A is chosen by policy π in state s . This is known as a Bellman equation. A similar equation holds for the optimal value, i.e., the highest possible value: v * ( s ) = max a E [ R t+1 + γ v * ( S t+1 ) | S t =s, A t =a ] . Note how this does not depend on a policy, it is just the maximum achievable value from the current state.

11.4.1 Value Function Approximations

Agents often approximate value functions, and with an accurate value function approximation, the agent can behave optimally, or very well, even in intractably big domains.

11.5 Model

A model predicts what the environment will do next. For example, P predicts the next state, given the current state and an action: P ( s,a,s' ) p( S t+1 =s'| S t =s, A t =a ) . Or R predicts the next immediate reward: R ( s,a ) E [ R t+1 | S t =s, A t =a ] . Note that a model does not immediately give us a good policy! We still need to plan and see how actions and states are related.
Example 11.2. Consider the following maze, where the rewards are -1 per time-step, the actions are to go N, E, S and W, and the states are the agent's location:
Start
End
The following arrows represent the policy, π ( s ) , for each state s :
image: 74_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___achine-Learning_LectureNotes_source_pegado5.png
image: 74_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___achine-Learning_LectureNotes_source_pegado5.png
image: 74_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___achine-Learning_LectureNotes_source_pegado5.png
image: 74_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___achine-Learning_LectureNotes_source_pegado5.png
image: 74_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___achine-Learning_LectureNotes_source_pegado5.png
image: 75_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___achine-Learning_LectureNotes_source_pegado9.png
Start
image: 74_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___achine-Learning_LectureNotes_source_pegado5.png
image: 76_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___achine-Learning_LectureNotes_source_pegado6.png
image: 76_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___achine-Learning_LectureNotes_source_pegado6.png
image: 75_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___achine-Learning_LectureNotes_source_pegado9.png
image: 76_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___achine-Learning_LectureNotes_source_pegado6.png
image: 77_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___achine-Learning_LectureNotes_source_pegado8.png
image: 75_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___achine-Learning_LectureNotes_source_pegado9.png
image: 77_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___achine-Learning_LectureNotes_source_pegado8.png
image: 76_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___achine-Learning_LectureNotes_source_pegado6.png
image: 77_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___achine-Learning_LectureNotes_source_pegado8.png
image: 75_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___achine-Learning_LectureNotes_source_pegado9.png
image: 75_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___achine-Learning_LectureNotes_source_pegado9.png
image: 76_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___achine-Learning_LectureNotes_source_pegado6.png
image: 74_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___achine-Learning_LectureNotes_source_pegado5.png
image: 75_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___achine-Learning_LectureNotes_source_pegado9.png
image: 74_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___achine-Learning_LectureNotes_source_pegado5.png
image: 74_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___achine-Learning_LectureNotes_source_pegado5.png
image: 76_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___achine-Learning_LectureNotes_source_pegado6.png
image: 77_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___achine-Learning_LectureNotes_source_pegado8.png
image: 74_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___achine-Learning_LectureNotes_source_pegado5.png
image: 74_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___achine-Learning_LectureNotes_source_pegado5.png
End
In the following one, the numbers represent the value v π ( s ) of each state s :
-14
-13
-12
-11
-10
-9
Start
-16
-15
-12
-8
-16
-17
-6
-7
-18
-19
-5
-24
-20
-4
-3
-23
-22
-21
-22
-2
-1
End
The grid layout represents the partial transition model P ss' a , and numbers represent the immediate reward, R ss' a from each state s , which is -1 for all a and s' in this case.

11.6 Agent categories

An agent is model free when the behavior of the environment is not known. The agent needs a policy or a value function to operate and there is no model. On the other hand, it is model based when the environment is known by means of a model. In this case, a policy and a value function might be optional, since it is possible that the agent can operate just knowing the model.
Model free agents are simpler, while model based agents are more sample efficient.
Another categorization is the following:

11.7 Subproblems of RL

Prediction consists in evaluating the future, for a given policy, i.e., what are the values in each state?
Control refers to the problem of optimising the future to find the best policy, i.e., which actions to take?
These two problems are strongly related, because the best actions to take will be decided using our predictions about the future: π * ( s ) = arg max π v π ( s ) .
Two fundamental problems in RL are:
In order to learn, we need to define all components of the problem as functions:
Then, we can use, for example, neural networks and deep learning techniques to learn. But we do need to be careful, because in RL it is usual to violate assumptios made in supervised learning, such as having i.i.d. samples, or stationarity.

12 Markov Decision Processes

We saw the notion of MDP, and now we formalize it:
Definition 12.1. A Markov Decision Process (MDP) is a tuple ( S,A,p, γ ) where:
  • S is the set of all possible states with the Markov Property.
  • A is the set of all possible actions.
  • p( r,s'|s,a ) is the joint probability of a reward, r , and next state, s' , given a state s and an action a .
  • γ [ 0,1 ] is a discount factor that trades off later rewards to earlier ones.
Remark 12.1. p defines the dynamics of the problem.
Sometimes, it is useful to marginalise out the state transitions or expected rewards: p( s'|s,a ) = r p( s',r|s,a ) , to obtain the probability of arriving to a certain state.
Also, the expected reward: E [ R|s,a ] = r r s' p( r,s'|s,a ) .
There is an alternative equivalent definition, which introduces the notion of the expected reward into the concept, and takes it out of the probabity function:
Definition 12.2. A MDP is a tuple ( S,A,p,r, γ ) where:
  • S is the set of all possible states with the Markov Property.
  • A is the set of all possible actions.
  • p( s'|s,a ) is the probability of transitioning to s' , fiven a state s and an action a .
  • r:S × A R is the expected reward, achieved on a transition starting in ( s,a ) , r= E [ R|s,a ] .
  • γ [ 0,1 ] is a discount factor that trades off later rewards to earlier ones.
Now, we have to clarify what is the Markov Property:
Definition 12.3. Consider a sequence of random variables, { S t } t N , indexed by time and taken from a set of states S . Consider also the set of actions A and rewards in R .
A state s has the Markov Property when, for all s'S , p( S t+1 =s'| S t =s ) =p( S t+1 =s'| h t-1 , S t =s ) , for all possible histories h t-1 ={ S 1 ,..., S t-1 , A 1 ,..., A t-1 , R 1 ,..., R t-1 } .
Therefore, in an MDP, the current state captures all relevant information from the history, it is a sufficient statistic of the past. So, once the state is known, the history may be thrown away.
Exercise 12.1. In an MDP, which of the following statemest are true?
  1. p( S t+1 =s'| S t =s, A t =a ) =p( S t+1 =s'| S 1 ,..., S t-1 , A 1 ,..., A t , S t =s ) : false, the RHS does not condition on A t =a .
  2. p( S t+1 =s'| S t =s, A t =a ) =p( S t+1 =s'| S 1 ,..., S t-1 , S t =s, A t =a ) : true.
  3. p( S t+1 =s'| S t =s, A t =a ) =p( S t+1 =s'| S 1 ,..., S t-1 , S t =s ) : false, the RHS does not condition on A t =a .
  4. p( R t+1 =r, S t+1 =s'| S t =s ) =p( R t+1 =r, S t+1 =s'| S 1 ,..., S t-1 , S t =s ) : true.
It is also worth noting that most MDPs are discounted, and there are several reasons for these:
As we outlined previously, the goal of an RL agent is to find a behavior policy that maximises the expected return G t . Recall our definition for value funtion v π ( s ) = E [ G t | S t =s, π ] . Similarly, we can define the state-action values, as q π ( s,a ) = E [ G t | S t =s, A t =a, π ] , and there is the following connection between them: v π ( s ) = a π ( a|s ) q π ( s,a ) = E [ q π ( S t , A t ) | S t =s, π ] ,sS.
Also, we can define the maximum possible value functions:
Definition 12.4. The optimal state value function, v * ( s ) , is the maximum value function over all policies, v * ( s ) = max π v π ( s ) . The optimal state-action value function, q * ( s,a ) , is the amximum state-action value function over all policies, q * ( s,a ) = max π q π ( s,a ) .
The optimal value function specifies the best possible performance in the MDP. We can consider the MDP to be solved when we know the optimal value function.
In addition, value functions allow us to define a partial ordering over policies, having π π ' v π ( s ) v π ' ( s ) ,sS.
With this partial ordering, the following theorem state that optimal policies exist for every MDP:
Theorem 12.1. Optimal Policies Theorem
For any MDP:
  • There exists an optimal policy π * that is better than or equal to all other policies, π * π , π .
  • All optimal policies achieve the optimal value function, v π * ( s ) = v * ( s ) .
  • All optimal policies achieve the optimal state-action value function, q π * ( s,a ) = q * ( s,a ) .
To find an optimal policy, we can maximise over q * ( s,a ) : π * ( s,a ) ={ 1 ifa= arg max aA q * ( s,a ) 0 otherwise .
That is, the optimal policy is to take action a in state s if a is the action that gives the highes state-action value given state s .
Remark 12.2. There is always a deterministic optimal policy for any MDP, and we know q * ( s,a ) , we know the optimal policy immediately.
Also, there can be multiple optimal policies, and if multiple actions maximize q * ( s, ) , we can pick any of them.
Now, recall the Bellman Equations we saw previously. The following theorem explains how to express the value functions by means of these equations:
Theorem 12.2. Bellman Expectation Equations
Given an MDP, M=( S,A,p,r, γ ) , for any policy π , the value functions obey the followin expectation equations: v π ( s ) = a π ( s,a ) [ r( s,a ) + γ s' p( s'|a,s ) v π ( s' ) ] , and q π ( s,a ) =r( s,a ) + γ s' p( s'|a,s ) a'A π ( a'|s' ) q π ( s',a' ) .
Theorem 12.3. Bellman Optimality Equations
Given an MDP, M=( S,A,p,r, γ ) , the optimal value functions obey the following expectation equations: v * ( s ) = max aA [ r( s,a ) + γ s' p( s'|a,s ) v * ( s' ) ] , q * ( s,a ) =r( s,a ) + γ s' p( s'|a,s ) max a'A q * ( s',a' ) .
Remark 12.3. There can not be a policy with a higher value than v * ( s ) = max π v π ( s ) ,s .
Intuition on the proof for the Bellman Optimality Equations:
An optimal policy can be found by maximising over q * ( s,a ) , π * ( s,a ) ={ 1 ifa= arg max aA q * ( s,a ) , 0 otherwise. Applyin the Bellman Expectation Equation: q π * ( s,a ) = r( s,a ) + γ s' p( s'|a,s ) a'A π * ( a'|s' ) q π * ( s',a' ) = r( s,a ) + γ s' p( s'|a,s ) max a'A q * ( s',a' ) .
We can also express the Bellman equations in matrix form, as V= R π + γ P π V, where v i =v( s i ) , R i π = E [ R t+1 | S t = s i , A t π ( S t ) ] and P ij π =p( s j | s i ) = a π ( a| s i ) p( s j | s i ,a ) .
This is a linear equations, that can be solved directly: ( I- γ P π ) V= R π V= ( I- γ P π ) -1 R π . The computational complexity is O( | S | 3 ) , making this only feasible for small problems. This makes it helpful to design other methods for larger problems. For example, there are iterative methods such as dynamic programming, monte-calro evaluation and temporal-difference learning.

12.1 Solving Reinforcement Learning Problems with Bellman Equations

  1. Estimating v π or q π is called policy evaluation, or prediction:
    1. Given a policy, what is my expected return under that behavior?
    2. Given this treatment protocol/trading strategy, what is my expected return?
  2. Estimating v * or q * is sometimes called control, because these can be used for policy optimisation:
    1. What is the optimal way of behaving? What is the optimal value function?
    2. What is the optimal treatment? What is the optimal control policy to minimise time, fuel consumption, etc?
Exercise 12.2. Consider the following MDP:
image: 78_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___e-Learning_LectureNotes_source_mdp_exercise.png
The actions have a 0.9 probability of success and with 0.1 probability we remain in the same state.
R t =0 for all transitions that end up in S 0 and R t =-1 for all other transitions.
Discount factor: γ =0.9 .
Questions:

12.2 Dynamic Programming

Dynamic programming refers to a collection of algorithms that can be used to compute optimal policies given a perfect model of the environment as a Markov Decision Process (MDP).
-Sutton & Barto, 2018
We will discuss several dynamic programming methods to solve MDPs, all of which consist of two important parts:

12.2.1 Policy Evaluation

We start by discussing how to estimate v π ( s ) = E [ R t+1 + γ v π ( S t+1 ) |s, π ] . The idea is to turn the equality into an update rule. The process is described in the following algorithm:
def policy_eval(S, R, pi, gamma):

	v = [0 for s in S]
	
	repeat until converge:
		for s in S:
			v_new(s) = E[R + gamma * v(S_t+1) | S_t = s, pi]

		v = v_new

	return v
Note that this algorithm always converge under appropriate conditions, like γ <1 . We will delve into this later.
Example 12.1. Policy Evaluation example.
Take the following MDP:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
The possible actions are to go up, down, left, right and when you reach the red cells, you finish. Each transitions costs -1 point.
Let's evaluate the random policy with γ =1 .
Initialization:
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Step 1: we do v( s ) = E [ R+ γ v( S ) |S, π ] , so for example, for cell ( 1 ) , it is: v( 1 ) = 1 3 ( -1+10 ) 3=-1.
0
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
0
Step 2: v( 1 ) = 1 3 ( -1+10 ) + 2 3 ( -1-1 ) =-1.7.
0
-1.7
-2
-2
-1.7
-2
-2
-2
-2
-2
-2
-1.7
-2
-2
-1.7
0
Step 3: v( 1 ) = 1 3 ( -1+0 ) + 2 3 ( -1-2 ) =-2.3.
0
-2.3
-2.9
-3
-2.3
-2.9
-3
-2.9
-2.9
–3
-2.9
-2.3
-3
-2.9
-2.3
0
And so on... It would converge at
0
-14
-20
-22
-14
-18
-20
-20
-20
-20
-18
-14
-22
-20
-14
0

12.2.2 Policy Improvement

We can use the values computed with policy evaluation to improve the policy. The simplest way to achieve this is with a greedy policy improvement approach, which is as follows:
def policy_improvement(S, R, pi, gamma)
	pi_new = {}

	v = [0 for s in S]

	for s in S:
		v(s) = policy_eval(S, R, pi, gamma)
		pi_new[v] = argmax_a E[R + gamma*v(S_t+1) | S_t = s, A_t = a]
		pi = pi_new

	return pi_new
Claim 12.1. It is possible to show that v π new ( s ) v π ( s ) ,s.
Example 12.2. We can use this greedy approach combined with the previous example:
Initialization, with random policy:
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
,,
,,
,
,,
,,,
,,,
,,
,,
,,,
,,,
,,
,
,,
,,
Step 1:
0
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
0
0
,,
,
,,,
,,,
,,
,,
,,,
,,,
,
,,
Step 2:
0
-1.7
-2
-2
-1.7
-2
-2
-2
-2
-2
-2
-1.7
-2
-2
-1.7
0
0
,
,
,,,
,,,
,
,
Step 3:
0
-2.3
-2.9
-3
-2.3
-2.9
-3
-2.9
-2.9
–3
-2.9
-2.3
-3
-2.9
-2.3
0
0
,
,
,,,
,,,
,
,
Step converged:
0
-14
-20
-22
-14
-18
-20
-20
-20
-20
-18
-14
-22
-20
-14
0
0
,
,
,,,
,,,
,
,
Observe how in the second iteration we already found the optimal policy!
In this example, we showed how we can use evaluation to improve our policy, and in fact we obtained the optimal policy. However, the greedy approach does not always ensure reaching the optimal policy.
This approach is called policy iteration:
It is natural to ask if policy evaluation need to converge to v π or if we should stop the evaluation at some point. Ways to stop it are to put a threshold of minimum change between iterations, or simply after k iterations. One extreme, which is in fact quite usual in practice, is to stop after k=1 , which is equivalent to value iteration:
def value_iter(S, R, pi, gamma):

	v = [0 for s in S]
	
	repeat until converge
		for s in S:
			v_new(s) = max_a E[R + gamma * v(S_t+1) | S_t = s, A_t = a]
		
		v = v_new

	return v
In the following table, we sum up the different approaches:
Problem
Bellman Equation
Algorithm
Prediction
Bellman Expectation Eq
Iterative Policy Eval
Control
Bellman Expectation Eq + (Greedy) Policy Improvement
Policy Iteration
Control
Bellman Optimality Eq
Value Iteration
Observations

12.3 Extensions to Dynamic Programming

12.3.1 Asynchronous Dynamic Programming

DP methods described so far used synchronous updates, meaning all states are updated in parallel. In contrast, asynchronous DP backs up states individually, in any order. This can significantly reduce computation, and it is guaranteed to converge if all states continue to be selected.
We are going to see three approaches for ADP:
In-Place DP
Before, with synchronous value iteration, we stored two copies of the value function: v new ( s ) = max a E [ R t+1 + γ v( S t+1 ) | S t =s, A t =a ] , v= v new .
  1. Now, in-place value iteration stores only one copy of the value function, doing: v( s ) = max a E [ R t+1 + γ v( S t+1 ) | S t =s, A t =a ] .
Prioritised Sweeping
We can use the magnitude of the Bellman error to guide the state selection. For example: E=| max a E [ R t+1 + γ v( S t+1 ) | S t =s, A t =a ] -v( s ) | . We then backup the state with the largest remaining Bellman error, and update the Bellman error of affected states after each backup. This reuires knowledge of reverse dynamics (which are the predecessor states). It can be implemented efficiently with a priority queue.
Real-Time DP
The idea in this case is to only update states that are relevant to the agent. For example, if the agent is in state S t , we update that state value, or the states that it expects to be in soon.
Full-Width backups
Standard DP uses full-width backups. This means that, for each backup, it being synchronous or asynchronous:
DP is effective for medium-sized problems (with millions of states). For large problems DP suffers from the curse of dimensionality, since the number of states grows exponentially with the number of state variables, and even one full backup can be too expensive.
Sample Backups
This approach consists in using sample rewards and sample transitions ( s,a,r,s' ) instead of the reward function R , and the transition dynamics, P .
It presents some advantages:

13 Model-Free Prediction

13.1 Monte Carlo Algorithms

We can use experience samples to learn without a model. The direct sampling of episodes is called Monte Carlo, and this is a model-free approach, because we don't need knowledge about the MDP, only the samples.

13.1.1 Monte Carlo Policy Evaluation

We consider sequential decision problems, and our goal is to learn v π from episodes of experience under policy π : S 1 , A 1 , R 2 ,..., S k π . The return is the total discounted reward, for an episode ending at time T<t : G t = R t+1 + γ R t+1 +...+ γ T-t-1 R T . The value function is the expected return: v π ( s ) = E [ G t | S t =s, π ] . We could also use just the sample average return, instead of the expected return.

13.1.2 First-Visit Monte Carlo Policy Evaluation

def FV_MC_PE:
	# Initialization
	foreach s:
		N(s) = 0
		G(s) = 0

	# Computation
	loop:
		# Sample episode
		e_i = s[i,1], a[i,1], r[i,1], s[i,2], a[i,2], r[i,2], ..., s[i,T_i]

		# Compute return from time step t onwards in episode i
		G[i,t] = r[i,t] + gamma * r[i,t+1] + ... + gamma ** (T_i - 1) * r[i, T_i]

		foreach time step t till the end of episode i:
			s = e_i[i, t].state
			if not visited[s, i]:
				N(s) = N(s) + 1		# Increment total visits counter
				G(s) = G(s) + G[i,t]	# Increment total return
				V_pi(s) = G(s) / N(s) 	# Update estimate					
Properties:

13.1.3 Every-Visit Monte Carlo Policy Evaluation

def EV_MC_PE:
	# Initialization
	foreach s:
		N(s) = 0
		G(s) = 0

	# Computation
	loop:
		# Sample episode
		e_i = s[i,1], a[i,1], r[i,1], s[i,2], a[i,2], r[i,2], ..., s[i,T_i]

		# Compute return from time step t onwards in episode i
		G[i,t] = r[i,t] + gamma * r[i,t+1] + ... + gamma ** (T_i - 1) * r[i, T_i]

		foreach time step t till the end of episode i:
			s = e_i[i, t].state
			N(s) = N(s) + 1		# Increment total visits counter
			G(s) = G(s) + G[i,t]	# Increment total return
			V_pi(s) = G(s) / N(s) 	# Update estimate					
Properties:

13.1.4 Incremental Monte Carlo Policy Evaluation

def EV_MC_PE:
	# Initialization
	foreach s:
		N(s) = 0
		G(s) = 0

	# Computation
	loop:
		# Sample episode
		e_i = s[i,1], a[i,1], r[i,1], s[i,2], a[i,2], r[i,2], ..., s[i,T_i]

		# Compute return from time step t onwards in episode i
		G[i,t] = r[i,t] + gamma * r[i,t+1] + ... + gamma ** (T_i - 1) * r[i, T_i]

		for j=1:T_i:
			V_pi(s[j,t]) = V_pi(s[j,t]) + alpha * (G[j,t] - V_pi(s[j,t]))					
These algorithms can be used to learn value predictions, but when episodes are long, the learning process can be slow, because we have to wait until an episode ends before we can learn. In addition, the return can have high variance. Therefore, it would be nice to have other methods.

13.2 Temporal Difference Learning

The core of TD learning is the temporal difference error, which measures the difference between the estimated value of the current state and the estimated value of the next state, combined with the reward received for transitioning between these states. This error guides the update of the value function.
Therefore, TD is model-free and learns directly from experience. It can also learn from incomplete episodes, by bootstrapping.
TD can learn during each episode, it does not need to complete it.

13.2.1 Temporal Difference Learning by Sampling Bellman Equations

Recall the Bellman equations, v π ( s ) = E [ R t+1 + γ v π ( S t+1 ) | S t =s, A t π ( S t ) ] , which can be approximated by iterating v k+1 ( s ) = E [ R t+1 + γ v k ( S t+1 ) | S t =s, A t π ( S t ) ] . But we can also sample this, as v t+1 ( S t ) = R t+1 + γ v t ( S t+1 ) . However, this is likely to be very noisy, so it is better to take a small step, v t+1 ( S t ) = v t ( S t ) + α t ( - v t ( S t ) ) .
The red part is the target value, and δ t = R t+1 + γ v t ( S t+1 ) - v t ( S t ) is called the TD error.
We can visualize the difference between the Dynamic Programming approach, the Monte Carlo approach and the Temporal Difference approach in the following figure:
image: 79_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___chine-Learning_LectureNotes_source_pegado25.png image: 80_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___chine-Learning_LectureNotes_source_pegado26.png image: 81_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___chine-Learning_LectureNotes_source_pegado27.png
Figure 6: Backups for the different methods. DP (left), MC (center), TD (right).
In the context of reinforcement learning, bootstrapping refers to a method where the current estimates are updated based on other estimated values, rather than solely on actual rewards and complete trajectories. Sampling, on the other side, refers to learning from actual experiences or interactions with the environment.
Method
Bootstrapping
Sampling
DP
Yes
No
MC
No
Yes
TD
Yes
Yes
Note now that the idea of TD can also be applied to action values, and we can update the value q t ( S t , A t ) towards the estimated return R t+1 + γ q( S t+1 , A t+1 ) by q t+1 ( S t , A t ) q t ( S t , A t ) + α ( R t+1 , γ q t ( S t+1 , A t+1 ) - q t ( S t , A t ) ) . This approach is known as SARSA, since it uses ( S t , A t , R t+1 , S t+1 , A t+1 ) .

13.3 Comparing MC and TD

13.4 Batch Monte Carlo and Temporal Difference

Tabular MC and TD converge as experience keeps going to infinity and α t 0 . But what about finite experience?
Consider a fixed batch of experience: episode1: S 1 1 , A 1 1 , R 2 1 ,..., S T 1 1 episodeK: S 1 K , A 1 K , R 2 K ,..., S T 1 K We can repeatedly sample each episode k[ 1,K ] and apply MC or TD(0), which is equivalent to sampling from an empirical model.
MC converges to the best mean-square fit for the observed returns: k=1 K t=1 T k ( G t k -v( S t k ) ) 2 .
TD converges to solution of the maximum likelihood Markov model, given the data.
Example 13.1. Consider two states, A and B , with no discounting, and the following 8 episodes of experience: A :0,B:0 B :1 B :1 B :1 B :1 B :1 B :1 B :0
What are v( A ) and v( B ) ?
MC:
All episodes regarding A have 0 reward, therefore v( A ) =0.
Episodes regarding B have 1 reward 75% of the time, so v( B ) =0.75.
TD:
In this case, we observe that p( S t+1 =B| S t =A ) =1, that is, whenever we go through state A , then we go to state B . This means that the reward of A is the same as the reward for B , v( A ) =v( B ) =0.75.
TD exploits the Markov property, and so it can be helpful in fully-observable environments. On the other side, MC does not rely the Markov property, so it can be useful in partially-observable environments.
When the data is finite or we are approximating the functions, the solutions may differ.

13.5 Multi-Step Temporal Difference

TD uses value estimates, which can be innacurate. In addition, the information can propagate quite slowly. On the other side, the information propagates faster, but the updates are noisier. We can go in between the two methods!
The idea is by applying TD, but instead of doing just one step, we allow it to target n steps into the future.
The returns, in this case, are: n=1 G t ( 1 ) = R t+1 + γ v( S t+1 ) n=2 G t ( 2 ) = R t+1 + γ R t+2 + γ 2 v( S t+2 ) n= G t ( ) = R t+1 + γ R t+2 +...+ γ T-t-1 R T As we can see, for n=1 we obtain the regular TD, and for n= we obtain MC.
The learning process is done by v( S t ) v( S t ) + α ( G t ( n ) -v( S t ) ) .

14 Model-Free Control

Model-Free Control refers to learning optimal policies directly from experience (i.e., from interaction with the environment) without needing a model of the environment. In model-based control, you know the dynamics of the environment (transition probabilities and rewards), but in model-free control, you do not.

14.1 Monte-Carlo Control

In the previous chapter, we focused on how to estimate the value function, given a policy. Now, we are going to try to discover new policies using these estimations. Monte-Carlo control is based on estimating the value of actions through sampling. This means that the agent tries out actions and observe the outcomes, and learns from these.
When we have a model, we saw how we can use a greedy policy improvement over v( s ) , as π '( s ) = arg max a E [ R t+1 + γ v( S t+1 ) | S t =s, A t =a ] . However, when we don't have the model, we cannot use this approach. Instead, we can use a greedy policy improvement over the action-value function, as π '( s ) = arg max a q π ' ( s,a ) , which makes the action-value function very convenient for this problem.
The idea, then, would be something like:
But there is also a problem with this approach! We are learning by interacting with the environment, so we cannot sample all states s and actions a : since learning is from interaction, not all state-action pairs may be visited, which makes it difficult to estimate their values accurately. This is particularly a problem in environments with a large number of states or actions.
We find a trade-off between exploitation (use the information we already know) and exploration (try new actions to discover more about the environment). One approach that increases the exploration is the ε -greedy approach:
This can be written, for each action a , as π t ( a ) ={ ( 1- ε ) + ε | A | ifq( s t ,a ) = max b q( s t ,b ) ε | A | otherwise . This approach keeps exploring and obtaining new information.
We can write the model-free control as the following algorithm:
Some possibilities for α t are α t = 1 N( S t , A t ) , the inverse of the amount of possible actions in state t ; or α t = 1 k , the inverse of the amount of episodes considered.

14.1.1 Greedy in the Limit with Infinite Exploration (GLIE)

GLIE is a strategy to ensure that the learning algorithm both explores the environment adequately and also converges to a near-optimal policy.
Infinite Exploration ensures that every state-action pair is explored infinitely often. In practical terms, this means that the learning algorithm never stops trying out new actions, even if it has already found actions that seem to give good results. This continuous exploration is crucial for making sure that the algorithm doesn't miss out on potentially better actions that haven't been tried enough. Infinite exploration is often implemented using strategies like ε -greedy. In other words, all state-actions are explored infinitely many times: lim t N t ( s,a ) =,s,a.
Greedy in the Limit means that as the learning process continues (i.e., as the number of iterations goes to infinity), the policy becomes increasingly greedy with respect to the estimated value function. In other words, over time, the policy relies more and more on the knowledge it has gained about the environment, and less on random exploration. This is typically achieved by gradually reducing the ε parameter in an ε -greedy policy. As ε approaches zero, the policy becomes completely greedy, always choosing the action that it currently believes to be the best. This means that lim t π t ( a|s ) = I ( a= arg max a' q t ( s,a' ) ) .
An example is using ε -greedy with ε k = 1 k .
Theorem 14.1. GLIE Model-Free Control converges to the optimal action-value funtion, q r q * .

14.2 Temporal-Difference Learning for Control

We saw that TD learning has several advantages over Monte-Carlo, as it shows lower variance, it learns online, and from incomplete sequences. Therefore, it is natural to think about using TD instead of MC for control. Instead of estimating q( s,a ) with MC, we do it with TD.
We can also update the action-value functions with SARSA, as studied in the previous section, by q t+1 ( S t , A t ) = q t ( S t , A t ) + α t ( R t+1 + γ q( S t+1 , A t+1 ) -q( S t , A t ) ) . Once we evaluate q q π , we improve the policy with the ε -greedy approach.
More concretely, the following algorithm represents the approach called tabular SARSA:
initialize q(s,a) arbitrarily

for each episode:
	initialize s
	choose a from a using policy derived from q

	for each step in the episode:
		take action a
		observe r, s'
		choose a' from s' using policy derived from q
		
		q(s,a) <- q(s,a) + alpha * [ r + gamma * q(s',a') - q(s,a) ]
		
		s <- s'
		a <- a'
	until s is terminal
Theorem 14.2. Tabular SARSA converges to the optimal aciton-value function, q q * , if the policy is GLIE.
Observe why it is called Tabular SARSA: we store the action-value function as a matrix of size | S | × | A | .

14.3 Off-Policy Temporal-Difference and Q-Learning

Definition 14.1. On-Policy Learning consists on learning about the behavior policy π from experience, sampeld from π .
Off-Policy Learning consists about learning the target policy π from experiences sampled from μ . It tries to learn counterfactually about other things the agent could do, by asking 'what if...?'.
Example 14.1. What I turned left? New observations, rewards?
What if I played more defensively? Do I change the win probability?
What if I continued to go forward? How long until I bump into a wall?
The approaches seen so far are on-policy learning, while off-policy learning approaches go as follows:
Now, why is this important? It is important because it enables an agent to learn by observing humans or other agents, and to re-use experience from old policies. In addition, the agent can learn several policies, while only following one. Finally, it learns about the greedy policy while following an exploratory policy.
A common approach is Q-Learning, which estimates the value of the greedy policy as q t+1 ( s,a ) = q t ( S t , A t ) + α t ( R t+1 + γ max a' q t ( S t+1 ,a' ) - q t ( S t , A t ) ) .
Theorem 14.3. Q-Learning control converges to the optimal action-value function, q q * , as long as we take each action in each state infinitely often.
Remark 14.1. Note that Q-Learning achieves convergence without the need for greedy behavior. It works for any policy that eventually selects all actions sufficiently often.
It requires appropriately decaying step sizes, having t α t = and t α t < .
One possibility is using α t = 1 t ω , with ω ( 0.5,1 ) .
Example 14.2. In this picture we can observe how Q-Learning is more conservative than SARSA. SARSA manages to find the optimal path, but this is quite close to the cliff, and is therefore dangerous. Q-Learning finds a path that is less efficient, but is far from the cliff, minimizing the danger to fall off.
image: 82_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___chine-Learning_LectureNotes_source_pegado48.png

14.4 Overestimation in Q-Learning

Classical Q-Learning has potential issues, due to the fact it uses the same values to select, and to evaluate; but these values are approximate, so the maximizing nature of the approach makes it more likely to select overestimated values, and less likely to select understimated values, causing upward bias.
Example 14.3. Roulette example
There are 171 actions: bet 1€ on one of 170 options, or stop.
Stops ends the episode, with 0€. All other actions have high variance reward, with negative expected value, and betting actions do not end the episode, instead enables the agent to bet again.
In the following graph, we observe how Q-Learning overestimates the expected profit in this problem:
image: 83_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___chine-Learning_LectureNotes_source_pegado49.png
The solution to this problem is to decouple selection from evaluation: double Q-Learning. The idea is to store two action-value functions, q and q' , and update the reward estimates using them both: R t+1 + γ q t '( S t+1 , arg max a q t ( S t+1 ,a ) ) , R t+1 + γ q t ( S t+1 , arg max a q t '( S t+1 ,a ) ) .
At each t , we pick q or q' (randomly or with some design) and update using one of the two previous formulas, depending on which one we chose.
We can alse use both to act, for example by using the policy q+q' 2 .
Double Q-Learning also converges to the optimal policy under the same conditions as Q-Learning, and solves the overestimation problem.
Moreover, this idea can be generalized to other update approaches, like double SARSA.

14.5 Importance of Sampling Corrections

Off-Policy Learning is basically trying to solve the following problem:
Given some function f , with random inputs X and a distribution d' , estimate the expectation of f( X ) under a different distribution d (the target distribution).
This can be solved by weighting the data using the ratio d d' : E xd [ f( x ) ] =d( x ) f( x ) =d'( x ) d( x ) d'( x ) f( x ) = E xd' [ d( x ) d'( x ) f( x ) ] . The intuition is that we scale up events that are rare under d' , but common under d , and scale down events that are common under d' , but rare under d .
Example 14.4. Estimate one-step reward, with behavior μ ( a|s ) . Then E [ R t+1 | S t =s, A t π ] = a π ( a|s ) r( s,a ) = μ ( a|s ) π ( a|s ) μ ( a|s ) r( s,a ) = E [ π ( A t | S t ) μ ( A t | S t ) R t+1 | S t =s, A t μ ] . Therefore, when following policy μ , we can use π ( A t | S t ) μ ( A t | S t ) R t+1 as unbiased sample.

15 Value approximation & Deep Reinforcement Learning

15.1 Approximate Model-Free Prediction

Tabular RL does not scale to large, complex problems, because there are too many states to store in memory, and it is too slow to learn the values of each state separately. Therefore, we need to generalise what we learn across states.
We can estimate values or policies in an approximate way:
  1. Map states s onto a suitable feature representation φ ( s ) .
  2. Map features to values through a parametrised funciton v θ ( φ ( s ) ) .
  3. Update parameters θ so that v π ( s ) v θ ( φ ( s ) ) .
The goal is then to find θ that minimises the difference between v π and v θ , L( θ ) = E Sd [ ( v π ( S ) - v θ ( S ) ) 2 ] , where d is the state visitation distribution induced by π and the dynamics p .
For this, we can use gradient descent to iteratively minimise this objective function: Δ θ =- 1 2 α θ L( θ ) = α E Sd [ ( v π ( S ) - v θ ( S ) ) θ v θ ( S ) ] . The evaluation of the expectation is going to be a problem, and therefore the solution is to use stochastic gradient descent, sampling the gradient update: Δ θ = α ( G t - v θ ( S t ) ) θ v θ ( S t ) , where G t is a suitable sampled estimate of the return.
For Monte Carlo Prediction, it is G t = R t + γ R t+1 + γ 2 R t+2 +... and for Temporal Difference Prediction it is G t = R t + γ v θ ( S t+1 ) .
At this point, it is pretty evident that we can use Deep Learning in this setting!

15.2 Deep Reinforcement Learning

Before, the feature representation was typically fixed, and the parametrized function was just a linear mapping. We will consider more complicated non-linear mappings v θ .
A popular choice is to use DNNs to parameterize such mapping. These are known to discover useful feature representation tailored to the specific task, and we can leverage extensive research on architectures and optimisation from DL.
Therefore, we are going to parameterize v θ using a DNN. For instance, as a MLP: v θ ( S ) = W 2 tanh( W 1 S+ b 1 ) + b 2 , where θ ={ W 1 , b 1 , W 2 , b 2 } .
When v θ is linear, v θ is trivial to compute. However, now we have to use automatic differentiation with back-propagation implemented in DL frameworks.
DeepMind was able to develop a Deep Reinforcement Learning system that obtained human performance in all classic Atari games. The did it by using function approximation to help scale ip to making decisions in very large domains.
To achieve such thing, they developed the concept of Deep Q-Networks (DQN) which consists on neural networks that are used to represents the value function, Q , policies and models, optimising a loss function by SGD.
Therefore, we represent the state-action value function by a DQN with weights w : Q ˆ ( s,a;w ) Q( s,a ) , for which we:
  1. Map the input state, s , to a h -dimensional vector.
  2. Apply a non-linear transformation, like tanh or ReLU .
  3. Map the hidden vector to the m possible action values, as a classification problem.
  4. The action with greatest approximated Q -value is selected.
It is also possible to pass each action to the network, and output its Q -value. This is useful when there are too many actions or the action space is continuous.
The performed end-to-end learning of the values Q( s,a ) from the states, s , represented by the pixels of the game of the last 4 frames of the game.
The output is the value Q( s,a ) for the 18 possible button positions.
The reward is a change in the score for that step.
The network architecture and hyperparameters were the same for all the games of the console!
image: 84_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___chine-Learning_LectureNotes_source_pegado67.png
We saw in previous sections that Q -learning converges to the optimal Q * ( s,a ) using table lookup representation. When working with value function approximation, we can minimize the MSE loss by SGD using a target Q estimate instead of the true Q , like in linear value function approximation.
However, Q -learning with value function approximation can diverge from the optimal Q * , due to mainly the correlations between samples, and the existence of non-stationary targets.
DQN addresses these challenges by leveraging experience replay and fixed Q -targets.

15.2.1 Experience Replay

To help remove correlations, we can store a dataset, called the replay buffer, D , obtained from prior experience. In this dataset, we store at each step i , the tuple ( s i , a i , r i+1 , s i+1 ) , that is:
Once we have the dataset, we would perform experience replay as follows:
  1. Sample an experience tuple from the replay buffer:. ( s,a,r,s' ) D .
  2. Compute the target value for the sampled s as r+ γ max a' Q ˆ ( s',a';w ) .
  3. Use SGD to update the network weights: Δ w= α ( r+ γ max a' Q ˆ ( s',a';w ) - Q ˆ ( s,a;w ) ) w Q ˆ ( s,a;w ) , obtained from the loss function L( w ) = 1 2 ( r+ γ max a' Q ˆ ( s',a';w ) - Q ˆ ( s,a;w ) ) 2 .

15.3 Deep Learning aware Reinforcement Learning

As always, things are not all good, and there exist issues with online deep RL. We know from DL literature that
However, in online RL, we perform an update on every new sample, and consecutive updates are strongly correlated.
We can be smart in the design of the method to improve it. For instance, Experience Replay helps with this, because we can mix online updates with updates on data sampled from the replay buffer. This approach can reduce correlation between consecutive updates, and enables mini-batch updates, instead of vanilla SGD.
Other ways to address this issue are:

15.3.1 The Deadly Triad

If we use experience replay, we are combining:
  1. Function approximation: with the mere use of a neural network to fit the action values.
  2. Bootstrapping: because we bootstrap on max a Q θ ( s,a ) to construct the target.
  3. Off-policy learning: the replay hold data from a mixture of past policies.
When these three elements are combined, they can create a feedback loop where errors in the function approximation are compounded through bootstrapping, and the off-policy nature of the learning algorithm can lead to poor or incorrect updates. This can cause the learning algorithm to diverge, meaning that instead of converging to a solution, it becomes increasingly unstable over time.
Empirically, it is actually rare to see unbounded divergence. The most common scenarios are value explosions that recover after an initial phase, a phenomenon referred to as soft-divergence.
Soft-divergence still cause value estimates to be quite poor for extended periods... But we can address this in our RL agents using a separate target network, that is:
This way, we can break the feedback loop that sits at the heart of the deadly triad.
The general pseudocode of DQN is the following:
def DQN(C, alpha, gamma):
	D = {} #Replay buffer
	Initialize w #DQN
	w_ = w #Target network
	t = 0
	Get initial state s_0

	while True:
		sample action a_t given eps-greedy policy for current Q'(s_t,a;w)
		Observe reward r_t
		Observe next state s_t+1
		D.append(s_t,a_t,r_t,s_t+1)
		sample random mini-batch of tuples (s_i,a_i,r_i,s_i+1) from D
		
		for j in mini-batch do
			if episode terminated at step i+1 then
				y_i = r_i
			else
				y_i = r_i + gamma * max_a[Q'(s_i,a_i;w_)]
			
			do SGD step on (y_i - Q'(s_i,a_i;w))^2
			update w += dw
		t++
		if t % C == 0:
			w_ = w

15.3.2 Deep Double Q-Learning

We've said that Q-Learning has an overstimation bias, and this can be correcpted by double Q-Learning. That is, we have two models: the current network, q θ , and the target network, q θ - . The current network is used to select the best action, and the target network is used to evaluate the Q-value of that action. By decoupling the action selection from its evaluation (using different networks), DDQN mitigates the overestimation bias. When updating the Q-value for a state-action pair, the current network determines the action that maximizes the Q-value, but the target network provides the Q-value for that action. Periodically, the weights of the target network are updated to match those of the current network. This lag in updating helps stabilize learning. The likelihood function in this case becomes: L( θ ) = 1 2 ( R i+1 + γ q θ - ( S i+1 , arg max a q θ ( S i+1 .a ) ) - q θ ( S i , A i ) ) 2 .

15.3.3 Prioritized Replay

DQN samples uniformly from replays, so an idea would be to prioritize those transitions with which the model can learn more. A basic implementation is to set a priority for sample i , | δ i | , as the TD error on the last time this transition was sampled. Then, we sample according to this priority.

15.3.4 Multi-Step Control

In the context of Deep Q-Learning (DQL), multi-step control is integrated to enhance the learning process by considering a sequence of actions and their cumulative rewards over multiple steps, rather than relying on single-step rewards and transitions. We define the n -step Q-Learning target as G t ( n ) = R t+1 + γ R t+1 +...+ γ n-1 R t+n + γ n q θ - ( S i+1 , arg max a q θ ( S i+1 ,a ) ) , with the update rule Δ θ = α ( G t ( n ) - q θ ( S t , A t ) ) θ q θ ( S t , A t ) . The return is partially on-policy, while the bootstrap is off-policy: For a certain number of steps ( n steps), the agent follows its current policy (on-policy learning). It accumulates rewards for these steps and uses them to update its value estimates. After these n steps, the agent uses a value estimate from a target policy (which could be a greedy policy) for further updates. This part is off-policy because it involves using a different policy (the target or greedy policy) for the value update.
A well-defined target: “On-policy for n steps, then act greedy”.
That's, okay: less greedy, but still a policy improvement. This method allows the agent to explore and learn effectively from its own policy while still benefiting from the stability and efficiency of using a greedy policy for value estimation.

15.4 Reinforcement Learning Aware Deep Learning

15.4.1 Architectures

Much of the successes of DL have come from encoding the right inductive bias in the network structure. To name a few examples:
We should not just copy architectures designed for supervised problems, but rather design the appropriate architecture to encode inductive biases that are good for RL.
Dueling networks
We can decompose q θ ( s,a ) = v ξ ( s ) + A χ ( s,a ) , where θ = ξ χ , with A χ representing the adcantage of taking action a . That is, we modify
image: 85_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___chine-Learning_LectureNotes_source_pegado68.png
to be:
image: 86_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___chine-Learning_LectureNotes_source_pegado69.png
The Dueling Network architecture splits the deep neural network into two distinct pathways at its final layers:
The outputs of these two pathways are then combined to estimate the Q-value function. This is typically done by adding the state value to the action advantage, but with an adjustment to maintain a zero advantage at the chosen action. The Q-value function represents the value of taking a certain action in a given state.
The final output is a set of Q-values for each action in the given state, just like in a traditional DQN. However, the way these Q-values are computed is different due to the dueling structure.
The advantages of this architecture over traditional ones are:
  1. Improved Learning of State Values: Since one pathway is dedicated to learning the state value function, the network can more accurately learn the value of each state, regardless of the actions available.
  2. Efficient Learning of Action Advantages: By separately estimating the advantage of each action, the network can more effectively distinguish between the importance of the state and the importance of the individual actions within that state. This leads to more nuanced policy learning.
  3. Stability and Robustness: Dueling networks can lead to more stable learning and improved convergence properties. By disentangling the estimation of state values and action advantages, the network reduces the potential for harmful interference between these two aspects of the learning process.
  4. Better Policy Evaluation and Generalization: The architecture allows for better generalization across actions without compromising the learning of the state value. It can also provide a more accurate evaluation of the policy.
  5. Handling Environments with Many Similar Actions: In situations where many actions yield similar results (e.g., in video games where moving left or right might have similar outcomes), a dueling network can effectively learn the value of the state, which is more significant than the specific action taken.

15.4.2 Capacity

In supervised DL, we often find that the more data and compute we have, the better performance we can get, since the loss is easier to optimise, there is less interference, we can have bigger models, etc. But this phenomenon is not found the same way in reinforcement learning, where larger network do typically perform better overall, but become more susceptible to the deadly triad.
image: 87_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___chine-Learning_LectureNotes_source_pegado70.png

15.4.3 Generalisation

The deadly triad also shows that generalization in RL can be tricky... Consider the problem of value learning, in presence of sharp discontinuities of v π . Deep neural networks, used in deep reinforcement learning (like DQN or policy gradient methods), might struggle to approximate functions with sharp discontinuities accurately. This is because neural networks, by their nature, tend to interpolate smoothly between points in their training data. For example:
image: 88_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___chine-Learning_LectureNotes_source_pegado71.png
TD learning with DL approximation leads to what's called leakage propagation, that is, the discontinuities spread their values to undesired areas of the space:
image: 89_home_runner_work_BDMA_Notes_BDMA_Notes_Paris___chine-Learning_LectureNotes_source_pegado72.png

References

1Tom Dupuis, "Machine Learning".
2Alec Radford, Luke Metz, and Soumith Chintala, "Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks", (2015).