BDMA - Massive Graph Management and Analytics

Jose Antonio Lorencio Abril

Fall 2023

Professor: Nacéra Seghouani

Student e-mail: jose-antonio.lorencio-abril@student-cs.fr

This is a summary of the course Massive Graph Management and Analytics taught at the Université Paris Saclay - CentraleSupélec by Professor Nacéra Seghouani in the academic year 23/24. Most of the content of this document is adapted from the course notes by Seghouani, [4], so I won’t be citing it all the time. Other references will be provided when used.

1 Introduction
2 Preliminaries
  2.1 Graph Theory Preliminaries
  2.2 Linear Algebra Preliminaries
3 Random Walks on Graphs
  3.1 First Perron-Frobenius Theorem
  3.2 Random Walks on Graphs
  3.3 PageRank
4 Centrality Measures
  4.1 Degree Centrality
  4.2 Neighbourhood centrality
  4.3 Eigenvector Centrality
  4.4 PageRank Centrality
  4.5 Katz (or alpha) centrality
  4.6 Clustering Coefficient Centrality
  4.7 Closeness Centrality
  4.8 Betweenness Centrality
5 MapReduce Computation Model
  5.1 The Map Function
  5.2 Shuflling/Grouping Function
  5.3 Reduce Function
  5.4 MapReduce Execution Model
  5.5 Use-Case: Matrix-Vector Multiplication by MapReduce
  5.6 Relational Algebra by MapReduce
  5.7 Some Issues of MapReduce
6 Spark Parallel Computing Framework
  6.1 Spark’s Engine Properties and Components
  6.2 Spark’s Execution Architecture
  6.3 Resilient Distributed Datasets (RDDs)
7 Community Detection Approaches
  7.1 k-Clique
  7.2 k-Core
  7.3 Community Detection Problem
8 Models of Influence and Diffusion
  8.1 Introduction
  8.2 Influence and Diffusion Models
  8.3 Influence Maximization Problem Formulation

1 Introduction

Graph-structured data is at the heart of complex systems and plays a major role in our daily life, science and economy. Examples of this data are the cooperation between billions of individuals, or communication infraestructures with billions of cell phones, computers and satellites, the interactions between thousands of genes and metabolites within our cells, and so on.

Therefore, understanding its mathematical foundations, description, prediction, and eventually being able to control them is one of the major scientific challenges of the 21st century.

2 Preliminaries

2.1 Graph Theory Preliminaries

A graph is a pair G = (V,E ) , where V is the set of vertices and E ⊂ V × V is the set of edges. Usually, we denote |V | = n and |E | = m.

There are different types of graphs:

Undirected: ∈ E ∈ E. That is, the edges goes in both directions.
Directed: ∈ E ⁄ ∈ E. That is, the edges have direction, and it is possible that an edge goes from u to v, but not the other way.
Weigthed vertices: the vertices have a weight. That is, there is a function w_v : V → ℝ.
Weigthed edges: the edges have a weight. That is, there is a function w_e : E → ℝ.
Labeled vertices: the vertices have a label, L_v : V →, where is the set of labels.
Labeled edges: the edges have a label, L_e : E →.
Bipartite: a graph G = is bipartite if there is a partition of the vertices, V = V ₁ ∪ V ₂, such that V ₁ ∩ V ₂ = ∅ and E = . That is, the vertices in V ₁ only connect to vertices in V ₂, and viceversa.
k-Partite: a graph G = is k-partite if there is a k-partition of the vertices, V = V ₁ ∪V ₂ ∪...∪V _k, such that V _i ∩ V _j = ∅,∀i≠j and the is no edge e = such that u,v ∈ V _i, for the same i.
Multigraph or multidigraph: in this case, there can be several edges between two vertices. For this, we define the edges as a separate set E, and a function r : E → V × V , that assigns the vertices related by that edge.
Hypergraph: in this case, E ⊂ 2^V. That it, the edges can relate 0 or more vertices. In this case, it is more appropriate to interpret E as a set of classes or hierarchies, rather than edges.
Complete: a graph is complete if E = V × V .

Some examples are:

Continuing with definitions, let G = (V,E) be a graph (directed or undirected). Let d_i⁺ and d_i^- denote the number of edges coming out and coming to v_i, respectively. The degree of v_i is

+ - di = di + di .

Note that it counts double for undirected graphs.

Now, let N_i⁺ and N_i^- the set of successors and predecessors of v_i, respectively. Then, the set of neighbors of v_i is

N = N+ + N -. i i i

A path between two vertices, u,v ∈ V , denoted u ⇝ v, is a sequence of vertices (u = v0,v1,...,vk- 1,vk = v) , where (vi-1,vi) ∈ E,∀i = 1,...,k. The length of a path, L (u ⇝ v) , is the number of edges in the cycle, that is, k.

A cycle is a path from a vertex to itself, u ⇝ u.

The distance between two nodes, d (u,v) , is the shortest path length between them:

d (u,v) = minL (u ⇝ v). u⇝v

The eccentricity of a node, ecc (u) , is the greatest distance between u and any other vertex in the graph:

ecc (u) = mva∈xV d (u,v) .

Note that this could be infinity if we cannot reach some node from u. Usually, we consider only reachable nodes, because this can give us information about the graph, but a value of infinity is not very informative.

The diameter of a graph, diam (G ) , is the greatest distance between two nodes in the graph:

diam (G) = max d(u,v) = max ecc(u). u,v∈V u∈V

The radius of a graph, rad (G ) , is the minimum eccentricity of any vertex in the graph:

rad(G) = mui∈nV ecc(u ).

The center of a graph, C (G) , is the set of all vertices of minimum eccentricity, i.e., the graph radius:

C (G) = {u : ecc(u) = rad (G )}.

Example 2.1. Compute the diameter, radius and center of the following graphs:

The solution is the following:

In each node, we show its eccentricity. The diameter is 6, the radius is 3 and the center is c (in blue).

Solution:

In this case, the diameter is 5, the radius is 3 and the center is {c,h} .

A partial graph of G = (V,E) is a graph G′ = (V,E ′) , where E′⊂ E.

A subgraph of G = (V,E ) is a graph G′ = ′ ′
(V ,E ) where V ′⊂ V and E′⊂ E. Note that partial graphs are also subgraphs.

A graph G = (V,E) is said to be connected if, and only if, ∀u,v ∈ V,∃u ⇝ v.

A (strongly) connected component of G = (V,E) is a subgraph G_cc = (Vcc,Ecc) , where ∀u,v ∈ V _cc,∃u ⇝ v ∈ V _cc. That it, a connected subgraph. It is called strongly when the paths are directed.

A graph G = (V,E) is a tree if, and only if, G is a connected graph without cycles. In this case, the graph has m = n - 1 edges.

A graph G = (V,E) is a forest if, and only if, all connected components of G are trees.

2.1.1 Breadth First Search (BFS)

BFS is a method to traverse the nodes of a graph, by starting at one node and traversing all its neighbours. Then, all neighbours of its neighbours, and so on.

For this, we use a FIFO queue. The algorithm is:

Example 2.2. Apply BFS in the following graph, starting at node A.

Q=[A]. We visit A’s neighbours:

Q=[F,G]. Now, by lexycographical order, we visit F’s neighbours:

Q=[G,B,E]. Now, we visit G’s neighbours. Since it has no new unvisited neighbours, there is no change.

Q=[B,E]. Now, we visit B’s neighbours:

Q=[E,C]. Now, we visit E’s neighbours:

Q=[C,D]. Everything is visited, so the queue will be slowly emptied!

2.1.2 Depth First Search (DFS)

In the case of DFS, the objective is also to traverse the whole graph. The difference is that in this case we try to go as deep as we can in the graph before visiting more neighbours.

It can be implemented with a stack, let it be a explicit stack, or an implicit one.

The implementation with an explicit stack is the following:

1procedure DFS(G=(V,E),r)
2  S <- emptyset
3  push(S,r)
4  while S is not empty do
5    v <- pop(S)
6    if not v.label then
7      v.label = true
8      for neig in neighbours(v) do
9        push(S, neig)
10      end for
11    end if
12  end while
13end procedure

The implementation with an implicit stack is recursive, and is as follows:

1procedure BFS(G=(V,E), r)
2  Q <- emptyset
3  enqueue(Q,r)
4  r.label = True
5
6  while Q is not empty do
7    v <- dequeue(Q)
8    for neig in neighbours(v) do
9      if not neig.label then
10        enqueue(Q,neig)
11        neig.label = True
12      end if
13    end for
14  end whileDFS*(G=(V,E), r)
15  r.label = true
16  for neig in neighbours(r) do
17    if not neig.label then
18      DFS*(G, neig)
19    end if
20  end for
21end procedure

Example 2.3. Let’s repeat the example, now using DFS:

2.1.3 Greaph Representations

A graph, G = (V,E ) , with n vertices and m edges can be encoded using different structures:

Adjacency matrix: a matrix A ∈_n×n, defined by
${ Aij = 1 if, (vi,vj) ∈ E . 0 otherwise$
The adjacency matrix is symmetric for undirected graphs.
Adjacency list: a list L of length n in which each vertex holds a list of its neighbours:
$∀u ∈ V,L = {v|(u,v) ∈ E}. u$
If G is directed, the choice of the direction depends on the analytic needs.
Incidence matrix: a matrix B ∈_n×m, defined by
${ Bij = 1 if ej = (vi,vk) ∈ E . 0 otherwise$

2.1.4 Exercises

Using graph traversal algorithms, propose an algorithm that computes the number of edges between a given vertex and all other vertices.
1procedure n_edges(G=(V,E), r)
2  Q <- emptyset
3  enqueue(Q,r)
4  r.n_edges = 0
5
6  while Q is not empty do
7    v <- dequeue(Q)
8    for neig in neighbours(v) do
9      if not neig.n_edges then
10        enqueue(Q,neig)
11        neig.n_edges = v.n_edges + 1
12      end if
13    end for
14  end while
15end procedure
Given the following cycles with even and odd lengths (with the distances or depths from the grey vertex), what do you think about the case of graphs with an odd cycle (in number of edges)? Is this a characteristic property? State the general case.

Proposition: a graph contains a cycle C with an odd number of edges if, and only if, ∃ ∈ E|depth = depth.
Proof : first, we know that all edges connect vertices of ’neighbouring’ depths. That it, ∀ ∈ E, it holds ≤ 1.
By reduction ad absurdum, seeking a contradiction, suppose that ∀ ∈ C, with depth≠depth. This means that depth = depth ± 1. Therefore, there is, along the cycle, a node of even depth, followed by a node of odd lenght, and so on. When we close the cycle, the final node is the inicial one, so its depth is 0 (even). Therefore, we need an even number of edges, to conserve the parity.
If there is an edge ∈ E with depth = depth, then we can consider the path tree that was used to annotate the depths. In this tree, x and y have a first ancestor z in common, from which we can form an odd cycle of size 2 ⋅ + 1 by adding the edge to this subtree starting at z.
Propose an algorithm that determines if a graph contains an odd cycle.
1procedure hasOddCycle(G=(V,E))
2  v <- a vertex from V
3  depths <- n_edges(G,v) #from the first exercise
4
5  for (u,v) in E do
6    if depth[u] == depth[v] then
7      return True
8    end if
9  end for
10end procedure
In a bipartite graph, can there be a cycle with an odd number of edges? Is this a characteristic property?
No, it is not possible!
Proposition: A graph is bipartite if, and only if, all cycles are of even size.
If the graph is bipartite, any path alternates between each vertex of each partition to create a cycle ending by the initial vertex. Therefore, all cycles must be of even size.
Consider the partition of vertices with even depth V ₁, and the partition of vertices with odd depth V ₂.
Since there is no odd cycle, then, from question 2, we know that ∀ ∈ E it is depth = depth ± 1. Therefore, the graph is bipartite.
Propose an algorithm that allows to determine if a graph is bipartite. Test your algorithm in the following graph. Is it bipartite? Justify your answer.

The algorithm is the same as in exercise 3, because of exercise 4.
The proposed graph is clearly not bipartite, because there are several odd cycles.
Graph coloring is a way of coloring the vertices of a graph in such a way that no two adjacent vertices share the same color. A 2-colorable graph is a graph that can be colored with only 2 colors.
1. What is the link with the previous exercise? Justify your answer.
  Proposition: a graph is 2-colorable if, and only if, it is bipartite.
  Proof : If it is 2-colorable, with colors red and blue. Then we take V ₁ = and V ₂ = . G is clearly bipartite with this partition.
  If it is bipartite, with partition V ₁ and V ₂, then we can color all nodes in V ₁ in blue, and all nodes in V ₂ in red. The graph is 2-colorable.
2. We want to write an algorithm, inspired by DFS search, which takes as input a graph, G = , and which returns a pair where result is True if the graph is colorable, False otherwise, and color is a dictionary associating a color 0 or 1 to each vertex. This algorithm should stop as soon as possible when the graph is not 2-colorable.
  1procedure coloring(G=(V,E), r)
  2  color <- {r: 0}
  3  stack <- emptyset
  4  push(stack, r)
  5
  6  while stack is not empty do
  7    v <- pop(stack)
  8    for neig in neighbours(v) do
  9      if neigh is not in color.keys then
  10        push(stack, neigh)
  11        color[neig] = 1 - color[v]
  12      elif color[neig] = color[v] then
  13        return False, color
  14      end if
  15    end for
  16  end while
  17  return True, color
  18end procedure
Compute the shortest path in the following graph using Dijkstra’s algorithm, starting at A:
1procedure dijkstra(G=(V,E), r)
2  dist <- {r:0}
3  P <- emptyset
4
5  for v in V-{r} do
6    dist[v] = infinity
7  end for
8
9  while V-P is not empty do
10    w <- select(v in V-P and dist[v]=min_u dist[u])
11    P <- P union {w}
12    for neig in neighbours(w)-P do
13      if dist[w]+weight(neig,w) < dist[neig] then
14        dist[neig] <- dist[w]+wight(neig,w)
15      end if
16    end for
17  end while
18end procedure

We start with: dist =

A B C D E F G
0 ∞ ∞ ∞ ∞ ∞ ∞

, and P = ∅.
Now, w = A and P = . We update dist =

A B C D E F G
0 ∞ ∞ ∞ ∞ 3 1

.
Now, w = G and P = . We update dist =

A B C D E F G
0 ∞ ∞ ∞ 3 2 1

.
Now, w = F and P = . We update dist =

A B C D E F G
0 6 ∞ ∞ 3 2 1

.
Now, w = E and P = . We update dist =

A B C D E F G
0 6 4 8 3 2 1

.
Now, w = C and P = . We update dist =

A B C D E F G
0 6 4 6 3 2 1

.
Now, w = B and P = . dist does not change.
Finally, w = D and P = . dist does not change.
Given the following graphs:
1. Give the different representations of these graphs.
  A₁ =
  
  0
  1
  2
  3
  4
  5
  
  0
  1
  
  2
  3
  
  4
  5
  
  ,L₁ =
  
  0 :
  
  1 :
  
  2 :
  
  3 :
  
  4 :
  
  5 :
  
  B₁ =
  
  01
  04
  05
  12
  13
  15
  23
  25
  35
  45
  
  0
  1
  
  2
  3
  
  4
  5
  
  A₂ =
  
  0
  1
  2
  3
  4
  5
  
  0
  1
  
  2
  3
  
  4
  5
  
  , L₂ =
  
  0 :
  
  1 :
  
  2 :
  
  3 :
  
  4 :
  
  5 :
  
  ,
  B₁ =
  
  01
  04
  15
  23
  25
  33
  40
  41
  54
  
  0
  1
  
  2
  3
  
  4
  5
2. Compute A²,A³. What does A_ij^r represents?
  A_ij^r represents the number of paths of length r from node i to node j.
3. What is the complexity of A^r? Is it possible to reduce it?
  Computing A^r is O, since it requires r products of complexity O.
  However, we can reuse some results to reduce the complexity:
  - If r is even, we can do A^r = ².
  - If r is odd, we can do A^r = A².
  Therefore, we can obtain A^r in O.

2.2 Linear Algebra Preliminaries

A norm is a function f that measures the size of a vector. It must satisfy the following properties:

f = 0 ⇐⇒ x = 0.
Linear on scale factors:
$f (αx) = |α|f (x) ,∀α ∈ ℝ.$
Triangle inequality:
$f (x+ y) ≤ f (x )+ f (y).$

A widely use family of norms are the p-norms:

∘∑-----p ∥x∥p = p |xi|, i

with the most common one being the Euclidean norm, for p = 2:

∘ ∑----- ∥x∥ = x2i. i

The determinant of a square matrix is equal to the hypervolume of the parallelotope defined by the vectors of the matrix. It is 0 if, and only if, the set of vectors is colinear.

The determinant can be used for many things:

We can represents linear systems with matrices as Y = AX, and there are many methods to solve this efficiently.
With the determinant we can compute the characteristic polynomial of A, whose roots are the eigenvalues of A.

Some properties of the determinant are:

= 1, where I is the identity matrix.
= 0 if A is singular (not invertible).
= .
= .
= cⁿ, where n is the dimension of A.

A square matrix, A, is invertible (non-singular, non-degenerate), with inverse denoetd A^-1, if ∃B such that

AB = BA = I,

in this case, A^-1 = B.

Proposition 2.1. For a square matrix, A, the following properties are equivalent:

A is invertible.
All vectors in A are linearly independent.
≠0.
A^T is invertible.
0 is not an eigenvalue of A.

Properties of the inverse:

^-1 = A.
^-1 = ^T.
^-1 = B^-1A^-1.
^-1 = A^-1 for c≠0.
= .

An eigenvector or characteristic vector of a linear transformation, T, is a non-zero vector that changes by a escalar factor, λ, when transformed by T. That is, v is an eigenvector of the linear transformation T if

T (v) = λv.

There is a direct correspondence between n×n matrices and linear transformation in the n-dimenstional vector space into itself. That is, every linear transformation T can be represented as a matrix A_T (the matrix depends on the chosen base). Therefore, we can say that A_T has an eigenvector v if

AT v = λv.

The scale factors of the eigenvectors are called eigenvalues.

We can find the eigenvalues by solving a polynomial function on λ called the characteristic polynomial of A_T:

(A - λI)v = 0.

Now, this equation has non-zero solution if, and only if,

|A - λI| = 0.

Therefore, we can compute |A - λI| and find all values of λ that makes it equal to 0.

Once we have the eigenvalues, we can use them to find the corresponding eigenvectors.

Example 2.4. Compute the eigenvalues and eigenvectors of A = ( )
2 1
1 2 .

| | || 2- λ 1 || 2 2 |A - λI| = | 1 2- λ | = (2 - λ ) - 1 = λ - 4λ + 3.

This has as solutions

√ ------- λ = 4±---16--12-= 4±-2-= 2± 1. 2 2

Therefore, we have λ₁ = 1,λ₂ = 3.

To find the eigenvectors, we solve

{ Av = λv ⇐⇒ 2x+ y = λx . x+ 2y = λy

For λ₁ = 1, it is

{ 2x + y = x x + 2y = y ⇐⇒ x = - y,

so the eigenvector associated to λ₁ = 1 is

( ) t vλ1 = - t .

For λ₂ = 3, it is

{ 2x + y = 3x ⇐ ⇒ x = y, x + 2y = 3y

so the eigenvector associated to λ₂ = 3 is

( ) vλ = t . 2 t

We call the algebraic multiplicity, t_i, of the eigenvalue λ_i to its multiplicity as root of the characteristic polynomial:

t1 t2 tk P (A) = |A - λI| = (λ- λ1) (λ - λ2) ⋅...⋅(λ- λk) .

Note that A can have at most n distinct eigenvalues, although some of them may be complex.

Proposition 2.2. If the eigenvalues of A are all different, then the corresponding eigenvectors are linearly independent.

The eigenspace of an eigenvalue, λ, is the spave generated by the eigenvectors associated to λ.

The dimension of the eigenspace of λ is the geometric multiplicity of λ. The geometric multiplicity of an eigenvalue is, at most, its algebraic multiplicity.

Example 2.5. Let’s get some eigenspaces:

A = ( )
- 1 1 0
( - 4 3 0 )
1 0 2 , so

=	= + 4
=	=
=
=	².

This has roots λ₁ = 1, with algebraic multiplicity 2, and λ₂ = 2, with algebraic multiplicity 1.

Now, we get the eigenvectors associated to them:

( |{ - x +y = λx Av = λv ⇐ ⇒ - 4x+ 3y = λy . |( x+ 2z = λz

For λ₁ this is

( ( |{ - x + y = x |{y = 2x - 4x + 3y = y ⇐ ⇒ - 4x+ 3y = y , |( x+ 2z = z |(x = - z

so v_λ₁ = ( t )
( 2t )
- t , with dimension 1 (it could be 2).

For λ₂ this is

(| - x + y = 2x (| y = 3x (|x = 0 { { { |( - 4x + 3y = 2y ⇐⇒ |( - 4x = - y ⇐ ⇒ |(y = 0 , x+ 2z = 2z x+ 2z = λz 2z = 2z

so v_λ₂ = ( )
0
( 0 )
t , with dimension 1 (it could not be differently).

B = ( )
( 4 6 0 )
- 3 - 5 0
- 3 - 6 1 , so

=	= + 18
=	=
=	= ².

This has roots λ₁ = 1, with algebraic multiplicity 2, and λ₂ = -2, with algebraic multiplicity 1.

Now, we get the eigenvectors associated to them:

(| 4x+ 6y = λx { Av = λv ⇐⇒ |( - 3x - 5y = λy . - 3x - 6y+ z = λz

For λ₁ = 1, we have

( ( |{ 4x +6y = x |{ x = - 2y | - 3x- 5y = y ⇐⇒ | . ( - 3x- 6y+ z = z ( z = z

Therefore, the eigenspace associated to λ₁ is

(( ) ( )) { - 2t 0 } E (λ1) = (( t ) ,( 0 )) . 0 u

For λ₂ = 2, we have

( ( ( |{ 4x+ 6y = - 2x |{ x = - y |{x = - y - 3x - 5y = - 2y ⇐ ⇒ ⇐ ⇒ . |( - 3x - 6y + z = - 2z |( - 3y +z = - 2z |(y = z

Thus, the eigenspace associated to λ₂is

( ) - t E (λ2) = ( t ) . t

C = ( )
1 - 1 0
( - 1 2 1 )
0 1 1 ,

=	= ² - 2
=
=
=
=	λ.

( |{x - y = λx Cv = λv ⇐ ⇒ - x+ 2y + z = λy . |(y + z = λz

λ₁ = 0:

( |{ x- y = 0 { { - x +2y + z = 0 ⇐⇒ x = y ⇐⇒ x = y , |( y+ z = 0 y+ z = 0 y = - z

( ) t E (λ1) = ( t ) . - t

λ₂ = 1:

( ( |{x - y = x |{ y = 0 { - x+ 2y + z = y ⇐ ⇒ - x+ z = ⇐ ⇒ y = 0 , |(y + z = z |( z = z x = z

( t ) E (λ2) = ( 0 ) . t

λ₃ = 3:

( |{x - y = 3x { - x+ 2y+ z = 3y ⇐⇒ y = - 2x , |(y + z = 3z y = 2z

( ) - t E (λ3) = ( 2t ) . t

D = ( 1 - 1 4 )
( 3 2 - 1 )
2 1 - 1 ,

=
=	+ 12 + 2 - 8 + 1 - λ + 3
=	- 4 + 4λ
=	- 2 - 2λ + 3λ + 3λ² - λ² - λ³ - 4 + 4λ
=	- λ³ + 2λ² + 5λ - 6

To obtain the roots, we can use Ruffini:

	-1	2	5	-6
1		-1	1	6

	-1	1	6	0

So λ₁ = 1 is a root and we have now -λ² + λ + 6 = 0, obtaining

- 1± √1-+-24- - 1 ± 5 λ = ------ 2---- = ---2--,

and we get λ₂ = -2 and λ₃ = 3.

(| x- y +4z = λx { Dv = λv ⇐ ⇒ |( 3x+ 2y- z = λy . 2x+ y - z = λz

λ₁ = 1:

( ( |{x - y+ 4z = x |{y = 4z {y = 4z 3x + 2y- z = y ⇐ ⇒ 3x + 3z = 0 ⇐⇒ . |(2x + y- z = z |(2x + 2z = 0 x = - z

Then, E (λ )
1 = ( - t )
( 4t )
t .

λ₂ = -2:

(| x- y+ 4z = - 2x (| 3x - y+ 4z = 0 { { { 5y- 5z = 0 |( 3x+ 2y- z = - 2y ⇐ ⇒ |( 3x + 4y - z = 0 ⇐ ⇒ 2x+ y + z = 0 2x+ y- z = - 2z 2x + y+ z = 0

{ { ⇐ ⇒ y = z ⇐⇒ y = z . 2x + 2y = 0 x = - y

Then, E (λ2) = ( )
- t
( t )
t .

λ₃ = 3:

( ( |{x - y+ 4z = 3x |{- 2x- y + 4z = 0 { - 2x - y+ 4z = 0 3x + 2y- z = 3y ⇐ ⇒ 3x - y- z = 0 ⇐⇒ |(2x + y- z = 3z |(2x + y- 4z = 0 5x- 5z = 0

{ y = 2z x = z .

Then, E (λ3) = ( )
( t )
2t
t .

E = ( 6 - 2 2 )
( - 2 3 - 1 )
2 - 1 3 ,

=
=	² + 4 + 4 - 4 - - 4
=	+ 2 - 8 + λ
=	54 - 36λ + 6λ² - 9λ + 6λ² - λ³ - 22 + 8λ + λ
=	- λ³ + 12λ² - 36λ + 32.

Again, we can use the Ruffini rule:

	-1	12	-36	32
2		-2	20	-32

	-1	10	-16	0

So λ₁ = 2 is a root, and we now have -λ² + 10λ - 16 = 0, which gives us

√-------- λ = - 10-±-100-- 64-= --10±-6 = 5± 3. - 2 - 2

Therefore, λ₁ is a double root and the other root is λ₂ = 8.

( |{ 6x- 2y + 2z = λx Ev = λv ⇐⇒ - 2x + 3y- z = λy . |( 2x- y + 3z = λz

λ₁ = 2:

( ( |{ 6x - 2y + 2z = 2x |{ 4x- 2y+ 2z = 0 { - 2x+ 3y- z = 2y ⇐⇒ - 2x + y- z = 0 ⇐ ⇒ 4x- 2y+ 2z = 0 |( 2x - y+ 3z = 2z |( 2x- y +z = 0 2x- y+ z = 0

⇐⇒ 2x - y+ z = 0

If x = 0: y = z.

If x = t≠0: -y + z = -2t, working for y = t and z = -t.

( ( 0) ( t ) ) E (λ ) = { ( t) ,( t ) } . 1 ( t - t )

λ₂ = 8:

( ( |{ 6x- 2y + 2z = 8x |{- 2x- 2y + 2z = 0 { - 2x + 3y- z = 8y ⇐ ⇒ - 2x- 5y - z = 0 ⇐⇒ - 3y - 3z = 0 |( |( 2x- y - 5z = 0 2x- y + 3z = 8z 2x - y- 5z = 0

{ { ⇐ ⇒ y = - z ⇐⇒ y = - z . 2x - 4z = 0 x = 2z

Therefore,

( 2t ) E (λ2) = ( - t) . t

F = ( 0 - 1 - 1 )
( 1 2 1 )
1 1 2 ,

=
=	- λ²+ 2 - λ
=
=
=	².

One root is λ₁ = 1 with algebraic dimension 2, and λ₂ = 2 with algebraic dimension 1.

( |{ - y- z = λx Fv = λv ⇐⇒ x + 2y + z = λy. |( x + y+ 2z = λz

λ₁ = 1:

( |{ - y - z = x { x+ 2y + z = y ⇐⇒ x+ y +z = 0 |( x+ y + 2z = z

If x = 0: y = -z.

If x = t≠0: y + z = -t. This works for y = t,z = -2t.

Therefore,

({ ( 0 ) ( t ) )} E (λ1) = ( t ) ,( t ) . ( - t - 2t )

λ₂ = 2:

( ( |{ - y- z = 2x |{ - y - z = 2x { x+ 2y+ z = 2y ⇐ ⇒ x+ z = 0 ⇐⇒ x = - z . |( x+ y+ 2z = 2z |( x+ y = 0 x = - y

⇐⇒ 2x - y+ z = 0

( t ) E (λ ) = ( - t) . 2 - t

Another way to represent eigenvalues and eigenvectors is

AV = VΛ,

where V = [v1,...,vn] is the matrix formed by putting each eigenvector as a column, and

( ) λ1 || λ2 || Λ = |( .. |) . λn

is the diagonal matrix formed by all eigenvalues.

A matrix A is diagonalizable if there exist n linearly independent eigenvectors. That is, if the matrix V is invertible:

Λ = V -1AV.

This leads naturally to the eigen-decomposition of the matrix,

-1 A = V ΛV .

A real matrix, U, is orthogonal if U^TU = UU^T = I.

Proposition 2.3. The following statements are equivalent:

U^T is orthogonal.
U^T = U^-1.
= 1.
U’s eigenvectors are orthonormal (the pairwise dot product is 0 and the norm is 1).

Example 2.6. Some examples of orthogonal matrices:

Identity: I
Permutation of coordinates:
Rotation: .
Reflection: .

A matrix A is said to be positive semi-definite when it can be obtained as the product of a matrix by its transpose:

T ∃X|A = XX .

Positive semi-definite matrices are always symmetric, because

AT = (XXT )T = XXT = A.

A symmetric matrix A is positive semi-definite if all its eigenvalues are non-negative.

Proposition 2.4. Let A be a positive semi-definite matrix. Then:

0 ≤ λ₁ ≤ λ₂ ≤ ... ≤ λ_n and its eigenvectors are pairwise orthogonal when their eigenvalues are different.
The eigenvalues are composed of real values.
The multiplicity of an eigenvalue is the dimension of its eigenspace.

In this case, since eigenvectors are orthogonal, it is possible to store all the eigenvectors in an orthogonal matrix.

Therefore, the eigen-decomposition of a positive semi-definite matrix, A, could be

A = UΛU T,

with U an orthogonal matrix.

As a consequence, the eigen-decomposition of a positive semi-definite matrix is often referred to as its diagonalization.

An alternative definition for positive semi-definite matrix is:

A is positive semi-definite if x^TAx ≥ 0,∀x.

If it is x^TAx > 0,∀x, then it is positive definite.

If it is x^TAx ≤ 0,∀x, then it is negative semi-definite.

If it is x^TAx < 0,∀x, then it is negative definite.

The rank of a matrix is the dimension of the vector space generated by its columns (or rows). This corresponds to the maximum number of linearly independent columns of A. A matrix whose rank is equal to its size is called a full rank matrix. Only full rank matrices have an inverse.

Proposition 2.5. The sum of the eigenvalues of a matrix is the sum of the elements of its main diagonal.

The product of the eigenvalues is equal to the determinant of the matrix.

We can now define the Laplacian matrix for undirected graphs, as

( |{- 1 ,(vi,vj) ∈ E Lij = 0 ,(vi,vj) ∕∈ E , |(d ,i = j i

or, equivalently,

L = D - A,

where A is the degree is the matrix of G, and A its adjacency matrix.

2.2.1 Exercises

What could you say about these matrices?
1. A = , det = -, A is invertible. Its eigenvalues are λ₁ = -1+ and λ₂ = -1-, with v_λ₁ = and v_λ₂ = .
2. B = . The second row is equal to the first row multiplied by -. Therefore, it is not invertible.
3. I: its determinant is 1. It is symmetric, orthogonal, its own inverse. Triple eigenvalue 1, with eigenspace the whole space.
Show that Aⁿ = XΛX^-1.
First, this is only true if A is diagonalizable. If that is the case, then we can proceed by induction on n:
n = 1: Obvious.
n = 2:
$2 ( -1)2 -1 -1 2 -1 A = X ΛX = X ΛX X ΛX = X Λ X .$
Suppose it is true for n - 1:
$An -1 = X Λn-1X -1.$
Then, for n, we have:
$n n-1 -1 n-1 - 1 n -1 A = AA = XΛX XΛ X = X Λ X .$
Find the eigenvalues and unit eigenvectors of A^TA and AA^T with A = the Fibonnaci matrix.
First of all, notice that A is symmetric, so A^TA = AA^T = A² = .
= - 1 = 2 - 3λ + λ² - 1 = λ² - 3λ + 1. The roots of this polynomial are
$√----- √- λ = 3-±-9---4= 3±--5-. 2 2$
Now,
${ A2v = λv ⇐⇒ 2x+ y = λx ⇐ ⇒ {x = (λ- 1)y x+ y = λy$
Therefore
$( √ - ) E (λ1) = 1+2-5t t$
with unit eigenvector v₁ = .
And
$( 1-√5- ) E (λ2) = 2 t t$
with unit eigenvector v₂ = .
Without multiplying
$( cosθ - sinθ ) ( 2 0 ) ( cosθ sinθ ) S = sinθ cosθ 0 5 - sin θ cosθ ,$
find the determinant, the eigenvalues and eigenvectors. Why S is positive definite?
We have S = UΛU^T with U orthogonal. Therefore, the eigenvalues of S are 2 and 5. Its determinant is 10. The eigenvectors are the eigenvectors of Λ rotated as well, that is:
$( ) ( ) ( ) cosθ - sin θ 1 0 cosθ sinθ V = sinθ cosθ 0 1 - sin θ cos θ .$
S is positive definite because
$( )( ) ( ) xSxT = x cosθ - sin θ 2 0 cosθ sin θ xT , sinθ cosθ 0 5 - sinθ cosθ$
now note that
$( ( )) ( ) cos θ - sin θ T cosθ sin θ T x sinθ cosθ = - sinθ cosθ x ,$
so
$( 2 0 ) xSxT = y 0 5 yT ≥ 0,$
because is positive semi-definite (symmetric with positive eigenvalues).
For what numbers c and d are the following matrices positive definite?
1. A = : all principal minors must be positive. That is:
  - c > 0.
  - = c² - 1 > 0. Combined with the previous one, this is c > 1.
  - = c³+2-3c. Roots: 1,
    
    1 0 -3 2
    1 1 1 -2
    1 1 -2 0
    
    , and we have c² + c- 2, with roots c = . We are only interested in the interval , in which c³ - 3c + 2 > 0.
  Therefore, it is c > 1.
2. B = :
  - 1 > 0.
  - = d - 4 > 0 ⇐⇒ d > 4.
  - = 5d + 24 + 24 - 9d - 16 - 20 = -4d + 12 > 0 ⇐⇒ - 4d > -12 ⇐⇒ d < 3.
  Therefore, there is no value for d for which B is positive.
Show that if λ₁,λ₂,...,λ_n are the eigenvalues of a matrix A, then A^m has as eigenvalues λ₁^m,λ₂^m,...,λ_n^m.
Induction on m.
m = 1: Obvious.
m = 2: Let v_i be the eigenvector associated to λ_i, then
$2 2 A vi = A (Avi) = A (λivi) = λiAvi = λivi,$
so λ_i² is an eigenvalue of A², with associated eigenvector v_i.
Suppose it is true for m - 1, then, for m:
$( ) ( ) Amvi = A Am -1vi = A λmi-1vi = λmi -1Avi = λmi vi,$
and we have the result.
What is the determinant of any orthogonal matrix?
If U is orthogonal, then UU^T = I. Then,
$| T| | T| 2 1 = |I| = |UU | = |U||U | = |U |.$
Therefore, = ±1.
For an undirected graph, both the adjacency matrix and the Laplacian matrix are symmetric. Show that the Laplacian matrix is positive semi-definite.

3 Random Walks on Graphs

3.1 First Perron-Frobenius Theorem

Definition 3.1. A matrix, A, is positive if A_ij > 0,∀i,j. Similarly, it is non-negative if A_ij ≥ 0,∀i,j. Similar definitions apply for negative and non-positive matrices.

Remark 3.1. Observe that it is not the same for a matrix to be positive as to be positive semi-definite.

The Perron-Frobenius theorem for non-negative matrices leads to the characterization of non-negative primary eigenvectors. This is useful in stationary distributions, such as those of Markov chains and the famous Google’s page rank algorithm.

Theorem 3.1. Perron-Frobenius Theorem for positive matrices

If A is a positive matrix, then:

∃λ^* > 0,v^* > 0,₂ = 1 such that A ⋅ v = λ^*v^* (v^* is a right column eigenvector).
∃λ^* > 0,w > 0,₂ = 1 such that w ⋅ A = λ^*w (w is a left row eigenvector).
For any other eigenvalue, λ, it holds, < λ^* (λ^* is a dominant eigenvalue, called the Perron eigenvalue).
λ^* is unique and v^* is unique (the only vector of unit length associated to λ^*).

Definition 3.2. A non-negative matrix A is:

Irreducible if, ∀i,j,∃k ∈ ℕ^* such that A_i,j^k > 0.
Primitive if, ∃k ∈ ℕ^* such that ∀i,j, A_i,j^k > 0.

Theorem 3.2. Perron-Frobenius Theorem for non-negative matrices

If A is a non-negative matrix, then:

∃λ^* > 0,v^*≥ 0,₂ = 1 such that A ⋅ v = λ^*v^* (v^* is a right column eigenvector).
∃λ^* > 0,w ≥ 0,₂ = 1 such that w ⋅ A = λ^*w (w is a left row eigenvector).
For any other eigenvalue, λ, it holds, ≤ λ^* (λ^* is a dominant eigenvalue, called the Perron eigenvalue).
If A is irreducible, then the vector v^* is unique and it holds v^* > 0.
If A is primitive, then the eigenvalue λ^* is unique.

Note now that a graph, G = (V,E ) , with adjacency matrix A, then: G is connected ⇐⇒∀1 ≤ i,j ≤ |V | ,∃k ∈ ℕ^* such that A_i,j^k > 0. This means that the adjacency matrix of connected graphs is irreducible.

Now, if a graph is k-connected, i.e., there is a k-path between all nodes, then its adjacency matrix is primitive. One sufficient condition for a graph to be k-connected is being connected and having A_ii > 0 for some i.

3.2 Random Walks on Graphs

A random walk on a graph, G = (V,E ) , is a random process that starts from some vertex v_i, and repeatedly moves to a neighbour v_j chosen at random (for example with uniform distribution). The random walk, ξ_t, is therefore a random variable describing the position of a random walk after t steps. The probability of going from node i to node j is the transition probability,

Pij = P (ξt+1 = j|ξt = i).

The sequence of nodes can be regarded as a Markov chain, i.e. a discrete time stochastic process, where the position ξ₀ is the initial state, according to the init distribution, P⁰, and from this point the next state only depends on the current state. The t-step transition probability is

Pitj = P (ξt = j|ξ0 = i).

Some examples are the path traced by a molecule in a liquid or a gas (Brownian motion), the price of a fluctuating stock, the financial status of a gambler, etc. The term random walk was first introduced by Karl Pearson in 1905.

The following is a basic visual example of a random walk on a graph:

Note that we can express the transition probability P_ij in a matrix P. This matrix is the transition probabilities matrix, and it is row-stochastic or row-Markov, meaning,

Pij ≥ 0,∀i,j, and ∑ Pi,j = 1,∀i. j

This implies that

( ) ( ) 1 1 P ⋅1 = P ⋅|( ... |) = |( ... |) . 1 1

This means that ( 1 )
| .|
( ..)
1 is an eigenvector and 1 is an eigenvalue. 1 is the largest eigenvalue because

∥Pv∥1 ≤ ∥v∥1 ,

so, for an eigenvalue λ,

|λ|∥v∥ = ∥λv∥ = ∥Pv∥ ≤ ∥v∥ , 1 1 1 1

so |λ | ≤ 1.

From the Perron-Frobenius theorem for non-negative matrices, we know that:

v^* = is a right Perron eigenvector for P.
≤ λ^* = 1 is a Perron eigenvalue.
There exists a left Perron eigenvector πP = π.
If P is irreducible, the vector π is unique.
If P is primitive, the eigenvalue 1 is unique (there are no complex eigenvalues with norm 1).

3.2.1 The Stationary Distribution

Let π^t be the row vector giving the probability distribution of ξ_t, that is, π_i^t is the probability that the random walk is at node i at time t. Therefore, we can write

πt+1 = πtP,

which, applied recursively, leads to

t+1 0 t+1 π = π P .

Or, we can take limits

lim πt+1 = lim πtP. t t

If this limit exists, lim_tπ^t = π, then

π = πP.

Convergence is ensured if P is irreducible.

Example 3.1. The following example does not converge:

( ) P = 0 1 1 0

A common way to perform random walks on graphs is with the uniform probability. That is,

{ P = P (ξ = j|ξ = i) = 1di if (i,j) ∈ E, ij t+1 t 0 otherwise,

where d_i is the degree of node i. Equivalently,

Pij = ∑-Aij----= Aij= D -1Aij. j∈V Aij di ii

The random sequence of vertices ξ₀,ξ₁,...,ξ_t,ξ_t+1,... visited on G is a Markov Chain with state space V and matrix transition probabilite P = D^-1A.

Example 3.2. Given the graph:

The transition matrix for the uniform distribution is:

( ) 0 0 11 01 01 0 || 01 01 3 31 31 0 || P = || 4 41 01 4 41 01 || || 0 41 41 01 4 41 || ( 0 4 4 41 01 4 ) 0 0 0 2 2 0

3.2.2 Balance Condition

A probability distribution π satisfies the balance condition if

πiPij = πjPji,∀i,j ∈ V.

If π satisfies the balance condition, then it is the stationary distribution for the undirected graph. To see this, notice that the balance condition can be rewritten as

A A πi--ij = πj-ji. di dj

Since the graph is considered without direction, A_ij = A_ji, and then

πi πj d-= d- = c, i j

where c is a constant, for all i,j. Now, we know that ∑ _iπ_i = 1, so

1 = ∑ πi = ∑ πjdi = ∑ cdi = c∑ di. i i di i i

Therefore

∑ di = 1. i c

Finally, it must be

d d πi = dic = ∑-i = --i-. jdj 2 |E |

In this case:

∑ ∑ ∑ ∑ ∑ (πP )i = πjPji = πj 1-Aji = cAji = c Aji = c Aij = cdi = ∑di-= πi. j j dj j j j jdj

Therefore, we have seen that the stationary probabilities are proportional to the degrees of the vertices.

In particular, if G is d-regular, i.e., all nodes have degree d, then

--d- --d- 1- π = 2 |E | = d ⋅n = n

is the uniform distribution. With this setup, a random walk moves along every edge with the same frequence.

The balance condition implies time-reversibility: the reversed walk is also a Markov chain.

3.2.3 Hitting Time

Definition 3.3. The expected hitting time, H_ij, is the expected number of steps before node j is reached in a random walk starting at node i:

{1 + ∑ P H if i ⁄= j, Hij = k ik kj 0 otherwise.

Remark. In general, H_ij≠H_ji, so H is not symmetric.

Remark 3.2. H follows the triangle inequality

Hij ≤ Hik + Hkj.

Definition 3.4. The commute time, C_ij, is the expected number of steps in a random walk starting at node i, reaching node j and coming back to i again:

Cij = Hij + Hji.

3.2.4 Lazy Random Walk

The lazy random walk is a variation of the random walk, in which the walk stays at the current node with probability , and continue with the walk with the rest of the probability.

In this case, the transition matrix is

( |{ 12 if i = j, Pij = 12di- if (i,j) ∈ E, |(0 otherwise.

If Q is the transition matrix for the uniform random walk, then

πt+1 = πtP = 1πt + 1πtQ. 2 2

Proposition 3.1. If the lazy random walk converges and the uniform random walk is irreducible, then it converges to the same stationary distribution as the uniform random walk.

Proof. Let Q be the transition matrix for the uniform random walk, then, the stationary distribution is

π = πQ.

For lazy random walk, say the stationary distribution is π′. Then:

1 1 1 1 π′ = 2 π′ + 2π′Q ⇐ ⇒ 2π′ = 2π′Q ⇐ ⇒ π′ = π′Q.

Therefore, since Q is irreducible, the uniqueness of π implies π′ = π. __

3.3 PageRank

The web is very heterogeneous bu nature, and certainly huge. We cannot expect the web graph to be connected. Page and Brin proposed a way to overcome this problem, by ensuring the convergence of random walks on the web graph.

The idea is to fix a positive constant, p, between 0 and 1, called the damping factor, and which represents the probability that a user leaves the current page and goes to a random web.

Therefore, the page rank transition matrix is

Pg = (1 - p)P + pB,

where B = 1
n ( 1 1 ... 1 )
| 1 1 ... 1 |
|| . . . .||
( .. .. .. ..)
1 1 ... 1 .

p is usually chosen small, like 0.15, modelling a situation in which a surfer will, most of the time, follow the outgoing links and move on to one of the neighbours. A smaller percentage of time, the surfer will dump the current page and choose arbritrarily a different page from the web.

Proposition 3.2. P_g is stochastic.

Proof. We need to proof that, for all i, it holds ∑ _jP_{g_i,j} = 1.

∑ _jP_{g_i,j} =	∑ _jP_ij + pB_ij
=	∑ _jP_ij + p∑ _jB_ij
=	⋅ 1 + p∑ _j
=	1 - p + p ⋅ n
=	1 - p + p
=	1.

4 Centrality Measures

Centrality Measures try to answer the question ’What characterizes an important vertex?’. They define a real-valued function on the vertices of the graph, m : V → ℝ, that serves to rank the vertices. However, there are many different ways to define such function, leading to different definitions of centrality, such as cohesiveness, ability to transfer information across the network, to influence other nodes, to control information flow, etc.

There are many centrality measures that count the number of paths through a given vertex. These differ in how relevant walks are defined and counted. For example, if we only consider paths of length one, we would be computing degree centrality, while if we allow paths of arbitrary length, we would be computing eigenvalue centrality.

4.1 Degree Centrality

The more neighbours a vertex has, the higher its communication ability is, increasing its importance.

Definition 4.1. Given the graph G = (V,E) , with adjacency matrix A, the degree centrality is computed as

D = Au,

where u = 1 ∈ ℝⁿ.

Example 4.1. Consider the following graph:

The degree centrality is

( ) ( ) ( ) 0 1 0 0 1 0 1 2 || 1 0 1 0 1 0 || || 1|| || 3 || D = Au = || 0 1 0 1 0 0 || || 1|| = || 2 || , || 0 0 1 0 1 1 || || 1|| || 3 || ( 1 1 0 1 0 0 ) ( 1) ( 3 ) 0 0 0 1 0 0 1 1

so the nodes with highest value are nodes (2,4,5) .

One drawback of this measure, is that it is very likely that several nodes present the same exact value, difficulting an unique ranking of vertices.

4.2 Neighbourhood centrality

This measure correspond to the average degree of each vertex neighbours. We could understand this measure as measuring how much a vertex is related to influencial vertices.

Definition 4.2. Given the graph G = (V,E) , with adjacency matrix A, the neighbourhood centrality is computed as

N = D -1AD,

where is the diagonal matrix where _ii = d_i is the degree of vertex i and D is the degree centrality. Each vertex’ measure is

∑ Nv = --u∈Nv du-. dv

Example 4.2. The neighbourhood centrality of the previous example graph is

( 1 ) ( ) ( ) ( 3+3 ) ( ) | 2 1 | | 0 1 0 0 1 0 | | 2 | | 2+22+3- | | 3 | || 3 1 || || 1 0 1 0 1 0 || || 3 || || 33+3 || || 2.33 || N = D-1Au = || 2 1 || || 0 1 0 1 0 0 || || 2 || = || 2+23+1- || = || 3 || , |( 3 1 |) |( 0 0 1 0 1 1 |) |( 3 |) |( 2+33+3- |) |( 2 |) 3 1 1 0 1 0 0 3 33 2.66 1 0 0 0 1 0 0 1 1 3

so the nodes with highest value are nodes (1,3,6) .

4.3 Eigenvector Centrality

A natural extension of degree centrality is to consider all reachable nodes, not just neighbours. Eigenvector centrality measures a node’s importance while considering the importance of its neighbours. A high eigenvector centrality means that a node is connected to many nodes that have high scores themselves.

Definition 4.3. Given the graph G = (V,E) , with adjacency matrix A, the eigenvector centrality of node v is

∑ Ev = 1- AvuEu, λ u∈Nv

where λ is a parameter. Note that this can be written as

E = 1AE, λ

AE = λE.

This means that E is an eigenvector of A, for the eigenvalue λ.

Bonacich suggested that the eigenvector of the largest eigenvalue of A could make a good network centrality measure.

The eigenvector E must be non-negative and according to the Perron-Frobenius theorem, the largest λ enforces this property, making it a suitable value.

Example 4.3. Let’s compute E for the previous example graph. The matrix A has as largest eigenvalue λ = 2.54, and the corresponding eigenvector is

( 2.5 ) | 3.1 | || 2.2 || E = || 2.5 || . |( 3.2 |) 1

Note that it is usually unfeasible to compute the eigenvalues and eigenvectors. It is more usual to get the vector iteratively as

Ek = A -Ek-1-. ∥Ek-1∥

4.4 PageRank Centrality

Google’s PageRank is a variant of the eigenvector centrality, which uses in-degree to award one centrality point for every link a node receives. As we saw, the algorithm is based on a web surfer who is randomly clicking on links, with a certain probability to go to a different place of the web (the damping factor).

Therefore, we define the matrix

Pg = (1 - p)P + pB,

where P_ij = {
-1 if j ∈ Ni
d0i otherwise , and B_ij = 1
n .

Now, we apply the eigenvector centrality to this modified matrix, as

P E = λE = E , g g g g

with λ = 1 because P_g is stochastic.

Or, iteratively as

Egk = PgEgk-1.

Note that in this case it is not necessary to normalize the vector at each step, because P_g is stochastic. A good E_g₀ is E_g₀ = ( 1n )
| .. |
( .1 )
n .

4.5 Katz (or alpha) centrality

The main problem with eigenvector centrality is that it only works well when the graph is strongly connected (so Perron-Frobenius is applicable in its stronger form). Real networks do not usually have this property, specially if they are directed. The vertices that are not in strongly connected components will have value 0.

A way to work around this problem was proposed by Leo Katz. The idea is to give each node a minimum, positive amount of centrality, that it can transfer to other nodes, so:

∑ Kv = α AvuKu + β, u

where K_v is the Katz centrality of node v, β is a vector whose elements are all equal to a given positive constant and α ∈ (0,1) is a parameter. Equivalently, this is

K = αAK + β,

(I - αA )K = β,

and

K = (I - αA )-1β.

For this to work, I -αA must be invertible, which happens if and only if |I - αA| ≠0 ⇐⇒ ||1I - A ||
α ≠0, so 1
α must not be an eigenvalue of A. This is ensured if we take 1
α > λ_max, or 0 < α < -1--
λmax .

An iterative way to compute K is

( ∞ ) K = ∑ αkAk u. k=1

The strength of α decreases at each iteration, acting as attenuation factor.

4.6 Clustering Coefficient Centrality

Triadic closure is the property among three nodes A, B, and C (representing people, for instance), that if the connections A-B and A-C exist, there is a tendency for the new connection B-C to be formed.

The clustering coefficient measures the proportion of neighbours of each node, that connected to each other.

Definition 4.4. Given a graph G = (V,E ) , with adjacency matrix A, the clustering coefficient of node v is

|{{u,v,w} : (u,v),(v,w ),(u,w ) ∈ E}| CCv = --------------(dv)--------------, 2

where the numerator is the number of triangles involving v and its neighbours, and the denominator is the total number of possible links between v’s neighbours.

The more densely connected the neighbourhood of v is, the higher is its clustering coefficient.

Example 4.4. The clustering coefficient of the graph example that we’ve been working with is

( 11 ) ( 1 ) | 1 | | 1 | || 30 || || 30 || CC = || 10 || = || 0 || . |( 31 |) |( 1 |) 30 30

4.7 Closeness Centrality

Closeness centrality is a measure of how close a node is, on average, to the rest of the nodes, in terms of shortest paths. It measures the average distance between a node v and all other nodes in the network. Thus, the more central a node is, the closer it is to all other nodes.

Definition 4.5. Given a graph G = (V,E ) , the closeness centrality of node v is

CLv = ∑-----1------. r⁄=vdist(v,r)

It can be normalized by the factor

N - 1 CLv = ∑------------. r⁄=vdist(v,r)

An alternative is the harmonic centrality, obtained as

Hv = ∑ ---1----, r⁄=v dist(v,r)

with dist (v,r) = 0 if there is no path from v to r.

4.8 Betweenness Centrality

A family of betweenness measures are defined to capture a node’s importance as a conduct of information flow in the network. This has wide applications in network theory, because in a telecommunications network, a node with higher betweenness centrality would have more control over the network, since more information will pass through that node.

The most well-known betweenness metric measures the number of times a node is on a shortest path between two nodes.

Definition 4.6. Given a graph G = (V,E ) , the betweenness centrality of node v is

Bv = ∑ σs,t(v), s⁄=v⁄=t σs,t

where σ_s,t is the number of shortest path from source node s to target node t, and σ_s,t (v) is the number of shortest path between these two nodes going through v.

This measure can be normalized by the number of ordered pairs not including v:

For directed graphs
$1 ∑ σs,t(v) Bv = (n---1)(n---2) -σs,t-. s⁄=v⁄=t$
For undirected graphs
$∑ Bv = ------2------ σs,t(v). (n - 1)(n - 2)s⁄=v⁄=t σs,t$

Example 4.5. For the undirected star graph:

The center vertex has a betweenness of (n-1)(n-2)-
2 (or 1, if we normalize it), while the leaves have a betweenness of 0.

Exercise 4.1. What about the following graphs?

$( 0) ( 0 ) | 63| | 0.5 | B = || 64|| = || 0.67 || . |( 63|) |( 0.5 |) 40 0 6$
$( 1 ) 61 || 61 || B = || 61 || . ( 61 ) 6$
$( 0) ( ) 63 0 || 65|| || 0.5 || B = || 60|| = || 0.83 || . ( 60) ( 0 ) 6 0$

5 MapReduce Computation Model

The advent of big data and the increasing analysis needs favorised the design of parallel algorithms, specially in the realm of big data processing pipelines, with a tradeoff between communication costs and degree of parallelism.

MapReduce is a processing paradigm that works on top of distributed environments. More precisely, it was built on top of Google File System (GFS) and Hadoop Distributed File System (HDFS), used to manage large-scale data and to be tolerant to hardware and networks faults. To do this, HDFS splits files into large blocks and distributes thema cross nodes in a cluster, and MapReduce is the programming model used to manage many large-scale parallel computations.

Basically, the idea is that the data is first splitted, then some operation is done to it, and then it’s merged to produce the final results. For this, we will just need to define the Map and Reduce functions, while the system manages the parallel execution on distributed data and the coordination between them, leading with the possibility that one of the tasks may fail.

Example 5.1. Word Counter

Consider a text file splitted into partitions A,B,C,D, across different nodes. We want to count how many times each word appears in the whole document. For this, we can use MapReduce as follows:

Map: for each word, w, in each partition, generate the pair .
Shuffle/sort: collects and groups the pairs by key (word), in order to guarantee that the same key will be processed by the same reduce task. Shuffling is the process of redistributing data from Map nodes to Reduce nodes.
In our example, we would have, for each word w, the pairs , with as many 1s as w appearances.
Reduce: for each input , output , where N_w is the amount of 1s.

The Map task will typically process many words in one or more chunks. If a word, w, appears m times among all chunks assigned to that process, there will be m key-value pairs (w,1) among its output.

To perform the grouping and distribution to the Reduce task, the master controller merges the pairs by key and produces a sequence of (w,[1,...,1]) . Since it knows how many reduce tasks there will be, r, it will produce r lists, putting a list in one of r local files destined to one of the Reduce tasks. Each key is assigned as input to one, and only one, Reduce task.

The Reduce task executes one or more reducers, one per key. The outputs from all reducers are merges into a single final file.

5.1 The Map Function

In general, a map function can be defined as a function, m_f : E₁ⁿ → E₂ⁿ, where E_i is the domain of the input (1) or output (2) and f : ℝ → ℝ, that applies f to each coordinate. That is:

mf ([e1,...,en]) = [f (e1),...,f (en)].

For example:

m⋅2([2,3,6]) = [4,6,12].

In the MapReduce scheme, map is more restrictive, as the function f must produce a key-value pair. That is, for all i = 1,...,n, it is

f (e ) = (k ,v ). i i i

For example, in the word counter example:

mf ([”a”,”b”,”a”]) = [(”a”,1),(”b”,1),(”a”,1)].

5.2 Shuflling/Grouping Function

The shuffle function consists in grouping the outputs of the map function by key, so

s([(k1,v1),...,(kn,vn)]) = [(k1,(vj : kj = k1,∀j = 1,...,n )),...].

Following the previous example:

s([(”a”,1),(”b”,1),(”a”,1)]) = [(”a”,[1,1]),(”b”,[1])].

5.3 Reduce Function

Generally, a reduce function applies to a vector/row, and outputs a single value, applying the aggregation function f:

rf ([v1,...,vn]) = f (v1,...,vn).

In MapReduce, reduce applies to each output of the shuffle function with the same key:

rf ((k,[v1,...,vn])) = [(k′1,f (v1,...,vn)),...,(km′, f (v1,...,vn))].

Following the previous example:

rsum ([(”a”,[1,1]),(”b”,[1])]) = [(”a”,2) ,(”b”,1)].

A MapReduce pipeline can be a composition of different r_{f_r} ∘ s ∘ m_{f_m}.

The process is illustrated below:

5.4 MapReduce Execution Model

Whenever we launch the execution of a MapReduce pipeline, the following happens:

The user program forks a master controller process and some number of worker processes at different computer nodes.
The amster creates some number of map tasks and some number of reduce tasks. It assigns the tasks to worker processes by taking into account the co-location.
A worker handles either map tasks (a map worker) or reduce tasks (a reduce worker), but not both.
A worker process reports to the amster when it finishes a task, and a new task is scheduled by the master for that worker process.
The master keeps track of the status of each map and reduce task (idle, executing, or completed).

5.4.1 Coping with Node Failures

If the master node fails, the entire MapReduce job must be restarted.

If a worker node fails, it would be detected and managed by the master, since it periodically pings the worker processes. All the map tasks assigned to this worker have to be redone in this case.

5.4.2 Algorithms by MapReduce

This paradigm is not a solution to every problem, and in fact it only makes sense when files are very large, and rarely outdated. Its original purpose was to execute very large matrix-vector multiplications.

5.5 Use-Case: Matrix-Vector Multiplication by MapReduce

Let M be a n × n squared matrix and V a vector of size n. Their product,

W = M V,

is defined by

n w = ∑ m v . i j=1 ij j

We can store M and V in a file in HDFS as triples ((i,j),mij) for M and pairs (j,vj) for V ¹ . Now we can compute the computation by MapReduce as:

Map: for each and , it returns .
Reduce: simply sums all the values for each key i, producing the pair .

For this to work, all the pairs from V must be available in all chunks (V cannot be stored distributely).

More concretely, we can define the functions:

1map(key,val):
2  i = key(1)
3  j = key(2)
4  for (j2, v) in V:
5    if j == j2:
6      emit(i, val*v)
7
8reduce(key, val):
9  sum = 0
10  for v in val
11    sum += v
12  emit(key, sum)

Now, if n is large, V might not fit in main memory of a worker node, and a large number of disk accesses may be required. We can improve the approach by distributing V and refining the algorithm as follows:

We devide the matrix into vertical stripes of equal width, and the vector in strips of the same size:

Here, the size of M_k is n×n_k and the size of V _k is n_k, so that the product M_k ⋅V _k can be performed, outputing a vector of size n.
Each map task is assigned a chunk from one of the matrix stripes and gets the entire corresponding stripe of the vector.
The final result would be
$∑K W = M V = Mk ⋅Vk, k=1$
where we apply the previously explained algorithm to each sub-multiplication step.

5.5.1 Matrix Multiplication

This approach can be extended to matrix multiplication. Now, let M be a matrix of size n₁ ×n₂ and N a matrix of size n₂ × n₃, the product P = MN is a matrix of size n₁ × n₃, where

n ∑ 2 pik = j=1MijNjk.

The matrices are stored as (M, (i,j),mij) and (N, (j,k),njk) .

Map 1: transform into and into .
Reduce 1: for each key, j, produces the key-value pair .
Map 2: the identity.
Reduce 2: for each key, , produce the sum of the list of values associated to this key, .

In addition, M could be divided into K vertical stripes of size (n1,nk) and N into K horizontal stripes of size (nk,n3) , where ∑ _kn_k = n₂. In this setup, we can apply the algorithm to compute each M_k ⋅ N_k and then sum them all.

The functions can be defined more precisely as:

1map_1(T,(i,j),T_ij):
2  emit(j , (T,i,T_ij))
3
4reduce_1(key, val):
5  for v in val:
6    for w in val:
7      if v(1) == M and w(1) == N:
8        i = v(2)
9        M_ij = v(3)
10        k = w(2)
11        N_jk = w(3)
12        emit((i, k), M_ij*N_jk)
13
14map_2(key, val):
15  emit(key, val)
16
17reduce_2(key, val):
18  sum = 0
19  for v in val:
20    sum += v
21  emit(key, sum)

5.6 Relational Algebra by MapReduce

5.6.1 Selection

Let R (A1,...,An) be a relation stored as a file in HDFS. The elements of this file are the tuples of R. The selection operator, σ_C (R) can be defined using MapReduce as:

Map: for each tuple in R, t, test if t satisfies C. If it does, produce the key-value pair .
Reduce: the identity.

5.6.2 Projection

For the projection, π_A (R) , we can do:

Map: for each tuple in R, t, construct a tuple t′ by removing the attributes that are not in A. Output .
Reduce: for each key, t′, produced by the map tasks, there will be one or more key-value pairs . The reduce function turns into so it produces exactly one pair.

5.6.3 Join

R (A) _BS (C ) with A,B,C sets of attributes satisfying B ⊂ A,B ⊂ C, can be implemented with MapReduce as:

Map: for each tuple ∈ R, produce the key-value pair . For each tuple ∈ S, produce the key-value pair .
Reduce: for each key, b, output as many pairs as needed, .

5.6.4 Aggregation

The aggregation operator, γ_A,θ (R) , where A ∪ B is the set of attributes of R, and A ∩ B = ∅, can be defined with MapReduce as:

Map: for each tuple, t, produce , where a is the A part of t, and b is the B part.
Reduce: each key represents a group, so we apply θ to the list associated to each value a. We output , where x = θ.

5.7 Some Issues of MapReduce

Locality: input data is stored on local disks of machines in the cluster. Each file is divided into blocks of 64MB, each of which is stored several times, as replicas, on different machines. MapReduce master node takes the location information of the input files into account, and attempts to schedule a map task on a machine that contains the needed replica. If this fails, it tries to schedule a map task in a machine that is near to one that has a replica.
Granularity: the amp and reduce steps are divided into M and R pieces. M and R should be much larger than the number of workers. Each worker can perform different tasks, improving dynamic load balancing and speeding up recovery when a worker fails. Some practical bounds on how large these values should be say that the master should take M + R scheduling decisions and keep M × R states in memory.
Refinements: partitioning input data using different functions according to the problem to be solved.
Ordering guarantees: the intermediate key-value pairs are generally processed in increasing key order, to make it easy to generate a sorted output file per partition. However, this is not guaranteed.

6 Spark Parallel Computing Framework

Nowadays, data is growing faster than processing speeds, and so the only possible solution is to parallelize on large clusters.

Apache Spark is an open source implementation of a framework for large-scale data processing, providing an interface for programming clusters with implicit data parallelism and fault tolerance. It extends a programming language with read-only data structure distributed over a cluster of machines, the Resilient Distributed Datasets (RDDs), maintained in a fault-tolerant way. RDDs were developed in 2012 in response to limitations in Hadoop’s MapReduce, which forces a particular linear dataflow as a sequence of HDFS reads and writes.

Spark is up to 100 times faster than traditional Hadoop thanks to its in-memory data processing:

6.1 Spark’s Engine Properties and Components

Spark was originally written in Scala, a high level language for JVM. There are APIs for Java, Scala, Python, R,...

The Dataframe API was released as an abstraction on top of the RDD, as well as packages like MLlib or GraphX, that can be used for machine learning and graph analytics. These APIs facilitate the implementation of both iterative algorithms and interactive or exploratory data analysis.

Spark requires a cluster manager and a distributed storage system:

Cluster management: Spark supports standalone, native Spark clusters, where we can launch a cluster either manually or using scripts, or we can use Hadoop YARN, Apache Mesos, or Kubernetes.
Distributed storage: Spark cna interface with a wide variety of distributed databases, like Alluxio, HDFS, MapR-FS, Cassandra,...

Spark also supports different sources of data, in different formats and from different databases. All these relationships are shown below:

6.2 Spark’s Execution Architecture

Data is splitted into partitions or blocks, and the driver assigns tasks to each worker, which reads a HDFS block, and has a cache. Each worker process and cache data, if necessary, sending the results to the driver when done.

6.3 Resilient Distributed Datasets (RDDs)

RDDs are immutable and distributed collections of objects, spread across a cluster, and stored in RAM or disk (when they are persistent). They are statically typed, i.e., RDD[T] has objects of type T. The types can be of any type of Python, Java, or Scala objects, including user-defined classes.

RDDs are built via parallel transformations and computed via parallel actions on distributed datasets, executed lazily. For instnace, RDDs are splitted into multiple partitions, which may be computed on different nodes of a cluster.

Transformation: operation on an RDD that returns a new RDD. They are computed lazily, only after an action is called.
Action: operation on an RDD that returns a final result which is not another RDD. When an action is called, all the transformations prior to it are executed in the same order they were defined. Each time a new action is called, the entire RDD must be computed from scratch, but the user can deide to persist intermediate result (caching).

Inside Apache Spark, the workflow is managed as a directed acyclic graph (DAG). Nodes represents RDDs while edges represent the operations executed on the RDDs. Spark keeps track of the set of dependencies between different RDDs. This is called the lineage graph.

Example 6.1. Python Example: first line mentioning Python

1# spark context creation
2sc = pyspark.SparkContext(...)
3
4# creating an RDD of strings with textFile()
5lines = sc.textFile("README.txt")
6
7# Transformations:
8# Construct a new RDD from a previous one, one common transformation is filtering data that matches a predicate
9pythonLines = lines.filter(lambda line: "Python" in line)
10
11# Actions:
12# Compute a result based on an RDD, and either return it to the driver program or save it to an external sotrage system
13pythonLines.first()

Some notes:

Once a SparkContext, sc, is set, it is used to build RDDs. The driver program manages a number of workers, and different workers on different machines might count lines in different ranges of the file.
filter does not mutate the existing input RDD. Instead, it returns a pointer to a new RDD.
Many transformations work on one element at a time, but this is not true for all transformation, like union.

6.3.1 RDDs Operations

map is used from different purposes, from fetching a large collection to just squaring numbers. It transforms an RDD of length N into another RDD of length N, applying a function to each element in the RDD. For example:
1nums = sc.parallelize([1,2,3,4])
2squared = nums.map(lambda x: x*x).collect()
3
4for num in squared:
5 print(num)
6
7# 1 4 9 16
flatMap transforms an RDD of length N into a collection of N collection, then flatterns these into a single RDD of results, applying a function to each collection. For example:
1def tokenize(sentence):
2  return [word for word in sentence]
3
4rdd = sc.parallelize(["coffee panda"])
5
6# Using map
7map_result = rdd1.map(tokenize).collect()
8
9for token in map_result:
10  print(token)
11
12# Output: ["coffee", "panda"]
13
14# Using flatMap
15flatmap_result = rdd1.flatMap(tokenize).collect()
16
17for token in flatmap_result:
18  print(token)
19
20# Output: "coffee", "panda"
reduce is the most common action on basic RDDs. It operates on two elements of the type of the RDD and returns a new element of the same type. For example:
1data = sc.parallelize([1,2,3])
2
3data.reduce(lambda x, y: x+y) # Result is 6
reduceByKey operates on RDD of key-value pairs. It runs several parallel reduce operations, for each key, where each operations combines values that have the same key. It returns a new RDD consisting of each key and the reduced value for that key. For example:
1pets = sc.parallelize([("cat", 3), ("dog", 2), ("cat", 1)])
2
3pets.reduceByKey(lambda x, y: x+y)
4
5# Result: {("cat", 4),("dog", 2)}
groupByKey and sortByKey return the RDD with the values grouped or sorted by the keys. For example:
1pets = sc.parallelize([("cat", 3), ("dog", 2), ("cat", 1)])
2
3pets.groupsByKey() # Result: {("cat", [3, 1]),("dog", [2])}
4
5pets.sortByKey() # Result: {("cat", 3),("cat", 1),("dog", 2)}
distinct produces a new RDD with only distinct items.
union produces a new RDD consisting of the data from both sources, removing duplicates.
RDD.subtract produces a new RDD consisting in all values present in RDD but not in otherRDD.
RDD.cartesian returns of possible pairs where a ∈ RDD and b ∈ otherRDD. Note that this operation is very costly.
RDD.innerJoin(otherRDD) returns only keys that are present in both pairs to the output RDD.
RDD.leftOuterJoin and RDD.rightOuterJoin join the two RDDs together by key, allowing one of them to miss the key (left or right). For example:
1rdd = sc.parallelize([(1,2),(3,4),(3,6)])
2other = sc.parallelzie([3, 9])
3
4rdd.join(other) # result: {(3, (4,9)), (3, (6,9))}
5
6rdd.leftOuterJoin(other) # result: {(1, (2, None)), (3, (4,9)), (3, (6,9))}

6.3.2 RDDs Actions

collect is used to retrieve the entire RDD. It is useful if it filters RDDs down to a very small size to deal with it locally at the driver. The retrieved dataset must fit in memory in a single machine.
take is used to retrieve a small number of elements in the RDD at the driver program and then iterate them over them locally.
top is used to extract the top elements.
takeSample allows to take a sample of the data either with out without replacement.
foreach performs computations on each element in the RDD without bringing it back locally.
count returns a count of the elements.
countByV alue returns a map of each unique value and its count.

6.3.3 Caching

Spark RDDs are lazily evaluated, and sometimes we use the same RDD multiple times. Naively, Spark will recompute the RDD and all of tis dependencies each time we call an action on the RDD. To avoid this, we can ask Spark to persist data using persist.

Notice that calling persist does not force the evaluation of the RDD.

If we cache too much data, Spark will automatically delete old partitions. For the memory-only storage levels, it will recompute these partitions the next time they are accessed. This means that caching unnecessary data can lead to eviction and increased re-computation time.

7 Community Detection Approaches

7.1 k-Clique

Definition 7.1. A clique is a complete subgraph.

A k-clique is a complete subgraph with k nodes.

Example 7.1. In the following picture² , we observe the brute force process to find all 4-cliques in this 7-node graph. For this, we need to check C₇⁴ combinations.

Most versions of clique problems are hard, and a common problem is that of finding maximal cliques. This is, cliques with the largest number of nodes. For this problem, we have different algorithms, like:

Bron-Kerbosch algorithm, with complexity O.
Tarjan and Trojanowski algorithm.
Janez Konc algorithm.

There are also algorithms with better theoretical complexity, but Bron-Kerbosch and some variants that improve it are more efficient in practice. The basic form of the algorithm is as follows:

1Bron-Kerbosch(G=(V,E)):
2  # Initialization
3  P = V
4  X = {}
5  R = {}
6
7  def BronKerbosch(R,P,X):
8    if P = {} and X = {} then
9      R is maximal
10
11    for v in P:
12      P_prime = P.intersection(neighbours(v))
13      X_prime = X.intersection(neighbours(v))
14      R_prime = R.union({v})
15      BronKerbosch(R_prime, P_prime, X_prime)
16      P = P - {v}
17      X = X.union({v})

For example, the following tree represents the execution of the algorithm for the shown graph:

7.2 k-Core

k-Cliques are interesting as a concept, but they are very restrictive, and so some relaxations have been proposed. A very well known one are k-Cores [1]. The basic idea is that networks may present a core-periphery structure, where in the core there are a lot of connected nodes, while in the periphery we find a more sparse structure, with peripheral nodes called whiskers.

Definition 7.2. The k-core of a graph G is a maximal connected subgraph of G where vertices have at least degree k. Also called k-degenerate.

The approach to find k-cores is k-core decomposition. The idea is that, given a graph G = (V,E) , we delete recursively all vertices, and edges connecting them, of degree less than k, to extract the k-core, such that:

Each v_i in a k-core graph has d_i ≥ k.
A -core is a subgraph of a k-core graph.

The algorithms goes as:

1kCore(G=(V,E)):
2  L = {} # Maps each vertex to the highest k-core it belongs
3  d = [] # List of degrees for each vertex
4  D = {} # Dictionary mapping each degree to all vertices with that degree
5  d_max = max([degree(v) for v in V])
6
7  for v_i in V:
8    d[v_i] = degree(v_i)
9    D[d[v_i]].append(v_i)
10
11  for k in range(0,d_max):
12    while not D[k].empty():
13      v_i = D[k].pop # Get vertices of k-core
14      L[v_i] = k
15
16      for u_j in neighbours(v_i): # Update neighbours removing the edges to the extracted vertex
17        if d[u_j] > k:
18          D[d[u_j]].pop(u_j)
19          D[d[u_j]-1].append(u_j)
20          d[u_j] -= 1

Example 7.2. Let’s go through an example:

First, we initialize d and D:


i	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25

d[v_i]	4	6	5	7	4	6	5	5	2	2	2	3	4	4	4	3	3	5	3	2	3	1	1	1	1


d[v_i]	7	6	5	4	3	2	1

D[d[v_i]]	[4]	[2,6]	[3,7,8,18]	[1,5,13,14,15]	[12,16,17,19,21]	[9,10,11,20]	[22,23,24,25]

Now, we start the procedure. For k = 1, the first node to be taken out would be 22, so:


i	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25

d[v_i]	4	6	5	7	4	6	5	5	2	2	2	3	4	4	4	3	3	5	3	2	3 2	1	1	1	1


d[v_i]	7	6	5	4	3	2	1

D[d[v_i]]	[4]	[2,6]	[3,7,8,18]	[1,5,13,14,15]	[12,16,17,19,21]	[9,10,11,20,21]	[22,23,24,25]


v_i	22

L[v_i]	1

Now, with 23:


i	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25

d[v_i]	4	6	5	7	4	6	5	5	2	2	2	3	4	4	4	3	3	5	3	2	2 1	1	1	1	1


d[v_i]	7	6	5	4	3	2	1

D[d[v_i]]	[4]	[2,6]	[3,7,8,18]	[1,5,13,14,15]	[12,16,17,19]	[9,10,11,20,21]	[21,23,24,25]


v_i	22	23

L[v_i]	1	1

Now, with 24:


i	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25

d[v_i]	4	6	5	7	4	6	5	5	2	2	2	3	4	4 3	4	3	3	5	3	2	1	1	1	1	1


d[v_i]	7	6	5	4	3	2	1

D[d[v_i]]	[4]	[2,6]	[3,7,8,18]	[1,5,13,14,15]	[12,16,17,19,14]	[9,10,11,20]	[21,24,25]


v_i	22	23	24

L[v_i]	1	1	1

And so on...

7.3 Community Detection Problem

A community in a network refers to the occurrence of clusters or groups of nodes in a network, that are more densely connected internally than with the nodes outside the community. There are mainly two cases:

Non-overlapping communities: divides into clusters of nodes with dense connections internally, and sparser connections between clusters. Each node is assigned to one community.
Overlapping communities: assumes that pairs of ndoes are more likely to be connected if they are both members of the same communities, and less likely to be connected if they do not share any community.

Figure 1: Non-verlapping communities (left) and overlapping communities (right).

There are several considerations or properties of communities that can be used to obtain them:

Mutuality of ties: every node in the group has ties to one another.
Compactness: closeness or reachability of group nodes.
Density of edges: high number of edges within a community.
Separation of higher frequency of ties among group members compared to non members.
Usual metrics: graph density, internal and external density, graph cut, modularity score.

This problems is a combinatorial optimisation problem, and it is NP-hard if we want a exact solution. However, there are many approaches that leverage greedy techniques or approximate heuristics, that we are going to see now.

7.3.1 k-Clique Community

Definition 7.3. A k-clique community is the union of all k-cliques that can be reached from one to the other through a sequence of adjacent k-cliques.

Two k-cliques are adjacent if they share k - 1 vertices.

To find k-clique communities, we can use the Clique Percolation Method [2], which consists in:

Find all maximal cliques, sorting them based on degrees.
Create clique overlap matrix, where M[i,j] = 1 if cliques i and j overlap in at least k - 1 nodes.
Communities are the connected components that arise from this.

7.3.2 Louvain Algorithm

The most populat community detection algorithm is the Louvain algorithm, which is based in the concept of modularity:

In network science, the configuration model is a method for generating random networks from a given degree sequence. It is widely used as a reference model for social networks, allowing the modeler to incorporate arbitrary degree distributions.

The idea is to assign a degree d_v to each node v. Each degree is a half-link, or stub, and the sum of stubs must be even to be able to build a graph, i.e., ∑ _v∈Vd_v = 2m. Then, we choose two stubs uniformly at random and connect them by a link, then choose another pair from the remaining 2m - 2 stubs, and connect them. We repeat this process until there are no more stubs. The resulting network keeps the same degrees but randomly pairs up nodes.

A realization might include cycles, self-loops or multi-links. The uniform distribution of the matching must be kept, so these are not excluded. However, their expected number goes to zero for large networks, because the probability of v being connected to one of w stubs is 2dmw--1 . Since node v has d_v stubs, the probability of v being connected to w is -dvdw-
2m -1 which is almost the same as for large m.

Now, modularity measures the relative density of links inside communities with respect to links outside communities. There are different methods to compute the modularity, but in the most common version, the randomization of the edges is done to preserve the degree of each vertex.

The basic idea is to compare the number of links within communities with the number expected on the basis of chance. The generated network has less links between nodes of the same community, and more between nodes of different communities.

It can be computed as

∑ ∑ [ ] Q = -1- Aij - didj δ(ci,cj) 2m i j 2m

[∑ ∑ ∑ ∑ ] --i--jAij --idi--jdj Q = 2m - (2m )2 δ (ci,cj),

where d_i is the degree of vertex i, and c_i is the community of vertex i. δ is the Kronecker function, so δ (ci,cj) is 1 if c_i = c_j, and 0 otherwise.

The modularity is -1 ≤ Q ≤ 1 and it is positive when links within communities exceed links within communities in a randomly rewired network.

This is the modularity of a graph, but we can also compute the modularity of a community, c, as

∑ ∑ ∑ ∑ --i∈c--j∈c-Aij --i∈cdi--j∈c-dj Qc = 2m - (2m )2

∑ ( ∑ )2 Q = --in - --in+out , c 2m 2m

where ∑ _in is the sum of edge weights between nodes within the community c (each edge is considered twice), and ∑ _in+out is the sum of all edge weights for nodes within the community, including incident ones from the other communities.

Note that if there is only one community, then Q = 0.

Now, finding the optimal value for modularity is impractical, because it needs to go through all possible combinations of the nodes into communities, and so the Louvain method is a heuristic method for greedy modularity maximisation in 2 phases:

Modularity is optimized by allowing only local changes to node communities memberships.
1. Initially, each node is assigned its own community.
The identified communities are aggregated into super-nodes, to build a new network.
1. This phase is repeated until there is only one super node, the whole graph.

This algorithm is widely used for large networks, because it is efficient and produces high dense communities. It has a complexity of O (nlogn) in time, and can be used on weighted graphs. In addition, it provides hierarchical communities, allowing us to define the level of detail of the communities, and also to obtain subcommunities within communities. A sample visualization is the following:

In more precise terms, the phase 1 goes as follows: what is ΔQ if we move a node v from C_J to C_I:

ΔQCJ v→CI = ΔQCJ \{v} + ΔQCI ∪{v}.

But how can we derive ΔQ_{C_I∪}? The key here is that the gain in modularity does not depend on the original community of v. It can easily be computed by moving an isolated node v into C_I:

[ ] ΔQCI ∪{v} = QCI∪{v} - QCI +Q {v},

and therefore

[∑ ( ∑ )2] [ ∑ ( ∑ )2 ( )2] ΔQC ∪{v} = --in+2-⋅dv,in- --in+out+dv- - --in-- --in+out - dv- . J 2m 2m 2m 2m 2m

On the other side,

[ ] ΔQCJ \{v} = QCJ\{v} + Q{v} - QCJ,

[∑ ( ∑ )2 ( )2] [ ∑ ( ∑ )2] ΔQ = --in- 2-⋅dv,in-- --in+out--dv- - dv- - --in-- --in+out . CJ\{v} 2m 2m 2m 2m 2m

Therefore, the gain only depends on d_v,in and C.

The complete algorithm is:

For each node v:
1. Compute modularity gain from removing v from its community and placing it in the community of its neighbours.
2. Place v in the community that maximizes ΔQ.
3. One iteration is achieved for all the nodes in a sequential manner.
4. Repeat the procedure sequentially to all nodes until no more improvement (local maximum of modularity).
5. The output of this phase depends on the order in which nodes are considered. Research shows that this does not significantly affect the overall modularity.
Nodes from communities are grouped into super nodes.
1. Links between nodes of the same community c are represented by self-loops weighted by adding up the links between these nodes: c has a loop edge with weight
  $∑ w = Ai,j. vi,vj∈c$
2. Links between communities are weighted by adding up the links between community’s nodes. Each has a link with weight
  $∑ w = Ai,j. vi∈ci,vj∈cj$

7.3.3 Walktrap Approach

This other approach is based on random walks. The idea is to consider a random walk on a graph, in which at each time step, we move to neighbours uniformly at random:

Aij Pij =-d-, P = D -1A, i

and P_ij^t represents the probability to get from i to j in t steps.

We would think that two nodes i and j that lie in the same community should have a high P_ij^t, as well as similar P_ik^t ~ P_jk^t, for different k. Therefore, we can define a distance (or similarity) between nodes, as

┌ --------------- ││ n (P t - Pt)2 ∥ ∥ rij(t) = │∘ ∑ ---ik----jk---= ∥∥D - 12P t- D- 12Pt∥∥. k=1 dk i j

In addition, we approximate the computation by

P t~ Nik, ik Ni

where N_ik is the number of walks starting at i and passing through k, and N_i is the number of walks starting from i.

The approach goes as follows:

Assign each node to its own community.
Compute the distance between adjacent nodes, r_ij.
Choose the two closest communities and merge them.
Update the distance between communities, as
$┌ ---------------- ││ ∑n (Pctk - Pctk)2 ∥∥ 1 1 ∥∥ rc1,c2 (t) = ∘ ---1-d---2---= ∥D -2Pct1 - D -2P tc2∥, k=1 k$
where
$Pt = -1 ∑ Pt. ck |c|i∈c ik$
Finish when there is only one community.

7.3.4 Girvan-Newman Approach

Edge betweenness of e is the fraction of shortest paths between all nodes s and t going through edge e:

∑ nes,t ebt(e) = ns,t. s⁄=t

In this case, we focus on edges that connect communities, and construct them by progressively removing edges with the highest betweenness value. It goes as:

For all e ∈ E
1. Compute ebt
2. Remove the edge e with largets ebt
Repeat until all edges are gone. Stop when it splits in 2 components. The output is a dendogram.

8 Models of Influence and Diffusion

8.1 Introduction

B. Ryan and N. Gross published Acceptance and Diffusion of Hybrid Corn Seed in two communities [3] in 1943. In this paper, they studied the information effect and the adoption of technology between 1924 and 1940. They found out that:

The diffusion pattern was made up of three periods:
1. Long period of slow initial growth
2. Rapid rise in adoption
3. Brief decline as the most resistant adopters accepted the new technology
They classified the different subjects as innovators, early adopters, early majority, late majority, and laggards.

This information is shown here:

This idea has been studied multiple times in different areas:

Studying and modelling the spread of beliefs or ideas or informaiton (rumours, news) or virus. This represents an active research area in fields like economics, epidemiology, social science,...
Influence models have been studied for uears:
- Original mathematical models: Schelling and Granovetter
- Viral marketing strategies modelled by Domingos and Richardson
- Network coordination games
Studying diseases or contagions. The most commonly used epidemic models are the SIR (Susceptible-Infected-Recovered) and the SIS (Susceptible-Infected-Susceptible).
Influence of the social environment on health: behaviors such as eating, practicing physical activities, drug use and seeking medical follow up.

8.1.1 Compartmental Models in Epidemiology

General models for infectuous diseases from human to human have been modeled since the Spanish flu in 1918. Relevant applications have been to SIDA and more recently to COVID 19.

The epidemiological models are based on 2 concepts:

There are compartments to divide individuals
Rules specify the rate of transition between compartments, like the force of infection.

There are variants with or without the dynamics of birth and death, immunity periods, etc

SIR is the simplest predictive model, and recovery confers lasting resistance, with death being negligible. The three compartments are:

S: number of susceptible individuals. When a susceptible and an infectious individual come into contact, the susceptible individual contracts the disease and transitions to the infectious compartment.
I: number of infectious individuals.
R: number of removed and immune individuals. Also called recovered or resistant.

Now, S (t) ,I (t) and R (t) are functions defined to study the dynamic in a short infectious period. Let’s define the rest of the variables:

N is the total population.
β is the average number of contacts per person per time.
Therefore, is the transmission parameter.
The transition rate between I and R, γ, represents the rate at which individuals recover from the disease.
is the incidence in terms of infections.
We impose that the population remains constant:
$dS dI dR dt-+ dt + dt-= 0.$

The SIR system can be expresed using the following differential equation system:

(| dS = - β-⋅S ⋅I { ddtI βN |( ddtR = N ⋅S ⋅I - γI dt = - γI

The Gillespie algorithm is used to simulate chemical or biochemical systems. It is a stochastic simulator algorithm that generates a statistically correct trajectory of a stochastic equation system for which the reaction rates are known.

The SIR model has many variants, that have been developed over the years:

Models without immunity after recovery (SIS). This tries to model illnesses like the common cold, which does not confer any long-lasting immunity.
Immunity lasts only for a short period of time (SIRS).
There exists a latent period of the disease (Exposed), where the person is not yet infectious (SEIS or SEIR).
Infants can be born with maternally derived immunity (MSIR).
Model differentiates between Recovered and Deceased (SIRD).
Vaccinated susceptible population (SIRV).

8.1.2 Challenges in Social Networks

Diffusion models are used to identify the way information is transmitted in a network. So:

How to model the information diffusion process in a social network?
How to identify the influencers (nodes which transmit more information)? Which kind of graph-based measures can be used?
How to maximise the influence?
How to minimise or stop the influence (which links to remove)?

From the computer science perspective, we need to develop fast and efficient algorithms on large networks.

8.2 Influence and Diffusion Models

These can be probabilistic models. The probability that someone becomes activated based on its activated n neighbours is

n P (n ) = 1- (1- p) ,

where p is the activation probability of a neihbour and n is the number of activated neighbours. The intuition here is that the more neighbours are activated, the higher probability that you will become activated.

But they can also be threshold models. In this case, nothing happens until the threshold is reached:

P (b) = δ (b > b) .

In both cases, the main idea is to define a diffusion process on the network, originating from a set seed S. The expected number of activated nodes at the end is the influence of S, σ (S) .

The network is represented as a directed graph, G = (V, E) , where individual ndoes are active or inactive. The process is thus defined as:

Start with initial set of active seed nodes, S.
Run t steps and finish when there are no more possible activations.

8.2.1 Independent Cascade Model

In the cascade model, when node u becomes active, it is given a single chance to activate each currently inactive neighbour v, with a probability of success of p_u,v.

This probability is independent of the history. If u succeeds, then v will be active in step t + 1, but whether or not u succeeds, it cannot make any further attempts to activate v in subsequent rounds.

If v has multiple newly activated neighbours, their attempts are sequenced in an arbitrary order.

For example, see the following diffusion process:

Let now D_t be the set of active nodes at step t. For v ∈ N (Dt) , its probability of being active at t + 1 is

p (t+ 1) = 1- ∏ (1 - p ). v u,v u∈N(v)∩Dt

The sets I_t and S_t follow the following iterative definition:

I0 = S,

S0 = V \ I0,

S = S \ I , t+1 t t+1

where I_t+1 is obtained from S_t and I_t as explained before.

Then, the set of all infected nodes throughout the contagion process originating at S is

I(S) = ∪t≥0It,

and the expectation is taken over the random infection attempts from the infected nodes, as

σ (S ) = E [|I(S)|].

However, there is still something that we need to sort out: how to define p_u,v.

A commonly used approach is to assign to each edge (u,v) the probability p_u,v = -1
d-v (note that we consider the directed edges).

Some studies propose to learn influence probabilities from data, for example studying the propagation actions in social networks, such as replies, forwards, etc.

8.2.2 Threshold Model

In this case, we consider, for each node u, the proportion p of active neighbours, and (1- p) of inactive ones. To become activated, node u needs to verify

ρ ⋅p⋅du > γ ⋅(1- p)⋅du,

so the threshold to accept is

--γ-- p > ρ +γ ,

where ρ is the reward (weight) of active nodes and γ is the reward of inactive nodes. The model applies this rule in cascade until there are no more possible activations.

For example, if ρ = 3 and γ = 2, the threshold is . It would go as follows:

The Linear Threshold Model In this case, each node v is influenced by each incoming active neighbour u, according to a weight w_u,v, so that

∑ wu,v ≤ 1. u∈N (v)

Each v has a random acceptance threshold θ ~ [0,1] , which represents the fraction of v’s neighbours that must become active in order for v to become active.

Given random thresholds, and an initial set of active nodes, S₀, with all the other nodes inactive, the diffusion process unfolds in discrete steps: in step t, all nodes that were active in step t- 1 remain active, and activate any node v such that the total weight of its active neighbours is at least θ_v:

∑ wu,v ≥ θv. u∈N(v)

A possible way to define w_u,v is by taking into account the degree of v, as w_u,v = -1
dv .

8.2.3 Cascades and Clusters

Homophily can often serve as a barrier to diffusion, by making it hard for innovations to arrive from outside communities. This means that we can use the structure of a network to analyze the success or failure of a cascade.

A cluster of density δ is a set of nodes such that each node in the set has at least a δ fraction of its network neighbours in the set. For the cascade process to enter the cluster, the threshold should be smaller than 1 - δ.

8.3 Influence Maximization Problem Formulation

Given G = (V,E) , let σ be a function such that σ : S → ℕ maps a set of nodes S ∈ V to their influence value σ (S) , i.e., the expected number of activated nodes in a diffusion process.

The influence maximization problem asks:

For a given budget, k, find a set of k nodes, S, with objective

max σ (S). |S|≤k

Solving this problem is NP-hard, so we need to leverage approximate methods.

8.3.1 Greedy Framework

The approximation by greedy algorithm consists in iteratively adding nodes to S by maximizing the marginal gain in each step:

1greedy_max_influence(G=(V,E), k):
2  S = {}
3  for i in range(1,k) do
4    u = argmax_u([ sigma(S.union({u})) - sigma(S) for u in V-S])
5    S = S.union({u})

8.3.2 Submodular Functions

A set function f is submodular if for any sets R ⊂ T and ∀vT,R, it holds

f (R ∪ {v})- f (R) ≥ f (T ∪{v}) - f (T).

These functions represents functions of diminishing returns: the marginal gain from adding an element to a set R is at least as high as the marginal gain from adding the same element to a superset of R.

It also implies the monotonicity of f, f (R ∪ {v}) ≥ f (R) .

Theorem 8.1. For a non-negative, monotone, submodular funciton f, let S be a set of size k obtained by selecting elements one at a time, each time choosing an element that provides the largest marginal increase in the function value.

Let S^* be a set that maximizes the value of f over all k-element sets.

Then

[ ( 1)k] f (S) ≥ 1 - 1- -- f (S*). k

In other words, S provides a ( 1)
1- e = lim_k→∞1 - ( 1)
1 - k ^k approximation of the optimal value.

It can be proven that σ (⋅) is a submodular function, and therefore, when we use the greedy approach, obtaining a set S, it must be

( ) 1 * σ(S) ≥ 1- e σ (S ).

References

[1] Jure Leskovec, Kevin J. Lang, Anirban Dasgupta, and Michael W. Mahoney. Community structure in large networks: Natural cluster sizes and the absence of large well-defined clusters, 2008.

[2] Gergely Palla, Imre DerÃ©nyi, IllÃ©s Farkas, and TamÃ¡s Vicsek. Uncovering the overlapping community structure of complex networks in nature and society. Nature, 435(7043):814–818, 2005.

[3] Bryce Ryan and Neal C. Gross. The diffusion of hybrid seed corn in two iowa communities. Rural Sociology, 8(1):15, 1943.