INFOH423 - Data Mining

Fall 2022

Jose Antonio Lorencio Abril

Professor: Mahmoud Sakr

Student e-mail: jose.lorencio.abril@ulb.be

This is a summary of the course Data Mining, taught at the Université Libre de Bruxelles by Professor Mahmoud Sakr in the academic year 22/23. Most of the content of this document is adapted from the course notes by Sakr, [3], and the basic bibliographic source of the course, the book of Aggarwal, [1], so I won't be citing it all the time. Other references will be provided when used.

Part I Introduction

1 What is data mining?

1.1 Why is data mining important today, if it was not yesterday?

2 The Data Mining Process

3 Data Types

3.1 Nondependency-oriented data

3.2 Dependency-oriented data

Part II Classification

4 Decision Trees

4.1 Split criteria

4.2 ID3 Tree Induction Algorithm

5 Bayesian classification

5.1 Naive Bayes classifier

6 Model evaluation and selection

6.1 Confusion Matrix

7 Ensemble methods: increasing accuracy

Part III Model validation and data preparation

8 Data preparation

8.1 Feature extraction

8.2 Data Type Portability

8.3 Data Cleaning

8.4 Exploratory analysis

8.5 Similarity and Distance

10 Representative-Based Algorithms

10.1 The k-Means algorithm

10.2 The k-Medians Algorithm

10.3 The k-Medoids Algorithm

10.4 Practical issues

11 Grid and Density based Algorithms

11.1 Grid-based methods

11.2 DBSCAN

11.3 DENCLUE

12 Probabilistic Model-Based Algorithms

12.1 Fuzzy sets and clusters

12.2 Mixture model

12.3 Evaluating fuzzy clusters

12.4 Cluster quality measures

Part V Frequent pattern and association rule mining

13 Frequent Itemset Mining

13.1 The model

13.2 Association rule generation framework

13.3 Frequent itemset mining algorithms

13.4 Mining Association Rules

14 Sequential pattern mining

14.1 Candidate generation

14.2 Generalized Sequential Pattern (GSP) Algorithm

14.3 Sequential PAttern Discovery using Equivalence classes (SPADE) Algorithm

14.4 PrefixSpan

14.5 Some comments

Part VI Stream Data Mining

15 Stream data mining

15.1 Bloom filter

15.2 Count-Min Sketch

15.3 Flajolet-Martin algorithm

15.4 Hyperloglog

Part VII Outlier mining

16 Outlier Mining

16.1 Types of outliers

16.2 Challenges of outlier detection

16.3 Supervised methods for outlier detection

16.4 Unsupervised methods for outlier detection

List of Figures

Figure 1: Holdout visualization.

Figure 2: Cross-Validation visualization. m=3.

Figure 3: A global outlier.

Figure 4: A context outlier.

Figure 5: Collective outliers.

Figure 6: Basic diagram of a one-class model.

List of Tables

Table 1: Records and multidimensional data set example

Table 2: The weather data. Source: [#witten2011].

List of Algorithms

Algorithm 1: GenericDecisionTree(Dataset: D)

Algorithm 2: ID3(I, O, T) : Decision Tree

Algorithm 3: DP_Edit(s1, s2)

Algorithm 4: Generic k-representative approach (Data D, int k, threshold eps) : Set of representatives Y and Clusters C

Algorithm 5: k-medoids(Data D, int k, threshold eps) : Set of representatives Y and Clusters C

Algorithm 6: GenericGrid(Data D, Ranges p, Density tau) : clusters C

Algorithm 7: DBSCAN(Data D, Radius eps, Density tau) : clusters C

Algorithm 8: Apriori(Transactions T, Minimum Support minsup)

Algorithm 9: Apriori_improved(Transactions T, Minimum support minsup)

Algorithm 10: FP-Growth(FP-Tree FPT, Minimum Support minsup, Current Suffix P)

Algorithm 11: BloomConstruct(Stream S, Size m, Number of hash w)

Algorithm 12: CountMinConstruct(Stream S, Width w, Height m)

Part I Introduction

1 What is data mining?

Even though there is not a formal-accepted-by-all definition of data mining, most of them are in agreement that data mining is a field of study that focuses on the collection, cleaning, processing, analysis and gaining of useful insights from the data that we have access to.

As wider as these topics are, is also the data mining domain, which is nowadays a really hot topic both in academia and industry.

In academia, there exist many goals to achieve in this field:

Developing new models.
Developing new ways to deal with real world data.
Understading what some complex models are actually doing.
Developing the mathematical tools to be able to better describe the models that are made by computer scientist in a less formalized way (it is not rare that scientific models come before their mathematical formalization
1
For example, for Newton, differential calculus was just a tool he developed to be able to better understand some physical processes that were kind of obvious to him. Afterwards, it became one of the widest branches of mathematics and led to other mathematical tools, theories and developments in many other fields.

.
Using of data mining to gain a better understanding of medical, biological, chemical, economic,... data than we are able to gain using just traditional statistic techniques.

In the industry, the focus is mainly in two fields (disregarding research companies, which would have similar objectives as academia):

Using well known models to understand what happened in the past.
Using well known models to try to predict what will happen in the future.

These usages are made with the aim of improving the business processes and, ultimately, maximize profit.

1.1 Why is data mining important today, if it was not yesterday?

Because the computing power has increased enormously, so now we are able to run algorithms that enable us to train models in a decent time, and this task was practically undoable a few years ago.

Not only that, but the data technology is rapidly evolving, too. We produce more data and store more data, so... we have more data! This data is potential knowledge and we know many tools to get this knowledge from it.

In fact, the amount of data is so vast, that not only we have to develop techniques to analyze data, but also to manage huge amounts of data.

2 The Data Mining Process

The data mining process is a pipeline constructed around the basic steps:

Data collection: obtention of data from real world sources.
Feature extraction and data cleaning: among all data retrieved from the real world, we have to select those characteristics or features that are relevant for our purposes (feature extraction) and to decide what to do with mistaken/noisy/lost data (data cleaning). Also, as we might be collecting data from different sources, it is important to decide how to aggregate the data into a unified format for later processing.
Analytical processing and algorithms: we now must develop suitable methods for analyzing our data. That is, we should decide what mathematical models to use to describe our data, and what algorithms to use in order to train the desired model
2
We will see what is specifically the meaning of train, but in the meanwhile we can think of it as how we make a general model adapt to our particular data.

.
After these steps, we would enter a phase of analyzing the obtained results to obtain knowledge from data, as well as a iterative procedure, in which we could return to any previous step and try to improve different parts of the process
3
Usually, data is constantly updating, so at least it will necessary to produce checks of correctness and updates.

.

3 Data Types

Not all data have have the same nature nor characteristic, so it is important to understand the differences between them and what techniques are applicable to what types of data.

We can characterize data in different levels of detail, and the most basic classification would distinguish between nondependency-oriented data and dependency-oriented data:

Nondependency-oriented data: simple data types with no specified dependencies between the data items or the attributes.
Dependency-oriented data: implicit or explicit relationships may exist between data items.

Dependency-oriented data are normally more complex to study because of the need to study not only the data itself, but also the relationships between different data items.

We will now define different subtypes of data that we can find.

3.1 Nondependency-oriented data

The term nondependency-oriented data is interchangeable with the term multidimensional data:

Definition 3.1. We will call source space,

S

, to the set of all possible values that our data can take. This set does not have neccessarily to take any particular form.

S

is a product space, then each component is called a feature.

Example 3.1. If we are measuring the name, age, height and gender of the students of a school, then we will have

S = T \times N \cap [0, 150] \times (0, 3) \times {M, F},

where

T

represents the set of possible names and

{M, F}

are the two possible values of the gender.

Definition 3.2. A record (data point, instance, tuple) is just a point

X = {(x^{i})}_{i = 1}^{d} \in S

that we measure and store in some form.

Example 3.2. Following the previous example, some records of

S

are represented in the following table:

index	Name	Age	Height	Gender
1	Josh	23	1.77	M
2	Mary	28	1.62	F
3	Larry	12	1.58	M

Table 1: Records and multidimensional data set example

Note that we added an index column, because it is a common practice.

Definition 3.3. A multidimensional data set,

D

, is a set of

n

records,

{X_{j}}_{j = 1}^{n}

Example 3.3. Table 3.2 represents an example of a dataset, too.

As we can see, Table 3.2 contains attributes of different types (very obvious from the definition of the source space

S

). Thus, we have to take into account also the type of each attributes of our data:

Quantitative multidimensional data: numerical data features, as age or height. If a data set is wholy compound of this kind of features, it is said to be a quantitative multidimensional data.
This type of data is the easiest to analyze, as mathematical tools are directly applicable and most algorithms are developed assuming this type of data. For this reason, it is common to try to transform all non-quantitative data into quantitative data.
Categorical and Mixed Attribute data: a categorical feature is such that it can only take values among a finite set (unordered) of options, as the gender.
If we encounter a data set compound of categorial data, we would say it is a categorical multidimensional data.

A dataset with both quantitative and categorical features is called a mixed multidimensional data.
Binary and Set data: binary data take values in ${0, 1}$ and it can be considered a special case of both numerical data (obviously) and categorical data (as if have a categorical feature which can only take two values, then it is easy to map these values to the set ${0, 1}$ ). For example, the gender is an obvious case of binary data.
Moreover, binary data can be seen as setwise data, where 1 indicates that the instance is in the set, while 0 indicates it is not.
Text data: usually a string, such as the name.

3.2 Dependency-oriented data

As outlined before, we can find implicit or explicit dependencies between instances:

Implicit dependencies: the dependencies between instancies are not explicitly specified but are known to exist. For example, if we measure the number of students in the library every 5 minutes, we would find that different instances are related via the temporal dimension, so measures with little time delay between them would be similar.
Explicit dependencies: this term usually refers to graph or network data, in which edges represents relationships between nodes.

As before, let deepen a bit in some types of this kind of data:

Time-Series data: it contains data that are generated by continuous measurement over time. This means that our source space has a temporal component. This dependency is implicit.
Formally, a time series of length $n$ and dimensionality $d$ contains $d$ numeric features at each of $n$ time stamps $t_{1}, ..., t_{n}$ . Each time stamp contains a component for each of the $d$ series. Therefore, the set of values received at time stamp $t_{i}$ es $Y_{i} = (y_{i}^{1}, ..., y_{i}^{d})$ .
Discrete sequences: these are the categorical analog of time-series data. We will now have categorical or text features along the temporal dimension. This dependency is implicit.
Spatial data: in this type of data, the dependance of the instances is given by their proximity in space. For example, if we measure the temperature in a room per $c m^{3}$ , we will find that points that are nearby show more similar temperature than points that are far away from each other. This dependency is implicit.
Spatiotemporal data: this data captures both spatial and temporal dimension, so we have to deal with both relationships. As before, this dependency is implicit.
Network and graph data: now, data values may correspond to nodes in the network, and the relationships between them would correspond to the edges between the nodes.
Formally, a network is a pair $G = (N, E)$ where $N$ is a set of nodes and $E \subset N \times N$ is a set of edges, that represent the relationships between the nodes. There can be attributes associated to both nodes or edges.

Edges may be directed or indirected, depending on wether the link is bidirectional or not.

Part II Classification

A classification problem consists in learning the structure of a dataset of examples, already partitioned into groups, referred as class. This learning is typically achieved with a model, which is used to estimate the class labels of unseen data examples with unknown labels. Thus, one of the inputs to the classification problem is the example dataset with known labels,

D

, called training data, while the unseen data points to be classified are the test data. The model learnt is referred to as training model. The algorithm used to create the model is the learner.

The output of the classification algorithm can be of two types:

Label prediction: a label is predicted for each test instance.
Numerical score: the learner assigns a score to each instance-label possible combination. This score measures the propensity of the instance to belong to a particular class.

4 Decision Trees

Decision trees are a classification methodology, which uses a tree structure to partition the feature space. Each node of the tree represents a decision to make according to the data, called the split criterion, and is a condition on one or more features variables in the training data.

The goal is to identify a split criterion such that the level of mixing of the class variables in each branch of the tree is reduced as much as possible.

The splits can be univariate, if they use a single attribute in the condition, or multivariate, if more than one attribute are used in the condition.

The nodes can be of two types:

Internal node: each internal node represents a partition of the space according to a certain condition.
Leaf node: they are labeled with the dominant class of the remaining partition of the training set at that node.

The general algorithm for constructing a decision tree is as follows:

begin
	Create root node containing D;
	repeat
		Select an eligible node in tree;
		<@\textcolor{red}{Split the selected node into two or more nodes based on the split criterion};<@
	until no more eligible nodes for split;
	<@\textcolor{red}{Prune overfitting nodes from tree};<@
	Label each leaf node with its dominant class;
end

Algorithm 1: GenericDecisionTree(Dataset: D)

The red lines indicate what changes accross the different algorithms to produce a specific decision tree.

4.1 Split criteria

The split criterion aims to maximize the separation of the different classes among the children nodes. Its design depends on the attributes of the data:

Binary attributes: produce a binary tree.
Categorical attribute: there several approaches:
- $r$ -way split: we split the branch in as many branches as distinct values of the attribute.
- binary split: testing each of the $2^{r} - 1$ groupings of categorical attributes, and selecting the best one.
Numeric attribute: we have, again, several possibilities:
- If it contains a small number $r$ of ordered values, we can treat it as a categorial attribute and apply $r$ -way split.
- For continuous numeric attributes, the split is performed using a binary condition, like $x \leq a$ for a certain contant $a$ .

These methods require to determine the best split among a set of different splits, so we need a way to measure which one is better than other.

For this end, we are using the entropy.

4.1.1 Entropy

Definition 4.1. Let

p_{j}

be the fraction of data points belonging to the class

j

for the attribute value

v_{j}

. Then, the class-based entropy,

E (v_{i})

, for the attribute value

v_{i}

E (v_{i}) = - \sum_{j = 1}^{k} p_{j} {log}_{2} (p_{j}) .

Remark 4.1. When

p_{j} = 0

, it is assumed that

p_{j} {log}_{2} (p_{j}) = 0

Remark 4.2.

E (v_{i}) \in [0, {log}_{2} k]

Remark 4.3. Higher values of the entropy imply greater mixing of different classes, while a value of 0 implies perfect separation.

Definition 4.2. The overal entropy of an attribute,

E

, is defined as the weighted average over the

r

different attribute values:

E = \sum_{i = 1}^{r} \frac{n_{i}}{n} E (v_{i}),

where

n_{i}

is the frequency of attribute value

v_{i}

The entropy is used in the ID3 algorithm for constructing decision trees.

The overall entropy for an

r

-way split of set

S

into sets

S_{1}, ..., S_{r}

may be computed as the weighted average of the entropy values of each

S_{i}

, being its weigth

| S_{i} |

. This is called the entropy-split:

E n t r o p y - S p l i t (S ⟹ S_{1}, ..., S_{r}) = \sum_{k = 1}^{r} \frac{| S_{k} |}{| S |} E (v_{i} | S_{k}) .

In relation to this, the information gain is defined as the reduction of entropy due to the split:

I G (S ⟹ S_{1}, ..., S_{r}) = E (S) - E n t r o p y - S p l i t (S ⟹ S_{1}, ..., S_{r}) .

Note that lower values of the entropy-split and higher values of the information gain are more desirable.

Sometimes, there are attributes with lots of distinct values, so using them to split the data reduces the entropy a lot, but are not very useful for prediction

Think, for example, in an ID. See Subsubsection 4.2.1.

. To account for this, we can divide the overall information gain with the normalization factor

- \sum_{i = 1}^{r} \frac{| S_{i} |}{| S |} {log}_{2} (\frac{| S_{i} |}{| S |}),

which helps adjusting for the varying number of categorical values.

Example 4.1. Entropy of the dataset 'Weather data'

Consider the following dataset:

outlook	temperature	humidity	windy	play
sunny	hot	high	false	no
sunny	hot	high	true	no
overcast	hot	high	false	yes
rainy	mild	high	false	yes
rainy	cool	normal	false	yes
rainy	cool	normal	true	no
overcast	cool	normal	true	yes
sunny	mild	high	false	no
sunny	cool	normal	false	yes
rainy	mild	normal	false	yes
sunny	mild	normal	true	yes
overcast	mild	high	true	yes
overcast	hot	normal	false	yes
rainy	mild	high	true	no

Table 2: The weather data. Source: [4].

If the class attribute is play, what is the entropy of this source?

\begin{matrix} E (p l a y) = & - p_{y e s} {log}_{2} (p_{y e s}) - p_{n o} {log}_{2} (p_{n o}) \\ = & - \frac{9}{14} {log}_{2} (\frac{9}{14}) - \frac{5}{14} {log}_{2} (\frac{5}{14}) \\ = & 0.94. \end{matrix}

What if the Entropy-Split of the attribute humidity?

\begin{matrix} E S (S ⟹ S_{h u m = h i g h}, S_{h u m = n o r m a l}) = & \frac{| S_{h u m = h i g h} |}{| S |} E (p l a y | h u m = h i g h) + \frac{| S_{h u m = n o r m a l} |}{| S |} E (p l a y | h u m = n o r m a l) \\ = & \frac{7}{14} [- \frac{3}{7} {log}_{2} (\frac{3}{7}) - \frac{4}{7} {log}_{2} (\frac{4}{7})] + \frac{7}{14} [- \frac{6}{7} {log}_{2} (\frac{6}{7}) - \frac{1}{7} {log}_{2} (\frac{1}{7})] \\ = & 0.7885. \end{matrix}

And the information gain?

\begin{matrix} I G (S ⟹ S_{h u m = h i g h}, S_{h u m = n o r m a l}) = & E (p l a y) - E S (S ⟹ S_{h u m = h i g h}, S_{h u m = n o r m a l}) \\ = & 0.94 - 0.79 = 0.15. \end{matrix}

4.2 ID3 Tree Induction Algorithm

ID3 is an algorithm to construct decision trees, in which the split criterion is the maximization of the information gain and the prunning strategy is to stop if all the records in a node are of the same class. The algorithm is more detailed in Algorithm 2.

Algorithm 2: ID3(I, O, T) : Decision Tree

# I is the set of input attributes
# O is the output attribute
# T is a set of training data

if (T is empty) then
	return a single node with value “Failure”

if (all records  in T have the same value for O) then
	return a single node with that value

if (I is empty) then
	return a single node with the most frequent value of O in T

# else
compute IG for each attribute in I using data in T

Let X = argmin{IG(attr) for attr in I}
Let {x_j for j=1,...,m} be the values in X
Let {T_j for j=1,...,m} be the subsets of T when partitioned

return a tree with:
	root node labelled X
	arcs labelled x_1,...,x_m
	connected to 
	ID3(I-{X}, O, T_1),...,ID3(I-{X}, O, T_m)

Example 4.2. Compute the decision tree of the data in Table 3.2 using ID3 algorithm.

Step 1: Root

We start by computing IG for each attribute. To simplify notation, let

P = p l a y, O = o u t l o o k, T = t e m p e r a t u r e, H = h u m i d i t y

and

W = w i n d y

We already know that

E (P) = 0.94

and

I G (S ⟹ S_{H = h i g h}, S_{H = n o r m a l}) = 0.15

. Let's compute the rest of the values:

\begin{matrix} E S (S ⟹ S_{O = s u n n y}, S_{O = o v e r c a s t}, S_{O = r a i n y}) = & - \frac{5}{14} (\frac{3}{5} log \frac{3}{5} + \frac{2}{5} log \frac{2}{5}) \cdot 2 - 0 \\ = & 0.69. \end{matrix}

Which implies that

I G (S ⟹ S_{O = s u n n y}, S_{O = o v e r c a s t}, S_{O = r a i n y}) = 0.25

Repeating this process with temerature, we get

I G (S ⟹ S_{T = h o t}, S_{T = m i l d}, S_{T = c o l d}) = 0.03

and with windy, we get

I G (S ⟹ S_{W = T r u e}, S_{W = F a l s e}) = 0.05

This means that we label the root node with

X = O

and we create three arcs, each of them with one of the values from

O

. So we have the following Tree:

image: 1_home_runner_work_BDMA_Notes_BDMA_Notes_ULB_Data_Mining_LectureNotes_source_pegado1.png

Step 2: Outlook=sunny

Now, we are going to do the same thing, restricting ourselves to the records for which Outlook=sunny. Now, we have to recompute the entropy and the gain for each of the rest of the attributes.

Let's start with the entropy:

\begin{matrix} E (P | O = s u n n y) = & - [\frac{3}{5} log \frac{3}{5} + \frac{2}{3} log \frac{2}{5}] = 0.97. \end{matrix}

Now, the Information Gain:

\begin{matrix} I G (S_{O = s u n n y} ⟹ S_{O = s u n n y, H = h i g h}, S_{O = s u n n y, H = n o r m a l}) = \\ \frac{3}{5} E (P ⋀ O = s u n n y | H = h i g h) + \frac{2}{5} E (P ⋀ O = s u n n y | H = n o r m a l) = \\ - \frac{3}{5} [\frac{3}{3} log \frac{3}{3} + 0] - \frac{2}{5} [\frac{2}{2} log \frac{2}{2} + 0] = & 0. \end{matrix}

Which means that

I G (S_{O = s u n n y} ⟹ S_{O = s u n n y, H = h i g h}, S_{O = s u n n y, H = n o r m a l}) = 0.97

. As this cannot be improved, we can savely not compute the rest of the values. This way, the Tree will now look as follows:

image: 2_home_runner_work_BDMA_Notes_BDMA_Notes_ULB_Data_Mining_LectureNotes_source_pegado2.png

Step 3: Outlook=Sunny, Humidity=Normal

Note that we are proceeding heightwise, but doing this breadthwise is also possible.

This time, all the values for

P

are

Y e s

, so we enter the third

i f

of the algorithm and label the node as

Y e s

Step 4: Outlook=Sunny, Humidity=High

Same, now

P = N o

Step 5: Outlook=Overcast

Same, now

P = Y e s

So, now we have the following tree:

image: 3_home_runner_work_BDMA_Notes_BDMA_Notes_ULB_Data_Mining_LectureNotes_source_pegado3.png

Step 6: Outlook=rainy

E (P | O = r a i n y) = 0.97

and

I G (S_{O = r a i n y} ⟹ S_{O = r a i n y, W = T r u e}, S_{O = r a i n y, W = F a l s e}) = 0.97,

so, again, it is maximum and we can continue using it as label:

image: 4_home_runner_work_BDMA_Notes_BDMA_Notes_ULB_Data_Mining_LectureNotes_source_pegado4.png

Step 7: Outlook=rainy, Windy=False

All records have

P = Y e s

Step 8: Outlook=rainy, Windy=True

All record have

P = N o

So, we have the tree

image: 5_home_runner_work_BDMA_Notes_BDMA_Notes_ULB_Data_Mining_LectureNotes_source_pegado5.png

And as there are no more nodes to analyze, this is the final decision tree.

4.2.1 The problem of UID

In general, attributes that have very many values have very high gain, but can lead to useless decision trees. Quinlan suggest choosing the attribute with the highest

G a i n R a t i o (X, S) = \frac{G a i n (X, S ⟹ S_{1}, ..., S_{r})}{E n t r o p y (S)},

where

X

is the label attribute.

The GainRatio favores attributes with higher gain, and punishes attributes with high entropy (many values).

Example 4.3. Repeat the decision tree ID3 algorithm, but use GainRatio instead.

The same tree is obtained.

5 Bayesian classification

Probabilistic classifiers construct a model that quantifies the relationships between the feature variables and the target variable as a probability. We are going to study a well-known kind of probabilistic classifier, namely the Bayesian classifier or Naive Bayes classifier.

5.1 Naive Bayes classifier

The Bayes classifier is based on the Bayes' theorem for conditional probabilities.

Theorem 5.1. Bayes' Theorem

If A and B are probabilistic events and

P (B) \neq 0

, then

P (A | B) = \frac{P (B | A) P (A)}{P (B)} .

Proof. On one side

P (A | B) = \frac{P (A \cap B)}{P (B)} .

On the other side, if

P (A) \neq 0

P (B | A) = \frac{P (B \cap A)}{P (A)} ⟹ P (B \cap A) = P (B | A) \cdot P (A) .

Now substituting this value in the previous equation, we get

P (A | B) = \frac{P (B | A) \cdot P (A)}{P (B)},

which is the result we wanted to proof.

P (A) = 0

, then

P (A | B) = 0, \forall B

, so the formula also holds.

This theorem quantifies the conditional probability of a random variable (the class variable), given known observations about another variables (the features). Let

C

be the class variable and

X

an unseen feature tuple. The goal of the method is to estimate

P (C = c | X = (a_{1}, ..., a_{d})) .

Let the random variables for the individual dimensions of

X

be denoted by

X = (x_{1}, ..., x_{d})

, so we want to estimate

P (C = c | x_{1} = a_{1}, ..., x_{d} = a_{d}) \overset{B a y e s}{=} \frac{P (C = c) P (x_{1} = a_{1}, ..., x_{d} = a_{d} | C = c)}{P (x_{1} = a_{1}, ..., x_{d} = a_{d})} .

Here, we will be interested in maximizing this value. Since the denominator is the same independently of the class

c

, then we can focus on maximizing the denumerator

P (C = c) P (x_{1} = a_{1}, ..., x_{d} = a_{d} | C = c) .

The value

P (C = c)

is the prior probability of the class identifier

c

and can be estimated as the fraction of points in the data whose class is

c

. Thus, we now want to approximate the right factor. In the Naive Bayes approach, it is assumed that the feature values are independent of one another conditional on a fixed value of

C

. Then

P (x_{1} = a_{1}, ..., x_{d} = a_{d} | C = c) = \prod_{j = 1}^{d} P (x_{j} = a_{j} | C = c),

P (C = c) P (x_{1} = a_{1}, ..., x_{d} = a_{d} | C = c) = P (C = c) \prod_{j = 1}^{d} P (x_{j} = a_{j} | C = c) .

And this terms are much easier to estimate: to estimate

P (x_{j} = a_{j} | C = c)

we just need to take the fraction of training values with class

c

and compute which fraction of them verifies

x_{j} = a_{j}

. This is usually written as

P (x_{j} = a_{j} | C = c) = \frac{q (a_{j}, c)}{r (c)} .

Remark 5.1. When there are not enough training samples to produce reliable estimates, we can use Laplacian smoothing in which a small value

α

is added to the numerator and

α \cdot m_{j}

is added to the denominator, where

m_{j}

is the number of distinct values of the

j^{t h}

attribute

P (x_{j} = a_{j} | C = c) = \frac{q (a_{j}, c) + α}{r (c) + α \cdot m_{j}} .

α

is called the Laplacian smoothing parameter.

Remark 5.2. If a feature is continuous, then the likelihood is computed using a Gaussian distribution with mean

μ

and standard deviation

σ

P (x_{j} = a_{j} | C = c) = g (a_{j}, μ_{c}, σ_{c}) = \frac{1}{\sqrt{2 π} σ_{c}} e^{- \frac{{(a_{j} - μ_{c})}^{2}}{2 σ_{c}^{2}}} .

This model is sometimes refererd to as the Bernoully model for Bayes classification when it is applied to categorical data with only two outcomes of each feature attribute.

In cases where more than two outcomes are possible for a feature variable, the model is referred to as the generalized Bernoulli model.

Example 5.1. With the following training data:

Age	Income	Student	Credit_Rating	Buys
$\leq 30$	high	no	fair	no
$\leq 30$	high	no	excellent	no
$31..40$	high	no	fair	yes
$> 40$	medium	no	fair	yes
$> 40$	low	yes	fair	yes
$> 40$	low	yes	excellent	no
$31..40$	low	yes	excellent	yes
$\leq 30$	medium	no	fair	no
$\leq 30$	low	yes	fair	yes
$> 40$	medium	yes	fair	yes
$\leq 30$	medium	yes	excellent	yes
$31..40$	medium	no	excellent	yes
$31..40$	high	yes	fair	yes
$> 40$	medium	no	excellent	no

Classify with the Naive Bayes model the unseen record

X = (\leq 30, m e d i u m, y e s, f a i r) .

In this case, we have

C \in {y e s, n o}

, so

P (C = y e s) = \frac{9}{14}, P (C = n o) = \frac{5}{14} .

And for the feature variables, we have to compute

P (x_{i} = a_{i} | C = y e s)

and

P (x_{i} = a_{i} | C = n o)

Age $\leq 30$ : $P (\leq 30 | y e s) = \frac{2}{9}, P (\leq 30 | n o) = \frac{3}{5} .$
Income medium: $P (m e d i u m | y e s) = \frac{4}{9}, P (m e d i u m | n o) = \frac{2}{5} .$
Student yes: $P (y e s | y e s) = \frac{6}{9}, P (y e s | n o) = \frac{1}{5} .$
Credit_rating fair: $P (f a i r | y e s) = \frac{6}{9}, P (f a i r | n o) = \frac{2}{5} .$

Then, we have

P (C = y e s) \cdot P (\leq 30, m e d i u m, y e s, f a i r | C = y e s) = P (C = y e s) \cdot \prod P (X_{i} | C = y e s) = \frac{9}{14} \cdot \frac{2}{9} \frac{4}{9} \frac{6}{9} \frac{6}{9} = \frac{16}{567} \sim 0.028,

P (C = n o) \cdot P (\leq 30, m e d i u m, y e s, f a i r | C = n o) = \frac{5}{14} \cdot \frac{3}{5} \frac{2}{5} \frac{1}{5} \frac{2}{5} = \frac{6}{875} \sim 0.007.

Thus, the model classifies

X

with class 'yes'.

Some final comments

The advantages of the Bayes Classifier is that it is easy to implement and can lead to fairly good results.

The drawbacks are that the main assumption of class conditional independence is not very trustable, which may lead to a loss of accuracy, because dependencies axist among variables. To deal with this issue there is a more complex model: Bayesian Belief Networks.

6 Model evaluation and selection

With a given dataset, we can train multiple classification models and each of them will behave differently, many times without an intuitive explanation of the differences observed. Thus, it becomes crucial to establish means of comparison between different models, so it is possible to choose between several models trained with the same data.

The simplest classification measure is the accuracy, which indicates the portion of well classified records. Nonetheless, if we compute the accuracy using the training data, we could be enhance overfitting to the training data, and maybe choose models that do not work well with unseen data. This is why it is usual to use a validation test set to compute comparison measures: the idea is that given a dataset, we can divide it into training data and test data. Then, models will be trained using the training data and evaluated with the test data, which has not been seen before. This makes the measures more reliable and the comparisons more fair.

There are several ways to extract training and test data from a dataset, which are detailed in 9.

6.1 Confusion Matrix

A confusion matrix is a visual and intuitive way to assess a classification algorithm. In its simplest form it is used to assess a binary classification model and it shows the following metrics:

True Positives: count of records classified as True that are actually True.
False Positives: count of records classified as True that were False in reality.
True Negatives: count of records classified as False that were actually False.
False Negatives: count of records classified as False that were True in reality.

In this case, the confusion matrix has the form:

	Predicted True	Predicted False
Actual True	TP	FN
Actual False	FP	TN

In the more general case in which we have a

N

-class classifier, each cell

M_{i j}

would have the count of records classified as class

j

that are of class

i

in reality:

	Predicted $C_{1}$	Predicted $C_{2}$	...	Predicted $C_{N}$
Actual $C_{1}$	$T C_{1}$	$F C_{2, 1}$	...	$F C_{N, 1}$
Actual $C_{2}$	$F C_{1, 2}$	$T C_{2}$	...	$F C_{N, 2}$
$⋮$	$⋮$	$⋮$	$⋱$	$⋮$
Actual $C_{N}$	$F C_{1, N}$	$F C_{2, N}$	...	$T C_{N}$

From these confusion matrices, we can derive some interesting measures:

Accuracy: the accuracy can be computed from the matrix as $A c c = \frac{T P + T N}{A l l} .$
Error rate: the reverse of the accuracy: $E R = 1 - A c c = \frac{F P + F N}{A l l} .$
Class imbalance measures: sometimes, one class appears in a little amount of records (this is specially common in disease detection, for example), and is usually the case to have the focus on being able to correctly detect this rare cases.
Example 6.1. As an example, imagine we are trying to detect fraudulent transactions in a banking context. Out of thousands of transactions, only a very few amount would be fraudulent. Say there are 1M records for training and only 100 are known to be fraudulent. If we predict all records as OK, we would get $\frac{999 900}{1 0^{6}} = 99.99 %$ accuracy, but we will not detect any fraud! So the accuracy is stunning but the model is useless.

For example like this the following measures have a motive:
- Sensitivity: focuses on correctly detecting True outcomes $S e n s = \frac{T P}{A c t u a l T r u e} .$
- Specificity: focuses on correctly detecting False outcomes $S p e c = \frac{T N}{A c t u a l F a l s e} .$
Precision: how many records labeled as True are actually True $P r e c = \frac{T P}{T P + F P} .$
Recall: a synonym of sensitivity.
F-score: the harmonic mean of the precision and recall $F = \frac{2 \cdot p r e c i s i o n \cdot r e c a l l}{p r e c i s i o n + r e c a l l} .$
$F_{β}$ : weighted measure of precision and recall. Assigns $β$ times as much weight to recall as to precision $F_{β} = \frac{(1 + β^{2}) \cdot p r e c i s i o n \cdot r e c a l l}{β^{2} p r e c i s i o n + r e c a l l} .$

7 Ensemble methods: increasing accuracy

An ensemble method for classification is a composite model, made up of a combination of classifiers. The idea is that given an unseen record, it can be classified using several models, and then make a consensus between all of them to decide the final decision of the class of the record. Usually, a voting scheme is used, in which the most voted class is the chosen one for classification. This approach usually improves the accuracy of each component.

Let's analyze why this works! There are three primary primary components to the error of a classifier:

Bias: every classifier has its own assumptions about the nature of the decision boundary between classes. When a classifier has high bias, it will make consistently incorrect predictions over records that lie near the incorrectly-modeled decision boundary.
Example 7.1. Bias is shown below.

The real model has been generated using the blue line. The classification model has the assumption that the data can be classified using a straight line. As we can see, points near to the boundary would be missclassified.
Variance: random variations in the choices of the traiing data will lead to different models. This is closely related to overfitting. When a classifier has an overfitting tendency, it will make inconsistent predictions for the same test instance oevr different training data sets.
Example 7.2. Variance is shown below.

In this case, the yellow-ish model happen to have used few red points near the boundary, so it became a perfect blue classifier, but a bad red classifier. The opposite happened to the blue-ish model. Note that this variations are only due to random selection of the training points.
Noise: it is the intrinsic errors in the target class labeling. As this is intrinsic, there is not much one can do. Therefore, we will focus in the two latter sources of error.

In addition, bias and variance are often in a trade-off relationship: improving bias worsens variance, and vice-versa. Generally speaking, simplified assumptions about the decision boundary lead to greater bias but lower variance, while complex assumptions reduce bias but are harder to robustly estimate with limited data.

Ensemble analysis can often be used to reduce both the bias and variance of the classification process, because a combination of different simple models with high bias and little variance will reduce the bias as the assumptions will be combined to model more complex scenarios. On the other hand, a combination of complex models with low bias and high variance, will reduce the variance because decisions made by multiple models tend to be more consistent than those made by individual models.

Looking at the accuracy

Now, let's look at the accuracy of the ensemble model in comparison to its component models in the case of binary classification (for ease). Let's say the ensemble model is composed of

N

models, each of them with an accuracy

a c c_{i}

. Then, for the ensemble model to classify a new record as correct, at least half of the models need to classify the tuple correctly. Let's define the random variable

X

as 'number of classifiers that classify the tuple as correct', then

P (c o r r e c t) = P (X \geq \frac{N}{2}) = 1 - P (X < \frac{N}{2}),

and we can decompose

P (X < \frac{N}{2}) = P (X = 1) + P (X = 2) + ... + P (X = \frac{N}{2} - 1) .

Here, assuming independence of the different models, it is

P (X = 1) = \sum_{i} a c c_{i} \times \prod_{j \neq i} (1 - a c c_{j}),

P (X = 2) = \sum_{i} \sum_{j \neq i} a c c_{i} \times a c c_{j} \times \prod_{k \neq i, j} (1 - a c c_{k}),

and so on. This yields to a complex formula, but when the case is the simplest in which

A = a c c_{i} = a c c_{j}, \forall i, j

, we are in a binomial distribution, having

P (X = k) = (\binom{N}{k}) A^{k} {(1 - A)}^{N - k} .

Example 7.3. Suppose an ensemble model with

N = 3

and equal accuracy for the three models,

A > 0

. Then, the accuracy for the ensemble model is

A c c = P (c o r r e c t) = 1 - P (X = 1) = 1 - 3 A {(1 - A)}^{2},

which can be compared with the individual accuracies:

A c c - A = 1 - 3 A {(1 - A)}^{2} - A = - 3 A {(1 - A)}^{2} + 1 - A = (1 - A) [1 - 3 A (1 - A)] .

At this point, the accuracy will be increased whenever

1 - 3 A (1 - A) > 0 ⟺ 3 A^{2} - 3 A + 1 > 0.

The discriminant of this polynomial is

b^{2} - 4 a c = 9 - 12 = - 3 < 0,

so all its solutions are complex and it does not cut the X axis. As the leading coefficient is positive, this polynomial is always positive and we conclude that the 3-ensemble method always improves the accuracy of the individual models.

Example 7.4. Suppose an ensemble model with

N = 5

and equal accuracy for the three models,

A > 0

. Then, the accuracy for the ensemble model is

\begin{matrix} A c c = 1 - P (X = 1) - P (X = 2) & = 1 - 5 \cdot A {(1 - A)}^{4} - \frac{5!}{2! 3!} A^{2} {(1 - A)}^{3} \\ = 1 - 5 A {(1 - A)}^{4} - 10 A^{2} {(1 - A)}^{3}, \end{matrix}

and when compared to the individual values we obtain

\begin{matrix} A c c - A & = 1 - 5 A {(1 - A)}^{4} - 10 A^{2} {(1 - A)}^{3} - A \\ = (1 - A) [1 - 5 A {(1 - A)}^{3} - 10 A^{2} {(1 - A)}^{2}] . \end{matrix}

This is an increased accuracy whenever

1 - 5 A {(1 - A)}^{3} - 10 A^{2} {(1 - A)}^{2} > 0,

which corresponds to the polynomial

- 5 A^{4} + 5 A^{3} + 5 A^{2} - 5 A + 1 > 0.

The graph of this polynomial is

image: 8_home_runner_work_BDMA_Notes_BDMA_Notes_ULB_Data_Mining_LectureNotes_source_pegado13.png

So for

A \in (0, 1)

it is positive everywhere except the interval

(0.354, 0.423)

, in which it is negative. This means that the accuracy is improved in all cases outside this interval.

For example, if

A = 0.7

, then

A c c = 0.839

, which is indeed an improvement.

There are several ways to make ensemble methods:

Bagging: the data is bootstrapped $N$ times, collecting a training data set of approximately the same size of the original data set for each model. Then, each model is trained with a different bootstrap.
This approach reduces the variance, because the differences owed to the sampling are reduced by performing this sampling $N$ times. However, bagging does not improve the bias, because all the models used have the same assumptions.
Boosting: a weight is associated with each training instance and the N classifiers are trained with the use of these weights. When the classifier M i has finished its training, those records in which the model missclassify are increased in weight, so the next model M i+1 will be trained paying more attention to those records.
With this approach, the overall bias is reduced because each model focuses on those places where the past models tend to fail.
1. Adaboost: a particular algorithm approach based on the boosting idea. Given a dataset, D , of k records with label y i , ( X 1 , y 1 ) ,...,( X k , y k ) :
  1. Set initial weights to $w_{i} = \frac{1}{k}, \forall i = 1, ..., k$ .
  2. j = 1
  3. While j<N
    1. Bootstrap $D$ to get a training set $T_{j}$ , select tuples with probability $w_{i}$ .
    2. Train model $M_{j}$ with $T_{j}$ .
    3. Compute the error rate of $M_{j}$ using $D$ .
    4. Update the weigths
      - If a tuple is misclassified: increase its weight
      
      - If it is correctly classified: decrease its weight
    5. j = j+1
  The error rate with weights is computed as $E R (M_{j}) = \sum_{i = 1}^{d} w_{i} \cdot e r r o r (M_{j}, X_{i}),$ where $e r r o r (M, X) = {\begin{matrix} 0 & i f M c l a s s i f i e s X c o r r e c t l y \\ 1 & i f M m i s c l a s s i f i e s X \end{matrix}$ .
  
  When the voting is performed, the votes are also weighted, with $w (M_{i}) = log \frac{1 - E R (M_{i})}{E R (M_{i})} .$
Random Forest: bagging does not work very well combined with decision trees, because the ID3 algorithm tends to generate similar/correlated trees. The idea here is to add randomness to the tree induction algorithm itself, as follows:
1. Before each split, $L$ attributes are randomly selected out of the available $K$ attributes.
2. The split attribute is selected from this group of $L$ attributes.
When $L$ is selected to be much smalles than $K$ , the trees in the forest are highly independent, so the method is able to improve the accuracy of the individual trees. Note nonetheless that in this case the individual trees will perform worse than normally trained trees, because they have been worsened on purpose.

Part III Model validation and data preparation

8 Data preparation

The data preparation phase is a multistage process that comprises several individual steps, some or all of which may be used in a given application. These steps are:

Feature extraction and portability: a feature is characteristic of the data or derived from the data. For example, if we have a sensor measuring humidity, the level of humidity will be a feature directly present in the data; the difference between the humidity level at each measure and the average humidity level is a derived feature.
Features with good semantic interpretability are more desirable because this makes things easier for the analyst to understand results. So, the process of selecting which features to take into account for further analysis is called feature extraction.

Data type portability refers to the process of transforming data into different formats. This could have several reasons behind: we could do this because we have several sources of data which we want to unify or because we use an internal datatype that will not be compatible with what the training algorithms expect.
Data cleaning: missing, erroneus and inconsistent entries are treated. We can either remove them or estimate them via the process of imputation.
Data reduction, selection and transformation: the size of the data is reduced through data subset selection, feature subset selection, or data transformation. This helps in two ways:
1. The algorithms perform more efficiently in smaller datasets.
2. The removal of irrelevant features or records improves the quality of the data mining process.

8.1 Feature extraction

Example 8.1. Image feature extraction

Image data are represented as pixel. Nonetheless, we know that pixels are related between each others and that combinations of pixels carry information about what we are seeing in the image. This is not straighforward for a computer to understand, as the computer only 'sees' a matrix of triplets.

At a higher level, color histograms can be used to represent the features in different segments of an image.

Also, visual words are used to extract features from images. A visual word is a semantically rich representation of parts of an image.

Example 8.2. Document feature extraction

Document data is often available in raw and unstructured form, and the data may contain rich linguistic relations between different entities.

One approach is to remove stop words, stem the data, and use bag-of-words representation.

Other methods use entity extraction to determine linguistic relationships.

Named-entity recognition is an important subtask of information extraction. It consists in locating and classifying atomic elements in text into predefined expressions of names of persons, organizations,...

8.2 Data Type Portability

Numerical to categorizal data: discretizacion
The process of discretization divides the ranges of the numeric attribute into $m$ ranges. Then, the attribute is assumed to contain $m$ different categorical labeled values from 1 to $m$ .

Variations within a range are not ditinguishable after discretization (some information is lost).

One challenge is that data may be nonuniformly distributed across the different intervals. Thus, there are several ways to perform the division:
- Equi-width ranges: the interval is divided into $m$ subintervals of equal length.
- Equi-log ranges: the interval is divided in such a way that the log-length is constant. If we want to divide the interval $[a, b]$ into $m$ equi-log ranges ${[a_{i}, b_{i}]}_{i = 1}^{m}$ , we have to ensure that $log (b_{i}) - log (a_{i}) = log (b_{j}) - log (a_{j}), \forall i, j = 1, ..., m$
- Equi-depth ranges: the ranges are selected so that each range has an equal number of records.
Categorical to numeric data: binarization
Suppose a categorical attribute with $m$ different values. Then, we can binarize it by creating $m$ attributes, and the record will have all of them set to 0, except the one corresponding to the value that it has, which will be set to 1. For example:

Name

Role

Pedro

CEO

$⟹$

Name

Role:CEO

Role:Employee

Role:Director

Pedro

1

0

0
Text to numeric data: Latent Semantic Analysis (LSA)
LSA transforms the text collection to a nonsparse representation with lower dimensionality. After transformation, each document $X = (x_{1}, ..., x_{d})$ needs to be scaled to $\frac{1}{\sqrt{\sum_{i = 1}^{d} x_{i} ²}} (x_{1}, ..., x_{d}) .$ This is necessary to ensure that documents of varying length are treated in uniform way.
Time series to Discrete Sequence Data: Symbolic Aggregate Approximation (SAX)
Two steps:
- Window-based averaging: the series is divided into windows of length $w$ , and the average time-series value over each window is computed.
- Value-based discretization: the averaged time-series values are discretized into a smaller number of equi-depth intervals. Idea: ensure that each symbol has an approximately equal frequency in the time-series.
Time series to Numeric Data:
- Discrete Wavelet Transform (DWT): converts the time series data to multidimensional data, as a set of coefficients that represent averaged differences between different portions of the series.
- Discrete Fourier Transform (DFT): similar, using Fourier series' theory.
Discrete Sequence to Numeric Data:
Two steps:
- Convert the discrete sequence to a set of binary time series, with as many time series as the number of values the discrete sequence can take.
- H each of these time series into a multidimensional vector using the DWT. This combines the features from the different series, creating a single multidimensional record.
Spatial to Numeric Data:
The approach is the same as the one used for time-series data, but now there are two contextual attributes instead of one, so the DWT has to be modified to be two-dimensional.

8.3 Data Cleaning

Data in the real world is

incomplete: there are values for some attributes that are missing. This can happen because some measures were not always taken of because of human/computer errors.
noisy: there are errors. This can happen because the instruments used to collect data are not working properly, because errors in the data transmission occur or because of human/computer errors.
inconsistent: there are discrepancies between different attributes that are related. This can happen when data from different sources needs to be combined or when some calculated values are not updated after changing their source values.
duplicate: there are duplicate values.

8.3.1 Handling Missing and Inconsistent entries

We can:

Delete the entire record: this is a safe option, because we don't introduce bias, but it usually not practical, because we might delete too many entries.
Impute/estimate the missing values: we use the rest of the data to estimate the values that are missing or choose some constant based on some assumption. The problem with this approach is that one way or another we are introducing bias in the data.
Change the mining algorithm: there are some algorithms that are developed to deal with missing values.

8.3.2 Handling Noisy entries

We can:

Perform kernel smoothing: for numerical data, we can use a kernel function to smooth the data values.
- kNN smoother: replaces each value with the average of itself and its k-nearest neighbors.
- kernel average smoother: replace a value with the weighted average of itself and its neighbors in a fixed size window.
Binning: it is also possible to sort the data and partition it into equally sized bins. Then, the data can be smoothed by the bin mean, median or boundary values.
Regression: smooth by fitting the data to a regression function.
Change the mining algorithm: there are algorithms designed to tolerate noise.

8.4 Exploratory analysis

Exploratory analysis is the task to understand the data, from the meaning of the features, to their range of values or even their statistical distributions. There are many actions we can do to explore the data, such as counting nulls, searching for repetition, compute some statistics as the maximum, the minimum, the mean,... of the data, and many more.

8.4.1 Central tendency measures

Central tendency measures are a 1 number summary that can be helpful:

Mean: $\bar{X} = \frac{\sum X_{_{i}}}{N} .$
Weighted mean: $\tilde{X} = \frac{\sum w_{i} X_{i}}{\sum w_{i}} .$
Trimmed mean: a mean calculated disregarding extreme values.
Median: the middle value of the data.
Mode: value that occurs most frequently in the data.

8.4.2 Symmetric and Skewed data

Using the mean, median and mode, we can understand in a soft way the distribution of the data:

If the three values are very similar, then the distribution is very symmetric, having this values in the center.
If the order is $M o d e < M e d i a n < M e a n,$ then the distribution is skewed to the left.
If the order is $M e a n < M e d i a n < M o d e,$ then the distribution is skewed to the right.

8.4.3 Measuring the dispersion

Quartiles: the Q1 (25th percentile) and Q3 (75th) percentile.
Inter-quartile range: $I Q R = Q_{3} - Q_{1}$
Five number summary: minimum, Q1, median, Q3, maximum.
Boxplot: the median is marked and there is a box around it which ends in the queartiles. Outliers are plotted individually.
Outlier: a value that lies outside the range $1.5 \times I Q R$ .

8.4.4 Comparing with the normal distribution

We can compute the mean

μ

and the standard deviation

σ

and check if the data behaves as a normal distribution:

In $(μ - σ, μ + σ)$ there is about 68% of the data.
In $(μ - 2 σ, μ + 2 σ)$ there is about 96% of the data.
In $(μ - 3 σ, μ + 3 σ)$ there is about 99.7% of the data.

We can also perform the Kolmogorov-Smirnov test or the Shapiro-Wilk test, which are statistical test that try to assess is a distribution is normal.

8.5 Similarity and Distance

There are many data mining algorithms which uses the notions of similarity or distance between two points. Usually, the selection of the distance function is an important decision before using an algorithm, because it will ultimately influence the results and their implications.

Definition 8.1. The

L_{p}

-norm is a distance function defined by

D i s t (X, Y) = {(\sum_{i = 1}^{d} {| X_{i} - Y_{i} |}^{p})}^{\frac{1}{p}} .

For

p = 2

it is the well-known Euclidean distance.

For

p = 1

it is called the Manhattan distance, because it is like traversing a grid made of rectangles, similar to the streetmap of Manhattan.

Definition 8.2. The generalized Minkowski distance is defined by

D i s t (X, Y) = {(\sum_{i = 1}^{d} a_{i} \cdot {| X_{i} - Y_{i} |}^{p})}^{\frac{1}{p}} .

Remark 8.1. As we can see Minkowski is a weighted

L_{p}

-norm. This is useful in context where some features are more important than others, so they can have a higher weight in the distance measures.

Remark 8.2. Generally,

p

is set to

d

, the number of dimensions of the data.

Remark 8.3. Multidimensional data has normally different scales for the different dimensions, resulting in features dominating others in distance computations. To solve this issue we can do normalization and scaling.

Normalization is to replace each value $X_{i}$ with $Z_{i}^{j} = \frac{X_{i}^{j} - μ_{j}}{σ_{j}},$ where $μ_{j}$ is the mean of the attribute $j$ and $σ_{j}$ its standard deviation.
Scaling maps the values to the range $[0, 1]$ : $Y_{i}^{j} = \frac{X_{i}^{j} - {min}_{j}}{{max}_{j} - {min}_{j}} .$

Definition 8.3. The edit distance is a distance defined over strings.

We have the operators:

r: replace one character by another.
i: insert one character.
d: delete one character.

The edit distance between two strings

s_{1}

and

s_{2}

is the minimum amount of operations needed to convert

s_{1}

s_{2}

The formula is

e d i t (s_{1}, s_{2}) = {\begin{matrix} | s_{1} | & i f | s_{2} | = 0 \\ | s_{2} | & i f | s_{1} | = 0 \\ e d i t (t a i l (s_{1}), t a i l (s_{2})) & i f s_{1} [0] = s_{2} [0] \\ 1 + min {\begin{matrix} e d i t (t a i l (s_{1}), s_{2}) \\ e d i t (s_{1}, t a i l (s_{2})) \\ e d i t (t a i l (s_{1}), t a i l (s_{2})) \end{matrix}} & o t h e r w i s e \end{matrix}

where

t a i l (s)

is the string

s

minus its first character.

Remark 8.4. If the edit distance is computed recursively, its complexity measures are:

Time: $O (| s_{1} | \times | s_{2} |)$ .
Space: $O (| s_{1} | \times | s_{2} |)$ .
Backtrace (length of the longest backtracking path): $O (| s_{1} | + | s_{2} |)$ .

Remark 8.5. If we do it with dynamic programming, with the algorithm in Algorithm 3, then its complexity measures are:

Time: $O (| s_{1} | \times | s_{2} |)$ .
Space: $O (| s_{1} | \times | s_{2} |)$ .
Backtrace: 0.

for i=1 to |s1| do
	for j=1 to |s2| do
		m[i,j] = min{m[i-1,j-1] + if[(s1[i] = s2[j] then 0 else 1],
					 m[i-1,j] + 1,
					 m[i,j-1] +1
					}
return m[|s1|,|s2|]

Algorithm 3: DP_Edit(s1, s2)

Remark 8.6. There is a further improvement that can be made. If we perform the dynamic programming only using rows or columns, then we only need to store two of them: the current one and the previous one. The complexity measures in this case are:

Time: $O (| s_{1} | \times | s_{2} |)$ .
Space: $O (min (| s_{1} |, | s_{2} |))$ .
Backtrace: 0.

9 Model evaluation

Once we have trained a model, we want to assess how well it performs. This way, we can compare different models and discuss, quantitatively, which of them is preferrable for our purposes. Nonetheless, this task is not easy, and there are both methodological and quantification issues to take into account:

Methodological issues: associated with dividing the labeled data appropriately into training and test segments for evaluation. The choice of methodology has a direct impact on the evaluation process, such as underestimation or overestimation of classifier accuracy. Several approaches are possible: holdout, bootstrap and cross-validation.
Quantification issues: associated with providing a numercial measure for the quality of the method after a specific methodology for evaluation has been selected.

9.1 Holdout

The labeled data is randomly divided into two disjoint sets, corresponding to the training and test data. The training data is used to feed the training algorithm and produce a model, whose performance is assessed using the test data.

The approach can be repeated several times with multiple samples to provide a final estimate.

Problem: classes that are overrepresented in the training data are underrepresented in the test data. This can have a significant impact when the original class distribution is imbalanced. The error estimates are pessimistic.

image: 9_home_runner_work_BDMA_Notes_BDMA_Notes_ULB_Data_Mining_LectureNotes_source_pegado6.png

Figure 1: Holdout visualization.

9.2 Cross-Validation

The labeled data is divided into

m

disjoint subsets of equal size

\frac{n}{m}

. A typical choice of

m

is around 10. One of the

m

segments is used for testing, and the other

(m - 1)

segments are used for training. This approach is repeated by selecting each of the

m

different segments in the data as test set.

The average accuracy over the different test sets is then reported.

The overall accuracy of the cross.validation procedure tends to be a highly representative, but pessimistic estimate, of model accuracy.

When

m

is chosen to be

m = n

n - 1

examples are used for training, and one example is used for testing. This is called leave-one-out cross-validation. This approach is very expensive for large datasets.

image: 10_home_runner_work_BDMA_Notes_BDMA_Notes_ULB_Data_Mining_LectureNotes_source_pegado7.png

Figure 2: Cross-Validation visualization.

m = 3

Stratified cross-validation uses proportional representation of each class in the different folds and usually provides less pessimistic results.

9.3 Bootstrap

The labeled data is sampled uniformly with replacement, to create a training dataset, which can contain duplicates. The labeled data of size

n

is sampled

n

times with replacement.

The probability that a particular point is not included in a sample is

p_{1} = 1 - \frac{1}{n} .

Therefore, the probability that the point is not included in

n

samples is

p_{n} = {(1 - \frac{1}{n})}^{n} .

For large values of

n

, this approximates

\frac{1}{e}

. Thus, the fraction of the labeled data points included at elast once in the training dataset is

1 - \frac{1}{e}

The overall accuracy is computed using the original set of full labeled data as the test examples.

The estimate is highly optimistic of the true classifier accuracy because of the large overlap between the training and test examples.

A better strategy is the leave-one-out bootstrap, in which the accuracy of each labeled instance is computed using the classifier performance on only the subset of the bootstraped samples in which the instance is not part of.

This approach provides a pessimistic accuracy estimate,

A_{l}

, given by the mean value of the accuracy computed for each labeled instance.

The 0.632-bootstrap improves the accuracy estimate with a compromise approach. The average training-data accuracy

A_{t}

over

b

bootstrapped samples is computed. This is a highly optimistic estimate. The overall accuracy is a weighted average of the leave-one-out accuracy and the training-data accuracy:

A = (0.632) \cdot A_{l} + (0.368) \cdot A_{t} .

Part IV Clustering

Some applications require to divide the data into different groups, that share some characteristics. The problem many of these times is that we don't know which characteristics or at how much extend are useful to characterize the data. The general (unsupervised) approach to tackle this problem is clustering.

Definition 9.1. Clustering problem (Informal)

Given a set of data points, partition them into groups containing similar data points.

This definition is informal and general, but gives enough information to understand the problem, as well as enough freedom to tackle it from different perspectives.

10 Representative-Based Algorithms

These are the simplest of all clustering algorithms, as they directly use distances or similarities to cluster the data. They not capture hierarchical relationships and use a set of representatives to cluster the data. The main insight is that the discovery of good clusters equates to the discovery of good representatives.

Definition 10.1. Representative-Based general clustering problem

Given a data set

D

with

n

data points

X_{1}, ..., X_{n}

in a

d

-dimensional space and a specified number of clusters,

k

, the goal of a representative-based algorithm is to determine

k

representatives

Y_{1}, ..., Y_{n}

such that the objective function

O = \sum_{i = 1}^{n} [{min}_{j} D i s t (X_{i}, Y_{j})]

is minimized, i.e., the sum of the distances of the different data points to their closest representative needs to be miminized.

Remark 10.1. The representatives

Y_{1}, ..., Y_{k}

and the optimal assigmnent of data points to representatives are unknown a priori, but they depend on each other in a circular way. This fact allows us to develop a iterative approach to solve the problem.

The generic $k$ -representative approach is as in Listing 4.

Algorithm 4: Generic

k

-representative approach (Data D, int k, threshold eps) : Set of representatives Y and Clusters C

Initialize Y = {Y_1,...,Y_k} <@\textcolor{purple}{\#Using heuristics}<@
Initialize clusters C_1 = {},... C_k = {}
do
# Assign step
	for(X in D):
		assign X to Y_j such that <@\textcolor{blue}{$Dist(X,Y_j) = \min_i Dist(X,Y_i)$}<@
		C_j.add(X)
	
# Optimize step
	for all Clusters C_j:
		determine Y_j' such that 
			<@\textcolor{blue}{$\sum_{X_i \in C_j} Dist(X_i,Y_j')$}<@
		is minimized
		Y_j = Y_j'

while <@$O=\sum_{i=1}^{n}\left[\min_{j}\ Dist\left(X_{i},Y_{j}\right)\right]$<@ > eps
return {C_1, Y_1},...,{C_k, Y_k}

Remark 10.2. The idea is to improve the objective function over multiple iterations. The increase is usually greater in early iterations, and decreases rapidly.

Remark 10.3. The main computational bottleneck is the assignment step, where distances need to be computed between all point to the representatives.

10.1 The k-Means algorithm

The

k - m e a n s

algorithm is a representative clustering method in which the distance used is the squared Euclidean distance (or squared

L_{2} -

norm):

D i s t (X_{i}, Y_{j}) = {∥ X_{i} - Y_{j} ∥}_{2}^{2} .

Thus, the objective function minimizes the sum of square errors over the data points, this is called the

S S E

(Sum of Squared Errors).

Proposition 10.1. The optimal representative

Y_{j}

for each of the optimize iterative steps is the mean of the data points in cluster

C_{j}

Proof. In the current step, we have a fixed clustering asssignment from the last step,

C_{1}, ..., C_{k}

. The overall clustering objective function is

O (X, Y) = \sum_{j = 1}^{k} \sum_{X_{i} \in C_{j}} {∥ X_{i} - Y_{j} ∥}_{2}^{2},

so its gradient for each

Y_{j}

\frac{d}{d Y_{j}} O (X, Y) = 2 \sum_{X_{i} \in C_{j}} (X_{i} - Y_{j}) .

When imposing the gradient equals to 0 (for optimization purposes), we get

\sum_{X_{i} \in C_{j}} X_{i} - Y_{j} = 0,

or, equivalently,

\sum_{X_{i} \in C_{j}} X_{i} = | C_{j} | Y_{j} ⟹ Y_{j} = \frac{\sum_{X_{i} \in C_{j}} X_{i}}{| C_{j} |} = m e a n (C_{j}) .

Remark 10.4. Note that the obtained representatives could be a point which is not a point in the data. This property sometimes is undesirable.

Remark 10.5. Note also that for the proof, we have suposed a numerical attribute. Computing the mean of different (for example) texts does not seem easy (think for example in the words

c l a s s i f i c a t i o n

and

r e g r e s s i o n

, they could be certainly clustered inside

d a t a m i n i n g t e c h n i q u e s

, but the latter is hardly the mean of the two former words).

Remark 10.6. Regarding time complexity:

The assign step is $O (n \cdot k),$ as we have to compute for each point, $k$ distances.
The optimize step is $O (n)$ , as we have to compute the $k$ different means using all $n$ points.
Overall, then, it is $O (n \cdot k)$ per iteration. But usually few iterations are needed.

Remark 10.7. Disadvantages:

All points are clustered and taken into account equally for the objective function. This makes $k - m e a n s$ sensitive to outliers, which introduces bias.
In some situations we cannot compute the mean of the data points (as the previous example).
We need to know $k$ in advance.
As we are minimizing the Euclidean distance, the algorithm is biased towards finding spherical clusters, because the sphere is the shape whose maximum distance to its center is constant: $S^{n} (y, r) = {x \in R^{n + 1} : \sum_{i = 1}^{n + 1} | x_{i} - y_{i} | \leq r^{2}} .$

Example 10.1. Apply the

k - m e a n s

algorithm with

k = 3

to the dataset

D = {A_{1} = (2, 10), A_{2} = (2, 5), A_{3} = (8, 4), A_{4} = (5, 8), A_{5} = (7, 5), A_{6} = (6, 4), A_{7} = (1, 2), A_{8} = (4, 9)}

and using the seed

Y_{1} = A_{5}, Y_{2} = A_{6}

and

Y_{3} = A_{8}

10.2 The k-Medians Algorithm

In this case the Manhattan distance (

L_{1}

-norm) is used:

D i s t (X_{i}, Y_{j}) = {∥ X_{i} - Y_{j} ∥}_{1} .

Proposition 10.2. The optimal representative

Y_{j}

for each of the optimize iterative steps is the median of the data points along each dimension in cluster

C_{j}

Proof. In this case, the objective function is

O (X, Y) = \sum_{j = 1}^{k} \sum_{X_{i} \in C_{j}} {∥ X_{i} - Y_{j} ∥}_{1} .

Now, the

L_{1}

-norm is obtained by summing the absolute value in each dimension. The problem is that this function is not differentiable. Nonetheless, it is differentiable almost everywhere. We can obtain the sub-gradient of

O

with respect to

Y_{j}

\frac{d}{d Y_{j}} O (X, Y) = \sum_{X_{i} \in C_{j}} s i g n (X_{i} - Y_{j}) .

For this to equal 0, we need as many negative signs as positive signs: the median in each direction achieves exactly this, as it has as many values to its left as to its right.

Example 10.2. Repeat the example using the k-Medians algorithm.

10.3 The k-Medoids Algorithm

In this algorithm the representatives are always selected from the database

D

, and this makes the structure of the algorithm different from the one we have seen before.

Reasons:

This approach makes outlier handling easier than with $k$ -means.
It is sometimes difficult to compute the central representative of a complex data type (text or categorical data). The $k$ -medoids algorithm can be defined in any datatype in which we are able to define a proper distance function.

The algorithm is as in Algorithm 5.

Algorithm 5:

k

-medoids(Data D, int k, threshold eps) : Set of representatives Y and Clusters C

Initialize Y = {Y_1,...,Y_k} <@\textcolor{purple}{\#Using heuristics}<@
Initialize clusters C_1 = {},... C_k = {}
do
# Assign step
	for(X in D):
		assign X to Y_j such that <@\textcolor{blue}{$Dist(X,Y_j) = \min_i Dist(X,Y_i)$}<@
		C_j.add(X)
	
# Optimize step
	Determine a pair X_i in D and Y_j in Y such that 
		replacing Y_j with X_i leads to the 
		greatest possible improvement in the objective function

	Perform the exchange between X_i and Y_j only if improvement is positive.

while <@$O=\sum_{i=1}^{n}\left[\min_{j}\ Dist\left(X_{i},Y_{j}\right)\right]$<@ > eps 
		or no improvement in current iteration
return {C_1, Y_1},...,{C_k, Y_k}

Remark 10.8. In this algorithm we use a hill climbing strategy to obtain the best representatives.

Remark 10.9. We can try all possible changes or sample points from the database to try with. The latter approach is often more desirable for time issues.

10.4 Practical issues

The initialization of the initial representative is not a trivial task. Normally they are selected randomly among the points of the datasets and most times the initialization step does not change the outcome of the algorithm: the representative based algorithm are very robust to this selection. Nonetheless, sometimes suboptimal clusters arise because of a bad choice of initial representatives.
Outliers can make a detrimental impact on the algorithms. If one outlier is selected as initial representative it is possible to obtain a singleton cluster with no meaning for the application.
The number of clusters, $k$ , is difficult to determine using automated methods. As it is not known a priori, a common approach is to start with larger values for $k$ than the one we think should be correct. Some natural clusters may split, but we can merge some of them as a postprocessing step.

11 Grid and Density based Algorithms

One of the major problems with distance-based algorithms is that the shape of the clusters is implicitly enforced by the distance function. Thus, it can be hard to detect natural cluster of arbitrary form.

Density-based algorithms are useful for this. The idea is to identify dense regions in the data, and use the positions of the different regions to determine the clusters.

11.1 Grid-based methods

The data is discretized into

p

intervals, typically equi-width. If the data has

d

dimensions, we will obtain

p^{d}

hyper-cubes. These are the building blocks for the clusters.

A density threshold $τ$ is used to determine the dense hyper-cubes. In most real data-sets, an arbitrarily shaped cluster will result in multiple dense regions connected together by a side or a corner.

Two hyper-cubes are said to be adjacently connected if they share a side (sometimes corners are also considered).

Two hyper-cubes are said to be density connected if a path can be found from one to another containing only a sequence of adjacently connected grid regions.

The goal is to determine the density connected regions. Using a graph representation, the problem is equivalent to finding the connected components of the graph, being the hyper-cubes the nodes and an edge is defined between every pair of adjacent cubes.

Advantages

The number of clusters is not pre-defined, so we don't need to bother with the estimation of

k

Disadvantages

We have to define

p

and

τ

, which is not easy.

If $p$ (number of ranges) is too small, the data points from multiple clusters will be present in the same hyper-cube. We will obtain undesired merged clusters.
If $p$ is too large, there will be many empty grid cells, so we may split a natural cluster. It also will be computationally expensive.
The choice of $τ$ has similar consecuences.

Also, if the clusters present different densities, it is even more difficult to determine

τ

and

p

because each cluster is 'asking' for different values.

The generic algorithm is as follows:

Discretize each dimension into p ranges
Determine grid cells at density level tau
Create graph in which dense grids are connected if they are adjacent
Determine connected components of the graph
return points in each connected component as a cluster

Algorithm 6: GenericGrid(Data D, Ranges p, Density tau) : clusters C

11.2 DBSCAN

The idea behind DBSCAN is similar to the one we have seen, but density is considered at a pointwise level:

The density of a data point is defined as the number of points that lie within a radius

e p s

from it, i.e. their neighbourhood of radius

τ

. The densities are used to classify the points:

Core point: its neighbourhood contains at least $τ$ points.
Border point: its neighbourhood contains less than $τ$ points, but it contains one or more core points.
Noise points: any other case.

And we define some relations between points:

$(p_{i}, p_{j})$ are directly density reachable if $p_{i}$ is a core point, and $p_{j}$ is in the neighbourhood of $p_{i}$ .
$(p_{i}, p_{j})$ are density reachable is $p_{i}$ is a core point, and there exists a chain of core points $p_{i + 1}, ..., p_{n}$ where $(p_{k}, p_{k + 1})$ are directly density reachable for $k = i, ..., n - 1$ and $p_{j}$ is in the neighbourhood of $p_{n}$ .
$(p_{i}, p_{j})$ are density connected if both $p_{i}$ and $p_{j}$ are density reachable from some point $p_{k}$

After the points have been classified, a connectivity graph is constructed as a maximal set of points that are all reachable from one another under any of these definitions.

Now, we identify the connected components of the graph, which are the clusters.

The detailed algorithm is as follows:

Clusters = {}
for each unvisited point P in D
	Neihbourhood = regionQuery(P, eps)
	if sizeof(Neihbourhood) < tau
		mark P as visited
	else
		C = next cluster
		expandCluster(P, Neighbourhood, C, eps, tau)
		if C not in Clusters
			Clusters.add(C)
return Clusters

function expandCluster(P, Neighbourhood, C, eps, tau)
	mark P as visited
	C.add(P)
	for Q in Neighbourhood
		if Q not visited
			mark Q as visited
			Neighbourhood_Q = regionQuery(Q, eps)
			if sizeof(Neighbourhood_Q >= tau)
				Neighbourhood.addAll(Neighbourhood_Q)
			if Q is not in any cluster
				C.add(Q)

Algorithm 7: DBSCAN(Data D, Radius eps, Density tau) : clusters C

Advantages

This method is not very different from the graph method, and it can also discover clusters of any shape, without the need of knowing the number of clusters in advance.

Disadvantages

Again, determining the correct values for

e p s

and

τ

is a complex task. Also, the existence of clusters with different densities makes it even harder.

The major time complexity is finding the neighbours:

O (n^{2})

In some special cases, an spatial index can reduce it to

O (n \cdot log n)

Remark 11.1. Usually, grid based methods are more efficient because they partition the space, which makes the procedure less computationally expensive.

11.2.1 Progressive DBSCAN

The idea is using the same value for

τ

, apply DBSCAN in a progressive way, increasing the value of

e p s

Start with a small $e p s$ to find dense clusters.
Iteratively relax the $e p s$ value to find less dense clusters.
After every iteration, the points that already belong to a cluster are removed from the dataset.

11.3 DENCLUE

The DENCLUE algorithm is based on kernel-density estimation, which can be used to create a smooth profile of the density distribution, by defining the density

f (X)

at coordinate

X

f (X) = \frac{1}{n} \sum_{i = 1}^{n} K (X - X_{i}),

where

K

is the kernel function

X_{i}

are

n

different data points. A commonly used kernel function is the Gaussian Kernel:

K (X - X_{i}) = {(\frac{1}{h \sqrt{2 π}})}^{d} e^{- \frac{{∥ X - X_{i} ∥}^{2}}{2 h^{2}}} .

The effect of this operation is to replace each discrete data point with a smooth bump, and the density at each points is the sum of all these bumps.

Example 11.1. A visual example of a kernel smoothing:

image: 18_home_runner_work_BDMA_Notes_BDMA_Notes_ULB_Data_Mining_LectureNotes_source_pegado8.png

Once the density has been smoothed, the goal is to determine clusters by using a density threshold

τ

that intersects the density profile. Two examples showing how the choice of

τ

affects the result are shown in Example 11.2.

Example 11.2. In the previous example, if we select

τ = 0.1

, we obtain the following:

image: 19_home_runner_work_BDMA_Notes_BDMA_Notes_ULB_Data_Mining_LectureNotes_source_pegado9.png

In this case, only one cluster is obtained. In contrast, if we choose

τ = 0.13

, two clusters are obtained:

image: 20_home_runner_work_BDMA_Notes_BDMA_Notes_ULB_Data_Mining_LectureNotes_source_pegado10.png

12 Probabilistic Model-Based Algorithms

Until now, all models described are hard clustering algorithm, meaning each data point is assigned to a particular cluster. Probabilistic model-based algorithms are soft algorithms, in which each data point may have a nonzero assignment probability to more than one cluster.

12.1 Fuzzy sets and clusters

A fuzzy cluster is a fuzzy set

F_{S} : X \to [0, 1]

. For each data point

X_{i} \in X

F_{S} (X_{i})

represents the probability that

X_{i}

is in cluster

S

F_{S} (X_{i})

can be called degree of membership of object

X_{i}

to cluster

S

Formally, given a set of objects

X_{1}, ..., X_{n}

, a fuzzy clustering of

k

fuzzy clusters

C_{1}, ..., C_{k}

can be represented using a partition matrix,

M = [w_{i j}]

, where

w_{i j} = F_{C_{j}} (X_{i}) .

M

should satisfy three conditions:

$w_{i j} \in [0, 1], i = 1, ..., n, j = 1, ..., j$ . From the definition of a fuzzy cluster.
$\sum_{j = 1}^{k} w_{i j} = 1, i = 1, ..., n$ . The sum of all probabilities is 1.
$0 < \sum_{i = 1}^{n} w_{i j} < n, j = 1, ..., k$ . There is no empty cluster.

12.2 Mixture model

The underlying assumption of a mixture-based generative model is to assume that the data was generated from a mixture of

k

distributions with probability distributions

G_{1}, ..., G_{k}

. Each of them represents a cluster and is called mixture component. The data points,

X_{i}

, are generated by this model as follows:

Select a mixture component with prior probability $α_{i} = P (G_{i})$ . Say $G_{r}$ is selected.
Generate a data point from $G_{r}$ .

This generative model is denoted by

M

. We don't know

G_{i}

nor

α_{i}

in advance. The

G_{i}

distributions are often assumed to be Gaussian

Note that any other distribution might be assumed.

, so we need to estimate the parameters of the distribution in such a way that the overall data has a maximum likelihood of being generated by the model.

Consider a set

C

k

probabilistic clusters

C_{1}, ..., C_{k}

with probability density functions

f_{1}, ..., f_{k}

, respectively, and probabilities

p_{1}, ..., p_{k}

. The probability of an object

X

being generated by the cluster

C_{j}

P (X | C_{j}) = p_{j} \cdot f_{j} (X),

and the probability of

X

being generated by the set

C

P (X | C) = \sum_{j = 1}^{k} p_{j} \cdot f_{j} (X) .

As objects are assumed to be independently generated, for a data set

D = {X_{1}, ..., X_{n}}

, the probability that

D

is generated by

C

P (D | C) = \prod_{i = 1}^{n} P (X_{i} | C) = \prod_{i = 1}^{n} \sum_{j = 1}^{k} p_{j} \cdot f_{j} (X_{i}) .

Now, we want to estimate

C

from

D

trying to maximize

P (D | C)

is maximized.

If we use the assumption that the underlying distributions are Gaussian

G (μ_{j}, σ_{j})

, then the probability density function of each cluster are centered at

μ_{j}

with standard deviation

σ_{j}

is:

P (X_{i} | Θ_{j}) = \frac{1}{σ_{j} \sqrt{2 π}} e^{- \frac{{(X_{i} - μ_{j})}^{2}}{2 σ^{2}}} .

And if we assume all clusters have the same probability

p_{j}

, then

P (X_{i} | Θ) = \sum_{j = 1}^{k} \frac{1}{σ_{j} \sqrt{2 π}} e^{- \frac{{(X_{i} - μ_{j})}^{2}}{2 σ^{2}}} .

Thus, our objective is to maximize

P (D | Θ) = \prod_{i = 1}^{n} \sum_{j = 1}^{k} \frac{1}{σ_{j} \sqrt{2 π}} e^{- \frac{{(X_{i} - μ_{j})}^{2}}{2 σ^{2}}} .

This is achieved with the expection-maximization (EM) algorithm. The EM algorithm is a framework to approach maximum likelihood estimates of parameters in statistical models. It consists of two steps:

E-step: assigns objects to clusters according to the current fuzzy clustering or parameters of probabilistic clusters.
M-step: finds the new clustering or parameters that maximize the SSE or the expected likelihood.

Example 12.1. EM algorithm example.

12.3 Evaluating fuzzy clusters

c_{1}, ..., c_{k}

are the centers of the

k

clusters, we define the sum of squared error (SSE) for a point

X_{i}

S S E (X_{i}) = \sum_{j = 1}^{k} w_{i j}^{p} \cdot d i s t {(X_{i}, c_{j})}^{2} .

For a cluster

C_{j}

, we have its SSE as

S S E (C_{j}) = \sum_{i = 1}^{n} w_{i j}^{p} \cdot d i s t {(X_{i}, c_{j})}^{2} .

Finally, the SSE of the whole clustering is

S S E (C) = \sum_{i = 1}^{n} \sum_{j = 1}^{k} w_{i j}^{p} \cdot d i s t {(X_{i}, c_{j})}^{2} .

12.4 Cluster quality measures

A good clustering methods will produce high quality clusters, i.e. clusters with the following characteristics:

High intra-class similarity: cohesive within clusters. This means that the objects inside the clusters are similar to each other.
Low inter-class similarity: distinctive between clusters. This means that the objects from different clusters are different.

The quality of the clustering method depends on the similarity measure used by the method, its implementation and its ability to discover the hidden patterns in the data.

Some examples of quality measures are:

Sum of square distance to centroids: the squared distance between the representative of each cluster to every other point in the cluster, and then summed. This measure is suitable for representative-based methods, but it favors clusters that suit the underlying distance function used: $S S D (C) = \sum_{j = 1}^{k} \sum_{i = 1}^{n_{j}} d i s t {(X_{i}, c_{j})}^{2} .$
Intracluster to intercluster distance ratio: we sample pairs of points from $D$ . Let $P$ be the pairs $(X_{i}, X_{j})$ such that $X_{i}$ and $X_{j}$ are in the same cluster, and $Q$ the pairs whose points are in different clusters. Then $I I D R = \frac{I n t r a}{I n t e r},$ where $I n t r a = \sum_{(X_{i}, X_{j}) \in P} \frac{d i s t (X_{i}, X_{j})}{| P |},$ $I n t e r = \sum_{(X_{i}, X_{j}) \in Q} \frac{d i s t (X_{i}, X_{j})}{| Q |} .$
Silhouette coefficient: let $D_{a v g - i n_{i}}$ denote the average distance between a point in the cluster $i$ and the rest of the points in the same cluster. Let $D_{a v g - o u t_{i, j}}$ denote the average distance between a point in the cluster $i$ and every other point in the cluster $j$ . Let $D_{m i n - o u t_{i}} = {min}_{j} (D_{a v g - o u t_{i, j}})$ . Then: $S C_{i} = \frac{D_{m i n - o u t_{i}} - D_{a v g - i n_{i}}}{max {D_{m i n - o u t_{i}}, D_{a v g - i n_{i}}}} .$ It follows that $S C_{i} \in (- 1, 1)$ , where large positive values indicate highly separated clusters (the distance to other clusters is high) and negative values are indicative of some level of mixing of data points from different clusters.

Part V Frequent pattern and association rule mining

13 Frequent Itemset Mining

Association pattern mining is usually defined in the context of supermarket data coontaining sets of items bought by cutomers, which are referred to as transactions. The goal is to determine associations between groups of items bought by customers. The discovered sets of items are reffered to as frequent itemsets.

This frequent itemset can then be used to generate association rules of the form

X ⟹ Y,

where

X

and

Y

are sets of items. The meaning of this is that we discovered that when some customer buys

X

, it is likely that the same customer is going to/would like to by

Y

We have to be careful, nonetheless, because the raw frequency of a pattern is not the same as the statistical significance of the underlying correlations. This is why numerous models for frequent pattern mining have been proposed that are based on statistical significance.

Example 13.1. An intuitive example is that when someone buys bread, cheese and yogurt, it is probably the case that he will buy also milk and eggs:

{B r e a d, C h e e s e, Y o g u r t} ⟹ {M i l k, E g g s} .

13.1 The model

The problem of association pattern mining is defined on unordered set-wise data.

The database

T

contains

n

transactions,

T_{1}, ..., T_{n}

Each transaction

T_{i}

is a subset of the set of all items

T_{i} \subset U_{i t e m s} = U

. The transactions can be represented as a multidimensional record of dimensionality

d = | U |

, where the values are binary:

T_{i} (i t e m) = {\begin{matrix} 1 & i f i t e m \in T_{i} \\ 0 & o t h e r w i s e \end{matrix} .

The universe of items is very large compared to the typical number of items in each transaction.

Definition 13.1. An itemset is a set of items,

I \subset U

k

-itemset is an itemset with

k

items,

I \subset U ⋀ | I | = k

The support of an itemset

I

s u p (I)

, is the fraction of the transactions in the database

T

that contain

I

as a subset.

Definition 13.2. The frequent itemset mining problem is defined as follows:

Given a set of transactions

T = {T_{1}, ..., T_{n}}

, where each transaction

T_{i}

is a subset of items from

U

, determine all itemsets

I

that occur as a subset of at least a predefined fraction minsup of the transactions in

T

The predefined fraction minsup is called minimum support.

The unique identifier of a transaction is referred to as transaction identifier (tid).

Remark 13.1. The number of frequent itemset is generally very sensitive to the minimum suppor level:

The use of a lower minimum support level, yields a larger number of frequent patterns.
If the support level is too high, no frequent patterns will be found.

Therefore, an appropriate choice of the support level is crucial for discovering a set of frequent patterns with meaningful size.

Example 13.2. A very simple example:

Tid	Transaction	Binary Representation
1	{Shirt, Trousers}	110
2	{Shirt, Jacket}	101
3	{Shirt, Trousers, Jacket}	111

In this case, the possible itemsets and their respective support is:

Itemset	Support
{Shirt}	1
{Trousers}	$\frac{2}{3}$
{Jacket}	$\frac{2}{3}$
{Shirt, Trousers}	$\frac{2}{3}$
{Shirt, Jacket}	$\frac{2}{3}$
{Trousers, Jacket}	$\frac{1}{3}$
{Shirt, Trousers, Jacket}	$\frac{1}{3}$

If we select

m i n s u p = \frac{1}{3}

, all possible itemsets would be selected.

If we select

m i n s u p = 1

, only {Shirt} would be selected.

Now, let's think about how many possible itemsets there are: if an itemset if a subset of

U

, then there are

2^{c a r d (U)}

possible itemsets. This means that computing all their supports as in the previous example would take an exponential amount of time to the cardinality of the number of items. Not only this, but the database needs also to be accessed for counting, comparing,... So it is easy to see how this problem becomes rapidly innaccessible. Then, it is compulsory to find better ways to perform this counting and comparing, or to be able to discard itemsets even before counting. For this, there are some very convenient properties of itemsets:

Property 1: Support Monotonicity Property

The support of every subset

J \subset I

is at greater than or equal to the support of

I

s u p (J) \geq s u p (I), \forall J \subset I .

This is because all subsets of the itemset

I

are also itemsets. As the itemset

I

appears in

s u p (I)

fraction of the total transactions, all its subsets also appears at least in the same transactions.

Property 2: Downward Closure Property

Every subset of a frequent itemset is also frequent.

This is a natural implication of the previous property, and it is very useful: when we discover a frequent itemset, we don't need to check its subsets because it is already assured that they are frequent, too.

Definition 13.3. A frequent itemset is maximal at a given minimum support level

m i n s u p

if it is frequent, and no superset of it is also frequent.

The possible itemsets given a set of items can be conceptually arranged in the form of a lattice of itemsets, which contains one node for each of the

2^{| U |}

sets drawn from the universe of items. An edge exists between a pair of nodes if the corresponding sets differ by exactly one item. The lattice represents the search space of frequent patterns and it is separated into frequent and infrequent itemsets by a border.

Example 13.3. A lattice of itemset with 4 elements.

image: 22_home_runner_work_BDMA_Notes_BDMA_Notes_ULB_Data_Mining_LectureNotes_source_lattice_itemset.png

13.2 Association rule generation framework

Definition 13.4. Let

X, Y

be two itemsets. The confidence of the rule,

c o n f (X ⟹ Y)

, is the conditional probability of

X \cup Y

ocurring in a transaction, given that the transaction contains

X

c o n f (X ⟹ Y) = P r (X \cup Y | X) = \frac{s u p (X \cup Y)}{s u p (X)} .

X

is called the antecedent of the rule and

Y

is the consequent.

Definition 13.5. Let

X, Y

be two itemsets. The rule

X ⟹ Y

is said to be an association rule at a minimum support of $m i n s u p$ and minimum confidence of $m i n c o n f$ if it satisfies:

The support of the itemset $X \cup Y$ is at least $m i n s u p$ .
The confidence of the rule $X ⟹ Y$ is at least $m i n c o n f$ .

Here, the first criterion ensures that there are enough transactions to believe that the rule has statistical relevance. The second criterion ensures that the rule is strength enough in terms of conditional probabilities.

The overall procedure for association rule generation uses two phases:

The frequent itemsets are generated at the minimum support of $m i n s u p$ .
The association rules are generated from the frequent itemsets at the minimum confidence level of $m i n c o n f$ .

The first phase is more computationally intensive, so we are focusing on it from now on.

13.3 Frequent itemset mining algorithms

13.3.1 Brute Force Algorithms

For a universe of items

U

, there are

2^{| U |} - 1

distinct subsets, excluding the empty set. A naïve idea would be to generate all these candidate itemsets and count their suppor against the transaction database

T

Definition 13.6. A candidate itemset is an itemset that can be frequent, so it is needed to be checked.

Now, we can verified the candidates against the transaction database by support counting, checking whether a given itemset

I

is subset of each transaction

T_{i} \in T

This approach is likely to be impractical when the universe of items is large.

A little tweak

The brute-force approach can be made faster by observing that no

(k + 1)

-patterns are frequent if no

k

-patterns are frequent (this follows from the downward closure property). Thus, we can enumerate and count in increasing length for the patterns.

For sparse transaction databases, the value of

l

(the largest frequent itemset) is usually small compared to

| U |

. At this point, one can terminate. This approach is orders of magnitude faster, but its computational complexity is still not satisfactory for large values of

U

Ideas to apply

Better algorithms can be developed by using one or more of the following approaches:

Reduce the size of the explored search space by prunning cadidate itemsets using tricks.
Counting the support of each candidate more efficiently by prunning transactions that are know to be irrelevant.
Using compact data structures to represent either candidates of transaction databases that suppor efficient counting.

13.3.2 The Apriori Algorithm

The Apriori algorithm uses the downward closure property to prune candidates. If an itemset is infrequent, then all its supersets are also infrequent, so we don't need to count them.

The Apriori algorithm works as follows:

Count the support of the individual items to generate frequent 1-itemsets.
Combine the frequent 1-itemsets to generate candidate 2-itemsets.
Count the support of the candidate 2-itemsets to generate frequent 2-itemsets.
...

In general:

Combine the frequent $(k - 1)$ -itemsets to generate candidate $k$ -itemsets.
Prune candidate $k$ -itemsets which has some subset which is not frequent.
Count the support of the candidate $k$ -itemsets to generate frequent $k$ -itemsets.
Repeat until there are no bigger frequent itemsets.

The algorithm is detailed in Algorithm 8.

k = 1
F1 = {Frequent 1-itemsets}

while Fk is not empty do
	Generate C(k+1) by joining itemset-pairs of Fk
	Prune itemsets from C(k+1) that violate the downward closure property
	Determine F(k+1) by support counting (C(k+1),T)
	k = k+1
end

return Union(F(i) for all i=1..k)

Algorithm 8: Apriori(Transactions T, Minimum Support minsup)

Remark 13.2. The downward closure property ensures that the candidate set generated does not miss any itemset that is frequent. This non-repetitive and exhaustive way of generating candidates can be interpreted in the context of a conceptual hierarchy of the patterns known as enumeration tree.

Remark 13.3. To do the pruning, we check generated elements against non-frequent itemsets already generated.

Remark 13.4. The support counting process is the most expensive part because it depends on the size of

T

. The level-wise approach ensures that the algorithm is relatively efficient from a disk-access perspective: each set of candidates in

C_{k}

can be counted in a single pass over the data without the need for random disk accesses.

Nonetheless, the counting procedure is still expensive.ç

Example 13.4. We are going to manually run an Apriori algorithm for a simple database using

m i n s u p = 2

(the minsup can be also indicated as a count, instead of as a frequency). The database is

Database
TID	Transaction
1	A,B,E
2	B,D
3	B,C
4	A,B,D
5	A,C
6	B,C
7	A,C
8	A,B,C,E
9	A,B,C

Let's compute

F_{1}

directly:

F1
Itemset	Count
A	6
B	7
C	6
D	2
E	2

From this, we see that all 1-itemsets are frequent, so we are generating all possible 2-itemsets as C2, and counting them:

C2
Itemsets	Count
A,B	4
A,C	4
A,D	1
A,E	2
B,C	4
B,D	2
B,E	2
C,D	0
C,E	1
D,E	0

⟹

F2
Itemsets	Count
A,B	4
A,C	4
A,E	2
B,C	4
B,D	2
B,E	2

We have coloured in red the infrequent itemset and generated F2. Now we can generate C3 by combining itemsets from F2. For this, we need to search for itemsets that share all values but one. For example: {A,B} and {A,C} are combinated to obtain {A,B,C}. After getting all possibilities, we find C3 and count:

C3
Itemset	Count
A,B,C	2
A,B,D
A,B,E	2
A,C,D
A,C,E
B,C,D
B,C,E
B,D,E

⟹

F3
Itemset	Count
A,B,C	2
A,B,E	2

The purple coloured itemsets are pruned because they contain some of the infrequent itemsets of C2. The rest are still frequent, so we continue with the process. We combine the only two itemsets left to get {A,B,C,E} as C4. Note, nonetheless that {C,E} is an infrequent itemset of C2, so we will prune it and we are in fact done.

The frequent itemsets with

m i n s u p = 2

are F1, F2 and F3.

Limits of Apriori

Apriori makes use of the downward closure property (sometimes called apriori property), which is nice.
The lattice is traversed in a general-to-specific way, it is possible to a bi-directional traversal.
It is done in a breadth-first manner, it might be done in a depth-first way.
The generate-and test strategy:
- Generate is $O (\frac{| U |}{k})$ for $k$ -candidates.
- Test is the counting, in which we are traversing the whole database, $O (| T |)$ .

Apriori improved with tricks

In Algorithm 9 it is detailed an improved version of the Apriori algorithm.

	k = 1
	F1 = {Frequent 1-itemsets}
	
	while Fk is not empty do
		C(k+1) = generate(Lk)
		for tran in T do
			C(tran) = subset(C(k+1),tran)
			for cand in C(tran)
				cand.count++
			end
		end
		L(k+1)={cand in C(k+1) | cand.count >= minsup}
	end
	return Union(F(i) for i=1..k)

procedure generate(Lk)
	foreach itemset1 in Lk
		foreach itemset2 in Lk
			if itemset1[1]=itemset2[1] and ... and itemset1[k-1] = itemset2[k-1] and itemset1[k] < itemset2[k]
				c = itemset1 join itemset2
				if has_infreq_subsets(c,Lk)
					continue
				else
					add c to C(k+1)
				end
			end
		end
	end
	return C(k+1)

procedure has_infreq_subsets(c,Lk)
	foreach k-subset s of c
		if s not in Lk
			return TRUE
		end
	end
	return FALSE

Algorithm 9: Apriori_improved(Transactions T, Minimum support minsup)

More Apriori Tricks

Transaction reduction: a transaction that does not contain any frequent $k$ -itemsets, cannot contain any frequent $(k + 1)$ -itemset either, so it can be removed to make counting faster.
Partitioning: divide the database into a set of disjoint partitions. Find all frequent itemsets in every partition. A local frequent itemset may not be frequent for the whole database, but a frequent itemset must be frequent in at least one partition. The best about this approach is that it allows for parallel processing and the partition size can be selected for it to fit in memory.
Sampling: perform Apriori on a random sample of the data. Rule accuracy is harmed, but efficiency is improved.

13.3.3 FP-Growth

Even though Apriori greatly improves the efficiency of the solution of the association pattern mining in comparison to the brute force approach, we have seen that it can still suffer from inefficiencies when the database is big. Particularly, counting is very costly and when there are lots of items we will need to count many times. There is an improved solution for the problem: Frequent Pattern Growth (FP-Growth) which:

Transforms the database into a compressed data structur called FP-Tree, which retains frequent itemsets information.
Mines the FP-tree for frequent itemsets by:
1. Divide it into a set of conditional databases, the conditional pattern bases, each associated eith one frequent item.
2. Mines each of these patterns separately to generate association rules, without the need to re-count the original database.

Definition 13.7. A FP-Tree is a trie (prefix tree) data structure, which acts as a compressed representaiton of a conditional database.

The path from the root to a leaf represents a repeated sub-transaction (frequent pattern) in the database.
The path from the root to an internal node represents either a frequent pattern or a prefix.
Each node is associated with a count, which is the number of transactions in the database containing the path to this node.
The prefixes are sorted in the order from the most frequent to the least frequent to leverage the prefix-based comrpession.

The algorithm for FP-Growth is detailed in Algorithm 10.

if FPT is a single path
	determine all combinations C of nodes on the path, report Union(C,P) as frequent
else
	foreach item i in FPT do
		report Pi = Union(i,P) as frequent
		
		use pointers to extract conditional prefix paths from FPT containing i
	
		readjust counts of prefix paths
		remove i
		
		remove infrequent items from prefix paths
		
		reconstruct FPTi
		
		if FPTi not empty
			FP-Growth(FPTi, minsup, Pi)
	end
end

Algorithm 10: FP-Growth(FP-Tree FPT, Minimum Support minsup, Current Suffix P)

Basically, what FP-Growth does is recursively find all frequent itemsets ending with a particular suffix by splitting the probelm into smaller subproblems.

Example 13.5. Let's use FP-Growth on the previous example database:

Database
TID	Transaction
1	A,B,E
2	B,D
3	B,C
4	A,B,D
5	A,C
6	B,C
7	A,C
8	A,B,C,E
9	A,B,C

First, we need to construct the FP-Tree. For this purpose, we compute the frequent 1-itemsets and reorder them in the count order:

F1
Itemset	Count
A	6
B	7
C	6
D	2
E	2

\overset{I n d e s c e n d i n g o r d e r}{⟹}

F1
Itemset	Count
B	7
A	6
C	6
D	2
E	2

Now we reorder the transactions in the database following this same ordering:

Database
TID	Transaction
1	B,A,E
2	B,D
3	B,C
4	B,A,D
5	A,C
6	B,C
7	A,C
8	B,A,C,E
9	B,A,C

And now we construct the FP-Tree. We start with the root labeled as NULL, and then add the transactions one by one, reusing the prefixes when we can, and keeping a counter for each node. Every time we traverse a node, we increase the counter. The first transaction would be entered as:

image: 23_home_runner_work_BDMA_Notes_BDMA_Notes_ULB_Data_Mining_LectureNotes_source_fpgrowth_0.png

The second one:

image: 24_home_runner_work_BDMA_Notes_BDMA_Notes_ULB_Data_Mining_LectureNotes_source_fpgrowth_1.png

The third one:

image: 25_home_runner_work_BDMA_Notes_BDMA_Notes_ULB_Data_Mining_LectureNotes_source_fpgrowth_2.png

And so on... Until all transactions are entered in the tree:

image: 26_home_runner_work_BDMA_Notes_BDMA_Notes_ULB_Data_Mining_LectureNotes_source_fpgrowth_3.png

Now, we connect each item in the ordered F1 to one node in the database corresponding to the same item, and all nodes that are equal are connected, too. Like this:

image: 27_home_runner_work_BDMA_Notes_BDMA_Notes_ULB_Data_Mining_LectureNotes_source_fpgrowth_4.png

And now we proceed with the algorithm. We start with the least frequent item,

E

For item $E$ , the conditional pattern base contains two itemsets: ${[{B, A} : 1], [{B, A, C} : 1]}$ . Thus, it follows that we have ${[{B, A} : 2]}$ , so the conditional FP-Tree is

Thus, the generated frequent patterns are: ${[{B, E} : 2], [{A, E} : 2], [{B, A, E} : 2]}$ .
For item $D,$ the conditional pattern base also contains two itemsets: ${[{B, A} : 1], [{B} : 1]}$ . So now we have ${[{B} : 2]}$ and the conditional FP-Tree is

And the generated frequent patterns are ${[{B, D} : 2]}$ .
For item $C,$ the conditional pattern base contains one itemset: ${[{B} : 4]}$ . The conditional FP-Tree ends up being

And the generated frequent patterns are ${[{B, A, C} : 2], [{B, C} : 2], [{A, C} : 2]}$ .
For item $A,$ the conditional pattern base contains three itemsets: ${[{B, A} : 2], [{B} : 2], [{A} : 2]}$ . So now we have the three of them and no combination. The conditional FP-Tree ends up being

And the generated frequent patterns are ${[{B, A} : 4]}$ .

And that's it!

Why FP-Growth outperforms Apriori?

FP-Growth counts the database only once. This fact by itself is already a huge improvement, because Apriori needs to count one time per candidate.
Also, Apriori needs indeed to create the candidates, which can also be expensive. FP-Growth, on the other hand, does not do a generation step, because the frequent itemsets are naturally obtained.
In relation to the last point, Apriori creates candidates which end up being infrequent, while this does not happen with FP-Growth.
Space complexity of FP-Growth is smaller because we don't need to store the candidates, only the FP-Tree, which is not much bigger than the number of items in $U$ . Also, the database is not needed after the FP-Tree is computed, so we have more memory to use for the tree itself.

13.4 Mining Association Rules

Once the frequent itemsets have been found, it is time to obtain the association rules, which are the goal we were aiming at since the beggining. To do this, a common approach is to follow:

For each frequent itemset I :
1. Generate subsets of $I$ , $\emptyset \neq S \subset I$ .
2. For every ∅ ≠ S ⊂ I :
  1. Output rule $S ⟹ I ∖ S$ if $c o n f (S ⟹ I ∖ S) \geq m i n c o n f$ .

13.4.1 Evaluating association rules

In Definition 13.4 we saw the concept of confidence, but there are more measures to assess how 'good' a rule is, as the lift or the correlation analysis.

Definition 13.8. The lift is

l i f t (X ⟹ Y) = \frac{s u p (X \cup Y)}{s u p (X) \cdot s u p (Y)} .

Remark 13.5. If

l i f t = 1

, then

X

and

Y

are independent.

l i f t > 1

, then

X

and

Y

have some dependency that is proportional to the lift value.

l i f t < 1

then

X

and

Y

are contradicting, i.e., the existence of

X

discourages

Y

and vice versa.

Definition 13.9. The correlation coefficient is

Φ = \frac{T T \cdot F F - T F \cdot F T}{\sqrt{_T \cdot_F \cdot T_\cdot F_}},

where we are assuming a rule

A ⟹ B

and:

$T T$ is how many times $A$ is True and $B$ is True
$F F$ is how many times $A$ is False and $B$ is False
$T F$ is how many times $A$ is True and $B$ is False
$F T$ is how many times $A$ is False and $B$ is True
$T_/ F_$ is how many times $A$ is True/False
$_T /_F$ is how many times $B$ is True/False

Remark 13.6. In this case, if

Φ = - 1

there is a perfect negative correlation.

Φ = 1

there is a perfect positive correlation.

Φ = 0

the two itemsets are statistically independent.

Remark 13.7. The definitions of

T T, F T

,... can be summarized as in the following table:

	B	$\neg$ B	Total
A	TT	TF	T_
$\neg$ A	FT	FF	F_
Total	_T	_F

Example 13.6. Imagine we have the rule

{T e a} ⟹ {C o f f e e}

with the following data:

	Coffee	$\neg$ Coffee	Total
Tea	150	50	200
$\neg$ Tea	650	150	800
Total	800	200

In this case, the confidence is

c o n f ({T e a} ⟹ {C o f f e e}) = \frac{150}{200} = 0.75,

which is a high confidence... but

s u p (C o f f e e) = 0.8

, which means that drinking coffee in fact decreases the probability of drinking coffee!

Now, the lift is

l i f t ({T e a} ⟹ {C o f f e e}) = \frac{0.15}{0.2 \cdot 0.8} = 0.9375 < 1,

which means that Tea and Coffee are negatively correlated. This insight is better than the one obtained by only looking at the value of the confidence.

Example 13.7. But let's now look at this example:

	p	$\neg$ p	Total
q	880	50	930
$\neg$ q	50	20	70
Total	930	70

	r	$\neg$ r	Total
s	20	50	70
$\neg$ s	50	880	930
Total	70	930

In this case

l i f t ({p} ⟹ {q}) = 1.02

and

l i f t ({r} ⟹ {s}) = 4.08,

but

(p, q)

appear together 88% of the time, while

(r, s)

appear together only 2% of the time. Confidence is a better indicator in this case. The problem here is that

r, s

appears in a small portion of records in the data.

13.4.2 How to choose a measure?

We have seen that we can obtain different conclusions by looking at different measures, so the choice of the measure is important because it will affect the results. A good choice must be based on a clear understanding of the measure and its properties, so we are aware of the flaws it entails and can leverage them for good.

We can also asses rules using interactive visualizations, subjective measures based on domain experience,...

14 Sequential pattern mining

In the last section, we were looking for patterns in a set-wise approach, but there exist also situations in which the order in which elements are encountered is important. For example, it is a well known fact that seeing something can make people wanting to buy it. This fact is used by supermarkets to strategically place the products in order for the customers to have as many temptations as possible. From the side of the supermarket, it is interesting to know which zones of the supermarket are related in such a way that customers tend to go from one to another in a particular order. Imagine we detect that it is usual that people that go to the desserts hall, usually go to the fruit hall afterwards. Then, the supermarket is interested in placing the fruits as far from the desserts as possible, to maximize the time spent by the customer at looking products that they did not want before, but might want now. It can be used also to detect sequences like:

Shopping sequences: $c o m p u t e r \to p r i n t e r \to i n k$ .
Website sessions: $h o m e \to e d u c a t i o n \to b a c h e l o r$ .
Tourist routes: $G r a n d P l a c e \to M a n n e k e n P i s \to C o m i c s M u s e u m$ .

With this objective in mind, sequential pattern mining was developed.

Definition 14.1. A sequence is an ordered list of elements. Each element is an unordered set of items.

A subsequence

s

of a sequence

S

is a sequence such that:

If $e \in s$ then there exists $E \in S$ such that $e \subset E$ .
If $e_{1} \to e_{2} \to ... \to e_{k}$ is the full sequence $s$ , then there exist $k$ different elements $E_{1}, ..., E_{k} \in S$ such that $e_{i} \subset E_{i}, i = 1, ..., k$ and $E_{1}$ goes before $E_{2}$ (maybe not directly), $E_{2}$ goes before $E_{3}$ ,...

In other words,

s = [e_{1}, ..., e_{k}]

is a subsequence of

S = [E_{1}, ..., E_{m}]

if there are

k

elements

E_{s_{1}}, ..., E_{s_{k}} \in S

such that

e_{i} \subset E_{s_{i}}, \forall i = 1, ..., k

and

s_{1} < s_{2} < ... < s_{k}

Definition 14.2. A sequence database is a database of sequences.

The support of a sequence

s

is the fraction of sequences in the database

S = {S_{1}, ..., S_{N}}

that contain

s

as a subsequence.

Definition 14.3. The sequential pattern mining problem consists in, given a sequence database, find all subsequences whose support is at least the specified

m i n s u p

This problem is obviously similar and related to that of frequent patterns mining, but this is more comples. A naive algorithm that generates all possible paterns and count them in the database would be highly inneficient, even more than in the set-wise case. There are several methods that work much better. First, let's see how to generate candidates.

14.1 Candidate generation

The candidate generation follows the steps:

Base case, $k = 2$ : merging two frequent 1-sequences $[{i_{1}}]$ and $[{i_{2}}]$ will produce two candidate 2-sequences: $[{i_{1}, i_{2}}] a n d [{i_{1}}, {i_{2}}] .$
General case, k>2 : two frequent ( k-1 ) -sequences w 1 , w 2 are merged if the subsequence obtained by removing the first event in w 1 is the same as the subsequence obtained by removing the last event in w 2 .
1. The resulting candidate after merging is given by the sequence w 1 extended with the last event of w 2 :
  1. If the last two events in $w_{2}$ belong to the same element, then the las event in $w_{2}$ becomes part of the last element in $w_{1}$ .
  2. Otherwise, the last event in $w_{2}$ in $w_{2}$ becomes a separate element appended to the end of $w_{1}$ .ç
  3. If both $w_{1}$ and $w_{2}$ consist in only one set, we generate the two variants.

Example 14.1. If we merge

w_{1} = [{1} {2, 3} {4}]

and

w_{2} = [{2, 3} {4, 5}]

we will get the sequence

w_{3} = [{1} {2, 3} {4, 5}]

, because the last two events in

w_{2}

belong to the same element.

If we merge

w_{1} = [{1} {2, 3} {4}]

and

w_{2} = [{2, 3} {4} {5}]

we will get the sequence

w_{3} = [{1} {2, 3} {4} {5}]

14.2 Generalized Sequential Pattern (GSP) Algorithm

Make the first pass over the database to obtain all the 1-element frequent sequences.
Repeat until no new frequent sequences are found:
1. Candidate generation: merge pairs of frequent subsequences found in the ${(k - 1)}^{t h}$ pass to generate candidate sequences that contain $k$ items.
2. Candidate pruning: prune candidate $k$ -sequences that contain infrequent $(k - 1)$ -subsequences.
3. Support counting: make a new pass over the sequence database to find the support for these candidate sequences.
4. Candidate elimination: eliminate candidate $k$ -sequences whose actual support is less than $m i n s u p$ .

Example 14.2. Let's perform GSP with the following database with

m i n s u p = 2

Database
Id	Sequence
1	$[{A, B} {C} {A}]$
2	$[{A, B} {B} {C}]$
3	$[{B} {C} {D}]$
4	$[{B} {A, B} {C}]$

First, we compute C1 and F1:

C1
Subsequence	Count
$[{A}]$	3
$[{B}]$	4
$[{C}]$	4
$[{D}]$	1

⟹

F1
Subsequence	Count
$[{A}]$	3
$[{B}]$	4
$[{C}]$	4

Now, we generate C2 by combining these subsequences. For example, merging

[{A}]

and

[{B}]

gives

[{A} {B}]

and

[{A, B}]

, but we also need to combine in the other possible order, obtaining

[{B} {A}]

in addition to these. Thus, we obtain:

C2
Subsequence	Count
$[{A} {A}]$	1
$[{A} {B}]$	1
$[{A, B}]$	3
$[{B} {A}]$	2
$[{B} {B}]$	2
$[{A} {C}]$	3
$[{A, C}]$	0
$[{C} {A}]$	1
$[{C} {C}]$	0
$[{B} {C}]$	4
$[{B, C}]$	0
$[{C} {B}]$	0

⟹

F2
Subsequence	Count
$[{A, B}]$	3
$[{B} {A}]$	2
$[{B} {B}]$	2
$[{A} {C}]$	3
$[{B} {C}]$	4

Now, from F2 we can combine

[{A, B}]

with

[{B} {A}], [{B} {B}]

and

[{B} {C}]

[{B} {A}]

with

[{A, B}]

and

[{A} {C}]

[{B} {B}]

with

[{B} {A}]

and

[{B} {C}]

[{A} {C}]

with none and

[{B} {C}]

with none. Thus:

C3
Subsequence	Count
$[{A, B} {A}]$	1
$[{A, B} {B}]$	1
$[{A, B} {C}]$	3
$[{B} {A, B}]$	1
$[{B} {A} {C}]$	1
$[{B} {B} {A}]$	0
$[{B} {B} {C}]$	2

⟹

F3
Subsequence	Count
$[{A, B} {C}]$	3
$[{B} {B} {C}]$	2

And that's it because we cannot combine anything else. Thus, the result is F1, F2 and F3.

14.3 Sequential PAttern Discovery using Equivalence classes (SPADE) Algorithm

Transform de database into its vertical format, i.e., with SeqID, elemID inside sequence, and the element itself.
Construct the ID-list of 1-sequences, i.e., construct a table in which the elements are in the columns and in the cells we write all pairs $(s e q I d : e l e m I D)$ in which the element in the column appears.
We count distinct seqID for each element. Those having more than $m i n s u p$ are kept and the rest are discarded.
For k>1 :
1. Contruct new $k$ -candidates using the $(k - 1)$ -frequent subsequences (as in GSP). Prune when possible.
2. Construct the ID-list of $k$ -sequences as $(s e q I D : e l e m I D_{1}, ..., e l e m I D_{k})$ where $e l e m I D_{i}$ is the element in $s e q I D$ in which the event $i$ in the current subsequence appear.
3. Count distinct seqID for each element. Those having more than $m i n s u p$ are kept and the rest are discarded.

Example 14.3. Let's repeat the example with SPADE. First, we transform the database to its vertical form:

Database
Id	Sequence
1	$[{A, B} {C} {A}]$
2	$[{A, B} {B} {C}]$
3	$[{B} {C} {D}]$
4	$[{B} {A, B} {C}]$

⟹

Vertical Database
SeqID	ElemID	Sequence
1	1	$[{A, B}]$
1	2	$[{C}]$
1	3	$[{A}]$
2	1	$[{A, B}]$
2	2	$[{B}]$
2	3	$[{C}]$
3	1	$[{B}]$
3	2	$[{C}]$
3	3	$[{D}]$
4	1	$[{B}]$
4	2	$[{A, B}]$
4	3	$[{C}]$

Now, we construct the ID-list of 1-sequences:

	1-ID-List
	$[{A}]$	$[{B}]$	$[{C}]$	$[{D}]$
	1:1	1:1	1:2	3:3
	1:3	2:1	2:3
	2:1	2:2	3:2
	4:2	3:1	4:3
		4:1
		4:2
Count	3	4	4	1

⟹

F1
Subsequence	Count
$[{A}]$	3
$[{B}]$	4
$[{C}]$	4

Now the ID-list of 2-sequences, combining the 1-sequences:

2-ID-List

[{A} {A}]

[{A} {B}]

[{A, B}]

[{B} {A}]

[{B} {B}]

[{A} {C}]

[{A, C}]

[{C} {A}]

[{C} {C}]

[{B} {C}]

[{B, C}]

[{C} {B}]

1:1,3

2:1,2

1:1,1

1:1,3

2:1,2

1:1,2

1:2,3

1:1,2

2:1,1

4:1,2

2:1,3

4:2,2

4:2,3

2:2,3

3:1,2

4:1,3

4:2,3

Count

F2
Subsequence	Count
$[{A, B}]$	3
$[{B} {A}]$	2
$[{B} {B}]$	2
$[{A} {C}]$	3
$[{B} {C}]$	4

And now we do the 3-ID-List combining these:

	3-ID-List
	$[{A, B} {A}]$	$[{A, B} {B}]$	$[{A, B} {C}]$	$[{B} {A, B}]$	$[{B} {A} {C}]$	$[{B} {B} {A}]$	$[{B} {B} {C}]$
	1:1,1,3	2:1,1,2	1:1,1,2	4:1,2,2	4:1,2,3		2:1,2,3
			2:1,1,3				4:1,2,3
			4:2,2,3



Count	1	1	3	1	1	0	2

F3
Subsequence	Count
$[{A, B} {C}]$	3
$[{B} {B} {C}]$	2

And the result is (obviously) the same we got with GSP.

14.4 PrefixSpan

Start by counting frequent 1-sequences, as in GSP.
PrefixSpan extends each frequent itemset recursively.
For each frequent 1-sequence S :
1. Project the database with $S$ as a prefix, i.e., for each sequence, remove everything until the first occurence of $S$ is found.
2. Count the possible expansions of $S$ with one additional event at the end.
3. For frequent 2-sequences, repeat the same procedure. Until no new frequent $k$ -sequences are discovered.

Example 14.4. We are going to do the same example again. The database is

Database
Id	Sequence
1	$[{A, B} {C} {A}]$
2	$[{A, B} {B} {C}]$
3	$[{B} {C} {D}]$
4	$[{B} {A, B} {C}]$

with frequent 1-itemsets

F1
Subsequence	Count
$[{A}]$	3
$[{B}]$	4
$[{C}]$	4

So we can start with

A

as a prefix. We project the database:

Database projected to ${A}$
Id	Sequence
1	$[{A, B} {C} {A}]$
2	$[{A, B} {B} {C}]$
3	$[{B} {C} {D}]$
4	$[{A, B} {C}]$

And we count possible 2-sequences starting with

{A}

C2 projected to ${A}$
Subsequence	Count
$[{A} {A}]$	1
$[{A} {B}]$	1
$[{A, B}]$	3
$[{A} {C}]$	3
$[{A, C}]$	0

Now we select

{A, B}

as prefix and count its possible extensions:

Database projected to ${A, B}$
Id	Sequence
1	$[{A, B} {C} {A}]$
2	$[{A, B} {B} {C}]$
4	$[{A, B} {C}]$

C3 projected to ${A, B}$
Subsequence	Count
$[{A, B} {A}]$	1
$[{A, B} {B}]$	1
$[{A, B} {C}]$	3

Now we select

{A, B} {C}

as prefix and again count its possible extensions (in this case the projected database only ahs one record, so there is no need to count):

Database projected to ${A, B} {C}$
Id	Sequence
1	$[{A, B} {C} {A}]$

And with

{A} {C}

Database projected to ${A} {C}$
Id	Sequence
1	$[{A, B} {C} {A}]$
2	$[{A, B} {B} {C}]$
4	$[{A, B} {C}]$

C3 projected to ${A} {C}$
Subsequence	Count
${A} {C, A}$	0
${A} {C} {A}$	1
${A} {C, B}$	0
${A} {C} {B}$	0
${A} {C} {C}$	0

We would repeat the same with prefixes

{B}

and

{C}

, and the result has to be the same as with the two previous algorithms.

14.5 Some comments

Algorithm	How	Time	Space
GSP	Candidate Generation	Counting against DB	Space alloc for candidate generation
SPADE	Candidate Generation	Counting against ID-Lists	Space alloc for the ID-Lists
PrefixSpan	No Candidate Generation (prefix-based generation)	Counting the projected DB	Space alloc for the projected DB, but there is one possible optimization using pointers

Part VI Stream Data Mining

15 Stream data mining

A data streaming is a constant flow of data, which usually carries to much information for it to be feasible to be stored. Somehow, we need to be able to apply data mining algorithms in datasets whose records we are only able to see once. This might seem esoteric, but there are plenty of applications in which we find this kind of needs. For instance, in credit card transactions, in wearable sensors information, connected vehicles, Internet of Things,...

More precisely, the challenges that data streams possess are:

One pass constraint: data size is assumed to be infinite. We cannot store everything to perform second passes through it. Say we want to cluster this type of data, we cannot apply an iterative approach like that of $k$ -means.
Drift: data evolves over time, as well as its statistical properties. This means that a model that works well today, could work poorly tomorrow. We need to be able to adapt to these changes.
Resource constraints: we are constrained to the arrival rate of the data, which can be variable and sometimes it can has huge peaks where data is coming faster than expected. This implies that the algorithms need to be fast enough to not lose valuable data because of processing.
Massive domain: some data attributes might have a large values of distinct values.

15.1 Bloom filter

The bloom filter is an idea thought to answer the question:

'Has this incoming element ever ocurred in the data stream before?'

The idea is to be able to summarize the past of the data stream to be able to answer this question with a reasonable level of confidence. A bloom filter is a data structure that consists of:

A binary array of length $m$ .
$w$ independent hash functions $h_{1}, ..., h_{w}$ .

And the algorithm developed to initialize this data structure is the one described in Algorithm 11. Basically, each element is hashed using all the hash functions, then the correspondent elements in the array are set to True.

B = array(length=m, type=bool, default=False)
repeat
	receive next element x in S
	for i = 1..w 
		idx = hi(x) 
		B[idx] = True
until S ends
return B

Algorithm 11: BloomConstruct(Stream S, Size m, Number of hash w)

Now, when a new element comes from the stream,

x \in S

, we would compute

h_{1} (x), ..., h_{w} (x)

. If all the correspondent elements in the bloom filter are True, then we are confident (if the hash functions are well done) that the element has already been seen before. If some of them is False, we know for sure that the element has not been seen before, and we update the filter.

Example 15.1. Suppose we have the stream

[3, 8, 15, 3, 5]

m = 10

h_{1} (k) = k mod 10

and

h_{2} (k) = (k + 7) mod 10

. The procedure would be:

Start with all-False bloom filter:

F

F

F

F

F

F

F

F

F

F
3 enters: $h_{1} (3) = 3$ and $h_{2} (3) = 0$ . Both F, so 3 has never been seen before. Update B:

T

F

F

T

F

F

F

F

F

F
8 enters: $h_{1} (8) = 8$ and $h_{2} (8) = 5$ . Both F, so 8 has never been seen before. Update B:

T

F

F

T

F

T

F

F

T

F
15 enters: $h_{1} (15) = 5$ and $h_{2} (15) = 2$ . The first is T, but the second is F, so 15 has never been seen before. Update B:

T

F

T

T

F

T

F

F

T

F
3 enters: $h_{1} (3) = 3$ and $h_{2} (3) = 0$ . Both T, so we assume 3 has indeed been seen before. In this case we are right! B is not updated.
5 enters: $h_{1} (5) = 5$ and $h_{2} (5) = 2$ . Both T, so we assume 5 has indeed been seen before. In this case we are wrong... it is a false positive.

Remark 15.1. Note how little by little the array gets populated by True values, making false positives increasingly likely.

15.2 Count-Min Sketch

In this case, we are interested in answering the question:

'How many times has this incoming element appeared in the data stream before?'

Again, the idea is to be able to make a summary of the past data that enables us to answer the question. A count-min sketch is a data structure which consists of:

A set of $w$ different numeric arrays, each of length $m$ .
$w$ independent hash functions $h_{1}, ..., h_{w}$ .

And the algorithm developed to initialize this data structure is the one described in Algorithm . Basically, each element is hashed using all the hash functions, then the correspondent elements in the array are augmented by 1 unit. There are

w

rows to make the values harder to collide (two elements

x, y

collide only if

h_{i} (x) = h_{i} (y)

, for the same

h_{i}

). To get the number of times an element has come, we get the minimum of the cells accessed, because the minimum is an upper bound for the number of times the element has been seen.

CM = Matrix(nrows=w, ncols=m, type=int, default=0)
repeat
	receive next element x in S
	for i=1..w
		idx = hi(x)
		CM[i,idx] += 1
until S ends
return CM

Algorithm 12: CountMinConstruct(Stream S, Width w, Height m)

Example 15.2. Suppose we have the stream

[3, 8, 15, 3, 5]

m = 10

h_{1} (k) = k mod 10

and

h_{2} (k) = (2 k) mod 10

. The procedure would be:

Start with all-0 count-min sketch:

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0
3 enters: $h_{1} (3) = 3$ and $h_{2} (3) = 6$ . Both 0, so 3 has been seen 0 times before. Update CM:

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

0

1

0

0

0
8 enters: $h_{1} (8) = 8$ and $h_{2} (8) = 6$ . One cell is 1 and the other is 0, so 8 has never been seen before. Update CM:

0

0

0

1

0

0

0

0

1

0

0

0

0

0

0

0

2

0

0

0
15 enters: $h_{1} (15) = 5$ and $h_{2} (15) = 0$ . Both 0, so 15 has never been seen before. Update CM:

0

0

0

1

0

1

0

0

1

0

1

0

0

0

0

0

2

0

0

0
3 enters: $h_{1} (3) = 3$ and $h_{2} (3) = 6$ . One is 2 and other is 1, so it has been seen once before. Update CM:

0

0

0

2

0

1

0

0

1

0

1

0

0

0

0

0

3

0

0

0
5 enters: $h_{1} (5) = 5$ and $h_{2} (5) = 0$ . Both 1, so we assume 5 has indeed been seen before once. In this case we are wrong... it is a false positive. The table would be updated again:

0

0

0

2

0

2

0

0

1

0

2

0

0

0

0

0

3

0

0

0

Remark 15.2. Note that if 15 went in again, we would fail again, and in fact from this point on we will always fail with the count of 15s and 5s seen.

15.3 Flajolet-Martin algorithm

The Flajolet-Martin algorithm aims at answering the question

'How many distinct elements appeared in the data stream before?'

The intuitive idea is as follows: if the hash function distribute evenly the numbers in the range of the function, then we can assume that the probability of getting a particular value after hashing is

\frac{1}{| r a n g e |}

. Thus, if we take the binary representation of the output, we can assume that the last digit is 0 with probability

\frac{1}{2}

because it has the same chance of being 0 or 1. Now, the output having the last two digits as 0 would occur approximately 1 out of 4 times, and so on... The probability that the last

k

digits are 0 is

\frac{1}{2^{k}}

, so if we see a value with

k

trailing 0, the expected amount of seen records is

2^{k}

The algorithm works as follows:

Given a data stream $S$ , use a hash function $h : S \to [0, 2^{L} - 1] \cap Z$ , where $L$ is such that $2^{L} > # d i s t i n c t e l e m e n t s$ . $L$ is usually chosen to be 64.
For each incoming record $x$ , take the binary representation of $h (x)$ .
Count the number of trailing zeros in $h {(x)}_{2}$ .
Keep a variable $R_{m a x}$ with the maximum number of zeros found until now.
The expected maximum number of trailing zeros over all stream elements is $E [R_{m a x}] = {log}_{2} (0.77351 n),$ where $n$ is the amount of distinct values. So, if we want to know how many distinct values we have seen until now, we would estimate $n \sim \frac{2^{R_{m a x}}}{0.77351} .$

Example 15.3. Suppose we have the stream

[3, 8, 15, 3, 5]

h (x) = (7 x + 5) mod 32

3 enters: $h (3) = 26 = 11010$ . $R_{m a x} = 1$ .
8 enters: $h (8) = 19 = 10011$ . $R_{m a x} = 1$ .
15 enters: $h (15) = 14 = 01110$ . $R_{m a x} = 1$ .
3 enters. $R_{m a x} = 1$ .
5 enters: $h (5) = 8 = 01000$ . $R_{m a x} = 3$ .

Thus, the estimation is

n \sim \frac{2^{3}}{0.77351} = 10.3,

and the real value is 4. Note, nonetheless, that as we are working with probabilities, the results are not expected to be accurate when there are very few values.

Improvements to Flajolet-Martin

It can happen that one of the first values has lots of trailing zeros, ruining the algorithm. A solution to avoid this is using several hash functions $h_{1}, ..., h_{w}$ and keep the maximum number of trailing zeros for each of them $R_{m a x, 1}, ..., R_{m a x, w}$ . At the end, we would obtain $R_{m a x}$ as the average of all these.
It is also possible to have several points in which the records are taken, and thus we would like to synchronize the results from every of them. The solution is to treat all these points as if they were only one. Let's explain this a little bit. Imagine we have $m$ measure points, where $m$ streams are measured, one stream per measure point. Then, we would have $m$ values $R_{m a x, i}$ for $i = 1, ..., m$ and we want to obtain the combined $R_{m a x}$ . The idea is that, if we think of all the $m$ streams as a single stream which has been divided, then we would just take $R_{m a x}$ as the maximum number of trailing zeros observed in the stream. Thus, when it is divided, if we know the maximum number of trailing zeros in each of the substreams, we also know the maximum number of trailing zeros in the whole stream: the maximum among the substreams! Thus, we would take $R_{m a x} = {max}_{i = 1, ..., m} {R_{i}} .$ Each of the substreams would work exactly as explained before (maybe with the Improvement 1 implemented).

15.4 Hyperloglog

Hyperloglog is a generalization of Flajolet-Martin, which tries to improve the predictions to answer the same question of how many different values have been seen in the stream before.

The idea is that we use

2^{k}

buckets to count the trailing zeros observed in the records observed. For an incoming record,

x

, we compute

h (x)

and translate it to binary form. Then, the bucket in which it will count is the bucket numbered with the first

k

bits of

h {(x)}_{2}

. In each bucket, we would count the maximum number of trailing zeros in records classified into that bucket. At the end, as explained by Flajolet et al. in [2], the estimation is computed as

n \sim k \frac{H M_{i = 1}^{k} (2^{R_{m a x, i}})}{0.77351},

where

k

is the number of buckets, and

H M_{i = 1}^{k} (x_{i}) = \frac{k}{\frac{1}{x_{1}} + ... + \frac{1}{x_{k}}}

is the harmonic mean.

Example 15.4. Suppose we have the stream

[3, 8, 15, 3, 5]

h (x) = (7 x + 5) mod 32

and there are two buckets.

3 enters: $h (3) = 26 = 11010$ . $B u c k e t = 1 \to R_{1} = 1$ .
8 enters: $h (8) = 19 = 10011$ . $B u c k e t = 1 \to R_{1} = 1$ .
15 enters: $h (15) = 14 = 01110$ . $B u c k e t = 0 \to R_{0} = 1$ .
3 enters. $B u c k e t = 1 \to R_{1} = 1$ .
5 enters: $h (5) = 8 = 01000$ . $B u c k e t = 0 \to R_{0} = 3$ .

Thus, the estimation is

n \sim 2 \frac{\frac{2}{\frac{1}{3} + \frac{1}{1}}}{0.77351} = 3.88.

Part VII Outlier mining

16 Outlier Mining

Definition 16.1. An outlier is a data object that deviates significantly from the normal objects as if it were generated by a different underlying mechanism.

Remark 16.1. Note that an outlier is not the same as noise in the data. Noise is due to random errors or the variance in a measured variable and for a good outlier analysis, it is required that noisy records are removed first.

Outlier are far more interesting than noise, because they are generated differently from the usual data.

As we saw in stream data mining, data can vary over time, and in the early stages of a change process, the new records would be seen as outliers from the past ones. But little by little we would notice that they are not outliers, they are the result of a change in the underlying process. Thus, outlier detection is also a part of the novelty detection process.

Remark 16.2. Some use cases are:

Credit card fraud detection: the normal records are not fraudulent, and they add up to a great majority of the records. Thus, the frauds are outliers which are generated differently.
Medical analysis: the outliers are the few people that suffer an illness.

16.1 Types of outliers

Global outlier: is a record that significantly deviates from the rest of the data set. It can be found clustering the data and defining an appropriate measurement of deviation.

Figure 3: A global outlier.
Contextual outlier: is a record that deviates significantly based on a selected context, i.e., a subset of the data. Usually, attributes can be categorized into:
- Contextual attributes: which defines the context of the record, e.g. the time and location.
- Behavioral attributes: these are the proper measures of the records, which are used in outlier evaluation, e.g. temperature.
This way, if the context is Brussels in December, a temperature of 0 degrees would probably not be an outlier. But the same temperature in Dominican Republic in June would be an outlier.

The main problem in this case is how to define meaningful context.

Figure 4: A context outlier.
Collective outliers: a subset of the data records collectively deviates significantly from the whole data set, but the individual data records might not be outliers themselves. For example, in the lottery, someone has to win, so when we look at individuals winning a lottery prize we would not see anything strange. But if the same person wins several editions of the lottery, then it would be an outlier.

Figure 5: Collective outliers.

16.2 Challenges of outlier detection

It is hard to enumerate all possible normal behaviors in an application and the boundary that separates normal data from outlier data is often in a gray area which is hard to identify.
The choice of distance measures among objects and the relationship between objects is application-dependent, so there is no one-fits-all solution.
As mentioned before, noice can blur the distinction between normal objects and outliers because the algorithms can sometimes confuse noise with outliers.
The understandability of the outliers is also hard to provide. It is not easy to understand how the detected outliers are produced in contrast to normal records. Usually, one thing we can do is estimate the unlikelihood of the record being generated by a normal mechanism (hypotheses testing).

16.3 Supervised methods for outlier detection

We can try to model outlier detection as a classification problem, in which the samples are examined and classified by domain experts and are later used for training and testing models.

16.3.1 One-Class model

The idea of the one-class model is to train a classification model that can distinguish normal data from outliers. For this, it requires many abnormal samples, as many as possible. The functioning is to learn the decision boundary of the normal class. Then, all samples that lie outside this boundary are labeled as outliers.

It poses the advantage that can detect new outliers that are different from past outliers.

It is possible to extend the model to a multi-class model, in which the normal objects might be also classified into multiple classes.

image: 35_home_runner_work_BDMA_Notes_BDMA_Notes_ULB_Data_Mining_LectureNotes_source_outliers_3.png

Figure 6: Basic diagram of a one-class model.

16.4 Unsupervised methods for outlier detection

If we assume that the normal objects are somehow clusered into multiple groups, each of them having some distinctive features, we can expect outliers to be far away from any group of normal records.

This approach posses the drawback that it is very hard to detect collective outliers effectively.

16.4.1 Proximity-based methods

In this case, a record is considered an outlier if the nearest neighbors of the objects are far away. The effectiveness of these methods relies on the proximity measure used and it is often hard to find groups of outliers that are close to each other.

Distance-based outlier detection

Let

r > 0

be a distance threshold and

τ \in (0, 1]

be a fraction threshold. An object

o

is a

D B (r, τ) - o u t l i e r

\frac{c a r d {o' | d i s t (o, o') \leq r}}{c a r d (D)} \leq τ .

So we count how many objects are within a distance of

r

and if these account to less than the predefined fraction

τ

, then

o

is considered an outlier.

If we do it just like this, we would test each object against the whole data set one by one. This can be very costly for big datasets, and can be improved with the CELL method, which is a grid-based method. In this case, the data space is partitioned into a multidimensional grid, in which is cell is a hyper cube with diagonal length

\frac{r}{2}

. This allows us to improve the efficiency of the method by using two pruning rules:

Definitions: Let C be the cell whose objects we want to assess:
1. The level-1 are the cells adjacent to cell $C$ .
2. The level-2 are the cells which have some point less than $r$ distance from $C$ .
Cell $C$ : it is obvious that all objects in cell $C$ are less than distance $r$ from each other. Call $a = c a r d (C)$ .
Level-1 cell pruning rule: let $b_{1} = c a r d (l e v e l - 1)$ . If $a + b_{1} > ⌈ τ n ⌉$ , then every object $o$ in $C$ is not a $D B (r, τ)$ -outlier because all objects in $C \cup l e v e l - 1$ are in the $r$ -neighborhood of $o$ and there are at least $⌈ τ n ⌉$ .
Level-2 cell pruning rule: let $b_{2} = c a r d (l e v e l - 2)$ . If $a + b_{1} + b_{2} < ⌈ τ n ⌉ + 1$ , then all objects in $C$ are $D B (r, τ)$ -outliers.

16.4.2 Density-based methods

The idea in this case is that the density around an outlier object is significantly different from the density around its neighbors. Thus, we can use the relative density of an object against its neighbors as the indicator of the degree of the object being an outlier.

For this, we define the

k

-distance of an object

o

as the distance between

o

and its

k^{t h}

nearest neighbor, and the

k

-distance neighbourhood of

o

N_{k} (o) = {o' | o' \in D, d i s t (o, o') \leq d i s t_{k} (o)} .

Note that this set can have more than

k

elements because multiple elements can have identical distance to

o

. Thus the measure

\frac{k}{c a r d (N_{k} (o))}

is an indicator of the density around the point

o

, which can be compared with the rest to decide if

o

is an outlier or not.

16.4.3 Clustering-based methods

The idea now is to cluster the data and classify an object as an outlier if:

It does not belong to any cluster.
There is a large distance between the object and its closest cluster.
It belongs to a small or sparse cluster.

This approach presents several advantages:

It can detect outliers without the need to label the data.
It works for many datatypes.
Clusters can be regarded as summaries of the data.
The detection is fast once the clusters are obtained, because we only need to compare any object against the clusters.

But it also has some drawbacks:

The effectiveness depends on the cluster method used.
The cost of the clustering can be high. This can be partially fixes using fixed-width clustering:
- A point is assigned to a cluster if the center of the cluster is within a predefined distance threshold from the point.
- If the point cannot be assigned to any existing cluster, a new cluster is created and the distance threshold may be learned from the training data under certain conditions.

References

1Charu C. Aggarwal, Data Mining (Springer International Publishing, 2015).

2Philippe Flajolet, Éric Fusy, Olivier Gandouet, and Frédéric Meunier, "Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm", in Discrete Mathematics and Theoretical Computer Science (2007), pp. 137--156.

3Mahmoud Sakr, "INFOH423 Data Mining".

4Ian Witten, Eibe Frank, and Mark Hall, Data Mining: Practical Machine Learning Tools and Techniques (Elsevier, 2011).

Name	Role
Pedro	CEO

Name	Role:CEO	Role:Employee	Role:Director
Pedro	1	0	0

INFOH423 - Data Mining

Table of Contents

List of Figures

List of Tables

List of Algorithms

Part I Introduction

1 What is data mining?

1.1 Why is data mining important today, if it was not yesterday?

2 The Data Mining Process

3 Data Types

3.1 Nondependency-oriented data

3.2 Dependency-oriented data

Part II Classification

4 Decision Trees

4.1 Split criteria

4.1.1 Entropy

4.2 ID3 Tree Induction Algorithm

4.2.1 The problem of UID

5 Bayesian classification

5.1 Naive Bayes classifier

Some final comments

6 Model evaluation and selection

6.1 Confusion Matrix

7 Ensemble methods: increasing accuracy

Looking at the accuracy

Part III Model validation and data preparation

8 Data preparation

8.1 Feature extraction

8.2 Data Type Portability

8.3 Data Cleaning

8.3.1 Handling Missing and Inconsistent entries

8.3.2 Handling Noisy entries

8.4 Exploratory analysis

8.4.1 Central tendency measures

8.4.2 Symmetric and Skewed data

8.4.3 Measuring the dispersion

8.4.4 Comparing with the normal distribution

8.5 Similarity and Distance

9 Model evaluation

9.1 Holdout

9.2 Cross-Validation

9.3 Bootstrap

Part IV Clustering

10 Representative-Based Algorithms

10.1 The k-Means algorithm

10.2 The k-Medians Algorithm

10.3 The k-Medoids Algorithm

10.4 Practical issues

11 Grid and Density based Algorithms

11.1 Grid-based methods

11.2 DBSCAN

11.2.1 Progressive DBSCAN

11.3 DENCLUE

12 Probabilistic Model-Based Algorithms

12.1 Fuzzy sets and clusters

12.2 Mixture model

12.3 Evaluating fuzzy clusters

12.4 Cluster quality measures

Part V Frequent pattern and association rule mining

13 Frequent Itemset Mining

13.1 The model

13.2 Association rule generation framework

13.3 Frequent itemset mining algorithms

13.3.1 Brute Force Algorithms

A little tweak

Ideas to apply

13.3.2 The Apriori Algorithm

Limits of Apriori

Apriori improved with tricks

More Apriori Tricks

13.3.3 FP-Growth

Why FP-Growth outperforms Apriori?

13.4 Mining Association Rules

13.4.1 Evaluating association rules

13.4.2 How to choose a measure?

14 Sequential pattern mining

14.1 Candidate generation

14.2 Generalized Sequential Pattern (GSP) Algorithm

14.3 Sequential PAttern Discovery using Equivalence classes (SPADE) Algorithm

14.4 PrefixSpan