INFOH423 - Data Mining

Fall 2022
Jose Antonio Lorencio Abril
image: 0_home_runner_work_BDMA_Notes_BDMA_Notes_ULB_Da___x-Universit___libre_de_Bruxelles__logo__svg.png
Professor: Mahmoud Sakr
Student e-mail: jose.lorencio.abril@ulb.be

This is a summary of the course Data Mining, taught at the Université Libre de Bruxelles by Professor Mahmoud Sakr in the academic year 22/23. Most of the content of this document is adapted from the course notes by Sakr, [3], and the basic bibliographic source of the course, the book of Aggarwal, [1], so I won't be citing it all the time. Other references will be provided when used.

Table of Contents

Part I Introduction
Part II Classification
Part III Model validation and data preparation
Part IV Clustering
Part V Frequent pattern and association rule mining
Part VI Stream Data Mining
Part VII Outlier mining

List of Figures

Figure 1: Holdout visualization.
Figure 2: Cross-Validation visualization. m=3.
Figure 3: A global outlier.
Figure 4: A context outlier.
Figure 5: Collective outliers.
Figure 6: Basic diagram of a one-class model.

List of Tables

Table 1: Records and multidimensional data set example
Table 2: The weather data. Source: [#witten2011].

List of Algorithms

Algorithm 1: GenericDecisionTree(Dataset: D)
Algorithm 2: ID3(I, O, T) : Decision Tree
Algorithm 3: DP_Edit(s1, s2)
Algorithm 4: Generic k-representative approach (Data D, int k, threshold eps) : Set of representatives Y and Clusters C
Algorithm 5: k-medoids(Data D, int k, threshold eps) : Set of representatives Y and Clusters C
Algorithm 6: GenericGrid(Data D, Ranges p, Density tau) : clusters C
Algorithm 7: DBSCAN(Data D, Radius eps, Density tau) : clusters C
Algorithm 8: Apriori(Transactions T, Minimum Support minsup)
Algorithm 9: Apriori_improved(Transactions T, Minimum support minsup)
Algorithm 10: FP-Growth(FP-Tree FPT, Minimum Support minsup, Current Suffix P)
Algorithm 11: BloomConstruct(Stream S, Size m, Number of hash w)
Algorithm 12: CountMinConstruct(Stream S, Width w, Height m)

Part I Introduction

1 What is data mining?

Even though there is not a formal-accepted-by-all definition of data mining, most of them are in agreement that data mining is a field of study that focuses on the collection, cleaning, processing, analysis and gaining of useful insights from the data that we have access to.
As wider as these topics are, is also the data mining domain, which is nowadays a really hot topic both in academia and industry.
In academia, there exist many goals to achieve in this field:
In the industry, the focus is mainly in two fields (disregarding research companies, which would have similar objectives as academia):
These usages are made with the aim of improving the business processes and, ultimately, maximize profit.

1.1 Why is data mining important today, if it was not yesterday?

Because the computing power has increased enormously, so now we are able to run algorithms that enable us to train models in a decent time, and this task was practically undoable a few years ago.
Not only that, but the data technology is rapidly evolving, too. We produce more data and store more data, so... we have more data! This data is potential knowledge and we know many tools to get this knowledge from it.
In fact, the amount of data is so vast, that not only we have to develop techniques to analyze data, but also to manage huge amounts of data.

2 The Data Mining Process

The data mining process is a pipeline constructed around the basic steps:
  1. Data collection: obtention of data from real world sources.
  2. Feature extraction and data cleaning: among all data retrieved from the real world, we have to select those characteristics or features that are relevant for our purposes (feature extraction) and to decide what to do with mistaken/noisy/lost data (data cleaning). Also, as we might be collecting data from different sources, it is important to decide how to aggregate the data into a unified format for later processing.
  3. Analytical processing and algorithms: we now must develop suitable methods for analyzing our data. That is, we should decide what mathematical models to use to describe our data, and what algorithms to use in order to train the desired model
    2
    We will see what is specifically the meaning of train, but in the meanwhile we can think of it as how we make a general model adapt to our particular data.
    .
  4. After these steps, we would enter a phase of analyzing the obtained results to obtain knowledge from data, as well as a iterative procedure, in which we could return to any previous step and try to improve different parts of the process
    3
    Usually, data is constantly updating, so at least it will necessary to produce checks of correctness and updates.
    .

3 Data Types

Not all data have have the same nature nor characteristic, so it is important to understand the differences between them and what techniques are applicable to what types of data.
We can characterize data in different levels of detail, and the most basic classification would distinguish between nondependency-oriented data and dependency-oriented data:
Dependency-oriented data are normally more complex to study because of the need to study not only the data itself, but also the relationships between different data items.
We will now define different subtypes of data that we can find.

3.1 Nondependency-oriented data

The term nondependency-oriented data is interchangeable with the term multidimensional data:
Definition 3.1. We will call source space, S , to the set of all possible values that our data can take. This set does not have neccessarily to take any particular form.
If S is a product space, then each component is called a feature.
Example 3.1. If we are measuring the name, age, height and gender of the students of a school, then we will have S = T × N [ 0,150 ] × ( 0,3 ) × { M,F } , where T represents the set of possible names and { M,F } are the two possible values of the gender.
Definition 3.2. A record (data point, instance, tuple) is just a point X= ( x i ) i=1 d S that we measure and store in some form.
Example 3.2. Following the previous example, some records of S are represented in the following table:
index
Name
Age
Height
Gender
1
Josh
23
1.77
M
2
Mary
28
1.62
F
3
Larry
12
1.58
M
Table 1: Records and multidimensional data set example
Note that we added an index column, because it is a common practice.
Definition 3.3. A multidimensional data set, D , is a set of n records, { X j } j=1 n .
Example 3.3. Table 3.2 represents an example of a dataset, too.
As we can see, Table 3.2 contains attributes of different types (very obvious from the definition of the source space S ). Thus, we have to take into account also the type of each attributes of our data:

3.2 Dependency-oriented data

As outlined before, we can find implicit or explicit dependencies between instances:
As before, let deepen a bit in some types of this kind of data:

Part II Classification

A classification problem consists in learning the structure of a dataset of examples, already partitioned into groups, referred as class. This learning is typically achieved with a model, which is used to estimate the class labels of unseen data examples with unknown labels. Thus, one of the inputs to the classification problem is the example dataset with known labels, D , called training data, while the unseen data points to be classified are the test data. The model learnt is referred to as training model. The algorithm used to create the model is the learner.
The output of the classification algorithm can be of two types:

4 Decision Trees

Decision trees are a classification methodology, which uses a tree structure to partition the feature space. Each node of the tree represents a decision to make according to the data, called the split criterion, and is a condition on one or more features variables in the training data.
The goal is to identify a split criterion such that the level of mixing of the class variables in each branch of the tree is reduced as much as possible.
The splits can be univariate, if they use a single attribute in the condition, or multivariate, if more than one attribute are used in the condition.
The nodes can be of two types:
The general algorithm for constructing a decision tree is as follows:
begin
	Create root node containing D;
	repeat
		Select an eligible node in tree;
		<@\textcolor{red}{Split the selected node into two or more nodes based on the split criterion};<@
	until no more eligible nodes for split;
	<@\textcolor{red}{Prune overfitting nodes from tree};<@
	Label each leaf node with its dominant class;
end
Algorithm 1: GenericDecisionTree(Dataset: D)
The red lines indicate what changes accross the different algorithms to produce a specific decision tree.

4.1 Split criteria

The split criterion aims to maximize the separation of the different classes among the children nodes. Its design depends on the attributes of the data:
These methods require to determine the best split among a set of different splits, so we need a way to measure which one is better than other.
For this end, we are using the entropy.

4.1.1 Entropy

Definition 4.1. Let p j be the fraction of data points belonging to the class j for the attribute value v j . Then, the class-based entropy, E( v i ) , for the attribute value v i is E( v i ) =- j=1 k p j log 2 ( p j ) .
Remark 4.1. When p j =0 , it is assumed that p j log 2 ( p j ) =0 .

Remark 4.2. E( v i ) [ 0, log 2 k ] .

Remark 4.3. Higher values of the entropy imply greater mixing of different classes, while a value of 0 implies perfect separation.
Definition 4.2. The overal entropy of an attribute, E , is defined as the weighted average over the r different attribute values: E= i=1 r n i n E( v i ) , where n i is the frequency of attribute value v i .
The entropy is used in the ID3 algorithm for constructing decision trees.
The overall entropy for an r -way split of set S into sets S 1 ,..., S r may be computed as the weighted average of the entropy values of each S i , being its weigth | S i | . This is called the entropy-split: Entropy-Split( S S 1 ,..., S r ) = k=1 r | S k | | S | E( v i | S k ) .
In relation to this, the information gain is defined as the reduction of entropy due to the split: IG( S S 1 ,..., S r ) =E( S ) -Entropy-Split( S S 1 ,..., S r ) . Note that lower values of the entropy-split and higher values of the information gain are more desirable.
Sometimes, there are attributes with lots of distinct values, so using them to split the data reduces the entropy a lot, but are not very useful for prediction
4
Think, for example, in an ID. See Subsubsection 4.2.1.
. To account for this, we can divide the overall information gain with the normalization factor - i=1 r | S i | | S | log 2 ( | S i | | S | ) , which helps adjusting for the varying number of categorical values.
Example 4.1. Entropy of the dataset 'Weather data'
Consider the following dataset:
outlook
temperature
humidity
windy
play
sunny
hot
high
false
no
sunny
hot
high
true
no
overcast
hot
high
false
yes
rainy
mild
high
false
yes
rainy
cool
normal
false
yes
rainy
cool
normal
true
no
overcast
cool
normal
true
yes
sunny
mild
high
false
no
sunny
cool
normal
false
yes
rainy
mild
normal
false
yes
sunny
mild
normal
true
yes
overcast
mild
high
true
yes
overcast
hot
normal
false
yes
rainy
mild
high
true
no
Table 2: The weather data. Source: [4].
If the class attribute is play, what is the entropy of this source?
E( play ) = - p yes log 2 ( p yes ) - p no log 2 ( p no ) = - 9 14 log 2 ( 9 14 ) - 5 14 log 2 ( 5 14 ) = 0.94. What if the Entropy-Split of the attribute humidity? ES( S S hum=high , S hum=normal ) = | S hum=high | | S | E( play|hum=high ) + | S hum=normal | | S | E( play|hum=normal ) = 7 14 [ - 3 7 log 2 ( 3 7 ) - 4 7 log 2 ( 4 7 ) ] + 7 14 [ - 6 7 log 2 ( 6 7 ) - 1 7 log 2 ( 1 7 ) ] = 0.7885. And the information gain? IG( S S hum=high , S hum=normal ) = E( play ) -ES( S S hum=high , S hum=normal ) = 0.94-0.79=0.15.

4.2 ID3 Tree Induction Algorithm

ID3 is an algorithm to construct decision trees, in which the split criterion is the maximization of the information gain and the prunning strategy is to stop if all the records in a node are of the same class. The algorithm is more detailed in Algorithm 2.
Algorithm 2: ID3(I, O, T) : Decision Tree
# I is the set of input attributes
# O is the output attribute
# T is a set of training data

if (T is empty) then
	return a single node with value “Failure”

if (all records  in T have the same value for O) then
	return a single node with that value

if (I is empty) then
	return a single node with the most frequent value of O in T

# else
compute IG for each attribute in I using data in T

Let X = argmin{IG(attr) for attr in I}
Let {x_j for j=1,...,m} be the values in X
Let {T_j for j=1,...,m} be the subsets of T when partitioned

return a tree with:
	root node labelled X
	arcs labelled x_1,...,x_m
	connected to 
	ID3(I-{X}, O, T_1),...,ID3(I-{X}, O, T_m)
Example 4.2. Compute the decision tree of the data in Table 3.2 using ID3 algorithm.
Step 1: Root
We start by computing IG for each attribute. To simplify notation, let P=play,O=outlook,T=temperature,H=humidity and W=windy .
We already know that E( P ) =0.94 and IG( S S H=high , S H=normal ) =0.15 . Let's compute the rest of the values: ES( S S O=sunny , S O=overcast , S O=rainy ) = - 5 14 ( 3 5 log 3 5 + 2 5 log 2 5 ) 2-0 = 0.69. Which implies that IG( S S O=sunny , S O=overcast , S O=rainy ) =0.25 .
Repeating this process with temerature, we get IG( S S T=hot , S T=mild , S T=cold ) =0.03 and with windy, we get IG( S S W=True , S W=False ) =0.05 .
This means that we label the root node with X=O and we create three arcs, each of them with one of the values from O . So we have the following Tree:
image: 1_home_runner_work_BDMA_Notes_BDMA_Notes_ULB_Data_Mining_LectureNotes_source_pegado1.png
Step 2: Outlook=sunny
Now, we are going to do the same thing, restricting ourselves to the records for which Outlook=sunny. Now, we have to recompute the entropy and the gain for each of the rest of the attributes.
Let's start with the entropy: E( P|O=sunny ) = -[ 3 5 log 3 5 + 2 3 log 2 5 ] =0.97. Now, the Information Gain: IG( S O=sunny S O=sunny,H=high , S O=sunny,H=normal ) = 3 5 E( PO=sunny|H=high ) + 2 5 E( PO=sunny|H=normal ) = - 3 5 [ 3 3 log 3 3 +0 ] - 2 5 [ 2 2 log 2 2 +0 ] = 0. Which means that IG( S O=sunny S O=sunny,H=high , S O=sunny,H=normal ) =0.97 . As this cannot be improved, we can savely not compute the rest of the values. This way, the Tree will now look as follows:
image: 2_home_runner_work_BDMA_Notes_BDMA_Notes_ULB_Data_Mining_LectureNotes_source_pegado2.png
Step 3: Outlook=Sunny, Humidity=Normal
Note that we are proceeding heightwise, but doing this breadthwise is also possible.
This time, all the values for P are Yes , so we enter the third if of the algorithm and label the node as Yes .
Step 4: Outlook=Sunny, Humidity=High
Same, now P=No .
Step 5: Outlook=Overcast
Same, now P=Yes .
So, now we have the following tree:
image: 3_home_runner_work_BDMA_Notes_BDMA_Notes_ULB_Data_Mining_LectureNotes_source_pegado3.png
Step 6: Outlook=rainy
E( P|O=rainy ) =0.97 and IG( S O=rainy S O=rainy,W=True , S O=rainy,W=False ) =0.97, so, again, it is maximum and we can continue using it as label:
image: 4_home_runner_work_BDMA_Notes_BDMA_Notes_ULB_Data_Mining_LectureNotes_source_pegado4.png
Step 7: Outlook=rainy, Windy=False
All records have P=Yes .
Step 8: Outlook=rainy, Windy=True
All record have P=No .
So, we have the tree
image: 5_home_runner_work_BDMA_Notes_BDMA_Notes_ULB_Data_Mining_LectureNotes_source_pegado5.png
And as there are no more nodes to analyze, this is the final decision tree.

4.2.1 The problem of UID

In general, attributes that have very many values have very high gain, but can lead to useless decision trees. Quinlan suggest choosing the attribute with the highest GainRatio( X,S ) = Gain( X,S S 1 ,..., S r ) Entropy( S ) , where X is the label attribute.
The GainRatio favores attributes with higher gain, and punishes attributes with high entropy (many values).
Example 4.3. Repeat the decision tree ID3 algorithm, but use GainRatio instead.
The same tree is obtained.

5 Bayesian classification

Probabilistic classifiers construct a model that quantifies the relationships between the feature variables and the target variable as a probability. We are going to study a well-known kind of probabilistic classifier, namely the Bayesian classifier or Naive Bayes classifier.

5.1 Naive Bayes classifier

The Bayes classifier is based on the Bayes' theorem for conditional probabilities.
Theorem 5.1. Bayes' Theorem
If A and B are probabilistic events and P( B ) 0 , then P( A|B ) = P( B|A ) P( A ) P( B ) .
Proof. On one side P( A|B ) = P( AB ) P( B ) .
On the other side, if P( A ) 0 P( B|A ) = P( BA ) P( A ) P( BA ) =P( B|A ) P( A ) . Now substituting this value in the previous equation, we get P( A|B ) = P( B|A ) P( A ) P( B ) , which is the result we wanted to proof.
If P( A ) =0 , then P( A|B ) =0,B , so the formula also holds.
This theorem quantifies the conditional probability of a random variable (the class variable), given known observations about another variables (the features). Let C be the class variable and X an unseen feature tuple. The goal of the method is to estimate P( C=c|X=( a 1 ,..., a d ) ) . Let the random variables for the individual dimensions of X be denoted by X=( x 1 ,..., x d ) , so we want to estimate P( C=c| x 1 = a 1 ,..., x d = a d ) = Bayes P( C=c ) P( x 1 = a 1 ,..., x d = a d |C=c ) P( x 1 = a 1 ,..., x d = a d ) . Here, we will be interested in maximizing this value. Since the denominator is the same independently of the class c , then we can focus on maximizing the denumerator P( C=c ) P( x 1 = a 1 ,..., x d = a d |C=c ) . The value P( C=c ) is the prior probability of the class identifier c and can be estimated as the fraction of points in the data whose class is c . Thus, we now want to approximate the right factor. In the Naive Bayes approach, it is assumed that the feature values are independent of one another conditional on a fixed value of C . Then P( x 1 = a 1 ,..., x d = a d |C=c ) = j=1 d P( x j = a j |C=c ) , so P( C=c ) P( x 1 = a 1 ,..., x d = a d |C=c ) =P( C=c ) j=1 d P( x j = a j |C=c ) . And this terms are much easier to estimate: to estimate P( x j = a j |C=c ) we just need to take the fraction of training values with class c and compute which fraction of them verifies x j = a j . This is usually written as P( x j = a j |C=c ) = q( a j ,c ) r( c ) .
Remark 5.1. When there are not enough training samples to produce reliable estimates, we can use Laplacian smoothing in which a small value α is added to the numerator and α m j is added to the denominator, where m j is the number of distinct values of the j th attribute P( x j = a j |C=c ) = q( a j ,c ) + α r( c ) + α m j . α is called the Laplacian smoothing parameter.

Remark 5.2. If a feature is continuous, then the likelihood is computed using a Gaussian distribution with mean μ and standard deviation σ : P( x j = a j |C=c ) =g( a j , μ c , σ c ) = 1 2 π σ c e - ( a j - μ c ) 2 2 σ c 2 .
This model is sometimes refererd to as the Bernoully model for Bayes classification when it is applied to categorical data with only two outcomes of each feature attribute.
In cases where more than two outcomes are possible for a feature variable, the model is referred to as the generalized Bernoulli model.
Example 5.1. With the following training data:
Age
Income
Student
Credit_Rating
Buys
30
high
no
fair
no
30
high
no
excellent
no
31..40
high
no
fair
yes
>40
medium
no
fair
yes
>40
low
yes
fair
yes
>40
low
yes
excellent
no
31..40
low
yes
excellent
yes
30
medium
no
fair
no
30
low
yes
fair
yes
>40
medium
yes
fair
yes
30
medium
yes
excellent
yes
31..40
medium
no
excellent
yes
31..40
high
yes
fair
yes
>40
medium
no
excellent
no
Classify with the Naive Bayes model the unseen record X=( 30,medium,yes,fair ) . In this case, we have C{ yes,no } , so P( C=yes ) = 9 14 ,P( C=no ) = 5 14 . And for the feature variables, we have to compute P( x i = a i |C=yes ) and P( x i = a i |C=no ) :
Then, we have P( C=yes ) P( 30,medium,yes,fair|C=yes ) =P( C=yes ) P( X i |C=yes ) = 9 14 2 9 4 9 6 9 6 9 = 16 567 0.028, P( C=no ) P( 30,medium,yes,fair|C=no ) = 5 14 3 5 2 5 1 5 2 5 = 6 875 0.007. Thus, the model classifies X with class 'yes'.

Some final comments

The advantages of the Bayes Classifier is that it is easy to implement and can lead to fairly good results.
The drawbacks are that the main assumption of class conditional independence is not very trustable, which may lead to a loss of accuracy, because dependencies axist among variables. To deal with this issue there is a more complex model: Bayesian Belief Networks.

6 Model evaluation and selection

With a given dataset, we can train multiple classification models and each of them will behave differently, many times without an intuitive explanation of the differences observed. Thus, it becomes crucial to establish means of comparison between different models, so it is possible to choose between several models trained with the same data.
The simplest classification measure is the accuracy, which indicates the portion of well classified records. Nonetheless, if we compute the accuracy using the training data, we could be enhance overfitting to the training data, and maybe choose models that do not work well with unseen data. This is why it is usual to use a validation test set to compute comparison measures: the idea is that given a dataset, we can divide it into training data and test data. Then, models will be trained using the training data and evaluated with the test data, which has not been seen before. This makes the measures more reliable and the comparisons more fair.
There are several ways to extract training and test data from a dataset, which are detailed in 9.

6.1 Confusion Matrix

A confusion matrix is a visual and intuitive way to assess a classification algorithm. In its simplest form it is used to assess a binary classification model and it shows the following metrics:
In this case, the confusion matrix has the form:
Predicted True
Predicted False
Actual True
TP
FN
Actual False
FP
TN
In the more general case in which we have a N -class classifier, each cell M ij would have the count of records classified as class j that are of class i in reality:
Predicted C 1
Predicted C 2
...
Predicted C N
Actual C 1
T C 1
F C 2,1
...
F C N,1
Actual C 2
F C 1,2
T C 2
...
F C N,2
Actual C N
F C 1,N
F C 2,N
...
T C N
From these confusion matrices, we can derive some interesting measures:

7 Ensemble methods: increasing accuracy

An ensemble method for classification is a composite model, made up of a combination of classifiers. The idea is that given an unseen record, it can be classified using several models, and then make a consensus between all of them to decide the final decision of the class of the record. Usually, a voting scheme is used, in which the most voted class is the chosen one for classification. This approach usually improves the accuracy of each component.
Let's analyze why this works! There are three primary primary components to the error of a classifier:
  1. Bias: every classifier has its own assumptions about the nature of the decision boundary between classes. When a classifier has high bias, it will make consistently incorrect predictions over records that lie near the incorrectly-modeled decision boundary.
    Example 7.1. Bias is shown below.
    image: 6_home_runner_work_BDMA_Notes_BDMA_Notes_ULB_Data_Mining_LectureNotes_source_pegado11.png
    The real model has been generated using the blue line. The classification model has the assumption that the data can be classified using a straight line. As we can see, points near to the boundary would be missclassified.
  2. Variance: random variations in the choices of the traiing data will lead to different models. This is closely related to overfitting. When a classifier has an overfitting tendency, it will make inconsistent predictions for the same test instance oevr different training data sets.
    Example 7.2. Variance is shown below.
    image: 7_home_runner_work_BDMA_Notes_BDMA_Notes_ULB_Data_Mining_LectureNotes_source_pegado12.png
    In this case, the yellow-ish model happen to have used few red points near the boundary, so it became a perfect blue classifier, but a bad red classifier. The opposite happened to the blue-ish model. Note that this variations are only due to random selection of the training points.
  3. Noise: it is the intrinsic errors in the target class labeling. As this is intrinsic, there is not much one can do. Therefore, we will focus in the two latter sources of error.
In addition, bias and variance are often in a trade-off relationship: improving bias worsens variance, and vice-versa. Generally speaking, simplified assumptions about the decision boundary lead to greater bias but lower variance, while complex assumptions reduce bias but are harder to robustly estimate with limited data.
Ensemble analysis can often be used to reduce both the bias and variance of the classification process, because a combination of different simple models with high bias and little variance will reduce the bias as the assumptions will be combined to model more complex scenarios. On the other hand, a combination of complex models with low bias and high variance, will reduce the variance because decisions made by multiple models tend to be more consistent than those made by individual models.

Looking at the accuracy

Now, let's look at the accuracy of the ensemble model in comparison to its component models in the case of binary classification (for ease). Let's say the ensemble model is composed of N models, each of them with an accuracy ac c i . Then, for the ensemble model to classify a new record as correct, at least half of the models need to classify the tuple correctly. Let's define the random variable X as 'number of classifiers that classify the tuple as correct', then P( correct ) =P( X N 2 ) =1-P( X< N 2 ) , and we can decompose P( X< N 2 ) =P( X=1 ) +P( X=2 ) +...+P( X= N 2 -1 ) .
Here, assuming independence of the different models, it is P( X=1 ) = i ac c i × ji ( 1-ac c j ) , P( X=2 ) = i ji ac c i × ac c j × ki,j ( 1-ac c k ) , and so on. This yields to a complex formula, but when the case is the simplest in which A=ac c i =ac c j ,i,j , we are in a binomial distribution, having P( X=k ) =(Nk) A k ( 1-A ) N-k .
Example 7.3. Suppose an ensemble model with N=3 and equal accuracy for the three models, A>0 . Then, the accuracy for the ensemble model is Acc=P( correct ) =1-P( X=1 ) =1-3A ( 1-A ) 2 , which can be compared with the individual accuracies: Acc-A=1-3A ( 1-A ) 2 -A=-3A ( 1-A ) 2 +1-A=( 1-A ) [ 1-3A( 1-A ) ] . At this point, the accuracy will be increased whenever 1-3A( 1-A ) >03 A 2 -3A+1>0. The discriminant of this polynomial is b 2 -4ac=9-12=-3<0, so all its solutions are complex and it does not cut the X axis. As the leading coefficient is positive, this polynomial is always positive and we conclude that the 3-ensemble method always improves the accuracy of the individual models.

Example 7.4. Suppose an ensemble model with N=5 and equal accuracy for the three models, A>0 . Then, the accuracy for the ensemble model is Acc=1-P( X=1 ) -P( X=2 ) =1-5A ( 1-A ) 4 - 5! 2!3! A 2 ( 1-A ) 3 =1-5A ( 1-A ) 4 -10 A 2 ( 1-A ) 3 , and when compared to the individual values we obtain Acc-A =1-5A ( 1-A ) 4 -10 A 2 ( 1-A ) 3 -A =( 1-A ) [ 1-5A ( 1-A ) 3 -10 A 2 ( 1-A ) 2 ] . This is an increased accuracy whenever 1-5A ( 1-A ) 3 -10 A 2 ( 1-A ) 2 >0, which corresponds to the polynomial -5 A 4 +5 A 3 +5 A 2 -5A+1>0. The graph of this polynomial is
image: 8_home_runner_work_BDMA_Notes_BDMA_Notes_ULB_Data_Mining_LectureNotes_source_pegado13.png
So for A( 0,1 ) it is positive everywhere except the interval ( 0.354,0.423 ) , in which it is negative. This means that the accuracy is improved in all cases outside this interval.
For example, if A=0.7 , then Acc=0.839 , which is indeed an improvement.
There are several ways to make ensemble methods:
  1. Bagging: the data is bootstrapped N times, collecting a training data set of approximately the same size of the original data set for each model. Then, each model is trained with a different bootstrap.
    This approach reduces the variance, because the differences owed to the sampling are reduced by performing this sampling N times. However, bagging does not improve the bias, because all the models used have the same assumptions.
  2. Boosting: a weight is associated with each training instance and the N classifiers are trained with the use of these weights. When the classifier M i has finished its training, those records in which the model missclassify are increased in weight, so the next model M i+1 will be trained paying more attention to those records.
    With this approach, the overall bias is reduced because each model focuses on those places where the past models tend to fail.
    1. Adaboost: a particular algorithm approach based on the boosting idea. Given a dataset, D , of k records with label y i , ( X 1 , y 1 ) ,...,( X k , y k ) :
      1. Set initial weights to w i = 1 k ,i=1,...,k .
      2. j = 1
      3. While j<N
        1. Bootstrap D to get a training set T j , select tuples with probability w i .
        2. Train model M j with T j .
        3. Compute the error rate of M j using D .
        4. Update the weigths
          - If a tuple is misclassified: increase its weight
          - If it is correctly classified: decrease its weight
        5. j = j+1
      The error rate with weights is computed as ER( M j ) = i=1 d w i error( M j , X i ) , where error( M,X ) ={ 0 ifMclassifiesXcorrectly 1 ifMmisclassifiesX .
      When the voting is performed, the votes are also weighted, with w( M i ) = log 1-ER( M i ) ER( M i ) .
  3. Random Forest: bagging does not work very well combined with decision trees, because the ID3 algorithm tends to generate similar/correlated trees. The idea here is to add randomness to the tree induction algorithm itself, as follows:
    1. Before each split, L attributes are randomly selected out of the available K attributes.
    2. The split attribute is selected from this group of L attributes.
    When L is selected to be much smalles than K , the trees in the forest are highly independent, so the method is able to improve the accuracy of the individual trees. Note nonetheless that in this case the individual trees will perform worse than normally trained trees, because they have been worsened on purpose.

Part III Model validation and data preparation

8 Data preparation

The data preparation phase is a multistage process that comprises several individual steps, some or all of which may be used in a given application. These steps are:
  1. Feature extraction and portability: a feature is characteristic of the data or derived from the data. For example, if we have a sensor measuring humidity, the level of humidity will be a feature directly present in the data; the difference between the humidity level at each measure and the average humidity level is a derived feature.
    Features with good semantic interpretability are more desirable because this makes things easier for the analyst to understand results. So, the process of selecting which features to take into account for further analysis is called feature extraction.
    Data type portability refers to the process of transforming data into different formats. This could have several reasons behind: we could do this because we have several sources of data which we want to unify or because we use an internal datatype that will not be compatible with what the training algorithms expect.
  2. Data cleaning: missing, erroneus and inconsistent entries are treated. We can either remove them or estimate them via the process of imputation.
  3. Data reduction, selection and transformation: the size of the data is reduced through data subset selection, feature subset selection, or data transformation. This helps in two ways:
    1. The algorithms perform more efficiently in smaller datasets.
    2. The removal of irrelevant features or records improves the quality of the data mining process.

8.1 Feature extraction

Example 8.1. Image feature extraction
Image data are represented as pixel. Nonetheless, we know that pixels are related between each others and that combinations of pixels carry information about what we are seeing in the image. This is not straighforward for a computer to understand, as the computer only 'sees' a matrix of triplets.
At a higher level, color histograms can be used to represent the features in different segments of an image.
Also, visual words are used to extract features from images. A visual word is a semantically rich representation of parts of an image.

Example 8.2. Document feature extraction
Document data is often available in raw and unstructured form, and the data may contain rich linguistic relations between different entities.
One approach is to remove stop words, stem the data, and use bag-of-words representation.
Other methods use entity extraction to determine linguistic relationships.
Named-entity recognition is an important subtask of information extraction. It consists in locating and classifying atomic elements in text into predefined expressions of names of persons, organizations,...

8.2 Data Type Portability

8.3 Data Cleaning

Data in the real world is

8.3.1 Handling Missing and Inconsistent entries

We can:

8.3.2 Handling Noisy entries

We can:

8.4 Exploratory analysis

Exploratory analysis is the task to understand the data, from the meaning of the features, to their range of values or even their statistical distributions. There are many actions we can do to explore the data, such as counting nulls, searching for repetition, compute some statistics as the maximum, the minimum, the mean,... of the data, and many more.

8.4.1 Central tendency measures

Central tendency measures are a 1 number summary that can be helpful:

8.4.2 Symmetric and Skewed data

Using the mean, median and mode, we can understand in a soft way the distribution of the data:

8.4.3 Measuring the dispersion

8.4.4 Comparing with the normal distribution

We can compute the mean μ and the standard deviation σ and check if the data behaves as a normal distribution:
We can also perform the Kolmogorov-Smirnov test or the Shapiro-Wilk test, which are statistical test that try to assess is a distribution is normal.

8.5 Similarity and Distance

There are many data mining algorithms which uses the notions of similarity or distance between two points. Usually, the selection of the distance function is an important decision before using an algorithm, because it will ultimately influence the results and their implications.
Definition 8.1. The L p -norm is a distance function defined by Dist( X,Y ) = ( i=1 d | X i - Y i | p ) 1 p .
For p=2 it is the well-known Euclidean distance.
For p=1 it is called the Manhattan distance, because it is like traversing a grid made of rectangles, similar to the streetmap of Manhattan.
Definition 8.2. The generalized Minkowski distance is defined by Dist( X,Y ) = ( i=1 d a i | X i - Y i | p ) 1 p .
Remark 8.1. As we can see Minkowski is a weighted L p -norm. This is useful in context where some features are more important than others, so they can have a higher weight in the distance measures.

Remark 8.2. Generally, p is set to d , the number of dimensions of the data.

Remark 8.3. Multidimensional data has normally different scales for the different dimensions, resulting in features dominating others in distance computations. To solve this issue we can do normalization and scaling.
Definition 8.3. The edit distance is a distance defined over strings.
We have the operators:
  • r: replace one character by another.
  • i: insert one character.
  • d: delete one character.
The edit distance between two strings s 1 and s 2 is the minimum amount of operations needed to convert s 1 to s 2 .
The formula is edit( s 1 , s 2 ) ={ | s 1 | if| s 2 | =0 | s 2 | if| s 1 | =0 edit( tail( s 1 ) ,tail( s 2 ) ) if s 1 [ 0 ] = s 2 [ 0 ] 1+ min { edit( tail( s 1 ) , s 2 ) edit( s 1 ,tail( s 2 ) ) edit( tail( s 1 ) ,tail( s 2 ) ) } otherwise where tail( s ) is the string s minus its first character.
Remark 8.4. If the edit distance is computed recursively, its complexity measures are:

Remark 8.5. If we do it with dynamic programming, with the algorithm in Algorithm 3, then its complexity measures are:
for i=1 to |s1| do
	for j=1 to |s2| do
		m[i,j] = min{m[i-1,j-1] + if[(s1[i] = s2[j] then 0 else 1],
					 m[i-1,j] + 1,
					 m[i,j-1] +1
					}
return m[|s1|,|s2|]
Algorithm 3: DP_Edit(s1, s2)
Remark 8.6. There is a further improvement that can be made. If we perform the dynamic programming only using rows or columns, then we only need to store two of them: the current one and the previous one. The complexity measures in this case are:

9 Model evaluation

Once we have trained a model, we want to assess how well it performs. This way, we can compare different models and discuss, quantitatively, which of them is preferrable for our purposes. Nonetheless, this task is not easy, and there are both methodological and quantification issues to take into account:

9.1 Holdout

The labeled data is randomly divided into two disjoint sets, corresponding to the training and test data. The training data is used to feed the training algorithm and produce a model, whose performance is assessed using the test data.
The approach can be repeated several times with multiple samples to provide a final estimate.
image: 9_home_runner_work_BDMA_Notes_BDMA_Notes_ULB_Data_Mining_LectureNotes_source_pegado6.png
Figure 1: Holdout visualization.

9.2 Cross-Validation

The labeled data is divided into m disjoint subsets of equal size n m . A typical choice of m is around 10. One of the m segments is used for testing, and the other ( m-1 ) segments are used for training. This approach is repeated by selecting each of the m different segments in the data as test set.
The average accuracy over the different test sets is then reported.
The overall accuracy of the cross.validation procedure tends to be a highly representative, but pessimistic estimate, of model accuracy.
When m is chosen to be m=n , n-1 examples are used for training, and one example is used for testing. This is called leave-one-out cross-validation. This approach is very expensive for large datasets.
image: 10_home_runner_work_BDMA_Notes_BDMA_Notes_ULB_Data_Mining_LectureNotes_source_pegado7.png
Figure 2: Cross-Validation visualization. m=3 .
Stratified cross-validation uses proportional representation of each class in the different folds and usually provides less pessimistic results.

9.3 Bootstrap

The labeled data is sampled uniformly with replacement, to create a training dataset, which can contain duplicates. The labeled data of size n is sampled n times with replacement.
The probability that a particular point is not included in a sample is p 1 =1- 1 n . Therefore, the probability that the point is not included in n samples is p n = ( 1- 1 n ) n . For large values of n , this approximates 1 e . Thus, the fraction of the labeled data points included at elast once in the training dataset is 1- 1 e .
The overall accuracy is computed using the original set of full labeled data as the test examples.
The estimate is highly optimistic of the true classifier accuracy because of the large overlap between the training and test examples.
A better strategy is the leave-one-out bootstrap, in which the accuracy of each labeled instance is computed using the classifier performance on only the subset of the bootstraped samples in which the instance is not part of.
This approach provides a pessimistic accuracy estimate, A l , given by the mean value of the accuracy computed for each labeled instance.
The 0.632-bootstrap improves the accuracy estimate with a compromise approach. The average training-data accuracy A t over b bootstrapped samples is computed. This is a highly optimistic estimate. The overall accuracy is a weighted average of the leave-one-out accuracy and the training-data accuracy: A=( 0.632 ) A l +( 0.368 ) A t .

Part IV Clustering

Some applications require to divide the data into different groups, that share some characteristics. The problem many of these times is that we don't know which characteristics or at how much extend are useful to characterize the data. The general (unsupervised) approach to tackle this problem is clustering.
Definition 9.1. Clustering problem (Informal)
Given a set of data points, partition them into groups containing similar data points.
This definition is informal and general, but gives enough information to understand the problem, as well as enough freedom to tackle it from different perspectives.

10 Representative-Based Algorithms

These are the simplest of all clustering algorithms, as they directly use distances or similarities to cluster the data. They not capture hierarchical relationships and use a set of representatives to cluster the data. The main insight is that the discovery of good clusters equates to the discovery of good representatives.
Definition 10.1. Representative-Based general clustering problem
Given a data set D with n data points X 1 ,..., X n in a d -dimensional space and a specified number of clusters, k , the goal of a representative-based algorithm is to determine k representatives Y 1 ,..., Y n such that the objective function O= i=1 n [ min j Dist( X i , Y j ) ] is minimized, i.e., the sum of the distances of the different data points to their closest representative needs to be miminized.
Remark 10.1. The representatives Y 1 ,..., Y k and the optimal assigmnent of data points to representatives are unknown a priori, but they depend on each other in a circular way. This fact allows us to develop a iterative approach to solve the problem.
The generic k -representative approach is as in Listing 4.
Algorithm 4: Generic k -representative approach (Data D, int k, threshold eps) : Set of representatives Y and Clusters C
Initialize Y = {Y_1,...,Y_k} <@\textcolor{purple}{\#Using heuristics}<@
Initialize clusters C_1 = {},... C_k = {}
do
# Assign step
	for(X in D):
		assign X to Y_j such that <@\textcolor{blue}{$Dist(X,Y_j) = \min_i Dist(X,Y_i)$}<@
		C_j.add(X)
	
# Optimize step
	for all Clusters C_j:
		determine Y_j' such that 
			<@\textcolor{blue}{$\sum_{X_i \in C_j} Dist(X_i,Y_j')$}<@
		is minimized
		Y_j = Y_j'

while <@$O=\sum_{i=1}^{n}\left[\min_{j}\ Dist\left(X_{i},Y_{j}\right)\right]$<@ > eps
return {C_1, Y_1},...,{C_k, Y_k}
	
Remark 10.2. The idea is to improve the objective function over multiple iterations. The increase is usually greater in early iterations, and decreases rapidly.

Remark 10.3. The main computational bottleneck is the assignment step, where distances need to be computed between all point to the representatives.

10.1 The k-Means algorithm

The k-means algorithm is a representative clustering method in which the distance used is the squared Euclidean distance (or squared L 2 - norm): Dist( X i , Y j ) = X i - Y j 2 2 . Thus, the objective function minimizes the sum of square errors over the data points, this is called the SSE (Sum of Squared Errors).
Proposition 10.1. The optimal representative Y j for each of the optimize iterative steps is the mean of the data points in cluster C j .
Proof. In the current step, we have a fixed clustering asssignment from the last step, C 1 ,..., C k . The overall clustering objective function is O( X,Y ) = j=1 k X i C j X i - Y j 2 2 , so its gradient for each Y j is d d Y j O( X,Y ) =2 X i C j ( X i - Y j ) . When imposing the gradient equals to 0 (for optimization purposes), we get X i C j X i - Y j =0, or, equivalently, X i C j X i =| C j | Y j Y j = X i C j X i | C j | =mean( C j ) .
Remark 10.4. Note that the obtained representatives could be a point which is not a point in the data. This property sometimes is undesirable.

Remark 10.5. Note also that for the proof, we have suposed a numerical attribute. Computing the mean of different (for example) texts does not seem easy (think for example in the words classification and regression , they could be certainly clustered inside dataminingtechniques , but the latter is hardly the mean of the two former words).

Remark 10.6. Regarding time complexity:

Remark 10.7. Disadvantages:
Example 10.1. Apply the k-means algorithm with k=3 to the dataset D={ A 1 =( 2,10 ) , A 2 =( 2,5 ) , A 3 =( 8,4 ) , A 4 =( 5,8 ) , A 5 =( 7,5 ) , A 6 =( 6,4 ) , A 7 =( 1,2 ) , A 8 =( 4,9 ) } and using the seed Y 1 = A 5 , Y 2 = A 6 and Y 3 = A 8 .
image:
image:
image:
image:

10.2 The k-Medians Algorithm

In this case the Manhattan distance ( L 1 -norm) is used: Dist( X i , Y j ) = X i - Y j 1 .
Proposition 10.2. The optimal representative Y j for each of the optimize iterative steps is the median of the data points along each dimension in cluster C j .
Proof. In this case, the objective function is O( X,Y ) = j=1 k X i C j X i - Y j 1 . Now, the L 1 -norm is obtained by summing the absolute value in each dimension. The problem is that this function is not differentiable. Nonetheless, it is differentiable almost everywhere. We can obtain the sub-gradient of O with respect to Y j as d d Y j O( X,Y ) = X i C j sign( X i - Y j ) . For this to equal 0, we need as many negative signs as positive signs: the median in each direction achieves exactly this, as it has as many values to its left as to its right.
Example 10.2. Repeat the example using the k-Medians algorithm.
image:
image:
image:

10.3 The k-Medoids Algorithm

In this algorithm the representatives are always selected from the database D , and this makes the structure of the algorithm different from the one we have seen before.
Reasons:
The algorithm is as in Algorithm 5.
Algorithm 5: k -medoids(Data D, int k, threshold eps) : Set of representatives Y and Clusters C
Initialize Y = {Y_1,...,Y_k} <@\textcolor{purple}{\#Using heuristics}<@
Initialize clusters C_1 = {},... C_k = {}
do
# Assign step
	for(X in D):
		assign X to Y_j such that <@\textcolor{blue}{$Dist(X,Y_j) = \min_i Dist(X,Y_i)$}<@
		C_j.add(X)
	
# Optimize step
	Determine a pair X_i in D and Y_j in Y such that 
		replacing Y_j with X_i leads to the 
		greatest possible improvement in the objective function

	Perform the exchange between X_i and Y_j only if improvement is positive.

while <@$O=\sum_{i=1}^{n}\left[\min_{j}\ Dist\left(X_{i},Y_{j}\right)\right]$<@ > eps 
		or no improvement in current iteration
return {C_1, Y_1},...,{C_k, Y_k}
	
Remark 10.8. In this algorithm we use a hill climbing strategy to obtain the best representatives.

Remark 10.9. We can try all possible changes or sample points from the database to try with. The latter approach is often more desirable for time issues.

10.4 Practical issues

11 Grid and Density based Algorithms

One of the major problems with distance-based algorithms is that the shape of the clusters is implicitly enforced by the distance function. Thus, it can be hard to detect natural cluster of arbitrary form.
Density-based algorithms are useful for this. The idea is to identify dense regions in the data, and use the positions of the different regions to determine the clusters.

11.1 Grid-based methods

The data is discretized into p intervals, typically equi-width. If the data has d dimensions, we will obtain p d hyper-cubes. These are the building blocks for the clusters.
A density threshold τ is used to determine the dense hyper-cubes. In most real data-sets, an arbitrarily shaped cluster will result in multiple dense regions connected together by a side or a corner.
Two hyper-cubes are said to be adjacently connected if they share a side (sometimes corners are also considered).
Two hyper-cubes are said to be density connected if a path can be found from one to another containing only a sequence of adjacently connected grid regions.
The goal is to determine the density connected regions. Using a graph representation, the problem is equivalent to finding the connected components of the graph, being the hyper-cubes the nodes and an edge is defined between every pair of adjacent cubes.
Advantages
The number of clusters is not pre-defined, so we don't need to bother with the estimation of k .
Disadvantages
We have to define p and τ , which is not easy.
Also, if the clusters present different densities, it is even more difficult to determine τ and p because each cluster is 'asking' for different values.
The generic algorithm is as follows:
Discretize each dimension into p ranges
Determine grid cells at density level tau
Create graph in which dense grids are connected if they are adjacent
Determine connected components of the graph
return points in each connected component as a cluster
Algorithm 6: GenericGrid(Data D, Ranges p, Density tau) : clusters C

11.2 DBSCAN

The idea behind DBSCAN is similar to the one we have seen, but density is considered at a pointwise level:
The density of a data point is defined as the number of points that lie within a radius eps from it, i.e. their neighbourhood of radius τ . The densities are used to classify the points:
And we define some relations between points:
After the points have been classified, a connectivity graph is constructed as a maximal set of points that are all reachable from one another under any of these definitions.
Now, we identify the connected components of the graph, which are the clusters.
The detailed algorithm is as follows:
Clusters = {}
for each unvisited point P in D
	Neihbourhood = regionQuery(P, eps)
	if sizeof(Neihbourhood) < tau
		mark P as visited
	else
		C = next cluster
		expandCluster(P, Neighbourhood, C, eps, tau)
		if C not in Clusters
			Clusters.add(C)
return Clusters

function expandCluster(P, Neighbourhood, C, eps, tau)
	mark P as visited
	C.add(P)
	for Q in Neighbourhood
		if Q not visited
			mark Q as visited
			Neighbourhood_Q = regionQuery(Q, eps)
			if sizeof(Neighbourhood_Q >= tau)
				Neighbourhood.addAll(Neighbourhood_Q)
			if Q is not in any cluster
				C.add(Q)
Algorithm 7: DBSCAN(Data D, Radius eps, Density tau) : clusters C
Advantages
This method is not very different from the graph method, and it can also discover clusters of any shape, without the need of knowing the number of clusters in advance.
Disadvantages
Again, determining the correct values for eps and τ is a complex task. Also, the existence of clusters with different densities makes it even harder.
The major time complexity is finding the neighbours: O( n 2 ) .
In some special cases, an spatial index can reduce it to O( n log n ) .
Remark 11.1. Usually, grid based methods are more efficient because they partition the space, which makes the procedure less computationally expensive.

11.2.1 Progressive DBSCAN

The idea is using the same value for τ , apply DBSCAN in a progressive way, increasing the value of eps .
  1. Start with a small eps to find dense clusters.
  2. Iteratively relax the eps value to find less dense clusters.
  3. After every iteration, the points that already belong to a cluster are removed from the dataset.

11.3 DENCLUE

The DENCLUE algorithm is based on kernel-density estimation, which can be used to create a smooth profile of the density distribution, by defining the density f( X ) at coordinate X as f( X ) = 1 n i=1 n K( X- X i ) , where K is the kernel function X i are n different data points. A commonly used kernel function is the Gaussian Kernel: K( X- X i ) = ( 1 h 2 π ) d e - X- X i 2 2 h 2 . The effect of this operation is to replace each discrete data point with a smooth bump, and the density at each points is the sum of all these bumps.
Example 11.1. A visual example of a kernel smoothing:
image: 18_home_runner_work_BDMA_Notes_BDMA_Notes_ULB_Data_Mining_LectureNotes_source_pegado8.png
Once the density has been smoothed, the goal is to determine clusters by using a density threshold τ that intersects the density profile. Two examples showing how the choice of τ affects the result are shown in Example 11.2.
Example 11.2. In the previous example, if we select τ =0.1 , we obtain the following:
image: 19_home_runner_work_BDMA_Notes_BDMA_Notes_ULB_Data_Mining_LectureNotes_source_pegado9.png
In this case, only one cluster is obtained. In contrast, if we choose τ =0.13 , two clusters are obtained:
image: 20_home_runner_work_BDMA_Notes_BDMA_Notes_ULB_Data_Mining_LectureNotes_source_pegado10.png

12 Probabilistic Model-Based Algorithms

Until now, all models described are hard clustering algorithm, meaning each data point is assigned to a particular cluster. Probabilistic model-based algorithms are soft algorithms, in which each data point may have a nonzero assignment probability to more than one cluster.

12.1 Fuzzy sets and clusters

A fuzzy cluster is a fuzzy set F S :X[ 0,1 ] . For each data point X i X , F S ( X i ) represents the probability that X i is in cluster S . F S ( X i ) can be called degree of membership of object X i to cluster S .
Formally, given a set of objects X 1 ,..., X n , a fuzzy clustering of k fuzzy clusters C 1 ,..., C k can be represented using a partition matrix, M=[ w ij ] , where w ij = F C j ( X i ) . M should satisfy three conditions:
  1. w ij [ 0,1 ] ,i=1,...,n,j=1,...,j . From the definition of a fuzzy cluster.
  2. j=1 k w ij =1,i=1,...,n . The sum of all probabilities is 1.
  3. 0< i=1 n w ij <n,j=1,...,k . There is no empty cluster.

12.2 Mixture model

The underlying assumption of a mixture-based generative model is to assume that the data was generated from a mixture of k distributions with probability distributions G 1 ,..., G k . Each of them represents a cluster and is called mixture component. The data points, X i , are generated by this model as follows:
  1. Select a mixture component with prior probability α i =P( G i ) . Say G r is selected.
  2. Generate a data point from G r .
This generative model is denoted by M . We don't know G i nor α i in advance. The G i distributions are often assumed to be Gaussian
5
Note that any other distribution might be assumed.
, so we need to estimate the parameters of the distribution in such a way that the overall data has a maximum likelihood of being generated by the model.
Consider a set C of k probabilistic clusters C 1 ,..., C k with probability density functions f 1 ,..., f k , respectively, and probabilities p 1 ,..., p k . The probability of an object X being generated by the cluster C j is P( X| C j ) = p j f j ( X ) , and the probability of X being generated by the set C is P( X|C ) = j=1 k p j f j ( X ) . As objects are assumed to be independently generated, for a data set D ={ X 1 ,..., X n } , the probability that D is generated by C is P( D |C ) = i=1 n P( X i |C ) = i=1 n j=1 k p j f j ( X i ) . Now, we want to estimate C from D trying to maximize P( D |C ) is maximized.
If we use the assumption that the underlying distributions are Gaussian G ( μ j , σ j ) , then the probability density function of each cluster are centered at μ j with standard deviation σ j is: P( X i | Θ j ) = 1 σ j 2 π e - ( X i - μ j ) 2 2 σ 2 . And if we assume all clusters have the same probability p j , then P( X i | Θ ) = j=1 k 1 σ j 2 π e - ( X i - μ j ) 2 2 σ 2 . Thus, our objective is to maximize P( D | Θ ) = i=1 n j=1 k 1 σ j 2 π e - ( X i - μ j ) 2 2 σ 2 .
This is achieved with the expection-maximization (EM) algorithm. The EM algorithm is a framework to approach maximum likelihood estimates of parameters in statistical models. It consists of two steps:
  1. E-step: assigns objects to clusters according to the current fuzzy clustering or parameters of probabilistic clusters.
  2. M-step: finds the new clustering or parameters that maximize the SSE or the expected likelihood.
Example 12.1. EM algorithm example.
image:

12.3 Evaluating fuzzy clusters

If c 1 ,..., c k are the centers of the k clusters, we define the sum of squared error (SSE) for a point X i as SSE( X i ) = j=1 k w ij p dist ( X i , c j ) 2 .
For a cluster C j , we have its SSE as SSE( C j ) = i=1 n w ij p dist ( X i , c j ) 2 .
Finally, the SSE of the whole clustering is SSE( C ) = i=1 n j=1 k w ij p dist ( X i , c j ) 2 .

12.4 Cluster quality measures

A good clustering methods will produce high quality clusters, i.e. clusters with the following characteristics:
The quality of the clustering method depends on the similarity measure used by the method, its implementation and its ability to discover the hidden patterns in the data.
Some examples of quality measures are:

Part V Frequent pattern and association rule mining

13 Frequent Itemset Mining

Association pattern mining is usually defined in the context of supermarket data coontaining sets of items bought by cutomers, which are referred to as transactions. The goal is to determine associations between groups of items bought by customers. The discovered sets of items are reffered to as frequent itemsets.
This frequent itemset can then be used to generate association rules of the form XY, where X and Y are sets of items. The meaning of this is that we discovered that when some customer buys X , it is likely that the same customer is going to/would like to by Y .
We have to be careful, nonetheless, because the raw frequency of a pattern is not the same as the statistical significance of the underlying correlations. This is why numerous models for frequent pattern mining have been proposed that are based on statistical significance.
Example 13.1. An intuitive example is that when someone buys bread, cheese and yogurt, it is probably the case that he will buy also milk and eggs: { Bread,Cheese,Yogurt } { Milk,Eggs } .

13.1 The model

The problem of association pattern mining is defined on unordered set-wise data.
The database T contains n transactions, T 1 ,..., T n .
Each transaction T i is a subset of the set of all items T i U items = U . The transactions can be represented as a multidimensional record of dimensionality d=| U | , where the values are binary: T i ( item ) ={ 1 ifitem T i 0 otherwise .
The universe of items is very large compared to the typical number of items in each transaction.
Definition 13.1. An itemset is a set of items, I U .
A k -itemset is an itemset with k items, I U | I | =k .
The support of an itemset I , sup( I ) , is the fraction of the transactions in the database T that contain I as a subset.
Definition 13.2. The frequent itemset mining problem is defined as follows:
Given a set of transactions T ={ T 1 ,..., T n } , where each transaction T i is a subset of items from U , determine all itemsets I that occur as a subset of at least a predefined fraction minsup of the transactions in T .
The predefined fraction minsup is called minimum support.
The unique identifier of a transaction is referred to as transaction identifier (tid).
Remark 13.1. The number of frequent itemset is generally very sensitive to the minimum suppor level:
Therefore, an appropriate choice of the support level is crucial for discovering a set of frequent patterns with meaningful size.
Example 13.2. A very simple example:
Tid
Transaction
Binary Representation
1
{Shirt, Trousers}
110
2
{Shirt, Jacket}
101
3
{Shirt, Trousers, Jacket}
111
In this case, the possible itemsets and their respective support is:
Itemset
Support
{Shirt}
1
{Trousers}
2 3
{Jacket}
2 3
{Shirt, Trousers}
2 3
{Shirt, Jacket}
2 3
{Trousers, Jacket}
1 3
{Shirt, Trousers, Jacket}
1 3
If we select minsup= 1 3 , all possible itemsets would be selected.
If we select minsup=1 , only {Shirt} would be selected.
Now, let's think about how many possible itemsets there are: if an itemset if a subset of U , then there are 2 card( U ) possible itemsets. This means that computing all their supports as in the previous example would take an exponential amount of time to the cardinality of the number of items. Not only this, but the database needs also to be accessed for counting, comparing,... So it is easy to see how this problem becomes rapidly innaccessible. Then, it is compulsory to find better ways to perform this counting and comparing, or to be able to discard itemsets even before counting. For this, there are some very convenient properties of itemsets:
Property 1: Support Monotonicity Property
The support of every subset JI is at greater than or equal to the support of I : sup( J ) sup( I ) ,JI.
This is because all subsets of the itemset I are also itemsets. As the itemset I appears in sup( I ) fraction of the total transactions, all its subsets also appears at least in the same transactions.
Property 2: Downward Closure Property
Every subset of a frequent itemset is also frequent.
This is a natural implication of the previous property, and it is very useful: when we discover a frequent itemset, we don't need to check its subsets because it is already assured that they are frequent, too.
Definition 13.3. A frequent itemset is maximal at a given minimum support level minsup if it is frequent, and no superset of it is also frequent.
The possible itemsets given a set of items can be conceptually arranged in the form of a lattice of itemsets, which contains one node for each of the 2 | U | sets drawn from the universe of items. An edge exists between a pair of nodes if the corresponding sets differ by exactly one item. The lattice represents the search space of frequent patterns and it is separated into frequent and infrequent itemsets by a border.
Example 13.3. A lattice of itemset with 4 elements.
image: 22_home_runner_work_BDMA_Notes_BDMA_Notes_ULB_Data_Mining_LectureNotes_source_lattice_itemset.png

13.2 Association rule generation framework

Definition 13.4. Let X,Y be two itemsets. The confidence of the rule, conf( XY ) , is the conditional probability of XY ocurring in a transaction, given that the transaction contains X : conf( XY ) =Pr( XY|X ) = sup( XY ) sup( X ) . X is called the antecedent of the rule and Y is the consequent.
Definition 13.5. Let X,Y be two itemsets. The rule XY is said to be an association rule at a minimum support of minsup and minimum confidence of minconf if it satisfies:
  1. The support of the itemset XY is at least minsup .
  2. The confidence of the rule XY is at least minconf .
Here, the first criterion ensures that there are enough transactions to believe that the rule has statistical relevance. The second criterion ensures that the rule is strength enough in terms of conditional probabilities.
The overall procedure for association rule generation uses two phases:
  1. The frequent itemsets are generated at the minimum support of minsup .
  2. The association rules are generated from the frequent itemsets at the minimum confidence level of minconf .
The first phase is more computationally intensive, so we are focusing on it from now on.

13.3 Frequent itemset mining algorithms

13.3.1 Brute Force Algorithms

For a universe of items U , there are 2 | U | -1 distinct subsets, excluding the empty set. A naïve idea would be to generate all these candidate itemsets and count their suppor against the transaction database T .
Definition 13.6. A candidate itemset is an itemset that can be frequent, so it is needed to be checked.
Now, we can verified the candidates against the transaction database by support counting, checking whether a given itemset I is subset of each transaction T i T .
This approach is likely to be impractical when the universe of items is large.

A little tweak

The brute-force approach can be made faster by observing that no ( k+1 ) -patterns are frequent if no k -patterns are frequent (this follows from the downward closure property). Thus, we can enumerate and count in increasing length for the patterns.
For sparse transaction databases, the value of l (the largest frequent itemset) is usually small compared to | U | . At this point, one can terminate. This approach is orders of magnitude faster, but its computational complexity is still not satisfactory for large values of U .

Ideas to apply

Better algorithms can be developed by using one or more of the following approaches:
  1. Reduce the size of the explored search space by prunning cadidate itemsets using tricks.
  2. Counting the support of each candidate more efficiently by prunning transactions that are know to be irrelevant.
  3. Using compact data structures to represent either candidates of transaction databases that suppor efficient counting.

13.3.2 The Apriori Algorithm

The Apriori algorithm uses the downward closure property to prune candidates. If an itemset is infrequent, then all its supersets are also infrequent, so we don't need to count them.
The Apriori algorithm works as follows:
  1. Count the support of the individual items to generate frequent 1-itemsets.
  2. Combine the frequent 1-itemsets to generate candidate 2-itemsets.
  3. Count the support of the candidate 2-itemsets to generate frequent 2-itemsets.
  4. ...
In general:
  1. Combine the frequent ( k-1 ) -itemsets to generate candidate k -itemsets.
  2. Prune candidate k -itemsets which has some subset which is not frequent.
  3. Count the support of the candidate k -itemsets to generate frequent k -itemsets.
  4. Repeat until there are no bigger frequent itemsets.
The algorithm is detailed in Algorithm 8.
k = 1
F1 = {Frequent 1-itemsets}

while Fk is not empty do
	Generate C(k+1) by joining itemset-pairs of Fk
	Prune itemsets from C(k+1) that violate the downward closure property
	Determine F(k+1) by support counting (C(k+1),T)
	k = k+1
end

return Union(F(i) for all i=1..k)
Algorithm 8: Apriori(Transactions T, Minimum Support minsup)
Remark 13.2. The downward closure property ensures that the candidate set generated does not miss any itemset that is frequent. This non-repetitive and exhaustive way of generating candidates can be interpreted in the context of a conceptual hierarchy of the patterns known as enumeration tree.

Remark 13.3. To do the pruning, we check generated elements against non-frequent itemsets already generated.

Remark 13.4. The support counting process is the most expensive part because it depends on the size of T . The level-wise approach ensures that the algorithm is relatively efficient from a disk-access perspective: each set of candidates in C k can be counted in a single pass over the data without the need for random disk accesses.
Nonetheless, the counting procedure is still expensive.ç
Example 13.4. We are going to manually run an Apriori algorithm for a simple database using minsup=2 (the minsup can be also indicated as a count, instead of as a frequency). The database is
Database
TID
Transaction
1
A,B,E
2
B,D
3
B,C
4
A,B,D
5
A,C
6
B,C
7
A,C
8
A,B,C,E
9
A,B,C
Let's compute F 1 directly:
F1
Itemset
Count
A
6
B
7
C
6
D
2
E
2
From this, we see that all 1-itemsets are frequent, so we are generating all possible 2-itemsets as C2, and counting them:
C2
Itemsets
Count
A,B
4
A,C
4
A,D
1
A,E
2
B,C
4
B,D
2
B,E
2
C,D
0
C,E
1
D,E
0
F2
Itemsets
Count
A,B
4
A,C
4
A,E
2
B,C
4
B,D
2
B,E
2
We have coloured in red the infrequent itemset and generated F2. Now we can generate C3 by combining itemsets from F2. For this, we need to search for itemsets that share all values but one. For example: {A,B} and {A,C} are combinated to obtain {A,B,C}. After getting all possibilities, we find C3 and count:
C3
Itemset
Count
A,B,C
2
A,B,D
A,B,E
2
A,C,D
A,C,E
B,C,D
B,C,E
B,D,E
F3
Itemset
Count
A,B,C
2
A,B,E
2
The purple coloured itemsets are pruned because they contain some of the infrequent itemsets of C2. The rest are still frequent, so we continue with the process. We combine the only two itemsets left to get {A,B,C,E} as C4. Note, nonetheless that {C,E} is an infrequent itemset of C2, so we will prune it and we are in fact done.
The frequent itemsets with minsup=2 are F1, F2 and F3.

Limits of Apriori

Apriori improved with tricks

In Algorithm 9 it is detailed an improved version of the Apriori algorithm.
	k = 1
	F1 = {Frequent 1-itemsets}
	
	while Fk is not empty do
		C(k+1) = generate(Lk)
		for tran in T do
			C(tran) = subset(C(k+1),tran)
			for cand in C(tran)
				cand.count++
			end
		end
		L(k+1)={cand in C(k+1) | cand.count >= minsup}
	end
	return Union(F(i) for i=1..k)

procedure generate(Lk)
	foreach itemset1 in Lk
		foreach itemset2 in Lk
			if itemset1[1]=itemset2[1] and ... and itemset1[k-1] = itemset2[k-1] and itemset1[k] < itemset2[k]
				c = itemset1 join itemset2
				if has_infreq_subsets(c,Lk)
					continue
				else
					add c to C(k+1)
				end
			end
		end
	end
	return C(k+1)

procedure has_infreq_subsets(c,Lk)
	foreach k-subset s of c
		if s not in Lk
			return TRUE
		end
	end
	return FALSE
Algorithm 9: Apriori_improved(Transactions T, Minimum support minsup)

More Apriori Tricks

13.3.3 FP-Growth

Even though Apriori greatly improves the efficiency of the solution of the association pattern mining in comparison to the brute force approach, we have seen that it can still suffer from inefficiencies when the database is big. Particularly, counting is very costly and when there are lots of items we will need to count many times. There is an improved solution for the problem: Frequent Pattern Growth (FP-Growth) which:
  1. Transforms the database into a compressed data structur called FP-Tree, which retains frequent itemsets information.
  2. Mines the FP-tree for frequent itemsets by:
    1. Divide it into a set of conditional databases, the conditional pattern bases, each associated eith one frequent item.
    2. Mines each of these patterns separately to generate association rules, without the need to re-count the original database.
Definition 13.7. A FP-Tree is a trie (prefix tree) data structure, which acts as a compressed representaiton of a conditional database.
The algorithm for FP-Growth is detailed in Algorithm 10.
if FPT is a single path
	determine all combinations C of nodes on the path, report Union(C,P) as frequent
else
	foreach item i in FPT do
		report Pi = Union(i,P) as frequent
		
		use pointers to extract conditional prefix paths from FPT containing i
	
		readjust counts of prefix paths
		remove i
		
		remove infrequent items from prefix paths
		
		reconstruct FPTi
		
		if FPTi not empty
			FP-Growth(FPTi, minsup, Pi)
	end
end
Algorithm 10: FP-Growth(FP-Tree FPT, Minimum Support minsup, Current Suffix P)
Basically, what FP-Growth does is recursively find all frequent itemsets ending with a particular suffix by splitting the probelm into smaller subproblems.
Example 13.5. Let's use FP-Growth on the previous example database:
Database
TID
Transaction
1
A,B,E
2
B,D
3
B,C
4
A,B,D
5
A,C
6
B,C
7
A,C
8
A,B,C,E
9
A,B,C
First, we need to construct the FP-Tree. For this purpose, we compute the frequent 1-itemsets and reorder them in the count order:
F1
Itemset
Count
A
6
B
7
C
6
D
2
E
2
Indescendingorder
F1
Itemset
Count
B
7
A
6
C
6
D
2
E
2
Now we reorder the transactions in the database following this same ordering:
Database
TID
Transaction
1
B,A,E
2
B,D
3
B,C
4
B,A,D
5
A,C
6
B,C
7
A,C
8
B,A,C,E
9
B,A,C
And now we construct the FP-Tree. We start with the root labeled as NULL, and then add the transactions one by one, reusing the prefixes when we can, and keeping a counter for each node. Every time we traverse a node, we increase the counter. The first transaction would be entered as:
image: 23_home_runner_work_BDMA_Notes_BDMA_Notes_ULB_Data_Mining_LectureNotes_source_fpgrowth_0.png
The second one:
image: 24_home_runner_work_BDMA_Notes_BDMA_Notes_ULB_Data_Mining_LectureNotes_source_fpgrowth_1.png
The third one:
image: 25_home_runner_work_BDMA_Notes_BDMA_Notes_ULB_Data_Mining_LectureNotes_source_fpgrowth_2.png
And so on... Until all transactions are entered in the tree:
image: 26_home_runner_work_BDMA_Notes_BDMA_Notes_ULB_Data_Mining_LectureNotes_source_fpgrowth_3.png
Now, we connect each item in the ordered F1 to one node in the database corresponding to the same item, and all nodes that are equal are connected, too. Like this:
image: 27_home_runner_work_BDMA_Notes_BDMA_Notes_ULB_Data_Mining_LectureNotes_source_fpgrowth_4.png
And now we proceed with the algorithm. We start with the least frequent item, E .
And that's it!

Why FP-Growth outperforms Apriori?

13.4 Mining Association Rules

Once the frequent itemsets have been found, it is time to obtain the association rules, which are the goal we were aiming at since the beggining. To do this, a common approach is to follow:
  1. For each frequent itemset I :
    1. Generate subsets of I , SI .
    2. For every SI :
      1. Output rule SIS if conf( SIS ) minconf .

13.4.1 Evaluating association rules

In Definition 13.4 we saw the concept of confidence, but there are more measures to assess how 'good' a rule is, as the lift or the correlation analysis.
Definition 13.8. The lift is lift( XY ) = sup( XY ) sup( X ) sup( Y ) .
Remark 13.5. If lift=1 , then X and Y are independent.
If lift>1 , then X and Y have some dependency that is proportional to the lift value.
If lift<1 then X and Y are contradicting, i.e., the existence of X discourages Y and vice versa.
Definition 13.9. The correlation coefficient is Φ = TTFF-TFFT _T_FT_F_ , where we are assuming a rule AB and:
  • TT is how many times A is True and B is True
  • FF is how many times A is False and B is False
  • TF is how many times A is True and B is False
  • FT is how many times A is False and B is True
  • T_/F_ is how many times A is True/False
  • _T/_F is how many times B is True/False
Remark 13.6. In this case, if Φ =-1 there is a perfect negative correlation.
If Φ =1 there is a perfect positive correlation.
If Φ =0 the two itemsets are statistically independent.

Remark 13.7. The definitions of TT,FT ,... can be summarized as in the following table:
B
¬ B
Total
A
TT
TF
T_
¬ A
FT
FF
F_
Total
_T
_F
Example 13.6. Imagine we have the rule { Tea } { Coffee } with the following data:
Coffee
¬ Coffee
Total
Tea
150
50
200
¬ Tea
650
150
800
Total
800
200
In this case, the confidence is conf( { Tea } { Coffee } ) = 150 200 =0.75, which is a high confidence... but sup( Coffee ) =0.8 , which means that drinking coffee in fact decreases the probability of drinking coffee!
Now, the lift is lift( { Tea } { Coffee } ) = 0.15 0.20.8 =0.9375<1, which means that Tea and Coffee are negatively correlated. This insight is better than the one obtained by only looking at the value of the confidence.

Example 13.7. But let's now look at this example:
p
¬ p
Total
q
880
50
930
¬ q
50
20
70
Total
930
70
r
¬ r
Total
s
20
50
70
¬ s
50
880
930
Total
70
930
In this case lift( { p } { q } ) =1.02 and lift( { r } { s } ) =4.08, but ( p,q ) appear together 88% of the time, while ( r,s ) appear together only 2% of the time. Confidence is a better indicator in this case. The problem here is that r,s appears in a small portion of records in the data.

13.4.2 How to choose a measure?

We have seen that we can obtain different conclusions by looking at different measures, so the choice of the measure is important because it will affect the results. A good choice must be based on a clear understanding of the measure and its properties, so we are aware of the flaws it entails and can leverage them for good.
We can also asses rules using interactive visualizations, subjective measures based on domain experience,...

14 Sequential pattern mining

In the last section, we were looking for patterns in a set-wise approach, but there exist also situations in which the order in which elements are encountered is important. For example, it is a well known fact that seeing something can make people wanting to buy it. This fact is used by supermarkets to strategically place the products in order for the customers to have as many temptations as possible. From the side of the supermarket, it is interesting to know which zones of the supermarket are related in such a way that customers tend to go from one to another in a particular order. Imagine we detect that it is usual that people that go to the desserts hall, usually go to the fruit hall afterwards. Then, the supermarket is interested in placing the fruits as far from the desserts as possible, to maximize the time spent by the customer at looking products that they did not want before, but might want now. It can be used also to detect sequences like:
With this objective in mind, sequential pattern mining was developed.
Definition 14.1. A sequence is an ordered list of elements. Each element is an unordered set of items.
A subsequence s of a sequence S is a sequence such that:
  • If es then there exists ES such that eE .
  • If e 1 e 2 ... e k is the full sequence s , then there exist k different elements E 1 ,..., E k S such that e i E i ,i=1,...,k and E 1 goes before E 2 (maybe not directly), E 2 goes before E 3 ,...
In other words, s=[ e 1 ,..., e k ] is a subsequence of S=[ E 1 ,..., E m ] if there are k elements E s 1 ,..., E s k S such that e i E s i ,i=1,...,k and s 1 < s 2 <...< s k .
Definition 14.2. A sequence database is a database of sequences.
The support of a sequence s is the fraction of sequences in the database S ={ S 1 ,..., S N } that contain s as a subsequence.
Definition 14.3. The sequential pattern mining problem consists in, given a sequence database, find all subsequences whose support is at least the specified minsup .
This problem is obviously similar and related to that of frequent patterns mining, but this is more comples. A naive algorithm that generates all possible paterns and count them in the database would be highly inneficient, even more than in the set-wise case. There are several methods that work much better. First, let's see how to generate candidates.

14.1 Candidate generation

The candidate generation follows the steps:
  1. Base case, k=2 : merging two frequent 1-sequences [ { i 1 } ] and [ { i 2 } ] will produce two candidate 2-sequences: [ { i 1 , i 2 } ] and[ { i 1 } ,{ i 2 } ] .
  2. General case, k>2 : two frequent ( k-1 ) -sequences w 1 , w 2 are merged if the subsequence obtained by removing the first event in w 1 is the same as the subsequence obtained by removing the last event in w 2 .
    1. The resulting candidate after merging is given by the sequence w 1 extended with the last event of w 2 :
      1. If the last two events in w 2 belong to the same element, then the las event in w 2 becomes part of the last element in w 1 .
      2. Otherwise, the last event in w 2 in w 2 becomes a separate element appended to the end of w 1
      3. If both w 1 and w 2 consist in only one set, we generate the two variants.
Example 14.1. If we merge w 1 =[ { 1 } { 2,3 } { 4 } ] and w 2 =[ { 2,3 } { 4,5 } ] we will get the sequence w 3 =[ { 1 } { 2,3 } { 4,5 } ] , because the last two events in w 2 belong to the same element.
If we merge w 1 =[ { 1 } { 2,3 } { 4 } ] and w 2 =[ { 2,3 } { 4 } { 5 } ] we will get the sequence w 3 =[ { 1 } { 2,3 } { 4 } { 5 } ] .

14.2 Generalized Sequential Pattern (GSP) Algorithm

  1. Make the first pass over the database to obtain all the 1-element frequent sequences.
  2. Repeat until no new frequent sequences are found:
    1. Candidate generation: merge pairs of frequent subsequences found in the ( k-1 ) th pass to generate candidate sequences that contain k items.
    2. Candidate pruning: prune candidate k -sequences that contain infrequent ( k-1 ) -subsequences.
    3. Support counting: make a new pass over the sequence database to find the support for these candidate sequences.
    4. Candidate elimination: eliminate candidate k -sequences whose actual support is less than minsup .
Example 14.2. Let's perform GSP with the following database with minsup=2 .
Database
Id
Sequence
1
[ { A,B } { C } { A } ]
2
[ { A,B } { B } { C } ]
3
[ { B } { C } { D } ]
4
[ { B } { A,B } { C } ]
First, we compute C1 and F1:
C1
Subsequence
Count
[ { A } ]
3
[ { B } ]
4
[ { C } ]
4
[ { D } ]
1
F1
Subsequence
Count
[ { A } ]
3
[ { B } ]
4
[ { C } ]
4
Now, we generate C2 by combining these subsequences. For example, merging [ { A } ] and [ { B } ] gives [ { A } { B } ] and [ { A,B } ] , but we also need to combine in the other possible order, obtaining [ { B } { A } ] in addition to these. Thus, we obtain:
C2
Subsequence
Count
[ { A } { A } ]
1
[ { A } { B } ]
1
[ { A,B } ]
3
[ { B } { A } ]
2
[ { B } { B } ]
2
[ { A } { C } ]
3
[ { A,C } ]
0
[ { C } { A } ]
1
[ { C } { C } ]
0
[ { B } { C } ]
4
[ { B,C } ]
0
[ { C } { B } ]
0
F2
Subsequence
Count
[ { A,B } ]
3
[ { B } { A } ]
2
[ { B } { B } ]
2
[ { A } { C } ]
3
[ { B } { C } ]
4
Now, from F2 we can combine [ { A,B } ] with [ { B } { A } ] ,[ { B } { B } ] and [ { B } { C } ] , [ { B } { A } ] with [ { A,B } ] and [ { A } { C } ] , [ { B } { B } ] with [ { B } { A } ] and [ { B } { C } ] , [ { A } { C } ] with none and [ { B } { C } ] with none. Thus:
C3
Subsequence
Count
[ { A,B } { A } ]
1
[ { A,B } { B } ]
1
[ { A,B } { C } ]
3
[ { B } { A,B } ]
1
[ { B } { A } { C } ]
1
[ { B } { B } { A } ]
0
[ { B } { B } { C } ]
2
F3
Subsequence
Count
[ { A,B } { C } ]
3
[ { B } { B } { C } ]
2
And that's it because we cannot combine anything else. Thus, the result is F1, F2 and F3.

14.3 Sequential PAttern Discovery using Equivalence classes (SPADE) Algorithm

  1. Transform de database into its vertical format, i.e., with SeqID, elemID inside sequence, and the element itself.
  2. Construct the ID-list of 1-sequences, i.e., construct a table in which the elements are in the columns and in the cells we write all pairs ( seqId:elemID ) in which the element in the column appears.
  3. We count distinct seqID for each element. Those having more than minsup are kept and the rest are discarded.
  4. For k>1 :
    1. Contruct new k -candidates using the ( k-1 ) -frequent subsequences (as in GSP). Prune when possible.
    2. Construct the ID-list of k -sequences as ( seqID:elemI D 1 ,...,elemI D k ) where elemI D i is the element in seqID in which the event i in the current subsequence appear.
    3. Count distinct seqID for each element. Those having more than minsup are kept and the rest are discarded.
Example 14.3. Let's repeat the example with SPADE. First, we transform the database to its vertical form:
Database
Id
Sequence
1
[ { A,B } { C } { A } ]
2
[ { A,B } { B } { C } ]
3
[ { B } { C } { D } ]
4
[ { B } { A,B } { C } ]
Vertical Database
SeqID
ElemID
Sequence
1
1
[ { A,B } ]
1
2
[ { C } ]
1
3
[ { A } ]
2
1
[ { A,B } ]
2
2
[ { B } ]
2
3
[ { C } ]
3
1
[ { B } ]
3
2
[ { C } ]
3
3
[ { D } ]
4
1
[ { B } ]
4
2
[ { A,B } ]
4
3
[ { C } ]
Now, we construct the ID-list of 1-sequences:
1-ID-List
[ { A } ]
[ { B } ]
[ { C } ]
[ { D } ]
1:1
1:1
1:2
3:3
1:3
2:1
2:3
2:1
2:2
3:2
4:2
3:1
4:3
4:1
4:2
Count
3
4
4
1
F1
Subsequence
Count
[ { A } ]
3
[ { B } ]
4
[ { C } ]
4
Now the ID-list of 2-sequences, combining the 1-sequences:
2-ID-List
[ { A } { A } ]
[ { A } { B } ]
[ { A,B } ]
[ { B } { A } ]
[ { B } { B } ]
[ { A } { C } ]
[ { A,C } ]
[ { C } { A } ]
[ { C } { C } ]
[ { B } { C } ]
[ { B,C } ]
[ { C } { B } ]
1:1,3
2:1,2
1:1,1
1:1,3
2:1,2
1:1,2
1:2,3
1:1,2
2:1,1
4:1,2
4:1,2
2:1,3
2:1,3
4:2,2
4:2,3
2:2,3
3:1,2
4:1,3
4:2,3
Count
1
1
3
2
2
3
0
1
0
4
0
0
So
F2
Subsequence
Count
[ { A,B } ]
3
[ { B } { A } ]
2
[ { B } { B } ]
2
[ { A } { C } ]
3
[ { B } { C } ]
4
And now we do the 3-ID-List combining these:
3-ID-List
[ { A,B } { A } ]
[ { A,B } { B } ]
[ { A,B } { C } ]
[ { B } { A,B } ]
[ { B } { A } { C } ]
[ { B } { B } { A } ]
[ { B } { B } { C } ]
1:1,1,3
2:1,1,2
1:1,1,2
4:1,2,2
4:1,2,3
2:1,2,3
2:1,1,3
4:1,2,3
4:2,2,3
Count
1
1
3
1
1
0
2
So
F3
Subsequence
Count
[ { A,B } { C } ]
3
[ { B } { B } { C } ]
2
And the result is (obviously) the same we got with GSP.

14.4 PrefixSpan

  1. Start by counting frequent 1-sequences, as in GSP.
  2. PrefixSpan extends each frequent itemset recursively.
  3. For each frequent 1-sequence S :
    1. Project the database with S as a prefix, i.e., for each sequence, remove everything until the first occurence of S is found.
    2. Count the possible expansions of S with one additional event at the end.
    3. For frequent 2-sequences, repeat the same procedure. Until no new frequent k -sequences are discovered.
Example 14.4. We are going to do the same example again. The database is
Database
Id
Sequence
1
[ { A,B } { C } { A } ]
2
[ { A,B } { B } { C } ]
3
[ { B } { C } { D } ]
4
[ { B } { A,B } { C } ]
with frequent 1-itemsets
F1
Subsequence
Count
[ { A } ]
3
[ { B } ]
4
[ { C } ]
4
So we can start with A as a prefix. We project the database:
Database projected to { A }
Id
Sequence
1
[ { A,B } { C } { A } ]
2
[ { A,B } { B } { C } ]
3
[ { B } { C } { D } ]
4
[ { A,B } { C } ]
And we count possible 2-sequences starting with { A } :
C2 projected to { A }
Subsequence
Count
[ { A } { A } ]
1
[ { A } { B } ]
1
[ { A,B } ]
3
[ { A } { C } ]
3
[ { A,C } ]
0
Now we select { A,B } as prefix and count its possible extensions:
Database projected to { A,B }
Id
Sequence
1
[ { A,B } { C } { A } ]
2
[ { A,B } { B } { C } ]
4
[ { A,B } { C } ]
C3 projected to { A,B }
Subsequence
Count
[ { A,B } { A } ]
1
[ { A,B } { B } ]
1
[ { A,B } { C } ]
3
Now we select { A,B } { C } as prefix and again count its possible extensions (in this case the projected database only ahs one record, so there is no need to count):
Database projected to { A,B } { C }
Id
Sequence
1
[ { A,B } { C } { A } ]
And with { A } { C } :
Database projected to { A } { C }
Id
Sequence
1
[ { A,B } { C } { A } ]
2
[ { A,B } { B } { C } ]
4
[ { A,B } { C } ]
C3 projected to { A } { C }
Subsequence
Count
{ A } { C,A }
0
{ A } { C } { A }
1
{ A } { C,B }
0
{ A } { C } { B }
0
{ A } { C } { C }
0
We would repeat the same with prefixes { B } and { C } , and the result has to be the same as with the two previous algorithms.

14.5 Some comments

Algorithm
How
Time
Space
GSP
Candidate Generation
Counting against DB
Space alloc for candidate generation
SPADE
Candidate Generation
Counting against ID-Lists
Space alloc for the ID-Lists
PrefixSpan
No Candidate Generation (prefix-based generation)
Counting the projected DB
Space alloc for the projected DB, but there is one possible optimization using pointers

Part VI Stream Data Mining

15 Stream data mining

A data streaming is a constant flow of data, which usually carries to much information for it to be feasible to be stored. Somehow, we need to be able to apply data mining algorithms in datasets whose records we are only able to see once. This might seem esoteric, but there are plenty of applications in which we find this kind of needs. For instance, in credit card transactions, in wearable sensors information, connected vehicles, Internet of Things,...
More precisely, the challenges that data streams possess are:

15.1 Bloom filter

The bloom filter is an idea thought to answer the question:
'Has this incoming element ever ocurred in the data stream before?'
The idea is to be able to summarize the past of the data stream to be able to answer this question with a reasonable level of confidence. A bloom filter is a data structure that consists of:
And the algorithm developed to initialize this data structure is the one described in Algorithm 11. Basically, each element is hashed using all the hash functions, then the correspondent elements in the array are set to True.
B = array(length=m, type=bool, default=False)
repeat
	receive next element x in S
	for i = 1..w 
		idx = hi(x) 
		B[idx] = True
until S ends
return B
Algorithm 11: BloomConstruct(Stream S, Size m, Number of hash w)
Now, when a new element comes from the stream, xS , we would compute h 1 ( x ) ,..., h w ( x ) . If all the correspondent elements in the bloom filter are True, then we are confident (if the hash functions are well done) that the element has already been seen before. If some of them is False, we know for sure that the element has not been seen before, and we update the filter.
Example 15.1. Suppose we have the stream [ 3,8,15,3,5 ] , m=10 , h 1 ( k ) =k mod 10 and h 2 ( k ) =( k+7 ) mod 10 . The procedure would be:
  1. Start with all-False bloom filter:
    F
    F
    F
    F
    F
    F
    F
    F
    F
    F
  2. 3 enters: h 1 ( 3 ) =3 and h 2 ( 3 ) =0 . Both F, so 3 has never been seen before. Update B:
    T
    F
    F
    T
    F
    F
    F
    F
    F
    F
  3. 8 enters: h 1 ( 8 ) =8 and h 2 ( 8 ) =5 . Both F, so 8 has never been seen before. Update B:
    T
    F
    F
    T
    F
    T
    F
    F
    T
    F
  4. 15 enters: h 1 ( 15 ) =5 and h 2 ( 15 ) =2 . The first is T, but the second is F, so 15 has never been seen before. Update B:
    T
    F
    T
    T
    F
    T
    F
    F
    T
    F
  5. 3 enters: h 1 ( 3 ) =3 and h 2 ( 3 ) =0 . Both T, so we assume 3 has indeed been seen before. In this case we are right! B is not updated.
  6. 5 enters: h 1 ( 5 ) =5 and h 2 ( 5 ) =2 . Both T, so we assume 5 has indeed been seen before. In this case we are wrong... it is a false positive.
Remark 15.1. Note how little by little the array gets populated by True values, making false positives increasingly likely.

15.2 Count-Min Sketch

In this case, we are interested in answering the question:
'How many times has this incoming element appeared in the data stream before?'
Again, the idea is to be able to make a summary of the past data that enables us to answer the question. A count-min sketch is a data structure which consists of:
And the algorithm developed to initialize this data structure is the one described in Algorithm . Basically, each element is hashed using all the hash functions, then the correspondent elements in the array are augmented by 1 unit. There are w rows to make the values harder to collide (two elements x,y collide only if h i ( x ) = h i ( y ) , for the same h i ). To get the number of times an element has come, we get the minimum of the cells accessed, because the minimum is an upper bound for the number of times the element has been seen.
CM = Matrix(nrows=w, ncols=m, type=int, default=0)
repeat
	receive next element x in S
	for i=1..w
		idx = hi(x)
		CM[i,idx] += 1
until S ends
return CM
Algorithm 12: CountMinConstruct(Stream S, Width w, Height m)
Example 15.2. Suppose we have the stream [ 3,8,15,3,5 ] , m=10 , h 1 ( k ) =k mod 10 and h 2 ( k ) =( 2k ) mod 10 . The procedure would be:
  1. Start with all-0 count-min sketch:
    0
    0
    0
    0
    0
    0
    0
    0
    0
    0
    0
    0
    0
    0
    0
    0
    0
    0
    0
    0
  2. 3 enters: h 1 ( 3 ) =3 and h 2 ( 3 ) =6 . Both 0, so 3 has been seen 0 times before. Update CM:
    0
    0
    0
    1
    0
    0
    0
    0
    0
    0
    0
    0
    0
    0
    0
    0
    1
    0
    0
    0
  3. 8 enters: h 1 ( 8 ) =8 and h 2 ( 8 ) =6 . One cell is 1 and the other is 0, so 8 has never been seen before. Update CM:
    0
    0
    0
    1
    0
    0
    0
    0
    1
    0
    0
    0
    0
    0
    0
    0
    2
    0
    0
    0
  4. 15 enters: h 1 ( 15 ) =5 and h 2 ( 15 ) =0 . Both 0, so 15 has never been seen before. Update CM:
    0
    0
    0
    1
    0
    1
    0
    0
    1
    0
    1
    0
    0
    0
    0
    0
    2
    0
    0
    0
  5. 3 enters: h 1 ( 3 ) =3 and h 2 ( 3 ) =6 . One is 2 and other is 1, so it has been seen once before. Update CM:
    0
    0
    0
    2
    0
    1
    0
    0
    1
    0
    1
    0
    0
    0
    0
    0
    3
    0
    0
    0
  6. 5 enters: h 1 ( 5 ) =5 and h 2 ( 5 ) =0 . Both 1, so we assume 5 has indeed been seen before once. In this case we are wrong... it is a false positive. The table would be updated again:
    0
    0
    0
    2
    0
    2
    0
    0
    1
    0
    2
    0
    0
    0
    0
    0
    3
    0
    0
    0
Remark 15.2. Note that if 15 went in again, we would fail again, and in fact from this point on we will always fail with the count of 15s and 5s seen.

15.3 Flajolet-Martin algorithm

The Flajolet-Martin algorithm aims at answering the question
'How many distinct elements appeared in the data stream before?'
The intuitive idea is as follows: if the hash function distribute evenly the numbers in the range of the function, then we can assume that the probability of getting a particular value after hashing is 1 | range | . Thus, if we take the binary representation of the output, we can assume that the last digit is 0 with probability 1 2 because it has the same chance of being 0 or 1. Now, the output having the last two digits as 0 would occur approximately 1 out of 4 times, and so on... The probability that the last k digits are 0 is 1 2 k , so if we see a value with k trailing 0, the expected amount of seen records is 2 k .
The algorithm works as follows:
  1. Given a data stream S , use a hash function h:S[ 0, 2 L -1 ] Z , where L is such that 2 L >#distinctelements . L is usually chosen to be 64.
  2. For each incoming record x , take the binary representation of h( x ) .
  3. Count the number of trailing zeros in h ( x ) 2 .
  4. Keep a variable R max with the maximum number of zeros found until now.
  5. The expected maximum number of trailing zeros over all stream elements is E[ R max ] = log 2 ( 0.77351n ) , where n is the amount of distinct values. So, if we want to know how many distinct values we have seen until now, we would estimate n 2 R max 0.77351 .
Example 15.3. Suppose we have the stream [ 3,8,15,3,5 ] , h( x ) =( 7x+5 ) mod 32 .
  1. 3 enters: h( 3 ) =26=11010 . R max =1 .
  2. 8 enters: h( 8 ) =19=10011 . R max =1 .
  3. 15 enters: h( 15 ) =14=01110 . R max =1 .
  4. 3 enters. R max =1 .
  5. 5 enters: h( 5 ) =8=01000 . R max =3 .
Thus, the estimation is n 2 3 0.77351 =10.3, and the real value is 4. Note, nonetheless, that as we are working with probabilities, the results are not expected to be accurate when there are very few values.

Improvements to Flajolet-Martin

  1. It can happen that one of the first values has lots of trailing zeros, ruining the algorithm. A solution to avoid this is using several hash functions h 1 ,..., h w and keep the maximum number of trailing zeros for each of them R max,1 ,..., R max,w . At the end, we would obtain R max as the average of all these.
  2. It is also possible to have several points in which the records are taken, and thus we would like to synchronize the results from every of them. The solution is to treat all these points as if they were only one. Let's explain this a little bit. Imagine we have m measure points, where m streams are measured, one stream per measure point. Then, we would have m values R max,i for i=1,...,m and we want to obtain the combined R max . The idea is that, if we think of all the m streams as a single stream which has been divided, then we would just take R max as the maximum number of trailing zeros observed in the stream. Thus, when it is divided, if we know the maximum number of trailing zeros in each of the substreams, we also know the maximum number of trailing zeros in the whole stream: the maximum among the substreams! Thus, we would take R max = max i=1,...,m { R i } . Each of the substreams would work exactly as explained before (maybe with the Improvement 1 implemented).

15.4 Hyperloglog

Hyperloglog is a generalization of Flajolet-Martin, which tries to improve the predictions to answer the same question of how many different values have been seen in the stream before.
The idea is that we use 2 k buckets to count the trailing zeros observed in the records observed. For an incoming record, x , we compute h( x ) and translate it to binary form. Then, the bucket in which it will count is the bucket numbered with the first k bits of h ( x ) 2 . In each bucket, we would count the maximum number of trailing zeros in records classified into that bucket. At the end, as explained by Flajolet et al. in [2], the estimation is computed as nk H M i=1 k ( 2 R max,i ) 0.77351 , where k is the number of buckets, and H M i=1 k ( x i ) = k 1 x 1 +...+ 1 x k is the harmonic mean.
Example 15.4. Suppose we have the stream [ 3,8,15,3,5 ] , h( x ) =( 7x+5 ) mod 32 and there are two buckets.
  1. 3 enters: h( 3 ) =26=11010 . Bucket=1 R 1 =1 .
  2. 8 enters: h( 8 ) =19=10011 . Bucket=1 R 1 =1 .
  3. 15 enters: h( 15 ) =14=01110 . Bucket=0 R 0 =1 .
  4. 3 enters. Bucket=1 R 1 =1 .
  5. 5 enters: h( 5 ) =8=01000 . Bucket=0 R 0 =3 .
Thus, the estimation is n2 2 1 3 + 1 1 0.77351 =3.88.

Part VII Outlier mining

16 Outlier Mining

Definition 16.1. An outlier is a data object that deviates significantly from the normal objects as if it were generated by a different underlying mechanism.
Remark 16.1. Note that an outlier is not the same as noise in the data. Noise is due to random errors or the variance in a measured variable and for a good outlier analysis, it is required that noisy records are removed first.
Outlier are far more interesting than noise, because they are generated differently from the usual data.
As we saw in stream data mining, data can vary over time, and in the early stages of a change process, the new records would be seen as outliers from the past ones. But little by little we would notice that they are not outliers, they are the result of a change in the underlying process. Thus, outlier detection is also a part of the novelty detection process.

Remark 16.2. Some use cases are:

16.1 Types of outliers

16.2 Challenges of outlier detection

16.3 Supervised methods for outlier detection

We can try to model outlier detection as a classification problem, in which the samples are examined and classified by domain experts and are later used for training and testing models.

16.3.1 One-Class model

The idea of the one-class model is to train a classification model that can distinguish normal data from outliers. For this, it requires many abnormal samples, as many as possible. The functioning is to learn the decision boundary of the normal class. Then, all samples that lie outside this boundary are labeled as outliers.
It poses the advantage that can detect new outliers that are different from past outliers.
It is possible to extend the model to a multi-class model, in which the normal objects might be also classified into multiple classes.
image: 35_home_runner_work_BDMA_Notes_BDMA_Notes_ULB_Data_Mining_LectureNotes_source_outliers_3.png
Figure 6: Basic diagram of a one-class model.

16.4 Unsupervised methods for outlier detection

If we assume that the normal objects are somehow clusered into multiple groups, each of them having some distinctive features, we can expect outliers to be far away from any group of normal records.
This approach posses the drawback that it is very hard to detect collective outliers effectively.

16.4.1 Proximity-based methods

In this case, a record is considered an outlier if the nearest neighbors of the objects are far away. The effectiveness of these methods relies on the proximity measure used and it is often hard to find groups of outliers that are close to each other.

Distance-based outlier detection

Let r>0 be a distance threshold and τ (0,1] be a fraction threshold. An object o is a DB( r, τ ) -outlier if card{ o'|dist( o,o' ) r } card( D ) τ . So we count how many objects are within a distance of r and if these account to less than the predefined fraction τ , then o is considered an outlier.
If we do it just like this, we would test each object against the whole data set one by one. This can be very costly for big datasets, and can be improved with the CELL method, which is a grid-based method. In this case, the data space is partitioned into a multidimensional grid, in which is cell is a hyper cube with diagonal length r 2 . This allows us to improve the efficiency of the method by using two pruning rules:
  1. Definitions: Let C be the cell whose objects we want to assess:
    1. The level-1 are the cells adjacent to cell C .
    2. The level-2 are the cells which have some point less than r distance from C .
  2. Cell C : it is obvious that all objects in cell C are less than distance r from each other. Call a=card( C ) .
  3. Level-1 cell pruning rule: let b 1 =card( level-1 ) . If a+ b 1 > τ n , then every object o in C is not a DB( r, τ ) -outlier because all objects in Clevel-1 are in the r -neighborhood of o and there are at least τ n .
  4. Level-2 cell pruning rule: let b 2 =card( level-2 ) . If a+ b 1 + b 2 < τ n +1 , then all objects in C are DB( r, τ ) -outliers.

16.4.2 Density-based methods

The idea in this case is that the density around an outlier object is significantly different from the density around its neighbors. Thus, we can use the relative density of an object against its neighbors as the indicator of the degree of the object being an outlier.
For this, we define the k -distance of an object o as the distance between o and its k th nearest neighbor, and the k -distance neighbourhood of o as N k ( o ) ={ o'|o'D,dist( o,o' ) dis t k ( o ) } . Note that this set can have more than k elements because multiple elements can have identical distance to o . Thus the measure k card( N k ( o ) ) is an indicator of the density around the point o , which can be compared with the rest to decide if o is an outlier or not.

16.4.3 Clustering-based methods

The idea now is to cluster the data and classify an object as an outlier if:
  1. It does not belong to any cluster.
  2. There is a large distance between the object and its closest cluster.
  3. It belongs to a small or sparse cluster.
This approach presents several advantages:
But it also has some drawbacks:

References

1Charu C. Aggarwal, Data Mining (Springer International Publishing, 2015).
2Philippe Flajolet, Éric Fusy, Olivier Gandouet, and Frédéric Meunier, "Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm", in Discrete Mathematics and Theoretical Computer Science (2007), pp. 137--156.
3Mahmoud Sakr, "INFOH423 Data Mining".
4Ian Witten, Eibe Frank, and Mark Hall, Data Mining: Practical Machine Learning Tools and Techniques (Elsevier, 2011).