INFOH423 - Data Mining
Fall 2022
Jose Antonio Lorencio Abril
Professor: Mahmoud Sakr
Student e-mail: jose.lorencio.abril@ulb.be
This is a summary of the course
Data Mining, taught at the Université Libre de Bruxelles by Professor Mahmoud Sakr in the academic year 22/23. Most of the content of this document is adapted from the course notes by Sakr, [
3], and the basic bibliographic source of the course, the book of Aggarwal, [
1], so I won't be citing it all the time. Other references will be provided when used.
Table of Contents
Part I Introduction
1 What is data mining?
Even though there is not a formal-accepted-by-all definition of data mining, most of them are in agreement that data mining is a field of study that focuses on the collection, cleaning, processing, analysis and gaining of useful insights from the data that we have access to.
As wider as these topics are, is also the data mining domain, which is nowadays a really hot topic both in academia and industry.
In academia, there exist many goals to achieve in this field:
- Developing new models.
- Developing new ways to deal with real world data.
- Understading what some complex models are actually doing.
- Developing the mathematical tools to be able to better describe the models that are made by computer scientist in a less formalized way (it is not rare that scientific models come before their mathematical formalization.
- Using of data mining to gain a better understanding of medical, biological, chemical, economic,... data than we are able to gain using just traditional statistic techniques.
In the industry, the focus is mainly in two fields (disregarding research companies, which would have similar objectives as academia):
- Using well known models to understand what happened in the past.
- Using well known models to try to predict what will happen in the future.
These usages are made with the aim of improving the business processes and, ultimately, maximize profit.
1.1 Why is data mining important today, if it was not yesterday?
Because the computing power has increased enormously, so now we are able to run algorithms that enable us to train models in a decent time, and this task was practically undoable a few years ago.
Not only that, but the data technology is rapidly evolving, too. We produce more data and store more data, so... we have more data! This data is potential knowledge and we know many tools to get this knowledge from it.
In fact, the amount of data is so vast, that not only we have to develop techniques to analyze data, but also to manage huge amounts of data.
2 The Data Mining Process
The data mining process is a pipeline constructed around the basic steps:
- Data collection: obtention of data from real world sources.
- Feature extraction and data cleaning: among all data retrieved from the real world, we have to select those characteristics or features that are relevant for our purposes (feature extraction) and to decide what to do with mistaken/noisy/lost data (data cleaning). Also, as we might be collecting data from different sources, it is important to decide how to aggregate the data into a unified format for later processing.
- Analytical processing and algorithms: we now must develop suitable methods for analyzing our data. That is, we should decide what mathematical models to use to describe our data, and what algorithms to use in order to train the desired model.
- After these steps, we would enter a phase of analyzing the obtained results to obtain knowledge from data, as well as a iterative procedure, in which we could return to any previous step and try to improve different parts of the process.
3 Data Types
Not all data have have the same nature nor characteristic, so it is important to understand the differences between them and what techniques are applicable to what types of data.
We can characterize data in different levels of detail, and the most basic classification would distinguish between nondependency-oriented data and dependency-oriented data:
- Nondependency-oriented data: simple data types with no specified dependencies between the data items or the attributes.
- Dependency-oriented data: implicit or explicit relationships may exist between data items.
Dependency-oriented data are normally more complex to study because of the need to study not only the data itself, but also the relationships between different data items.
We will now define different subtypes of data that we can find.
3.1 Nondependency-oriented data
The term nondependency-oriented data is interchangeable with the term multidimensional data:
Definition 3.1.
We will call source space, , to the set of all possible values that our data can take. This set does not have neccessarily to take any particular form.
If is a product space, then each component is called a feature.
Example 3.1.
If we are measuring the name, age, height and gender of the students of a school, then we will have where represents the set of possible names and are the two possible values of the gender.
Definition 3.2.
A record (data point, instance, tuple) is just a point that we measure and store in some form.
Example 3.2.
Following the previous example, some records of are represented in the following table:
index
|
Name
|
Age
|
Height
|
Gender
|
1
|
Josh
|
23
|
1.77
|
M
|
2
|
Mary
|
28
|
1.62
|
F
|
3
|
Larry
|
12
|
1.58
|
M
|
Table 1: Records and multidimensional data set example
Note that we added an index column, because it is a common practice.
Definition 3.3.
A multidimensional data set, , is a set of records, .
Example 3.3.
Table
3.2 represents an example of a dataset, too.
As we can see, Table
3.2 contains attributes of different types (very obvious from the definition of the source space
). Thus, we have to take into account also the type of each attributes of our data:
- Quantitative multidimensional data: numerical data features, as age or height. If a data set is wholy compound of this kind of features, it is said to be a quantitative multidimensional data.
This type of data is the easiest to analyze, as mathematical tools are directly applicable and most algorithms are developed assuming this type of data. For this reason, it is common to try to transform all non-quantitative data into quantitative data.
- Categorical and Mixed Attribute data: a categorical feature is such that it can only take values among a finite set (unordered) of options, as the gender.
If we encounter a data set compound of categorial data, we would say it is a categorical multidimensional data.
A dataset with both quantitative and categorical features is called a mixed multidimensional data.
- Binary and Set data: binary data take values in and it can be considered a special case of both numerical data (obviously) and categorical data (as if have a categorical feature which can only take two values, then it is easy to map these values to the set ). For example, the gender is an obvious case of binary data.
Moreover, binary data can be seen as setwise data, where 1 indicates that the instance is in the set, while 0 indicates it is not.
- Text data: usually a string, such as the name.
3.2 Dependency-oriented data
As outlined before, we can find implicit or explicit dependencies between instances:
- Implicit dependencies: the dependencies between instancies are not explicitly specified but are known to exist. For example, if we measure the number of students in the library every 5 minutes, we would find that different instances are related via the temporal dimension, so measures with little time delay between them would be similar.
- Explicit dependencies: this term usually refers to graph or network data, in which edges represents relationships between nodes.
As before, let deepen a bit in some types of this kind of data:
- Time-Series data: it contains data that are generated by continuous measurement over time. This means that our source space has a temporal component. This dependency is implicit.
Formally, a time series of length and dimensionality contains numeric features at each of time stamps . Each time stamp contains a component for each of the series. Therefore, the set of values received at time stamp es .
- Discrete sequences: these are the categorical analog of time-series data. We will now have categorical or text features along the temporal dimension. This dependency is implicit.
- Spatial data: in this type of data, the dependance of the instances is given by their proximity in space. For example, if we measure the temperature in a room per , we will find that points that are nearby show more similar temperature than points that are far away from each other. This dependency is implicit.
- Spatiotemporal data: this data captures both spatial and temporal dimension, so we have to deal with both relationships. As before, this dependency is implicit.
- Network and graph data: now, data values may correspond to nodes in the network, and the relationships between them would correspond to the edges between the nodes.
Formally, a network is a pair where is a set of nodes and is a set of edges, that represent the relationships between the nodes. There can be attributes associated to both nodes or edges.
Edges may be directed or indirected, depending on wether the link is bidirectional or not.
Part II Classification
A classification problem consists in learning the structure of a dataset of examples, already partitioned into groups, referred as class. This learning is typically achieved with a model, which is used to estimate the class labels of unseen data examples with unknown labels. Thus, one of the inputs to the classification problem is the example dataset with known labels, , called training data, while the unseen data points to be classified are the test data. The model learnt is referred to as training model. The algorithm used to create the model is the learner.
The output of the classification algorithm can be of two types:
- Label prediction: a label is predicted for each test instance.
- Numerical score: the learner assigns a score to each instance-label possible combination. This score measures the propensity of the instance to belong to a particular class.
4 Decision Trees
Decision trees are a classification methodology, which uses a tree structure to partition the feature space. Each node of the tree represents a decision to make according to the data, called the split criterion, and is a condition on one or more features variables in the training data.
The goal is to identify a split criterion such that the level of mixing of the class variables in each branch of the tree is reduced as much as possible.
The splits can be univariate, if they use a single attribute in the condition, or multivariate, if more than one attribute are used in the condition.
The nodes can be of two types:
- Internal node: each internal node represents a partition of the space according to a certain condition.
- Leaf node: they are labeled with the dominant class of the remaining partition of the training set at that node.
The general algorithm for constructing a decision tree is as follows:
begin
Create root node containing D;
repeat
Select an eligible node in tree;
<@\textcolor{red}{Split the selected node into two or more nodes based on the split criterion};<@
until no more eligible nodes for split;
<@\textcolor{red}{Prune overfitting nodes from tree};<@
Label each leaf node with its dominant class;
end
Algorithm 1: GenericDecisionTree(Dataset: D)
The red lines indicate what changes accross the different algorithms to produce a specific decision tree.
4.1 Split criteria
The split criterion aims to maximize the separation of the different classes among the children nodes. Its design depends on the attributes of the data:
- Binary attributes: produce a binary tree.
- Categorical attribute: there several approaches:
- -way split: we split the branch in as many branches as distinct values of the attribute.
- binary split: testing each of the groupings of categorical attributes, and selecting the best one.
- Numeric attribute: we have, again, several possibilities:
- If it contains a small number of ordered values, we can treat it as a categorial attribute and apply -way split.
- For continuous numeric attributes, the split is performed using a binary condition, like for a certain contant .
These methods require to determine the best split among a set of different splits, so we need a way to measure which one is better than other.
For this end, we are using the entropy.
4.1.1 Entropy
Definition 4.1.
Let be the fraction of data points belonging to the class for the attribute value . Then, the class-based entropy, , for the attribute value is
Definition 4.2.
The overal entropy of an attribute, , is defined as the weighted average over the different attribute values: where is the frequency of attribute value .
The entropy is used in the ID3 algorithm for constructing decision trees.
The overall entropy for an -way split of set into sets may be computed as the weighted average of the entropy values of each , being its weigth . This is called the entropy-split:
In relation to this, the information gain is defined as the reduction of entropy due to the split: Note that lower values of the entropy-split and higher values of the information gain are more desirable.
Sometimes, there are attributes with lots of distinct values, so using them to split the data reduces the entropy a lot, but are not very useful for prediction. To account for this, we can divide the overall information gain with the normalization factor which helps adjusting for the varying number of categorical values.
Example 4.1.
Entropy of the dataset 'Weather data'
Consider the following dataset:
outlook
|
temperature
|
humidity
|
windy
|
play
|
sunny
|
hot
|
high
|
false
|
no
|
sunny
|
hot
|
high
|
true
|
no
|
overcast
|
hot
|
high
|
false
|
yes
|
rainy
|
mild
|
high
|
false
|
yes
|
rainy
|
cool
|
normal
|
false
|
yes
|
rainy
|
cool
|
normal
|
true
|
no
|
overcast
|
cool
|
normal
|
true
|
yes
|
sunny
|
mild
|
high
|
false
|
no
|
sunny
|
cool
|
normal
|
false
|
yes
|
rainy
|
mild
|
normal
|
false
|
yes
|
sunny
|
mild
|
normal
|
true
|
yes
|
overcast
|
mild
|
high
|
true
|
yes
|
overcast
|
hot
|
normal
|
false
|
yes
|
rainy
|
mild
|
high
|
true
|
no
|
Table 2: The weather data. Source: [4].
If the class attribute is play, what is the entropy of this source?
What if the Entropy-Split of the attribute humidity? And the information gain?
4.2 ID3 Tree Induction Algorithm
ID3 is an algorithm to construct decision trees, in which the split criterion is the maximization of the information gain and the prunning strategy is to stop if all the records in a node are of the same class. The algorithm is more detailed in Algorithm
2.
Algorithm 2:
ID3(I, O, T) : Decision Tree
# I is the set of input attributes
# O is the output attribute
# T is a set of training data
if (T is empty) then
return a single node with value “Failure”
if (all records in T have the same value for O) then
return a single node with that value
if (I is empty) then
return a single node with the most frequent value of O in T
# else
compute IG for each attribute in I using data in T
Let X = argmin{IG(attr) for attr in I}
Let {x_j for j=1,...,m} be the values in X
Let {T_j for j=1,...,m} be the subsets of T when partitioned
return a tree with:
root node labelled X
arcs labelled x_1,...,x_m
connected to
ID3(I-{X}, O, T_1),...,ID3(I-{X}, O, T_m)
Example 4.2.
Compute the decision tree of the data in Table
3.2 using ID3 algorithm.
Step 1: Root
We start by computing IG for each attribute. To simplify notation, let and .
We already know that and . Let's compute the rest of the values: Which implies that .
Repeating this process with temerature, we get and with windy, we get .
This means that we label the root node with and we create three arcs, each of them with one of the values from . So we have the following Tree:
Step 2: Outlook=sunny
Now, we are going to do the same thing, restricting ourselves to the records for which Outlook=sunny. Now, we have to recompute the entropy and the gain for each of the rest of the attributes.
Let's start with the entropy: Now, the Information Gain: Which means that . As this cannot be improved, we can savely not compute the rest of the values. This way, the Tree will now look as follows:
Step 3: Outlook=Sunny, Humidity=Normal
Note that we are proceeding heightwise, but doing this breadthwise is also possible.
This time, all the values for are , so we enter the third of the algorithm and label the node as .
Step 4: Outlook=Sunny, Humidity=High
Same, now .
Step 5: Outlook=Overcast
Same, now .
So, now we have the following tree:
Step 6: Outlook=rainy
and so, again, it is maximum and we can continue using it as label:
Step 7: Outlook=rainy, Windy=False
All records have .
Step 8: Outlook=rainy, Windy=True
All record have .
So, we have the tree
And as there are no more nodes to analyze, this is the final decision tree.
4.2.1 The problem of UID
In general, attributes that have very many values have very high gain, but can lead to useless decision trees. Quinlan suggest choosing the attribute with the highest where is the label attribute.
The GainRatio favores attributes with higher gain, and punishes attributes with high entropy (many values).
Example 4.3.
Repeat the decision tree ID3 algorithm, but use GainRatio instead.
The same tree is obtained.
5 Bayesian classification
Probabilistic classifiers construct a model that quantifies the relationships between the feature variables and the target variable as a probability. We are going to study a well-known kind of probabilistic classifier, namely the Bayesian classifier or Naive Bayes classifier.
5.1 Naive Bayes classifier
The Bayes classifier is based on the Bayes' theorem for conditional probabilities.
Theorem 5.1.
Bayes' Theorem
If A and B are probabilistic events and , then
Proof.
On one side
On the other side, if Now substituting this value in the previous equation, we get which is the result we wanted to proof.
If , then , so the formula also holds.
This theorem quantifies the conditional probability of a random variable (the class variable), given known observations about another variables (the features). Let be the class variable and an unseen feature tuple. The goal of the method is to estimateLet the random variables for the individual dimensions of be denoted by , so we want to estimate Here, we will be interested in maximizing this value. Since the denominator is the same independently of the class , then we can focus on maximizing the denumerator The value is the prior probability of the class identifier and can be estimated as the fraction of points in the data whose class is . Thus, we now want to approximate the right factor. In the Naive Bayes approach, it is assumed that the feature values are independent of one another conditional on a fixed value of . Then so And this terms are much easier to estimate: to estimate we just need to take the fraction of training values with class and compute which fraction of them verifies . This is usually written as
This model is sometimes refererd to as the Bernoully model for Bayes classification when it is applied to categorical data with only two outcomes of each feature attribute.
In cases where more than two outcomes are possible for a feature variable, the model is referred to as the generalized Bernoulli model.
Example 5.1.
With the following training data:
Age
|
Income
|
Student
|
Credit_Rating
|
Buys
|
|
high
|
no
|
fair
|
no
|
|
high
|
no
|
excellent
|
no
|
|
high
|
no
|
fair
|
yes
|
|
medium
|
no
|
fair
|
yes
|
|
low
|
yes
|
fair
|
yes
|
|
low
|
yes
|
excellent
|
no
|
|
low
|
yes
|
excellent
|
yes
|
|
medium
|
no
|
fair
|
no
|
|
low
|
yes
|
fair
|
yes
|
|
medium
|
yes
|
fair
|
yes
|
|
medium
|
yes
|
excellent
|
yes
|
|
medium
|
no
|
excellent
|
yes
|
|
high
|
yes
|
fair
|
yes
|
|
medium
|
no
|
excellent
|
no
|
Classify with the Naive Bayes model the unseen record In this case, we have , so And for the feature variables, we have to compute and :
- Age :
- Income medium:
- Student yes:
- Credit_rating fair:
Then, we have Thus, the model classifies with class 'yes'.
Some final comments
The advantages of the Bayes Classifier is that it is easy to implement and can lead to fairly good results.
The drawbacks are that the main assumption of class conditional independence is not very trustable, which may lead to a loss of accuracy, because dependencies axist among variables. To deal with this issue there is a more complex model: Bayesian Belief Networks.
6 Model evaluation and selection
With a given dataset, we can train multiple classification models and each of them will behave differently, many times without an intuitive explanation of the differences observed. Thus, it becomes crucial to establish means of comparison between different models, so it is possible to choose between several models trained with the same data.
The simplest classification measure is the accuracy, which indicates the portion of well classified records. Nonetheless, if we compute the accuracy using the training data, we could be enhance overfitting to the training data, and maybe choose models that do not work well with unseen data. This is why it is usual to use a validation test set to compute comparison measures: the idea is that given a dataset, we can divide it into training data and test data. Then, models will be trained using the training data and evaluated with the test data, which has not been seen before. This makes the measures more reliable and the comparisons more fair.
There are several ways to extract training and test data from a dataset, which are detailed in
9.
6.1 Confusion Matrix
A confusion matrix is a visual and intuitive way to assess a classification algorithm. In its simplest form it is used to assess a binary classification model and it shows the following metrics:
- True Positives: count of records classified as True that are actually True.
- False Positives: count of records classified as True that were False in reality.
- True Negatives: count of records classified as False that were actually False.
- False Negatives: count of records classified as False that were True in reality.
In this case, the confusion matrix has the form:
|
Predicted True
|
Predicted False
|
Actual True
|
TP
|
FN
|
Actual False
|
FP
|
TN
|
In the more general case in which we have a -class classifier, each cell would have the count of records classified as class that are of class in reality:
|
Predicted
|
Predicted
|
...
|
Predicted
|
Actual
|
|
|
...
|
|
Actual
|
|
|
...
|
|
|
|
|
|
|
Actual
|
|
|
...
|
|
From these confusion matrices, we can derive some interesting measures:
7 Ensemble methods: increasing accuracy
An ensemble method for classification is a composite model, made up of a combination of classifiers. The idea is that given an unseen record, it can be classified using several models, and then make a consensus between all of them to decide the final decision of the class of the record. Usually, a voting scheme is used, in which the most voted class is the chosen one for classification. This approach usually improves the accuracy of each component.
Let's analyze why this works! There are three primary primary components to the error of a classifier:
- Bias: every classifier has its own assumptions about the nature of the decision boundary between classes. When a classifier has high bias, it will make consistently incorrect predictions over records that lie near the incorrectly-modeled decision boundary.
Example 7.1.
Bias is shown below.
The real model has been generated using the blue line. The classification model has the assumption that the data can be classified using a straight line. As we can see, points near to the boundary would be missclassified.
- Variance: random variations in the choices of the traiing data will lead to different models. This is closely related to overfitting. When a classifier has an overfitting tendency, it will make inconsistent predictions for the same test instance oevr different training data sets.
Example 7.2.
Variance is shown below.
In this case, the yellow-ish model happen to have used few red points near the boundary, so it became a perfect blue classifier, but a bad red classifier. The opposite happened to the blue-ish model. Note that this variations are only due to random selection of the training points.
- Noise: it is the intrinsic errors in the target class labeling. As this is intrinsic, there is not much one can do. Therefore, we will focus in the two latter sources of error.
In addition, bias and variance are often in a trade-off relationship: improving bias worsens variance, and vice-versa. Generally speaking, simplified assumptions about the decision boundary lead to greater bias but lower variance, while complex assumptions reduce bias but are harder to robustly estimate with limited data.
Ensemble analysis can often be used to reduce both the bias and variance of the classification process, because a combination of different simple models with high bias and little variance will reduce the bias as the assumptions will be combined to model more complex scenarios. On the other hand, a combination of complex models with low bias and high variance, will reduce the variance because decisions made by multiple models tend to be more consistent than those made by individual models.
Looking at the accuracy
Now, let's look at the accuracy of the ensemble model in comparison to its component models in the case of binary classification (for ease). Let's say the ensemble model is composed of models, each of them with an accuracy . Then, for the ensemble model to classify a new record as correct, at least half of the models need to classify the tuple correctly. Let's define the random variable as 'number of classifiers that classify the tuple as correct', then and we can decompose
Here, assuming independence of the different models, it is and so on. This yields to a complex formula, but when the case is the simplest in which , we are in a binomial distribution, having
Example 7.3.
Suppose an ensemble model with and equal accuracy for the three models, . Then, the accuracy for the ensemble model is which can be compared with the individual accuracies: At this point, the accuracy will be increased whenever The discriminant of this polynomial is so all its solutions are complex and it does not cut the X axis. As the leading coefficient is positive, this polynomial is always positive and we conclude that the 3-ensemble method always improves the accuracy of the individual models.
Example 7.4.
Suppose an ensemble model with and equal accuracy for the three models, . Then, the accuracy for the ensemble model is and when compared to the individual values we obtain This is an increased accuracy whenever which corresponds to the polynomial The graph of this polynomial is
So for it is positive everywhere except the interval , in which it is negative. This means that the accuracy is improved in all cases outside this interval.
For example, if , then , which is indeed an improvement.
There are several ways to make ensemble methods:
- Bagging: the data is bootstrapped times, collecting a training data set of approximately the same size of the original data set for each model. Then, each model is trained with a different bootstrap.
This approach reduces the variance, because the differences owed to the sampling are reduced by performing this sampling times. However, bagging does not improve the bias, because all the models used have the same assumptions.
- Boosting: a weight is associated with each training instance and the classifiers are trained with the use of these weights. When the classifier has finished its training, those records in which the model missclassify are increased in weight, so the next model will be trained paying more attention to those records.
With this approach, the overall bias is reduced because each model focuses on those places where the past models tend to fail.
- Adaboost: a particular algorithm approach based on the boosting idea. Given a dataset, , of records with label , :
- Set initial weights to .
- j = 1
- While j<N
- Bootstrap to get a training set , select tuples with probability .
- Train model with .
- Compute the error rate of using .
- Update the weigths
- If a tuple is misclassified: increase its weight
- If it is correctly classified: decrease its weight
- j = j+1
The error rate with weights is computed as where .
When the voting is performed, the votes are also weighted, with
- Random Forest: bagging does not work very well combined with decision trees, because the ID3 algorithm tends to generate similar/correlated trees. The idea here is to add randomness to the tree induction algorithm itself, as follows:
- Before each split, attributes are randomly selected out of the available attributes.
- The split attribute is selected from this group of attributes.
When is selected to be much smalles than , the trees in the forest are highly independent, so the method is able to improve the accuracy of the individual trees. Note nonetheless that in this case the individual trees will perform worse than normally trained trees, because they have been worsened on purpose.
Part III Model validation and data preparation
8 Data preparation
The data preparation phase is a multistage process that comprises several individual steps, some or all of which may be used in a given application. These steps are:
- Feature extraction and portability: a feature is characteristic of the data or derived from the data. For example, if we have a sensor measuring humidity, the level of humidity will be a feature directly present in the data; the difference between the humidity level at each measure and the average humidity level is a derived feature.
Features with good semantic interpretability are more desirable because this makes things easier for the analyst to understand results. So, the process of selecting which features to take into account for further analysis is called feature extraction.
Data type portability refers to the process of transforming data into different formats. This could have several reasons behind: we could do this because we have several sources of data which we want to unify or because we use an internal datatype that will not be compatible with what the training algorithms expect.
- Data cleaning: missing, erroneus and inconsistent entries are treated. We can either remove them or estimate them via the process of imputation.
- Data reduction, selection and transformation: the size of the data is reduced through data subset selection, feature subset selection, or data transformation. This helps in two ways:
- The algorithms perform more efficiently in smaller datasets.
- The removal of irrelevant features or records improves the quality of the data mining process.
8.1 Feature extraction
Example 8.1.
Image feature extraction
Image data are represented as pixel. Nonetheless, we know that pixels are related between each others and that combinations of pixels carry information about what we are seeing in the image. This is not straighforward for a computer to understand, as the computer only 'sees' a matrix of triplets.
At a higher level, color histograms can be used to represent the features in different segments of an image.
Also, visual words are used to extract features from images. A visual word is a semantically rich representation of parts of an image.
Example 8.2.
Document feature extraction
Document data is often available in raw and unstructured form, and the data may contain rich linguistic relations between different entities.
One approach is to remove stop words, stem the data, and use bag-of-words representation.
Other methods use entity extraction to determine linguistic relationships.
Named-entity recognition is an important subtask of information extraction. It consists in locating and classifying atomic elements in text into predefined expressions of names of persons, organizations,...
8.2 Data Type Portability
- Numerical to categorizal data: discretizacion
The process of discretization divides the ranges of the numeric attribute into ranges. Then, the attribute is assumed to contain different categorical labeled values from 1 to .
Variations within a range are not ditinguishable after discretization (some information is lost).
One challenge is that data may be nonuniformly distributed across the different intervals. Thus, there are several ways to perform the division:
- Equi-width ranges: the interval is divided into subintervals of equal length.
- Equi-log ranges: the interval is divided in such a way that the log-length is constant. If we want to divide the interval into equi-log ranges , we have to ensure that
- Equi-depth ranges: the ranges are selected so that each range has an equal number of records.
- Categorical to numeric data: binarization
Suppose a categorical attribute with different values. Then, we can binarize it by creating attributes, and the record will have all of them set to 0, except the one corresponding to the value that it has, which will be set to 1. For example:
Name
|
Role:CEO
|
Role:Employee
|
Role:Director
|
Pedro
|
1
|
0
|
0
|
- Text to numeric data: Latent Semantic Analysis (LSA)
LSA transforms the text collection to a nonsparse representation with lower dimensionality. After transformation, each document needs to be scaled to This is necessary to ensure that documents of varying length are treated in uniform way.
- Time series to Discrete Sequence Data: Symbolic Aggregate Approximation (SAX)
Two steps:
- Window-based averaging: the series is divided into windows of length , and the average time-series value over each window is computed.
- Value-based discretization: the averaged time-series values are discretized into a smaller number of equi-depth intervals. Idea: ensure that each symbol has an approximately equal frequency in the time-series.
- Time series to Numeric Data:
- Discrete Wavelet Transform (DWT): converts the time series data to multidimensional data, as a set of coefficients that represent averaged differences between different portions of the series.
- Discrete Fourier Transform (DFT): similar, using Fourier series' theory.
- Discrete Sequence to Numeric Data:
Two steps:
- Convert the discrete sequence to a set of binary time series, with as many time series as the number of values the discrete sequence can take.
- H each of these time series into a multidimensional vector using the DWT. This combines the features from the different series, creating a single multidimensional record.
- Spatial to Numeric Data:
The approach is the same as the one used for time-series data, but now there are two contextual attributes instead of one, so the DWT has to be modified to be two-dimensional.
8.3 Data Cleaning
Data in the real world is
- incomplete: there are values for some attributes that are missing. This can happen because some measures were not always taken of because of human/computer errors.
- noisy: there are errors. This can happen because the instruments used to collect data are not working properly, because errors in the data transmission occur or because of human/computer errors.
- inconsistent: there are discrepancies between different attributes that are related. This can happen when data from different sources needs to be combined or when some calculated values are not updated after changing their source values.
- duplicate: there are duplicate values.
8.3.1 Handling Missing and Inconsistent entries
We can:
- Delete the entire record: this is a safe option, because we don't introduce bias, but it usually not practical, because we might delete too many entries.
- Impute/estimate the missing values: we use the rest of the data to estimate the values that are missing or choose some constant based on some assumption. The problem with this approach is that one way or another we are introducing bias in the data.
- Change the mining algorithm: there are some algorithms that are developed to deal with missing values.
8.3.2 Handling Noisy entries
We can:
- Perform kernel smoothing: for numerical data, we can use a kernel function to smooth the data values.
- kNN smoother: replaces each value with the average of itself and its k-nearest neighbors.
- kernel average smoother: replace a value with the weighted average of itself and its neighbors in a fixed size window.
- Binning: it is also possible to sort the data and partition it into equally sized bins. Then, the data can be smoothed by the bin mean, median or boundary values.
- Regression: smooth by fitting the data to a regression function.
- Change the mining algorithm: there are algorithms designed to tolerate noise.
8.4 Exploratory analysis
Exploratory analysis is the task to understand the data, from the meaning of the features, to their range of values or even their statistical distributions. There are many actions we can do to explore the data, such as counting nulls, searching for repetition, compute some statistics as the maximum, the minimum, the mean,... of the data, and many more.
8.4.1 Central tendency measures
Central tendency measures are a 1 number summary that can be helpful:
- Mean:
- Weighted mean:
- Trimmed mean: a mean calculated disregarding extreme values.
- Median: the middle value of the data.
- Mode: value that occurs most frequently in the data.
8.4.2 Symmetric and Skewed data
Using the mean, median and mode, we can understand in a soft way the distribution of the data:
- If the three values are very similar, then the distribution is very symmetric, having this values in the center.
- If the order is then the distribution is skewed to the left.
- If the order is then the distribution is skewed to the right.
8.4.3 Measuring the dispersion
- Quartiles: the Q1 (25th percentile) and Q3 (75th) percentile.
- Inter-quartile range:
- Five number summary: minimum, Q1, median, Q3, maximum.
- Boxplot: the median is marked and there is a box around it which ends in the queartiles. Outliers are plotted individually.
- Outlier: a value that lies outside the range .
8.4.4 Comparing with the normal distribution
We can compute the mean and the standard deviation and check if the data behaves as a normal distribution:
- In there is about 68% of the data.
- In there is about 96% of the data.
- In there is about 99.7% of the data.
We can also perform the Kolmogorov-Smirnov test or the Shapiro-Wilk test, which are statistical test that try to assess is a distribution is normal.
8.5 Similarity and Distance
There are many data mining algorithms which uses the notions of similarity or distance between two points. Usually, the selection of the distance function is an important decision before using an algorithm, because it will ultimately influence the results and their implications.
Definition 8.1.
The -norm is a distance function defined by
For it is the well-known Euclidean distance.
For it is called the Manhattan distance, because it is like traversing a grid made of rectangles, similar to the streetmap of Manhattan.
Definition 8.2.
The generalized Minkowski distance is defined by
Definition 8.3.
The edit distance is a distance defined over strings.
We have the operators:
- r: replace one character by another.
- i: insert one character.
- d: delete one character.
The edit distance between two strings and is the minimum amount of operations needed to convert to .
The formula is where is the string minus its first character.
for i=1 to |s1| do
for j=1 to |s2| do
m[i,j] = min{m[i-1,j-1] + if[(s1[i] = s2[j] then 0 else 1],
m[i-1,j] + 1,
m[i,j-1] +1
}
return m[|s1|,|s2|]
Algorithm 3:
DP_Edit(s1, s2)
9 Model evaluation
Once we have trained a model, we want to assess how well it performs. This way, we can compare different models and discuss, quantitatively, which of them is preferrable for our purposes. Nonetheless, this task is not easy, and there are both methodological and quantification issues to take into account:
- Methodological issues: associated with dividing the labeled data appropriately into training and test segments for evaluation. The choice of methodology has a direct impact on the evaluation process, such as underestimation or overestimation of classifier accuracy. Several approaches are possible: holdout, bootstrap and cross-validation.
- Quantification issues: associated with providing a numercial measure for the quality of the method after a specific methodology for evaluation has been selected.
9.1 Holdout
The labeled data is randomly divided into two disjoint sets, corresponding to the training and test data. The training data is used to feed the training algorithm and produce a model, whose performance is assessed using the test data.
The approach can be repeated several times with multiple samples to provide a final estimate.
- Problem: classes that are overrepresented in the training data are underrepresented in the test data. This can have a significant impact when the original class distribution is imbalanced. The error estimates are pessimistic.
9.2 Cross-Validation
The labeled data is divided into disjoint subsets of equal size . A typical choice of is around 10. One of the segments is used for testing, and the other segments are used for training. This approach is repeated by selecting each of the different segments in the data as test set.
The average accuracy over the different test sets is then reported.
The overall accuracy of the cross.validation procedure tends to be a highly representative, but pessimistic estimate, of model accuracy.
When is chosen to be , examples are used for training, and one example is used for testing. This is called leave-one-out cross-validation. This approach is very expensive for large datasets.
Stratified cross-validation uses proportional representation of each class in the different folds and usually provides less pessimistic results.
9.3 Bootstrap
The labeled data is sampled uniformly with replacement, to create a training dataset, which can contain duplicates. The labeled data of size is sampled times with replacement.
The probability that a particular point is not included in a sample is Therefore, the probability that the point is not included in samples is For large values of , this approximates . Thus, the fraction of the labeled data points included at elast once in the training dataset is .
The overall accuracy is computed using the original set of full labeled data as the test examples.
The estimate is highly optimistic of the true classifier accuracy because of the large overlap between the training and test examples.
A better strategy is the leave-one-out bootstrap, in which the accuracy of each labeled instance is computed using the classifier performance on only the subset of the bootstraped samples in which the instance is not part of.
This approach provides a pessimistic accuracy estimate, , given by the mean value of the accuracy computed for each labeled instance.
The 0.632-bootstrap improves the accuracy estimate with a compromise approach. The average training-data accuracy over bootstrapped samples is computed. This is a highly optimistic estimate. The overall accuracy is a weighted average of the leave-one-out accuracy and the training-data accuracy:
Part IV Clustering
Some applications require to divide the data into different groups, that share some characteristics. The problem many of these times is that we don't know which characteristics or at how much extend are useful to characterize the data. The general (unsupervised) approach to tackle this problem is clustering.
Definition 9.1.
Clustering problem (Informal)
Given a set of data points, partition them into groups containing similar data points.
This definition is informal and general, but gives enough information to understand the problem, as well as enough freedom to tackle it from different perspectives.
10 Representative-Based Algorithms
These are the simplest of all clustering algorithms, as they directly use distances or similarities to cluster the data. They not capture hierarchical relationships and use a set of representatives to cluster the data. The main insight is that the discovery of good clusters equates to the discovery of good representatives.
Definition 10.1.
Representative-Based general clustering problem
Given a data set with data points in a -dimensional space and a specified number of clusters, , the goal of a representative-based algorithm is to determine representatives such that the objective function is minimized, i.e., the sum of the distances of the different data points to their closest representative needs to be miminized.
The
generic -representative approach is as in Listing
4.
Algorithm 4:
Generic -representative approach (Data D, int k, threshold eps) : Set of representatives Y and Clusters C
Initialize Y = {Y_1,...,Y_k} <@\textcolor{purple}{\#Using heuristics}<@
Initialize clusters C_1 = {},... C_k = {}
do
# Assign step
for(X in D):
assign X to Y_j such that <@\textcolor{blue}{$Dist(X,Y_j) = \min_i Dist(X,Y_i)$}<@
C_j.add(X)
# Optimize step
for all Clusters C_j:
determine Y_j' such that
<@\textcolor{blue}{$\sum_{X_i \in C_j} Dist(X_i,Y_j')$}<@
is minimized
Y_j = Y_j'
while <@$O=\sum_{i=1}^{n}\left[\min_{j}\ Dist\left(X_{i},Y_{j}\right)\right]$<@ > eps
return {C_1, Y_1},...,{C_k, Y_k}
10.1 The k-Means algorithm
The algorithm is a representative clustering method in which the distance used is the squared Euclidean distance (or squared norm): Thus, the objective function minimizes the sum of square errors over the data points, this is called the (Sum of Squared Errors).
Proposition 10.1.
The optimal representative for each of the optimize iterative steps is the mean of the data points in cluster .
Proof.
In the current step, we have a fixed clustering asssignment from the last step, . The overall clustering objective function is so its gradient for each is When imposing the gradient equals to 0 (for optimization purposes), we get or, equivalently,
Example 10.1.
Apply the algorithm with to the dataset and using the seed and .
10.2 The k-Medians Algorithm
In this case the Manhattan distance (-norm) is used:
Proposition 10.2.
The optimal representative for each of the optimize iterative steps is the median of the data points along each dimension in cluster .
Proof.
In this case, the objective function is Now, the -norm is obtained by summing the absolute value in each dimension. The problem is that this function is not differentiable. Nonetheless, it is differentiable almost everywhere. We can obtain the sub-gradient of with respect to as For this to equal 0, we need as many negative signs as positive signs: the median in each direction achieves exactly this, as it has as many values to its left as to its right.
Example 10.2.
Repeat the example using the k-Medians algorithm.
10.3 The k-Medoids Algorithm
In this algorithm the representatives are always selected from the database , and this makes the structure of the algorithm different from the one we have seen before.
Reasons:
- This approach makes outlier handling easier than with -means.
- It is sometimes difficult to compute the central representative of a complex data type (text or categorical data). The -medoids algorithm can be defined in any datatype in which we are able to define a proper distance function.
The algorithm is as in Algorithm
5.
Algorithm 5:
-medoids(Data D, int k, threshold eps) : Set of representatives Y and Clusters C
Initialize Y = {Y_1,...,Y_k} <@\textcolor{purple}{\#Using heuristics}<@
Initialize clusters C_1 = {},... C_k = {}
do
# Assign step
for(X in D):
assign X to Y_j such that <@\textcolor{blue}{$Dist(X,Y_j) = \min_i Dist(X,Y_i)$}<@
C_j.add(X)
# Optimize step
Determine a pair X_i in D and Y_j in Y such that
replacing Y_j with X_i leads to the
greatest possible improvement in the objective function
Perform the exchange between X_i and Y_j only if improvement is positive.
while <@$O=\sum_{i=1}^{n}\left[\min_{j}\ Dist\left(X_{i},Y_{j}\right)\right]$<@ > eps
or no improvement in current iteration
return {C_1, Y_1},...,{C_k, Y_k}
10.4 Practical issues
- The initialization of the initial representative is not a trivial task. Normally they are selected randomly among the points of the datasets and most times the initialization step does not change the outcome of the algorithm: the representative based algorithm are very robust to this selection. Nonetheless, sometimes suboptimal clusters arise because of a bad choice of initial representatives.
- Outliers can make a detrimental impact on the algorithms. If one outlier is selected as initial representative it is possible to obtain a singleton cluster with no meaning for the application.
- The number of clusters, , is difficult to determine using automated methods. As it is not known a priori, a common approach is to start with larger values for than the one we think should be correct. Some natural clusters may split, but we can merge some of them as a postprocessing step.
11 Grid and Density based Algorithms
One of the major problems with distance-based algorithms is that the shape of the clusters is implicitly enforced by the distance function. Thus, it can be hard to detect natural cluster of arbitrary form.
Density-based algorithms are useful for this. The idea is to identify dense regions in the data, and use the positions of the different regions to determine the clusters.
11.1 Grid-based methods
The data is discretized into intervals, typically equi-width. If the data has dimensions, we will obtain hyper-cubes. These are the building blocks for the clusters.
A density threshold is used to determine the dense hyper-cubes. In most real data-sets, an arbitrarily shaped cluster will result in multiple dense regions connected together by a side or a corner.
Two hyper-cubes are said to be adjacently connected if they share a side (sometimes corners are also considered).
Two hyper-cubes are said to be density connected if a path can be found from one to another containing only a sequence of adjacently connected grid regions.
The goal is to determine the density connected regions. Using a graph representation, the problem is equivalent to finding the connected components of the graph, being the hyper-cubes the nodes and an edge is defined between every pair of adjacent cubes.
Advantages
The number of clusters is not pre-defined, so we don't need to bother with the estimation of .
Disadvantages
We have to define and , which is not easy.
- If (number of ranges) is too small, the data points from multiple clusters will be present in the same hyper-cube. We will obtain undesired merged clusters.
- If is too large, there will be many empty grid cells, so we may split a natural cluster. It also will be computationally expensive.
- The choice of has similar consecuences.
Also, if the clusters present different densities, it is even more difficult to determine and because each cluster is 'asking' for different values.
The generic algorithm is as follows:
Discretize each dimension into p ranges
Determine grid cells at density level tau
Create graph in which dense grids are connected if they are adjacent
Determine connected components of the graph
return points in each connected component as a cluster
Algorithm 6: GenericGrid(Data D, Ranges p, Density tau) : clusters C
11.2 DBSCAN
The idea behind DBSCAN is similar to the one we have seen, but density is considered at a pointwise level:
The density of a data point is defined as the number of points that lie within a radius from it, i.e. their neighbourhood of radius . The densities are used to classify the points:
- Core point: its neighbourhood contains at least points.
- Border point: its neighbourhood contains less than points, but it contains one or more core points.
- Noise points: any other case.
And we define some relations between points:
- are directly density reachable if is a core point, and is in the neighbourhood of .
- are density reachable is is a core point, and there exists a chain of core points where are directly density reachable for and is in the neighbourhood of .
- are density connected if both and are density reachable from some point
After the points have been classified, a connectivity graph is constructed as a maximal set of points that are all reachable from one another under any of these definitions.
Now, we identify the connected components of the graph, which are the clusters.
The detailed algorithm is as follows:
Clusters = {}
for each unvisited point P in D
Neihbourhood = regionQuery(P, eps)
if sizeof(Neihbourhood) < tau
mark P as visited
else
C = next cluster
expandCluster(P, Neighbourhood, C, eps, tau)
if C not in Clusters
Clusters.add(C)
return Clusters
function expandCluster(P, Neighbourhood, C, eps, tau)
mark P as visited
C.add(P)
for Q in Neighbourhood
if Q not visited
mark Q as visited
Neighbourhood_Q = regionQuery(Q, eps)
if sizeof(Neighbourhood_Q >= tau)
Neighbourhood.addAll(Neighbourhood_Q)
if Q is not in any cluster
C.add(Q)
Algorithm 7: DBSCAN(Data D, Radius eps, Density tau) : clusters C
Advantages
This method is not very different from the graph method, and it can also discover clusters of any shape, without the need of knowing the number of clusters in advance.
Disadvantages
Again, determining the correct values for and is a complex task. Also, the existence of clusters with different densities makes it even harder.
The major time complexity is finding the neighbours: .
In some special cases, an spatial index can reduce it to .
11.2.1 Progressive DBSCAN
The idea is using the same value for , apply DBSCAN in a progressive way, increasing the value of .
- Start with a small to find dense clusters.
- Iteratively relax the value to find less dense clusters.
- After every iteration, the points that already belong to a cluster are removed from the dataset.
11.3 DENCLUE
The DENCLUE algorithm is based on kernel-density estimation, which can be used to create a smooth profile of the density distribution, by defining the density at coordinate as where is the kernel function are different data points. A commonly used kernel function is the Gaussian Kernel: The effect of this operation is to replace each discrete data point with a smooth bump, and the density at each points is the sum of all these bumps.
Example 11.1.
A visual example of a kernel smoothing:
Once the density has been smoothed, the goal is to determine clusters by using a density threshold
that intersects the density profile. Two examples showing how the choice of
affects the result are shown in Example
11.2.
Example 11.2.
In the previous example, if we select
, we obtain the following:
In this case, only one cluster is obtained. In contrast, if we choose , two clusters are obtained:
12 Probabilistic Model-Based Algorithms
Until now, all models described are hard clustering algorithm, meaning each data point is assigned to a particular cluster. Probabilistic model-based algorithms are soft algorithms, in which each data point may have a nonzero assignment probability to more than one cluster.
12.1 Fuzzy sets and clusters
A fuzzy cluster is a fuzzy set . For each data point , represents the probability that is in cluster . can be called degree of membership of object to cluster .
Formally, given a set of objects , a fuzzy clustering of fuzzy clusters can be represented using a partition matrix, , where should satisfy three conditions:
- . From the definition of a fuzzy cluster.
- . The sum of all probabilities is 1.
- . There is no empty cluster.
12.2 Mixture model
The underlying assumption of a mixture-based generative model is to assume that the data was generated from a mixture of distributions with probability distributions . Each of them represents a cluster and is called mixture component. The data points, , are generated by this model as follows:
- Select a mixture component with prior probability . Say is selected.
- Generate a data point from .
This generative model is denoted by . We don't know nor in advance. The distributions are often assumed to be Gaussian, so we need to estimate the parameters of the distribution in such a way that the overall data has a maximum likelihood of being generated by the model.
Consider a set of probabilistic clusters with probability density functions , respectively, and probabilities . The probability of an object being generated by the cluster is and the probability of being generated by the set is As objects are assumed to be independently generated, for a data set , the probability that is generated by is Now, we want to estimate from trying to maximize is maximized.
If we use the assumption that the underlying distributions are Gaussian , then the probability density function of each cluster are centered at with standard deviation is: And if we assume all clusters have the same probability , then Thus, our objective is to maximize
This is achieved with the expection-maximization (EM) algorithm. The EM algorithm is a framework to approach maximum likelihood estimates of parameters in statistical models. It consists of two steps:
- E-step: assigns objects to clusters according to the current fuzzy clustering or parameters of probabilistic clusters.
- M-step: finds the new clustering or parameters that maximize the SSE or the expected likelihood.
Example 12.1.
EM algorithm example.
12.3 Evaluating fuzzy clusters
If are the centers of the clusters, we define the sum of squared error (SSE) for a point as
For a cluster , we have its SSE as
Finally, the SSE of the whole clustering is
12.4 Cluster quality measures
A good clustering methods will produce high quality clusters, i.e. clusters with the following characteristics:
- High intra-class similarity: cohesive within clusters. This means that the objects inside the clusters are similar to each other.
- Low inter-class similarity: distinctive between clusters. This means that the objects from different clusters are different.
The quality of the clustering method depends on the similarity measure used by the method, its implementation and its ability to discover the hidden patterns in the data.
Some examples of quality measures are:
- Sum of square distance to centroids: the squared distance between the representative of each cluster to every other point in the cluster, and then summed. This measure is suitable for representative-based methods, but it favors clusters that suit the underlying distance function used:
- Intracluster to intercluster distance ratio: we sample pairs of points from . Let be the pairs such that and are in the same cluster, and the pairs whose points are in different clusters. Then where
- Silhouette coefficient: let denote the average distance between a point in the cluster and the rest of the points in the same cluster. Let denote the average distance between a point in the cluster and every other point in the cluster . Let . Then: It follows that , where large positive values indicate highly separated clusters (the distance to other clusters is high) and negative values are indicative of some level of mixing of data points from different clusters.
Part V Frequent pattern and association rule mining
13 Frequent Itemset Mining
Association pattern mining is usually defined in the context of supermarket data coontaining sets of items bought by cutomers, which are referred to as transactions. The goal is to determine associations between groups of items bought by customers. The discovered sets of items are reffered to as frequent itemsets.
This frequent itemset can then be used to generate association rules of the form where and are sets of items. The meaning of this is that we discovered that when some customer buys , it is likely that the same customer is going to/would like to by .
We have to be careful, nonetheless, because the raw frequency of a pattern is not the same as the statistical significance of the underlying correlations. This is why numerous models for frequent pattern mining have been proposed that are based on statistical significance.
Example 13.1.
An intuitive example is that when someone buys bread, cheese and yogurt, it is probably the case that he will buy also milk and eggs:
13.1 The model
The problem of association pattern mining is defined on unordered set-wise data.
The database contains transactions, .
Each transaction is a subset of the set of all items . The transactions can be represented as a multidimensional record of dimensionality , where the values are binary:
The universe of items is very large compared to the typical number of items in each transaction.
Definition 13.1.
An itemset is a set of items, .
A -itemset is an itemset with items, .
The support of an itemset , , is the fraction of the transactions in the database that contain as a subset.
Definition 13.2.
The frequent itemset mining problem is defined as follows:
Given a set of transactions , where each transaction is a subset of items from , determine all itemsets that occur as a subset of at least a predefined fraction minsup of the transactions in .
The predefined fraction minsup is called minimum support.
The unique identifier of a transaction is referred to as transaction identifier (tid).
Example 13.2.
A very simple example:
Tid
|
Transaction
|
Binary Representation
|
1
|
{Shirt, Trousers}
|
110
|
2
|
{Shirt, Jacket}
|
101
|
3
|
{Shirt, Trousers, Jacket}
|
111
|
In this case, the possible itemsets and their respective support is:
Itemset
|
Support
|
{Shirt}
|
1
|
{Trousers}
|
|
{Jacket}
|
|
{Shirt, Trousers}
|
|
{Shirt, Jacket}
|
|
{Trousers, Jacket}
|
|
{Shirt, Trousers, Jacket}
|
|
If we select , all possible itemsets would be selected.
If we select , only {Shirt} would be selected.
Now, let's think about how many possible itemsets there are: if an itemset if a subset of , then there are possible itemsets. This means that computing all their supports as in the previous example would take an exponential amount of time to the cardinality of the number of items. Not only this, but the database needs also to be accessed for counting, comparing,... So it is easy to see how this problem becomes rapidly innaccessible. Then, it is compulsory to find better ways to perform this counting and comparing, or to be able to discard itemsets even before counting. For this, there are some very convenient properties of itemsets:
Property 1: Support Monotonicity Property
The support of every subset is at greater than or equal to the support of :
This is because all subsets of the itemset are also itemsets. As the itemset appears in fraction of the total transactions, all its subsets also appears at least in the same transactions.
Property 2: Downward Closure Property
Every subset of a frequent itemset is also frequent.
This is a natural implication of the previous property, and it is very useful: when we discover a frequent itemset, we don't need to check its subsets because it is already assured that they are frequent, too.
Definition 13.3.
A frequent itemset is maximal at a given minimum support level if it is frequent, and no superset of it is also frequent.
The possible itemsets given a set of items can be conceptually arranged in the form of a lattice of itemsets, which contains one node for each of the sets drawn from the universe of items. An edge exists between a pair of nodes if the corresponding sets differ by exactly one item. The lattice represents the search space of frequent patterns and it is separated into frequent and infrequent itemsets by a border.
Example 13.3.
A lattice of itemset with 4 elements.
13.2 Association rule generation framework
Definition 13.4.
Let
be two itemsets. The
confidence of the rule,
, is the conditional probability of
ocurring in a transaction, given that the transaction contains
:
is called the
antecedent of the rule and
is the
consequent.
Definition 13.5.
Let be two itemsets. The rule is said to be an association rule at a minimum support of and minimum confidence of if it satisfies:
- The support of the itemset is at least .
- The confidence of the rule is at least .
Here, the first criterion ensures that there are enough transactions to believe that the rule has statistical relevance. The second criterion ensures that the rule is strength enough in terms of conditional probabilities.
The overall procedure for association rule generation uses two phases:
- The frequent itemsets are generated at the minimum support of .
- The association rules are generated from the frequent itemsets at the minimum confidence level of .
The first phase is more computationally intensive, so we are focusing on it from now on.
13.3 Frequent itemset mining algorithms
13.3.1 Brute Force Algorithms
For a universe of items , there are distinct subsets, excluding the empty set. A naïve idea would be to generate all these candidate itemsets and count their suppor against the transaction database .
Definition 13.6.
A candidate itemset is an itemset that can be frequent, so it is needed to be checked.
Now, we can verified the candidates against the transaction database by support counting, checking whether a given itemset is subset of each transaction .
This approach is likely to be impractical when the universe of items is large.
A little tweak
The brute-force approach can be made faster by observing that no -patterns are frequent if no -patterns are frequent (this follows from the downward closure property). Thus, we can enumerate and count in increasing length for the patterns.
For sparse transaction databases, the value of (the largest frequent itemset) is usually small compared to . At this point, one can terminate. This approach is orders of magnitude faster, but its computational complexity is still not satisfactory for large values of .
Ideas to apply
Better algorithms can be developed by using one or more of the following approaches:
- Reduce the size of the explored search space by prunning cadidate itemsets using tricks.
- Counting the support of each candidate more efficiently by prunning transactions that are know to be irrelevant.
- Using compact data structures to represent either candidates of transaction databases that suppor efficient counting.
13.3.2 The Apriori Algorithm
The Apriori algorithm uses the downward closure property to prune candidates. If an itemset is infrequent, then all its supersets are also infrequent, so we don't need to count them.
The Apriori algorithm works as follows:
- Count the support of the individual items to generate frequent 1-itemsets.
- Combine the frequent 1-itemsets to generate candidate 2-itemsets.
- Count the support of the candidate 2-itemsets to generate frequent 2-itemsets.
- ...
In general:
- Combine the frequent -itemsets to generate candidate -itemsets.
- Prune candidate -itemsets which has some subset which is not frequent.
- Count the support of the candidate -itemsets to generate frequent -itemsets.
- Repeat until there are no bigger frequent itemsets.
The algorithm is detailed in Algorithm
8.
k = 1
F1 = {Frequent 1-itemsets}
while Fk is not empty do
Generate C(k+1) by joining itemset-pairs of Fk
Prune itemsets from C(k+1) that violate the downward closure property
Determine F(k+1) by support counting (C(k+1),T)
k = k+1
end
return Union(F(i) for all i=1..k)
Algorithm 8:
Apriori(Transactions T, Minimum Support minsup)
Example 13.4.
We are going to manually run an Apriori algorithm for a simple database using (the minsup can be also indicated as a count, instead of as a frequency). The database is
Database
|
TID
|
Transaction
|
1
|
A,B,E
|
2
|
B,D
|
3
|
B,C
|
4
|
A,B,D
|
5
|
A,C
|
6
|
B,C
|
7
|
A,C
|
8
|
A,B,C,E
|
9
|
A,B,C
|
Let's compute directly:
F1
|
Itemset
|
Count
|
A
|
6
|
B
|
7
|
C
|
6
|
D
|
2
|
E
|
2
|
From this, we see that all 1-itemsets are frequent, so we are generating all possible 2-itemsets as C2, and counting them:
C2
|
Itemsets
|
Count
|
A,B
|
4
|
A,C
|
4
|
A,D
|
1
|
A,E
|
2
|
B,C
|
4
|
B,D
|
2
|
B,E
|
2
|
C,D
|
0
|
C,E
|
1
|
D,E
|
0
|
F2
|
Itemsets
|
Count
|
A,B
|
4
|
A,C
|
4
|
A,E
|
2
|
B,C
|
4
|
B,D
|
2
|
B,E
|
2
|
We have coloured in red the infrequent itemset and generated F2. Now we can generate C3 by combining itemsets from F2. For this, we need to search for itemsets that share all values but one. For example: {A,B} and {A,C} are combinated to obtain {A,B,C}. After getting all possibilities, we find C3 and count:
C3
|
Itemset
|
Count
|
A,B,C
|
2
|
A,B,D
|
|
A,B,E
|
2
|
A,C,D
|
|
A,C,E
|
|
B,C,D
|
|
B,C,E
|
|
B,D,E
|
|
F3
|
Itemset
|
Count
|
A,B,C
|
2
|
A,B,E
|
2
|
The purple coloured itemsets are pruned because they contain some of the infrequent itemsets of C2. The rest are still frequent, so we continue with the process. We combine the only two itemsets left to get {A,B,C,E} as C4. Note, nonetheless that {C,E} is an infrequent itemset of C2, so we will prune it and we are in fact done.
The frequent itemsets with are F1, F2 and F3.
Limits of Apriori
- Apriori makes use of the downward closure property (sometimes called apriori property), which is nice.
- The lattice is traversed in a general-to-specific way, it is possible to a bi-directional traversal.
- It is done in a breadth-first manner, it might be done in a depth-first way.
- The generate-and test strategy:
- Generate is for -candidates.
- Test is the counting, in which we are traversing the whole database, .
Apriori improved with tricks
In Algorithm
9 it is detailed an improved version of the Apriori algorithm.
k = 1
F1 = {Frequent 1-itemsets}
while Fk is not empty do
C(k+1) = generate(Lk)
for tran in T do
C(tran) = subset(C(k+1),tran)
for cand in C(tran)
cand.count++
end
end
L(k+1)={cand in C(k+1) | cand.count >= minsup}
end
return Union(F(i) for i=1..k)
procedure generate(Lk)
foreach itemset1 in Lk
foreach itemset2 in Lk
if itemset1[1]=itemset2[1] and ... and itemset1[k-1] = itemset2[k-1] and itemset1[k] < itemset2[k]
c = itemset1 join itemset2
if has_infreq_subsets(c,Lk)
continue
else
add c to C(k+1)
end
end
end
end
return C(k+1)
procedure has_infreq_subsets(c,Lk)
foreach k-subset s of c
if s not in Lk
return TRUE
end
end
return FALSE
Algorithm 9: Apriori_improved(Transactions T, Minimum support minsup)
More Apriori Tricks
- Transaction reduction: a transaction that does not contain any frequent -itemsets, cannot contain any frequent -itemset either, so it can be removed to make counting faster.
- Partitioning: divide the database into a set of disjoint partitions. Find all frequent itemsets in every partition. A local frequent itemset may not be frequent for the whole database, but a frequent itemset must be frequent in at least one partition. The best about this approach is that it allows for parallel processing and the partition size can be selected for it to fit in memory.
- Sampling: perform Apriori on a random sample of the data. Rule accuracy is harmed, but efficiency is improved.
13.3.3 FP-Growth
Even though Apriori greatly improves the efficiency of the solution of the association pattern mining in comparison to the brute force approach, we have seen that it can still suffer from inefficiencies when the database is big. Particularly, counting is very costly and when there are lots of items we will need to count many times. There is an improved solution for the problem: Frequent Pattern Growth (FP-Growth) which:
- Transforms the database into a compressed data structur called FP-Tree, which retains frequent itemsets information.
- Mines the FP-tree for frequent itemsets by:
- Divide it into a set of conditional databases, the conditional pattern bases, each associated eith one frequent item.
- Mines each of these patterns separately to generate association rules, without the need to re-count the original database.
Definition 13.7.
A FP-Tree is a trie (prefix tree) data structure, which acts as a compressed representaiton of a conditional database.
- The path from the root to a leaf represents a repeated sub-transaction (frequent pattern) in the database.
- The path from the root to an internal node represents either a frequent pattern or a prefix.
- Each node is associated with a count, which is the number of transactions in the database containing the path to this node.
- The prefixes are sorted in the order from the most frequent to the least frequent to leverage the prefix-based comrpession.
The algorithm for FP-Growth is detailed in Algorithm
10.
if FPT is a single path
determine all combinations C of nodes on the path, report Union(C,P) as frequent
else
foreach item i in FPT do
report Pi = Union(i,P) as frequent
use pointers to extract conditional prefix paths from FPT containing i
readjust counts of prefix paths
remove i
remove infrequent items from prefix paths
reconstruct FPTi
if FPTi not empty
FP-Growth(FPTi, minsup, Pi)
end
end
Algorithm 10:
FP-Growth(FP-Tree FPT, Minimum Support minsup, Current Suffix P)
Basically, what FP-Growth does is recursively find all frequent itemsets ending with a particular suffix by splitting the probelm into smaller subproblems.
Example 13.5.
Let's use FP-Growth on the previous example database:
Database
|
TID
|
Transaction
|
1
|
A,B,E
|
2
|
B,D
|
3
|
B,C
|
4
|
A,B,D
|
5
|
A,C
|
6
|
B,C
|
7
|
A,C
|
8
|
A,B,C,E
|
9
|
A,B,C
|
First, we need to construct the FP-Tree. For this purpose, we compute the frequent 1-itemsets and reorder them in the count order:
F1
|
Itemset
|
Count
|
A
|
6
|
B
|
7
|
C
|
6
|
D
|
2
|
E
|
2
|
F1
|
Itemset
|
Count
|
B
|
7
|
A
|
6
|
C
|
6
|
D
|
2
|
E
|
2
|
Now we reorder the transactions in the database following this same ordering:
Database
|
TID
|
Transaction
|
1
|
B,A,E
|
2
|
B,D
|
3
|
B,C
|
4
|
B,A,D
|
5
|
A,C
|
6
|
B,C
|
7
|
A,C
|
8
|
B,A,C,E
|
9
|
B,A,C
|
And now we construct the FP-Tree. We start with the root labeled as NULL, and then add the transactions one by one, reusing the prefixes when we can, and keeping a counter for each node. Every time we traverse a node, we increase the counter. The first transaction would be entered as:
The second one:
The third one:
And so on... Until all transactions are entered in the tree:
Now, we connect each item in the ordered F1 to one node in the database corresponding to the same item, and all nodes that are equal are connected, too. Like this:
And now we proceed with the algorithm. We start with the least frequent item, .
- For item , the conditional pattern base contains two itemsets: . Thus, it follows that we have , so the conditional FP-Tree is
Thus, the generated frequent patterns are: .
- For item the conditional pattern base also contains two itemsets: . So now we have and the conditional FP-Tree is
And the generated frequent patterns are .
- For item the conditional pattern base contains one itemset: . The conditional FP-Tree ends up being
And the generated frequent patterns are .
- For item the conditional pattern base contains three itemsets: . So now we have the three of them and no combination. The conditional FP-Tree ends up being
And the generated frequent patterns are .
And that's it!
Why FP-Growth outperforms Apriori?
- FP-Growth counts the database only once. This fact by itself is already a huge improvement, because Apriori needs to count one time per candidate.
- Also, Apriori needs indeed to create the candidates, which can also be expensive. FP-Growth, on the other hand, does not do a generation step, because the frequent itemsets are naturally obtained.
- In relation to the last point, Apriori creates candidates which end up being infrequent, while this does not happen with FP-Growth.
- Space complexity of FP-Growth is smaller because we don't need to store the candidates, only the FP-Tree, which is not much bigger than the number of items in . Also, the database is not needed after the FP-Tree is computed, so we have more memory to use for the tree itself.
13.4 Mining Association Rules
Once the frequent itemsets have been found, it is time to obtain the association rules, which are the goal we were aiming at since the beggining. To do this, a common approach is to follow:
- For each frequent itemset :
- Generate subsets of , .
- For every :
- Output rule if .
13.4.1 Evaluating association rules
In Definition
13.4 we saw the concept of confidence, but there are more measures to assess how 'good' a rule is, as the lift or the correlation analysis.
Definition 13.8.
The lift is
Definition 13.9.
The correlation coefficient is where we are assuming a rule and:
- is how many times is True and is True
- is how many times is False and is False
- is how many times is True and is False
- is how many times is False and is True
- is how many times is True/False
- is how many times is True/False
Example 13.6.
Imagine we have the rule with the following data:
|
Coffee
|
Coffee
|
Total
|
Tea
|
150
|
50
|
200
|
Tea
|
650
|
150
|
800
|
Total
|
800
|
200
|
|
In this case, the confidence is which is a high confidence... but , which means that drinking coffee in fact decreases the probability of drinking coffee!
Now, the lift is which means that Tea and Coffee are negatively correlated. This insight is better than the one obtained by only looking at the value of the confidence.
Example 13.7.
But let's now look at this example:
|
p
|
p
|
Total
|
q
|
880
|
50
|
930
|
q
|
50
|
20
|
70
|
Total
|
930
|
70
|
|
|
r
|
r
|
Total
|
s
|
20
|
50
|
70
|
s
|
50
|
880
|
930
|
Total
|
70
|
930
|
|
In this case and but appear together 88% of the time, while appear together only 2% of the time. Confidence is a better indicator in this case. The problem here is that appears in a small portion of records in the data.
13.4.2 How to choose a measure?
We have seen that we can obtain different conclusions by looking at different measures, so the choice of the measure is important because it will affect the results. A good choice must be based on a clear understanding of the measure and its properties, so we are aware of the flaws it entails and can leverage them for good.
We can also asses rules using interactive visualizations, subjective measures based on domain experience,...
14 Sequential pattern mining
In the last section, we were looking for patterns in a set-wise approach, but there exist also situations in which the order in which elements are encountered is important. For example, it is a well known fact that seeing something can make people wanting to buy it. This fact is used by supermarkets to strategically place the products in order for the customers to have as many temptations as possible. From the side of the supermarket, it is interesting to know which zones of the supermarket are related in such a way that customers tend to go from one to another in a particular order. Imagine we detect that it is usual that people that go to the desserts hall, usually go to the fruit hall afterwards. Then, the supermarket is interested in placing the fruits as far from the desserts as possible, to maximize the time spent by the customer at looking products that they did not want before, but might want now. It can be used also to detect sequences like:
- Shopping sequences: .
- Website sessions: .
- Tourist routes: .
With this objective in mind, sequential pattern mining was developed.
Definition 14.1.
A sequence is an ordered list of elements. Each element is an unordered set of items.
A subsequence of a sequence is a sequence such that:
- If then there exists such that .
- If is the full sequence , then there exist different elements such that and goes before (maybe not directly), goes before ,...
In other words, is a subsequence of if there are elements such that and .
Definition 14.2.
A sequence database is a database of sequences.
The support of a sequence is the fraction of sequences in the database that contain as a subsequence.
Definition 14.3.
The sequential pattern mining problem consists in, given a sequence database, find all subsequences whose support is at least the specified .
This problem is obviously similar and related to that of frequent patterns mining, but this is more comples. A naive algorithm that generates all possible paterns and count them in the database would be highly inneficient, even more than in the set-wise case. There are several methods that work much better. First, let's see how to generate candidates.
14.1 Candidate generation
The candidate generation follows the steps:
- Base case, : merging two frequent 1-sequences and will produce two candidate 2-sequences:
- General case, : two frequent -sequences are merged if the subsequence obtained by removing the first event in is the same as the subsequence obtained by removing the last event in .
- The resulting candidate after merging is given by the sequence extended with the last event of :
- If the last two events in belong to the same element, then the las event in becomes part of the last element in .
- Otherwise, the last event in in becomes a separate element appended to the end of .ç
- If both and consist in only one set, we generate the two variants.
Example 14.1.
If we merge and we will get the sequence , because the last two events in belong to the same element.
If we merge and we will get the sequence .
14.2 Generalized Sequential Pattern (GSP) Algorithm
- Make the first pass over the database to obtain all the 1-element frequent sequences.
- Repeat until no new frequent sequences are found:
- Candidate generation: merge pairs of frequent subsequences found in the pass to generate candidate sequences that contain items.
- Candidate pruning: prune candidate -sequences that contain infrequent -subsequences.
- Support counting: make a new pass over the sequence database to find the support for these candidate sequences.
- Candidate elimination: eliminate candidate -sequences whose actual support is less than .
Example 14.2.
Let's perform GSP with the following database with .
Database
|
Id
|
Sequence
|
1
|
|
2
|
|
3
|
|
4
|
|
First, we compute C1 and F1:
C1
|
Subsequence
|
Count
|
|
3
|
|
4
|
|
4
|
|
1
|
F1
|
Subsequence
|
Count
|
|
3
|
|
4
|
|
4
|
Now, we generate C2 by combining these subsequences. For example, merging and gives and , but we also need to combine in the other possible order, obtaining in addition to these. Thus, we obtain:
C2
|
Subsequence
|
Count
|
|
1
|
|
1
|
|
3
|
|
2
|
|
2
|
|
3
|
|
0
|
|
1
|
|
0
|
|
4
|
|
0
|
|
0
|
F2
|
Subsequence
|
Count
|
|
3
|
|
2
|
|
2
|
|
3
|
|
4
|
Now, from F2 we can combine with and , with and , with and , with none and with none. Thus:
C3
|
Subsequence
|
Count
|
|
1
|
|
1
|
|
3
|
|
1
|
|
1
|
|
0
|
|
2
|
F3
|
Subsequence
|
Count
|
|
3
|
|
2
|
And that's it because we cannot combine anything else. Thus, the result is F1, F2 and F3.
14.3 Sequential PAttern Discovery using Equivalence classes (SPADE) Algorithm
- Transform de database into its vertical format, i.e., with SeqID, elemID inside sequence, and the element itself.
- Construct the ID-list of 1-sequences, i.e., construct a table in which the elements are in the columns and in the cells we write all pairs in which the element in the column appears.
- We count distinct seqID for each element. Those having more than are kept and the rest are discarded.
- For :
- Contruct new -candidates using the -frequent subsequences (as in GSP). Prune when possible.
- Construct the ID-list of -sequences as where is the element in in which the event in the current subsequence appear.
- Count distinct seqID for each element. Those having more than are kept and the rest are discarded.
Example 14.3.
Let's repeat the example with SPADE. First, we transform the database to its vertical form:
Database
|
Id
|
Sequence
|
1
|
|
2
|
|
3
|
|
4
|
|
Vertical Database
|
SeqID
|
ElemID
|
Sequence
|
1
|
1
|
|
1
|
2
|
|
1
|
3
|
|
2
|
1
|
|
2
|
2
|
|
2
|
3
|
|
3
|
1
|
|
3
|
2
|
|
3
|
3
|
|
4
|
1
|
|
4
|
2
|
|
4
|
3
|
|
Now, we construct the ID-list of 1-sequences:
|
1-ID-List
|
|
|
|
|
|
1:1
|
1:1
|
1:2
|
3:3
|
1:3
|
2:1
|
2:3
|
|
2:1
|
2:2
|
3:2
|
|
4:2
|
3:1
|
4:3
|
|
|
|
4:1
|
|
|
|
|
4:2
|
|
|
Count
|
3
|
4
|
4
|
1
|
F1
|
Subsequence
|
Count
|
|
3
|
|
4
|
|
4
|
Now the ID-list of 2-sequences, combining the 1-sequences:
|
2-ID-List
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1:1,3
|
2:1,2
|
1:1,1
|
1:1,3
|
2:1,2
|
1:1,2
|
|
1:2,3
|
|
1:1,2
|
|
|
|
|
2:1,1
|
4:1,2
|
4:1,2
|
2:1,3
|
|
|
|
2:1,3
|
|
|
|
|
4:2,2
|
|
|
4:2,3
|
|
|
|
2:2,3
|
|
|
|
|
|
|
|
|
|
|
|
3:1,2
|
|
|
|
|
|
|
|
|
|
|
|
|
4:1,3
|
|
|
|
|
|
|
|
|
|
|
|
|
4:2,3
|
|
|
Count
|
1
|
1
|
3
|
2
|
2
|
3
|
0
|
1
|
0
|
4
|
0
|
0
|
So
F2
|
Subsequence
|
Count
|
|
3
|
|
2
|
|
2
|
|
3
|
|
4
|
And now we do the 3-ID-List combining these:
|
3-ID-List
|
|
|
|
|
|
|
|
|
|
|
|
1:1,1,3
|
2:1,1,2
|
1:1,1,2
|
4:1,2,2
|
4:1,2,3
|
|
2:1,2,3
|
|
|
2:1,1,3
|
|
|
|
4:1,2,3
|
|
|
4:2,2,3
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Count
|
1
|
1
|
3
|
1
|
1
|
0
|
2
|
So
F3
|
Subsequence
|
Count
|
|
3
|
|
2
|
And the result is (obviously) the same we got with GSP.
14.4 PrefixSpan
- Start by counting frequent 1-sequences, as in GSP.
- PrefixSpan extends each frequent itemset recursively.
- For each frequent 1-sequence :
- Project the database with as a prefix, i.e., for each sequence, remove everything until the first occurence of is found.
- Count the possible expansions of with one additional event at the end.
- For frequent 2-sequences, repeat the same procedure. Until no new frequent -sequences are discovered.
Example 14.4.
We are going to do the same example again. The database is
Database
|
Id
|
Sequence
|
1
|
|
2
|
|
3
|
|
4
|
|
with frequent 1-itemsets
F1
|
Subsequence
|
Count
|
|
3
|
|
4
|
|
4
|
So we can start with as a prefix. We project the database:
Database projected to
|
Id
|
Sequence
|
1
|
|
2
|
|
3
|
|
4
|
|
And we count possible 2-sequences starting with :
C2 projected to
|
Subsequence
|
Count
|
|
1
|
|
1
|
|
3
|
|
3
|
|
0
|
Now we select as prefix and count its possible extensions:
Database projected to
|
Id
|
Sequence
|
1
|
|
2
|
|
4
|
|
C3 projected to
|
Subsequence
|
Count
|
|
1
|
|
1
|
|
3
|
Now we select as prefix and again count its possible extensions (in this case the projected database only ahs one record, so there is no need to count):
Database projected to
|
Id
|
Sequence
|
1
|
|
And with :
Database projected to
|
Id
|
Sequence
|
1
|
|
2
|
|
4
|
|
C3 projected to
|
Subsequence
|
Count
|
|
0
|
|
1
|
|
0
|
|
0
|
|
0
|
We would repeat the same with prefixes and , and the result has to be the same as with the two previous algorithms.
14.5 Some comments
Algorithm
|
How
|
Time
|
Space
|
GSP
|
Candidate Generation
|
Counting against DB
|
Space alloc for candidate generation
|
SPADE
|
Candidate Generation
|
Counting against ID-Lists
|
Space alloc for the ID-Lists
|
PrefixSpan
|
No Candidate Generation (prefix-based generation)
|
Counting the projected DB
|
Space alloc for the projected DB, but there is one possible optimization using pointers
|
Part VI Stream Data Mining
15 Stream data mining
A data streaming is a constant flow of data, which usually carries to much information for it to be feasible to be stored. Somehow, we need to be able to apply data mining algorithms in datasets whose records we are only able to see once. This might seem esoteric, but there are plenty of applications in which we find this kind of needs. For instance, in credit card transactions, in wearable sensors information, connected vehicles, Internet of Things,...
More precisely, the challenges that data streams possess are:
- One pass constraint: data size is assumed to be infinite. We cannot store everything to perform second passes through it. Say we want to cluster this type of data, we cannot apply an iterative approach like that of -means.
- Drift: data evolves over time, as well as its statistical properties. This means that a model that works well today, could work poorly tomorrow. We need to be able to adapt to these changes.
- Resource constraints: we are constrained to the arrival rate of the data, which can be variable and sometimes it can has huge peaks where data is coming faster than expected. This implies that the algorithms need to be fast enough to not lose valuable data because of processing.
- Massive domain: some data attributes might have a large values of distinct values.
15.1 Bloom filter
The bloom filter is an idea thought to answer the question:
'Has this incoming element ever ocurred in the data stream before?'
The idea is to be able to summarize the past of the data stream to be able to answer this question with a reasonable level of confidence. A bloom filter is a data structure that consists of:
- A binary array of length .
- independent hash functions .
And the algorithm developed to initialize this data structure is the one described in Algorithm
11. Basically, each element is hashed using all the hash functions, then the correspondent elements in the array are set to True.
B = array(length=m, type=bool, default=False)
repeat
receive next element x in S
for i = 1..w
idx = hi(x)
B[idx] = True
until S ends
return B
Algorithm 11:
BloomConstruct(Stream S, Size m, Number of hash w)
Now, when a new element comes from the stream, , we would compute . If all the correspondent elements in the bloom filter are True, then we are confident (if the hash functions are well done) that the element has already been seen before. If some of them is False, we know for sure that the element has not been seen before, and we update the filter.
Example 15.1.
Suppose we have the stream , , and . The procedure would be:
- Start with all-False bloom filter:
- 3 enters: and . Both F, so 3 has never been seen before. Update B:
- 8 enters: and . Both F, so 8 has never been seen before. Update B:
- 15 enters: and . The first is T, but the second is F, so 15 has never been seen before. Update B:
- 3 enters: and . Both T, so we assume 3 has indeed been seen before. In this case we are right! B is not updated.
- 5 enters: and . Both T, so we assume 5 has indeed been seen before. In this case we are wrong... it is a false positive.
15.2 Count-Min Sketch
In this case, we are interested in answering the question:
'How many times has this incoming element appeared in the data stream before?'
Again, the idea is to be able to make a summary of the past data that enables us to answer the question. A count-min sketch is a data structure which consists of:
- A set of different numeric arrays, each of length .
- independent hash functions .
And the algorithm developed to initialize this data structure is the one described in Algorithm . Basically, each element is hashed using all the hash functions, then the correspondent elements in the array are augmented by 1 unit. There are rows to make the values harder to collide (two elements collide only if , for the same ). To get the number of times an element has come, we get the minimum of the cells accessed, because the minimum is an upper bound for the number of times the element has been seen.
CM = Matrix(nrows=w, ncols=m, type=int, default=0)
repeat
receive next element x in S
for i=1..w
idx = hi(x)
CM[i,idx] += 1
until S ends
return CM
Algorithm 12: CountMinConstruct(Stream S, Width w, Height m)
Example 15.2.
Suppose we have the stream , , and . The procedure would be:
- Start with all-0 count-min sketch:
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
- 3 enters: and . Both 0, so 3 has been seen 0 times before. Update CM:
0
|
0
|
0
|
1
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
1
|
0
|
0
|
0
|
- 8 enters: and . One cell is 1 and the other is 0, so 8 has never been seen before. Update CM:
0
|
0
|
0
|
1
|
0
|
0
|
0
|
0
|
1
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
2
|
0
|
0
|
0
|
- 15 enters: and . Both 0, so 15 has never been seen before. Update CM:
0
|
0
|
0
|
1
|
0
|
1
|
0
|
0
|
1
|
0
|
1
|
0
|
0
|
0
|
0
|
0
|
2
|
0
|
0
|
0
|
- 3 enters: and . One is 2 and other is 1, so it has been seen once before. Update CM:
0
|
0
|
0
|
2
|
0
|
1
|
0
|
0
|
1
|
0
|
1
|
0
|
0
|
0
|
0
|
0
|
3
|
0
|
0
|
0
|
- 5 enters: and . Both 1, so we assume 5 has indeed been seen before once. In this case we are wrong... it is a false positive. The table would be updated again:
0
|
0
|
0
|
2
|
0
|
2
|
0
|
0
|
1
|
0
|
2
|
0
|
0
|
0
|
0
|
0
|
3
|
0
|
0
|
0
|
15.3 Flajolet-Martin algorithm
The Flajolet-Martin algorithm aims at answering the question
'How many distinct elements appeared in the data stream before?'
The intuitive idea is as follows: if the hash function distribute evenly the numbers in the range of the function, then we can assume that the probability of getting a particular value after hashing is . Thus, if we take the binary representation of the output, we can assume that the last digit is 0 with probability because it has the same chance of being 0 or 1. Now, the output having the last two digits as 0 would occur approximately 1 out of 4 times, and so on... The probability that the last digits are 0 is , so if we see a value with trailing 0, the expected amount of seen records is .
The algorithm works as follows:
- Given a data stream , use a hash function , where is such that . is usually chosen to be 64.
- For each incoming record , take the binary representation of .
- Count the number of trailing zeros in .
- Keep a variable with the maximum number of zeros found until now.
- The expected maximum number of trailing zeros over all stream elements is where is the amount of distinct values. So, if we want to know how many distinct values we have seen until now, we would estimate
Example 15.3.
Suppose we have the stream , .
- 3 enters: . .
- 8 enters: . .
- 15 enters: . .
- 3 enters. .
- 5 enters: . .
Thus, the estimation is and the real value is 4. Note, nonetheless, that as we are working with probabilities, the results are not expected to be accurate when there are very few values.
Improvements to Flajolet-Martin
- It can happen that one of the first values has lots of trailing zeros, ruining the algorithm. A solution to avoid this is using several hash functions and keep the maximum number of trailing zeros for each of them . At the end, we would obtain as the average of all these.
- It is also possible to have several points in which the records are taken, and thus we would like to synchronize the results from every of them. The solution is to treat all these points as if they were only one. Let's explain this a little bit. Imagine we have measure points, where streams are measured, one stream per measure point. Then, we would have values for and we want to obtain the combined . The idea is that, if we think of all the streams as a single stream which has been divided, then we would just take as the maximum number of trailing zeros observed in the stream. Thus, when it is divided, if we know the maximum number of trailing zeros in each of the substreams, we also know the maximum number of trailing zeros in the whole stream: the maximum among the substreams! Thus, we would take Each of the substreams would work exactly as explained before (maybe with the Improvement 1 implemented).
15.4 Hyperloglog
Hyperloglog is a generalization of Flajolet-Martin, which tries to improve the predictions to answer the same question of how many different values have been seen in the stream before.
The idea is that we use
buckets to count the trailing zeros observed in the records observed. For an incoming record,
, we compute
and translate it to binary form. Then, the bucket in which it will count is the bucket numbered with the first
bits of
. In each bucket, we would count the maximum number of trailing zeros in records classified into that bucket. At the end, as explained by Flajolet et al. in [
2], the estimation is computed as
where
is the number of buckets, and
is the harmonic mean.
Example 15.4.
Suppose we have the stream , and there are two buckets.
- 3 enters: . .
- 8 enters: . .
- 15 enters: . .
- 3 enters. .
- 5 enters: . .
Thus, the estimation is
Part VII Outlier mining
16 Outlier Mining
Definition 16.1.
An outlier is a data object that deviates significantly from the normal objects as if it were generated by a different underlying mechanism.
16.1 Types of outliers
- Global outlier: is a record that significantly deviates from the rest of the data set. It can be found clustering the data and defining an appropriate measurement of deviation.
- Contextual outlier: is a record that deviates significantly based on a selected context, i.e., a subset of the data. Usually, attributes can be categorized into:
- Contextual attributes: which defines the context of the record, e.g. the time and location.
- Behavioral attributes: these are the proper measures of the records, which are used in outlier evaluation, e.g. temperature.
This way, if the context is Brussels in December, a temperature of 0 degrees would probably not be an outlier. But the same temperature in Dominican Republic in June would be an outlier.
The main problem in this case is how to define meaningful context.
- Collective outliers: a subset of the data records collectively deviates significantly from the whole data set, but the individual data records might not be outliers themselves. For example, in the lottery, someone has to win, so when we look at individuals winning a lottery prize we would not see anything strange. But if the same person wins several editions of the lottery, then it would be an outlier.
16.2 Challenges of outlier detection
- It is hard to enumerate all possible normal behaviors in an application and the boundary that separates normal data from outlier data is often in a gray area which is hard to identify.
- The choice of distance measures among objects and the relationship between objects is application-dependent, so there is no one-fits-all solution.
- As mentioned before, noice can blur the distinction between normal objects and outliers because the algorithms can sometimes confuse noise with outliers.
- The understandability of the outliers is also hard to provide. It is not easy to understand how the detected outliers are produced in contrast to normal records. Usually, one thing we can do is estimate the unlikelihood of the record being generated by a normal mechanism (hypotheses testing).
16.3 Supervised methods for outlier detection
We can try to model outlier detection as a classification problem, in which the samples are examined and classified by domain experts and are later used for training and testing models.
16.3.1 One-Class model
The idea of the one-class model is to train a classification model that can distinguish normal data from outliers. For this, it requires many abnormal samples, as many as possible. The functioning is to learn the decision boundary of the normal class. Then, all samples that lie outside this boundary are labeled as outliers.
It poses the advantage that can detect new outliers that are different from past outliers.
It is possible to extend the model to a multi-class model, in which the normal objects might be also classified into multiple classes.
16.4 Unsupervised methods for outlier detection
If we assume that the normal objects are somehow clusered into multiple groups, each of them having some distinctive features, we can expect outliers to be far away from any group of normal records.
This approach posses the drawback that it is very hard to detect collective outliers effectively.
16.4.1 Proximity-based methods
In this case, a record is considered an outlier if the nearest neighbors of the objects are far away. The effectiveness of these methods relies on the proximity measure used and it is often hard to find groups of outliers that are close to each other.
Distance-based outlier detection
Let be a distance threshold and be a fraction threshold. An object is a if So we count how many objects are within a distance of and if these account to less than the predefined fraction , then is considered an outlier.
If we do it just like this, we would test each object against the whole data set one by one. This can be very costly for big datasets, and can be improved with the CELL method, which is a grid-based method. In this case, the data space is partitioned into a multidimensional grid, in which is cell is a hyper cube with diagonal length . This allows us to improve the efficiency of the method by using two pruning rules:
- Definitions: Let be the cell whose objects we want to assess:
- The level-1 are the cells adjacent to cell .
- The level-2 are the cells which have some point less than distance from .
- Cell : it is obvious that all objects in cell are less than distance from each other. Call .
- Level-1 cell pruning rule: let . If , then every object in is not a -outlier because all objects in are in the -neighborhood of and there are at least .
- Level-2 cell pruning rule: let . If , then all objects in are -outliers.
16.4.2 Density-based methods
The idea in this case is that the density around an outlier object is significantly different from the density around its neighbors. Thus, we can use the relative density of an object against its neighbors as the indicator of the degree of the object being an outlier.
For this, we define the -distance of an object as the distance between and its nearest neighbor, and the -distance neighbourhood of as Note that this set can have more than elements because multiple elements can have identical distance to . Thus the measure is an indicator of the density around the point , which can be compared with the rest to decide if is an outlier or not.
16.4.3 Clustering-based methods
The idea now is to cluster the data and classify an object as an outlier if:
- It does not belong to any cluster.
- There is a large distance between the object and its closest cluster.
- It belongs to a small or sparse cluster.
This approach presents several advantages:
- It can detect outliers without the need to label the data.
- It works for many datatypes.
- Clusters can be regarded as summaries of the data.
- The detection is fast once the clusters are obtained, because we only need to compare any object against the clusters.
But it also has some drawbacks:
- The effectiveness depends on the cluster method used.
- The cost of the clustering can be high. This can be partially fixes using fixed-width clustering:
- A point is assigned to a cluster if the center of the cluster is within a predefined distance threshold from the point.
- If the point cannot be assigned to any existing cluster, a new cluster is created and the distance threshold may be learned from the training data under certain conditions.
References
1Charu C. Aggarwal, Data Mining (Springer International Publishing, 2015).
2Philippe Flajolet, Éric Fusy, Olivier Gandouet, and Frédéric Meunier, "Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm", in Discrete Mathematics and Theoretical Computer Science (2007), pp. 137--156.
3Mahmoud Sakr, "INFOH423 Data Mining".
4Ian Witten, Eibe Frank, and Mark Hall, Data Mining: Practical Machine Learning Tools and Techniques (Elsevier, 2011).