INFOH417 - Database Systems Architecture

Jose Antonio Lorencio Abril

Fall 2022

PIC

Professor: Mahmoud Sakr

Student e-mail: jose.lorencio.abril@ulb.be

This is a summary of the course Database Systems Architecture, taught at the Université Libre de Bruxelles by Professor Mahmoud Sakr in the academic year 22/23. Most of the content of this document is adapted from the course notes by Sakr, [3], so I won’t be citing it all the time. Other references will be provided when used.

Contents

I  Query Planning: Translating SQL into Relational Algebra
 1 Relational Algebra
  1.1 The extended relational algebra
  1.2 Relational algebra expressions
 2 Translating SQL into Relational Algebra
  2.1 SELECT-FROM-WHERE statemets without subqueries
  2.2 Normalizing WHERE-subqueries into EXISTS and NOT EXISTS form
  2.3 Translating SELECT-FROM-WHERE subqueries
  2.4 De-correlation of subqueries appearing in a conjunctive WHERE condition
  2.5 Flattening subqueries in bag-based relations
II  Query Optimization
 3 System-R
  3.1 Architecture Components
  3.2 Query language
  3.3 Catalogues
  3.4 Cursors
  3.5 Clustering images
  3.6 Optimizer
  3.7 PostgreSQL in relation to System R
 4 Query Optimization
  4.1 Cost-based query optimization
  4.2 Viewing query evaluation plans
  4.3 Generating equivalent expressions
  4.4 Enumeration of equivalent expressions
  4.5 Cost estimation
  4.6 Choice of execution plan
 5 Statistics for cost estimation
  5.1 Histograms
  5.2 Estimation of selection size
  5.3 Estimation of the size of joins
III  Indexing
 6 Conventional indexes
  6.1 Sparse second level index
  6.2 How to deal with duplicate keys.
  6.3 How to delete records
  6.4 How to insert records
  6.5 Secondary indexes
 7 B-Trees
  7.1 Lookup in BTree
  7.2 Range queries
  7.3 Insertion into a BTree
  7.4 Deletion from a BTree
IV  Physical Query Plans
 8 Physical Query Plans
  8.1 Computing joins
  8.2 Factors that affect performance
V  Extensibility
 9 Extensible databases: PostgreSQL
  9.1 Types
  9.2 Functions
  9.3 Procedures
  9.4 Interfacing extensions to indexes
  9.5 Steps to create a PostgreSQL extension
VI  Failure Recovery and concurrency control
 10 Failure recovery
  10.1 Key problem: unfinished transactions
  10.2 Logging
 11 Concurrency control
  11.1 Schedules: serial, serializable and conflict-serializable
  11.2 How to enforce serializability: locking
  11.3 Shared locks
  11.4 More types of locks
  11.5 Lock granularity
VII  Distributed Databases
 12 Distributed databases
  12.1 Data distribution
  12.2 Distributed data access: distributed SQL
  12.3 Distributed transactions
  12.4 Replication
  12.5 CAP theorem
  12.6 PACELC theorem
  12.7 More trade-offs

List of Figures

Architecture of System R
Result of the program. Stored data in EMP (left). Active Set (right).
A left-deep join tree (top) and not a left-deep join tree (bottom).
A BTree. Source: [2].

List of Algorithms

1 procedure findbestplan(S)
2 Iteration Join
3 Merge Join
4 Index Join
5 Hash Join (k buckets G1...Gk, H1...Hk)
6 Undo logging: recovery rules
7 Redo logging: recovery rules

Part I
Query Planning: Translating SQL into Relational Algebra

1 Relational Algebra

We are going to start with some definitions:

Definition 1.1. A relation is a table whose columns have names, called attributes. The set of all attributes is called the schema of the relation. The rows of the table are tuples of values for each of the attributes, and are called simply tuples. We are going to denote R a relation, and we will express it as R ~ [A1,...,An] to indicate the schema of the relation, Ai,i = 1,...,n are the attributes of the schema. If two relations, R and R, share the same schema, we will simply write R Ra .

A relation is set-based if there are no duplicate tuples in it. If this is not the case, the relation is bag-based.

A relational algebra operator takes as input 1 or more relations and produces as output a new relation. More formally, if we have a set of relations Σ = {R ,...,R }
  1     nU, where U identifies the set of all possible relations, a relational algebra operator is a function

Op : P (Σ) → U,

being P(Σ) the power set of Σ.

Example 1.1. As an example, we can take Σ = {StarsIn,M ovieStar}, StartsIn = [starN ame,filmN  ame], MovieStar = [name, birthDate]. In this case, an operator Op could be such that produces the relation that contains all names of films in which some movie stars in MovieStar born in 1960 participated.

In this example, we have explained what we would like our operator to do, but we need some way to actually compute this. For this, there are some basic operators that can be combined to create complex operators.

1.1 The extended relational algebra

Let’s define a set of operators that are useful:

The union of two relations with the same schema returns another relation with the same schema and all tuples in any of the two input relations: Let Ri,Rj Σ such that Ri Rj, then

Ri ∪ Rj = {x|x ∈ Ri ∨ x ∈ Rj} ≃ Ri ≃ Rj.

Note, nonetheless, that the result of the operator is different in set-based relations than in bag-based relations.

Example 1.2. An example of the operator :

                        (            bag-based )
                        ||| ⌊set-based⌋ ⌊ A   B ⌋ |||
⌊  A  B ⌋    ⌊       ⌋  ||||   A   B   |  1  2 | ||||
|  1  2 | ⋃    A  B     |{ ||  1  2 || ||  3  4 || |}
|⌈  3  4 |⌉    ⌈ 3  4  ⌉ = | || 3  4 ||,||  5  6 || |
   5  6        1  5     |||| ⌈  5  6 ⌉ |⌈  1  5 |⌉ ||||
                        |||(    1  5      3  4   |||)

The intersection of two relations with the same schema returns another relation with the same schema and all tuples in both of the two input relations: Let Ri,Rj Σ such that Ri Rj, then

Ri ∩ Rj = {x|x ∈ Ri ∧ x ∈ Rj} ≃ Ri ≃ Rj.

Example 1.3. An example of the operator :

⌊       ⌋   ⌊       ⌋
| A   B | ⋂   A   B     [       ]
|⌈  1  2 |⌉   ⌈  3  4 ⌉ =   A  B   .
   3  4        1  5       3   4
   5  6

The difference of two relations with the same schema returns another relation with the same schema and all tuples in the first input relations which don’t appear in the second input relation: Let Ri,Rj Σ such that Ri Rj, then

Ri - Rj = {x|x ∈ Ri ∧ x ∕∈ Rj } ≃ Ri ≃ Rj.

Example 1.4. An example of the operator -:

⌊       ⌋
  A   B     ⌊ A  B  ⌋   ⌊ A  B  ⌋
|| 1   2 || - ⌈ 3   4 ⌉ = ⌈ 1  2  ⌉.
⌈ 3   4 ⌉     1   5       5  6
  5   6

The selection operator applies a condition on the values of the tuples of the input relation and returns only those tuples that fullfil the condition: Let R Σ and P a condition, then

σP (R) = {x|x ∈ R ∧ P (R) == true}.

Example 1.5. An example of the operator σP:

    ( ⌊       ⌋)
        A   B       ⌊ A  B  ⌋
σA≥3|| ||  1  2 ||||  = ⌈ 3   4 ⌉.
    ( ⌈  3  4 ⌉)      5   6
         5  6

In this case, the condition P is: ’the value of A is bigger than or equal than 3’.

The projection operator returns all tuples of the input relation, but deleting all unspecified attributes: Let R Σ and Aj1,...,Ajk [A ,...,A  ]
  1     n ~ R, then

                 ′
ΠAj1,...,Ajk (R ) = R [Aj1,...,Aj] = {y|∃x ∈ R s.t.x(Aj1,...,Ajk) = y}.
                        k

The result of this operation also depends on the type of relations used.

Example 1.6. An example of the operator Π[Aj ,...,Aj ]
  1    k:

                          (                    )
    ( ⌊              ⌋)   || ⌊set-based⌋ ⌊bag-based⌋||
    | | A   B  C  D  ||   ||||   A   C   | A   C |||||
    || || 1   2  3   5 ||||   { ||  1  3 || ||  1  3 ||}
ΠA,C|( |⌈ 3   4  3   6 |⌉|) = || ⌈  3  3 ⌉,|⌈  3  3 |⌉||  .
        5   6  5   9      ||||    5  5      5  5  ||||
        1   6  3   5      (              1  3  )

The cartesian product of two relations with disjoint schemas returns a relation with the schema resulting of combining both schemas and with all possible tuples made out of tuples from the first relation and tuples from the second relation: Let Ri,Rj Σ such that their schemas are disjoint, then

Ri × Rj = {z = (x,y)|x ∈ Ri ∧ y ∈ Rj }.

Example 1.7. An example of the operator ×:

                        ⌊              ⌋
                          A   B  C   D
⌊        ⌋  ⌊ C   D ⌋   ||  1  2  2   6 ||
   A  B     |  2  6 |   ||  1  2  3   7 ||
⌈  1  2  ⌉× |⌈  3  7 |⌉ = ||  1  2  4   9 || .
   3  4        4  9     ||  3  4  2   6 ||
                        ⌈  3  4  3   7 ⌉
                           3  4  4   9

The natural join of two tuples whose schemas share at most one attribute returns a relation with the schema resulting of combining both schemas and with all possible tuples made out of tuples from the first relation and tuples from the second relation with the condition that they have the same value for the shared attribute: Let Ri,Rj Σ such that their schemas share at most one attribute, A, then

Ri ⊳⊲ Rj = {z = (x,y)|x ∈ Ri ∧ y ∈ Rj ∧ x(A) = y (A)}.

Note that if the relations are disjoint, the natural join gives the same results as the cartesian product.

Example 1.8. An example of the operator ⊳⊲:

            ⌊       ⌋
⌊ A   B ⌋     B   D     ⌊ A   B  D  ⌋
⌈ 1   2 ⌉ ⊳⊲ || 2   6 || = ⌈ 1   2  6  ⌉.
  3   4     ⌈ 3   7 ⌉     3   4  9
              4   9

The theta join of two relations given a condition P returns all the tuples in the cartesian product of the two relations that fullfil the condition P: Let Ri,Rj Σ and P a condition, then

R ⊳⊲  R  = {x|x ∈ R × R  ∧  P (x) == true} = σ (R × R ).
 i  P  j          i    j                   P   i   j

Example 1.9. An example of the operator ⊳⊲P:

                ⌊       ⌋
⌊        ⌋        C   D     ⌊              ⌋
⌈  A  B  ⌉      ||  2  6 ||   ⌈ A   B  C   D ⌉
   1  2   ⊳⊲B=C ⌈  3  7 ⌉ =    1  2  2   6   .
   3  4            4  9        3  4  4   9

The left/right/full outer join operators are similar to the theta join, but for those tuples in the left/right/both relation that does not find a match in the other relation, it returns a new tuples with the values of the tuple and the rest of the attributes empty.

Example 1.10. An example of = ⊳⊲, ⊳⊲ =,= ⊳⊲ =:

⌊ A   B ⌋        ⌊ C   D ⌋   ⌊  A  B  C   D ⌋
|  1  2 |        |  2  6 |   |  1  2   2  6 |
|⌈  3  4 |⌉ =⊳⊲B=C |⌈  3  7 |⌉ = |⌈  3  4   4  9 |⌉
   5  5             4  9        5  5

⌊ A   B ⌋        ⌊ C   D ⌋   ⌊  A  B  C   D ⌋
|  1  2 |        |  2  6 |   |  1  2   2  6 |
|⌈  3  4 |⌉ ⊳⊲=B=C |⌈  3  7 |⌉ = |⌈  3  4   4  9 |⌉
   5  5             4  9               3  7

                               ⌊              ⌋
⌊  A  B ⌋          ⌊ C   D ⌋     A   B  C  D
|  1  2 |          |  2  6 |   || 1   2  2   6 ||
|⌈  3  4 |⌉ = ⊳⊲=B=C  |⌈  3  7 |⌉ = || 3   4  4   9 ||
   5  5               4  9     ⌈ 5   5        ⌉
                                        3   7

The renaming operator changes the name of a relation, ρ(R ) = R. The feature rename operator changes the name of an attribute in a relation, A A.

The aggregation operator of a relation returns another relation in which the tuples that share the value of the aggregating attribute are merged using an aggregate function: Let R Σ, with R ~[A ,...,A ,A]
  1    n, A the aggregating attribute and f1,...,fn the aggregating functions for the rest of the attributes, then

γ              (R ) = {z = (v ,f (AvA ),...f (AvA))|v ∈ R (A)},
 A,f1(A1),...,fn(An)           A  1   1     n   n    A

where AjvA is a short notation for

AvjA= ΠAj (σA=vA (R )),

i.e. all values in R(  v )
 A jA such that come from a tuple whose values for the attribute A is v.

Example 1.11. An example of the operator γA,f1(A1),...,fn(An)(R):

        ⌊       ⌋
        |  A  B |   ⌊            ⌋
        ||  1  2 ||   | A   min (B )|
γA,min(B)||  1  1 || = |⌈  1     1   |⌉ .
        |⌈  3  7 |⌉      3     7
           3  9        4     4
           4  4

1.2 Relational algebra expressions

Now, we can build expressions in relational algebra to get new relations from current ones. Let’s return to Example 1.1, we can define the operator Op such that produces the relation that contains all names of films in which some movie stars in MovieStar born in 1960 participated as:

R ′ = Op (M ovieStar,StarsIn ) = ΠfilmName (σbirthDate.year=1960(M ovieStar ⊳⊲name=starName StarsIn)).

Relational algebra is the theoretical basis of the SQL language, meaning SQL is designed as an implementation of the relational algebra operators that we have seen so far. The equivalent SQL sentence to the last RA operator, OP, is

SELECT filmName 
FROM StarsIn 
JOIN MovieStar ON name=starName 
WHERE birthDate.year = 1960;

As explained in [4], translating an arbitrary SQL query into a logical query plan, or, equivalently, a relational algebra expression, is a complex task. Let’s first give some examples.

Example 1.12. We are going to work with some examples now. Let’s our database have the following relations:

SQL:

SELECT movieTitle, count(S.starName) AS numStars 
FROM StarsIn S, MovieStar M 
WHERE S.starName = M.name 
GROUP BY movieTitle;

RA:

γ                                 (ρ (StarsIn) ⊳⊲               ρ  (M ovieStar)).
 M.movieTitle, count(S.starName )→numStars S           S.starName=M.name  M

SQL:

SELECT movieTitle, count(S.starName) AS numStars 
FROM StarsIn S, MovieStar M 
WHERE S.starName = M.name 
GROUP BY movieTitle 
HAVING count(S.starName) > 5;

RA:

σnumStars>5 (γM.movieTitle, count(S.starName)→numStars (ρS (StarsIn) ⊳⊲S.starName=M.name ρM (M ovieStar))).

At this point, one can understand that it is not easy at all to automatize this procedure of translating from SQL to RA. Not only the process is not trivial as is, but it is also needed to take into consideration that one SQL sentence can be translated into several equivalent RA expressions, which will ultimately be executed in a computer and the election of the translation to execute will affect the efficiency of the program. Let’s review the paper [4], explaining each of the translations, assuming set-based relations.

2 Translating SQL into Relational Algebra

2.1 SELECT-FROM-WHERE statemets without subqueries

A query of the form:

SELECT select-list 
FROM R1 T1,..., Rn Tn 
WHERE condition;

in which the condition does not involve subqueries, we can translate it as

Πselect-list(σcondition(ρT1(R1)× ...× ρTn(Rn ))).

2.2 Normalizing WHERE-subqueries into EXISTS and NOT EXISTS form

In general, queries in which there are subqueries in the WHERE clause can arise, and they need to be translated, too. The property used in these cases is that subqueries occurring in the WHERE clause that use the operators =,<,>,<=,>=,<>,EXISTS,IN,NOT EXISTS,NOT IN or the quantifiers ANY or ALL can all be rewritten to use the operators EXISTS and NOT EXISTS.

Proposition 2.1. All conditions using a subquery can be rewritten using only EXISTS and NOT EXISTS.

Proof. Let’s proof some of the results:

An equivalent query is:

SELECT select-list 
FROM R1 
WHERE EXISTS(SELECT B 
                FROM R2 
                WHERE cond AND R2.B = R1.A);
SELECT select-list 
FROM R1 
WHERE R1.A = ALL (SELECT B 
                       FROM R2 
                       WHERE cond);

An equivalent query is:

SELECT select-list 
FROM R1 
WHERE NOT EXISTS(SELECT B 
                       FROM R2 
                       WHERE cond AND R2.B <> R1.A);

The rest of the cases binaryOP + ANY |ALL is similar.

SELECT select-list 
FROM R1 
WHERE R1.A IN (SELECT B 
                  FROM R2 
                  WHERE cond);

An equivalent query is:

SELECT select-list 
FROM R1 
WHERE EXISTS(SELECT B 
                FROM R2 
                WHERE cond AND R2.B = R1.A);

The case NOT IN is analogous. __

Example 2.1. Let’s see some examples from the paper:

The query

SELECT movieTitle FROM StarsIn 
WHERE starName IN (SELECT name 
                  FROM MovieStar 
                  WHERE birthdate = 1960);

is equivalent to:

SELECT movieTitle FROM StarsIn 
WHERE EXISTS (SELECT name 
         FROM MovieStar 
         WHERE birthdate = 1960 AND name = starName);

The query

SELECT name FROM MovieExec 
WHERE netWorth >= ALL (SELECT E.networth 
                       FROM MovieExec E);

is equivalent to:

SELECT name FROM MovieExec 
WHERE NOT EXISTS (SELECT E.networth 
                 FROM MovieExec E 
                 WHERE netWorth < E.netWorth);

Without loss of generality, we can now assume that all subqueries in the where clause are of the form EXISTS or NOT EXISTS.

Now, to translate a query with subqueries, in which an arbitrary number of subqueries inside the subqueries may arise, it seems logical to proceed recursively. The idea is to translate into RA from inner queries to outer queries. For subqueries that do not contain more subqueries, we could translate them as in Section 2.1. The problem in this case is that the subqueries can refer to attributes of relations appearing in the FROM clause of the outer queries. This is known as correlated queries.

Example 2.2. A correlated query.

SELECT movieTitle 
FROM StarsIn 
WHERE EXISTS (SELECT name 
               FROM MovieStar 
               WHERE birthdate = 1960 AND name = starName);

The outer relations from which a correlated subquery uses certain attributes are called context relations. The attributes of the context relations are the parameters of the subquery1 .

2.3 Translating SELECT-FROM-WHERE subqueries

To translate a SELECT-FROM-WHERE statement that is used as a subquery, we must make the following modifications to the method from Section 2.1:

Example 2.3. The subquery from Example 2.2:

SELECT name 
FROM MovieStar 
WHERE birthdate = 1960 AND name = starName

is translated into

ΠmovieTitle,movieY ear,starName,name(σbirthdate=1960 ∧ name=starName(StarsIn× M ovieStar)).

2.4 De-correlation of subqueries appearing in a conjunctive WHERE condition

Now, let’s focus on a particular case:

Suppose we have a query of the general form:

SELECT Select-list 
FROM from-list 
WHERE condition;

And the following assumption: the condition is a conjunction (AND) of SELECT-FROM-WHERE subqueries, possibly with an additional condition that does not contain subqueries, i.e., the condition is of the form

ϕ AN D EXIST  S(Q1) AN D EXIST  S(Q2 ) AN D...AN D NOT  EXIST  S (P1) AN D...

where ϕ denotes the subquery-free condition and Q1,...,Qn,P1,...,Pm are select statements. The translation is done in four steps:

  1. Translate ϕ.

  2. De-correlate the EXISTS subqueries.

  3. De-correlate the NOT EXISTS subqueries.

  4. Apply the projection ΠSelect-list.

2.4.1 Translating ϕ

It is translated using the method of Section 2.1, but the following context relations must be included:

We will obtain a expression of the form

σ (E),
 ϕ

where E is a cartesian product of all the context relations involved. From now on, we are going to adapt and refine E gradually when de-correlating the subqueries.

Example 2.4. Consider the following query, with relations R(A,B ) and S(C ):

SELECT R1.A, R1.B 
FROM R R1, S 
WHERE EXISTS 
       (SELECT R2.A, R2.B 
       FROM R R2 
       WHERE R2.A = R1.B AND EXISTS 
                       (SELECT R3.A, R3.B 
                       FROM R R3 
                       WHERE R3.A = R2.B AND R3.B = S.C));

Let’s denote the queries, from outer to inner: Q1,Q2 and Q3. Q1 does not have a subquery-free part, so we continue with Q2. The subquery-free part of Q2 is:

SELECT * 
FROM R R2 
WHERE R2.A = R1.B;

So it can be translated as

σR2.A=R1.B(ρR2(R) ×ρR1 (R)).

Note that S is a context relation for this subquery-free part, but no parameter from it is needed and it is not only used in NOT EXISTS clauses, so it is not added.

2.4.2 De-correlating EXISTS subqueries

After translating the subquery-free part, we translate all the subqueries EXISTS(Qi) as explained in Section 2.3, obtaining an algebra expression EQi.

Let A1,...,Ap be the list of parameters of context relations of Qi. We can translate EXISTS(Qi) by joining E with the space of parameters for EQi, namely ΠA1,...,Ap(EQi ) :

E := E ⊳⊲ πA1,...,Ap (EQi).

Example 2.5. Let’s continue the translation of Q2 from Example 2.4. Now, we have to translate Q3 as:

σR3.A=R2.B ∧ R3.B=S.C (ρR3(R) ×ρR2 (R)× S).

At this point, we have

E = ρR2 (R )× ρR1 (R),

EQ3 = σR3.A=R2.B ∧ R3.B=S.C (ρR3(R) ×ρR2 (R)× S),

and by joining E and EQ3 on the parameters of Q3 we ensure that we are taking the correct tuples from E and EQ3. In particular, we are taking the tuples in R1 for which tuples in R2,R3 and S exist that satisfy the requirements of Q2:

ρR (R )× ρR (R) ⊳⊲ ΠR .A,R .B,S.C (σR3.A=R2.B ∧ R3.B=S.C (ρR3(R)× ρR2 (R )× S)).
  2        1        2   2

Note that this expression can be simplified:

E := ρ  (R) ⊳⊲ Π          (σ                  (ρ  (R)× ρ   (R )× S)),
      R1       R2.A,R2.B,S.C   R3.A=R2.B ∧ R3.B=S.C  R3      R2

because we are joining R2 with a subset of itself, so we will obtain the entire subset.

Remark 2.1. This simplification can always be done. Before joining with ΠA1,...,Ap(EQi), we can remove from E all context relations for Qi, because they are already present in the parameter space. This way, denoting by Ê the adapted E, we can change what we explained later for

     ˆ
E := E ⊳⊲ ΠA1,...,Ap (EQi).

Example 2.6. Now we can translate Q2 as follows:

E2 := σR2.A=R1.B (E) =

σ         (ρ   (R ) ⊳⊲ Π          (σ                   (ρ   (R )× ρ  (R) × S))).
 R2.A=R1.B  R1        R2.A,R2.B,S.C  R3.A=R2.B ∧ R3.B=S.C  R3       R2

Notice how R2 has been removed from the cartesian product of the subquery-free part of Q2 that we translated in the first of the examples.

Finally, the translation of the entire Q1 is

ΠR1.A,R1.B(E2),

where ρR1(R) and S have been removed from the cartesian product originating from the translation of the subquery-free part of Q1 (the FROM clause).

2.4.3 De-correlating NOT EXISTS subqueries

Now we can de-correlate the NOT EXISTS(Pj) subqueries. We start translating Pj into a RA expression EPj. Again, we consider the parameters A1,...,Ap of the context relations of Pj. The difference now is that we don’t join E and EPj, but we perform an anti-join:

      --        (   )
E := E⊳⊲ΠA1,...,Ap  EPj ,

where

 --
R⊳⊲S = R- (R ⊳⊲ S) .

In this anti-join, it necessary that R contains all attributes of S, and this is the reason why it is needed to add all context relations appearing only in NOT EXISTS clauses to the cartesian product of the subquery-free part of the query.

2.4.4 Translating the Select-list

Finally, we apply the projection ΠSelect-list.

2.5 Flattening subqueries in bag-based relations

Until now, we have supposed that all relations involved are set-based, but this is not the case in real databases, where duplicates can occur. In this case, the requirements for flattening into a normal join are:

Part II
Query Optimization

3 System-R

In this section, we are going to explain System R, which is a pioneering SQL system developed by IBM Research and which was released in 1976, with an accompanying paper, [1].

System R was an experimental prototype database management system, with complete capability, including application programming, query capability, conccurent access support, system recovery, etc.

3.1 Architecture Components

System R is composed by several parts:

A logical diagram of this architecture is depicted in Figure 1.


PIC

Figure 1: Architecture of System R


3.2 Query language

3.2.1 Data manipulation

The RDI interfaces SQL to a host programming language by means of a concept called a cursor, which is a name used at the RDI to identify a set of tuples called its active set, and to maintain a position on one tuple of the set. The cursor is associted with a set of tuples by means of the RDI operator SEQUEL; the tuples may then be retrieved, one at a time, by the RDI operator FETCH. The program must first give the system the addresses of the program variables to be used by means of the RDI operator BIND.

Example 3.1. Here, the host program identifies variables X and Y to the system and then issues a query whose results are to be placed in these variables:

CALL BIND(X, ADDR(X)); 
CALL BIND(Y, ADDR(Y)); 
CALL SEQUEL(C1, SELECT NAME:X, SAL:Y 
               FROM EMP 
               WHERE JOB = "PROGRAMMER"); 
CALL FETCH(C1);

The SEQUEL operator is associating the cursor C1 with the set of tuples which satisfy the query and positioning it just before the first such tuple. The optimizer is invoked to choose an access path whereby the tuples may be materialized, but no tuples are actually materialized in response to the SEQUEL call. The materialization is done as they are called for, one at a time, by the FETCH operator. Each call to FETCH deliver the next tuple of the active set into program variables X and Y. In Figure 2 we can see an example of the stored data in the relation EMP and the resulting Active Set. In this case, after calling FETCH(C1), the values of the variables would be X=”Mike” and Y=800. If another call to FECTH(C1) were made, then the variables would be overrided to X=”Sarah” and Y=810.





EMP






NAME SAL JOB



“John” 1000 “CEO”



“Mike” 800 “PROGRAMMER”



“Sarah” 810 “PROGRAMMER”





Active Set




NAME SAL


“Mike” 800


“Sarah” 810



Figure 2: Result of the program. Stored data in EMP (left). Active Set (right).


The DESCRIBE operator returns the degree and the data types of the active set. The degree is the number of attributes. It is useful when this information is not known in advanced, so it can be inputted to the FETCH operator.

The operator OPEN is used to associate a cursor with an entire relation.

Each cursor remains active until an RDI operator CLOSE or KEEP is issued on it. CLOSE deactivates the cursor, while KEEP causes the tuples identified by a cursor to be copied to form a new permanent relation in the database.

The operator FETCH_HOLD is as FETCH, but it also acquires a hold on the tuple returned, which prevents other users from updating or deleting it until it is explicitly released by the RELEASE operator or until the holding transaction has ended.

3.2.2 Data definition

The SQL statement CREATE TABLE is used to create a new base relation. For each field, the field name and data type are specified. When a relation is no longer useful, it may be deleted by issuing a DROP TABLE statement.

Access paths include images and binary links. Images are value orderings maintained on base relations by the RSS, using multilevel index structures2 , associating a value with one or more tuple identifiers (TIDs), which are internal addresses allowing rapid access to a tuple. One image per relation can have the clustering property, which causes tuples whose sort field values are close to be physically stored near each other. Binary paths are access paths in the RSS which link tuples in one relation to related tuples of another relation through pointer chains. They are employed in a value dependent manner: the user specifies that each tuple of Relation 1 is to be linked to the tuples in Relation 2 which have matching values in some field/s, and that the tuples on the link are to be ordered in some way3 . A link may be declared to have the clustering property.

A view is a relation derived from one or more relations, and can be used in the same way as a base table. It can be defined using the DEFINE VIEW statement. Views are updated automatically when changes are made to the base tables on which they are defined. When the statement DROP VIEW is issued, the indicated view and all other views defined in terms of it disappear from the system. Modifications to views are only allowed if the tuples of the view are associated one-to-one with tuples of an underlying base relation.

The statement KEEP TABLE causes a temporary table to become permanent.

The statement EXPAND TABLE is used to add a new field to an existing table.

3.2.3 Data Control

A transaction is a series of RDI calls which the user wishes to be processed as an atomic act. A transaction starts when the user issues a BEGIN_TRANS statement and ends when END_TRANS is called. Save points may be specified by means of the operator SAVE. When a transaction is active, the user may go back to the beginning of it, or to any save point using RESTORE.

Regarding authorization, System R does not require a particular individual to be the DB administrator, but allows each user to create his own data objects by executing the create staatements. The creator of an object has full authorization on it. The user can gran selected capabilities for his objects to other users with the statement GRANT.

About integrity assertions, any SQL predicate may be stated as an assertion about the integrity of data in a base table or view. When an assertion is made by an ASSERT statement, its truth is checked. If true, the assertion is atuomatically enforced until it is explicitly dropped by a DROP ASSERTION statement. Assertions may describe the permissible states of the database or the permissible transitions in the database. For this latter purpose, the keywords OLD and NEW are used in SQL to denote data values before and after modification.

If an assertion is IMMEDIATE, it cannot be suspended within a transaction, but is enforced after each data modification. Also, integrity points may be established by the SQL ENFORCE INTEGRITY.

Triggers are a generalization of the concept of assertion, causing a prespecified sequence of SQL statements to be executed when some triggering event occurs.

3.3 Catalogues

Catalogues are maintained by the RDS, and they describe the information of the relations, views, images, links, assertions and triggers known to the system. Each user may access a set of views of the system catalogs which contain information pertinent to him. Users cannot modify a catalog directly, but it is modified indirectly, when tables are created, an image is dropped, etc. A user can enter commments into his various catalog entries by means of the COMMENT statement.

3.4 Cursors

As we have seen, cursors are pointers to specific tuples on a resulting table from a query. They can be used to retrieve the values of the tuples individually or to store the tables into the database as permanent relations. Cursors are still used, although they are often a low level feature that is not directly used by users, by it is used by the DBMS to provide higher level features to the user.

In addition, SQL can be used to maniputale either one tuple at a time or a set of tuples with a single command. The current tuple of a particular cursor may be selected for some operation using the predicate CURRENT TUPLE OF CURSOR.

Example 3.2. Give a 10% raise to all employees in Dept. 50.

CALL SEQUEL(UPDATE EMP 
               SET SAL = SAL*1.1 
               WHERE DNO = 50);

Example 3.3. Individual update.

CALL BIND(NEWSAL, ADDR(NEWSAL)); 
CALL SEQUEL(UPDATE EMP 
               SET SAL=NEWSAL 
               WHERE CURRENT TUPLE OF CURSOR C);

3.5 Clustering images

Clustering images, as we have explained, are images (indexes) that can be used to physically store the data in the same order as it is indexed. At most one image per relation can have the clustering property. The reason is simple: it is not possible to store the same data physically in two different orders.

3.6 Optimizer

The objective of the optimizer is to find a low cost means of executing a SQL statement, given the data structures and access paths available. For this, it attempts to minimize the expected number of pages to be fetches from disk into the RSS buffers. The cost of CPU instructions is also taken into account by means of an adjustable coefficient, H, which is multiplied by the number of tuple comparison operations to convert equivalent page accesses. H is useful to adjust the metric for compute-bounded systems or disk access-bounded systems.

The optimizers follows some steps when it receives a SQL statement:

  1. Classify the SQL statement into one of several statement types.

  2. Examine the system catalogs to find the set of images and links which are pertinent to the given statement.

  3. A rough decision procedure is executed to find the set of reasonable methods of executing the statement.

  4. If there is more than one reasonable method, the expected cost formula is evaluated for each method, and the minimizing method is choosing.

The following parameters, available in the system catalogues, are taken into account:

  1. relation cardinality: number of tuples

  2. number of pages occupied by the relation

  3. average number of tuples per page:

    T = R-.
    D

  4. image cardinality: number of distinct sort fields values in a given image.

  5. coefficient of CPU cost: -1
H is the number of tuple comparisons which are considered equivalent in cost to one disk page access.

An image match a predicate if the sort field of the image is the field which is tested by the predicate.

3.6.1 Simple query optimization

In the case of a simple query on a single relation, the optimizer compares the available images with the predicates of the query, in order to determine which of the following eight methods are available:

  1. : use a clustering image which matches a predicate whose comparison operator is ’=’. The expected cost, C is

        --R--
C = T × I,

    that is: from I values, we want one, so we need to retrieve RI tuples on average. These fit in TR×I pages.

  2. : use a clustering image which matches a predicate whose comparison operator is not ’=’. Assuming half the tuples satisfy the predicate, we have

          R
C = T-×-2.

    The idea is the same as before, but now we are assuming to retrieve R-
2 tuples on average.

  3. : use a non-clustering image which matches a predicate whose comparison operator is ’=’. In this case, we have

    C = R-,
    I

    because now we might find only one correct tuple per page.

  4. : use a non-clustering image which matches a predicate whose comparison operator is not ’=’. It is

        R-
C = 2 .

  5. : use a clustering image which does not match any predicate. We would scan the image and test each tuple against all predicates. The expected cost is

    C = R-+ H × R × N,
    T

    where N is the number of predicates. So, we recover R tuples, distributed in R-
T pages. In addition to this, we need to perform R×N comparisons (N predicates per tuple), which are weighted by the coefficient of CPU, H.

  6. : use a non-clustering image which does not match any predicate:

    C = R + H × R × N.

  7. : use a relation scan, where this relation is the only one in its segment and test each tupple agains all predicates:

        R
C = T-+ H × R × N.

  8. : use a relation scan, where there are other relations sharing the segment. The cost is unknown, but is greater than RT- + H × R × N.

The optimizer then chooses a method from this set, according to the following rules:

  1. If Method 1 is available, it is chose.

  2. If exactly one among Methods 2,3,5 and 7 are available, it is chosen. If more than one method is available in this class, the expected cost formulas for these methods are evaluated and the method of minimum cost is chosen.

  3. If none of the above methods are available, the optimizer chooses Method 4, if available.

  4. Else, Method 6, if available.

  5. Else, Method 8.

3.6.2 Join query optimization

In the release paper, only 4 methods are explained, although they say the system takes more methods into account.

  1. : use images on join fields. A simultaneous scan of the image on R1.A and the image of R2.A. The idea is having two pointers, and advance them coordinately, using the fact that images are ordered to find matches.

  2. : sort both relations. R1 and R2 are ordered using their cluster images and two files, F1 and F2 are created. F1 and F2 are sorted on field A. The resulting sorted files are scanned simultaneously and the join is performed.

  3. : multiple passes. R1 is scanned, storing the pertinent fields into a main memory data structure, W. If space in main memory is available to insert a subtuple, S, it is inserted. If there is no space and S.A is less than the current highest value of A in W, S is discarded. After completing the scan of R1, R2 is scanned using its clustering image and a tuple Sof R2 is obtained. Then, W is checked for the presence of S.A. If present, Sis joined to the appropriate subtuple in W. This process continues until all tuples of R2 have been examined. If any R1 subtuples were discarded, another scan of R1 is made to form a new W consisting of subtuples with A value greater than the current highest. R2 is scanned again and the process is repeated.

  4. : the TID algorithm. Basically, it works as follows:

    1. Obtain the TIDs of tuples from R1 which satisfy additional restrictions to the join. Sort them and store the TIDs in a file F1. Do the same with R2, storing the TIDs in F2.

    2. Perform a simultaneous scan over the images on R1.A and R2.A, finding the TID pairs of tuples whose values for A match.

    3. Check each pair (TID1, TID2 ) to see if TID1 is present in W1 and TID2 is present in W2. If they are, the tuples are fetched and joined.

A method cannot be applied unless the appropriate access paths are available. The performance of a method depends strongly on the clustering of the relations with respect to the access paths. In the paper, four situation are presented in which the optimizer would decide between the four methods, but they claim to detail the cost formulas on a later paper:

  1. : there are clustering images on both R1.A and R2.A, but not no images on R1.B or R2.C, which are additional conditions. Method 1 is always chosen.

  2. : there are non-clustering images on R1.A and R2.A, but no images on R1.B or R2.C. Method 3 is chosen if W fits into the main memory buffer at once. Otherwise, Method 2 is chosen.

  3. : there are clustering images on R1.A and R2.A and non-clustering images on R1.B or R2.C. Method 4 is always chosen.

  4. : there are non-clustering images on R1.A,R2.A,R1.B and R2.C. Method 3 is chosen if W fits into the main memory buffer. Otherwise, Method 2 is chosen if more than one tuple per disk page is expected to satisfy the restriction predicates. In other cases, Method 4 is chosen.

3.6.3 Optimized Packages

After analyzing a SQL statement, the optimizer produces an Optimized Package (OP) containing the parse tree and a plan for executing the statement.

3.7 PostgreSQL in relation to System R

Here, we are going to examine how PostgreSQL is similar or different to the characteristics of System R:

4 Query Optimization

As we have seen until now, there are alternative ways to evaluate a given query: there are equivalent RA expression for the same query, and also there are different methods that can physically execute a given query.

An evaluation plan defines exactly what algorithm is ued for each operation and how the execution of the operations is coordinated.

4.1 Cost-based query optimization

Cost difference between evaluation plans for a query can be enormous. The general steps in cost-based query optimization are as in System R:

  1. Generate logically equivalent expressions using equivalence rules.

  2. Annotate resultant expressions to get alternative query plans.

  3. Choose the cheapest plan based on the estimated cost.

The estimation of the cost is based on:

4.2 Viewing query evaluation plans

Most database support the EXPLAIN <QUERY> statement, which displays the plan chosen by the optimizer, along with the cost estimates that it uses for decision.

Some databases also support EXPLAIN ANALYSE <QUERY>, which shows actual runtime statistics found by running the query, in addition to showing the plan.

Some databases show the cost as

f..l

where f is the cost of delivering the first tuple and l is the cost of delivering all results.

4.3 Generating equivalent expressions

Definition 4.1. Two relational algebra expressions are equivalent if the two expressions generate the same set/bag of tuples on every legal database instance.

An equivalence rule between two expressions ensure that both expressions are equivalent.

Now, we are going to list some equivalence rules:

  1. Conjunctive selection can be deconstructed into a sequence of individual selections:

    σP1∧P2 (E) ≡ σP1 (σP2 (E)).

  2. Selection is commutative:

    σP1 (σP2 (E )) ≡ σP2 (σP1 (E)).

  3. In a sequence of projections, where L1 L2 ... Ln, only the outermost one is needed:

    ΠL1 (πL2 (...(πLn (E )))) ≡ ΠL1 (E).

  4. Selections can be combined with cartesian products and theta joins:

    σ (E  × E ) ≡ E ⊳⊲ E  ,
 P  1    2     1  P  2

    σP1 (E1 ⊳⊲P2 E2 ) ≡ E1 ⊳⊲P1∧P2 E2.

  5. Theta join operations are commutative

    E  ⊳⊲ E ≡ E  ⊳⊲ E .
  1   2    2    1

  6. And they are associative, in a soft manner:

    1. The natural join is associative:

      (E1 ⊳⊲ E2 ) ⊳⊲ E3 ≡ E1 ⊳⊲ (E2 ⊳⊲ E3 ).

    2. The thetha join is associative in a soft sense:

      (E1 ⊳⊲P1 E2) ⊳⊲P2∧P3 E3 ≡ E1 ⊳⊲P1∧P3 (E2 ⊳⊲P2 E3),

      where P3 involves attributes that are present in the three relations.

    When we can decide the order of the joins, we would choose the smaller join to be performed before, so that we compute and store a smaller temporary relation.

  7. The selection operation distributes over the theta join operation in the following two situations:

    1. When all the attributes in P0 involve only the attributes of one of the expressions being joined:

      σP0 (E1 ⊳⊲P E2 ) ≡ σP0 (E1) ⊳⊲P E2.

    2. When P1 involves only the attributes of E1 and P2 involves only the attributes of E2:

      σP1∧P2 (E1 ⊳⊲P E2) ≡ σP1 (E1) ⊳⊲P σP2 (E2).

  8. The projection operation distributes over the theta join operation as follows: If P involves only attributes from L1 L2:

    ΠL1∪L2 (E1 ⊳⊲P E2) ≡ ΠL1 (E1) ⊳⊲P ΠL2 (E2) .

    Similar equivalences hold for outerjoin operations.

  9. Union and intersection are commutative:

    E1 ∪ E2 ≡ E2 ∪E1,

    E1 ∩ E2 ≡ E2 ∩E1.

  10. Union and intersection are associative:

    (E1 ∪ E2)∪ E3 ≡ E1 ∪ (E2 ∪ E3) ,

    (E1 ∩ E2)∩ E3 ≡ E1 ∩ (E2 ∩ E3) .

  11. The selection operation distributes over ,and -:

    σP (E1 ∪ E2) ≡ σP (E1)∪ σP (E2),

    σP (E1 ∩ E2) ≡ σP (E1)∩ σP (E2),

    σP (E1 - E2) ≡ σP (E1)- σP (E2),

    σP (E1 ∩E2 ) ≡ σP (E1 )∩E2,

    σP (E1 - E2 ) ≡ σP (E1 )- E2.

  12. The projection operation distributes over union:

    ΠL (E1 ∪ E2) ≡ ΠL(E1) ∪ΠL (E2).

Example 4.1. Pushing selections.

Query: find the names of all instructors in the Music department, along with the title of the courses that they teach.

A first RA expression could be the following:

Πname,title(σdpt-name= ′Music′(Instructor ⊳⊲ (Teaches ⊳⊲ Πcourse id,title(Course)))).

It can be transformed using rule 7a:

Πname,title((σdpt-name= ′Music′(Instructor)) ⊳⊲ (T eaches) ⊳⊲ Πcourse id,title(Course)).

The advantage of doing this is taht by performing the selection as early as possible we are reducing the size of the relation to be joined.

Example 4.2. Pushing projections.

We start with the RA expression:

Πname,title((σdpt name=′Music′ (Instructor) ⊳⊲ T eaches) ⊳⊲ Πcourse id,title (Course)).

When we compute

σdpt name=′Music′ (Instructor) ⊳⊲ Teaches,

we obtain a relation with schema (ID, name,dpt-name,salary,course-id,sec id,semester,year). Equivalence rule 8 allows to push projections, eeliminating unneeded attributes from intermediate results to get:

Πname,title((Πname,course id(σdpt-name= ′Music′ (Instructor) ⊳⊲ Teaches)) ⊳⊲ Πcourse id,title (Course )).

This is useful because performing the projection as early as possible reduces the size of the relation to be joined. Note that course_id needs to be projected because it is needed for the join.

Example 4.3. Join ordering.

Consider the expression

Πname,title((σdpt name=′Music′ (Instructor) ⊳⊲ T eaches) ⊳⊲ Πcourse id,title (Course)).

In this case, we could compute

T eaches ⊳⊲ Πcourse-id,title(Course )

first, and then join the result with the left relation. The problem with this approach is that doing this join first seems more likely to be large, as only a small fraction of the university’s instructor are going to be from the Music department. So it is better to leave the query as is.

4.4 Enumeration of equivalent expressions

Query optimizers use equivalence rules to systematically generate expressions equivalent to the given expression, which is a first translation of the query.

All the equivalent expressions can be generated with the following approach:

REPEAT 
       APPLY all applicable equivalence rules 
               ON every subexpression of every equivalent expression found so far 
       ADD newly generated expressions 
               TO the set of equivalent expressions 
UNTIL no new equivalent expressions are generated

4.5 Cost estimation

The optimizer takes into account the cost of each operator and the statistics of the input relations, such as the number of tuples and the sizes of the tuples. Also, inputs can be results of sub-expressions, so we need to estimate estatistics of these results. For this purpose, more statistics, such as the number of distinct values for an attribute, are used.

4.6 Choice of execution plan

Once we have generated different equivalent expressions, we need to decide which one to use to execute the query and get the results. For this, we must consider the interaction of evaluation techniques, because choosing the cheapest algorithm for each operation disregarding the others may not yield best overall algorithm. For example, a merge-join may be costlier than a hash-join, but may provide a sorted output which could reduce the cost for an outer level aggregation.

Practical query optimizers incorporate elements of two broad approaches:

  1. Search all the plans and choose the best plan in a cost-based fashion.

  2. Uses heuristics to choose a plan.

4.6.1 Best join-order problem

Problem: find the best join-order for

R  ⊳⊲ R ⊳⊲ ...⊳⊲ R .
  1    2        n

A first idea could be to check all possibilities and choose the cheapest one. But...

Proposition 4.1. For the best join-order problem, with n relations involved, there are

(2(n---1))!
   n- 1

different possible join orders.

Proof. First, we need to count all possible orderings, i.e. the number of permutations, which is known to be n!. Now, we have to count all possible ways to assign the n-1 needed parenthesis. This is known to be the Catalan number5

                (2-(n---1))!-
# ()n-1 = Cn-1 = n!(n- 1)! .

Thus, the total amount is

n!⋅Cn-1 = n!× (2(n--1))!= (2(n--1))!.
              n!(n - 1)!   (n - 1)!
__

This number is huge, and it is unfeasable to check the whole search space. Thus, a different approach is needed.

Dynamic programming approach

Using dynamic programming, the least-cost join for any subset of {R1,...,Rn } is computed only once and stored for future use. The algorithm works as follows:

The pseudocode is shown in Algorithm 1.


if (bestplan[S].cost != infty) 
       return bestplan[S]¯ 
// else it has not been computed yet 
if (S contains only 1 relation) 
       set bestplan[S].plan and bestplan[S].cost 
               based on the best way to access S 
else for each non-empty proper subset S1 of S 
       P1 = findbestplan(S1) 
       P2 = findbestplan(S - S1) 
       A = best algorithm for joining P1 and P2 
       cost = P1.cost + P2.cost + A.cost 
 
       if cost < bestplan[S].cost 
               bestplan[S].cost = cost 
               bestplan[S].plan = plan 
       return bestplan[S]
Algorithm 1: procedure findbestplan(S)

The time complexity of this algorithm is O(3n) and the space complexity is O(2n). This is a huge gain with respect to checking the whole search space, but it is still a very high cost.

Left-deep join trees

In left-deep join trees, the right-hand-side input for each join is a relation, not the result of an intermediate join. In Figure 3 we can see an example of what is a left-deep join tree, and what is not.


                                ⊳⊲


                        ⊳⊲              R
                                         5


                ⊳⊲              R4


        ⊳⊲             R3


R1             R2

                        ⊳⊲

                ⊳⊲               ⊳⊲
                                |
                                |
        ⊳⊲             R       R       R
                         3       4       5


R1             R2

Figure 3: A left-deep join tree (top) and not a left-deep join tree (bottom).


With this structure, we can reduce the cost of the optimization problem.

For a set of n relations, we can consider n alternatives with one relation as right-hand-side input and the other relations as left-hand-side input.

The time complexity of finding the best join order is in this case O(n2n) and the space complexity remains the same.

At this points, one might think that it is of no use bothering with optimizing the order of queries, if it so costful, but, in reality, typical queries have a small n, usually less than 10, and a good ordering can change a query from being unfeasable to being executed in an acceptable time.

Heuristic optimization

As we have seen, cost-based optimization is expensive, even using dynamic programming.

Heuristic optimization transforms the query-tree by using a set of rules that typically improve execution performance. These rules can include:

Some systems use only heuristics, while others combine the two approaches. A frequently used approach is the following:

  1. Heuristic rewriting of nested block structure and aggregation.

  2. A cost-based join-order optimization for each block.

There is usually an optimization cost budget to stop optimization early if the cost of the plan is less than the cost of the optimizations to be made.

Also, it can be useful to implement plan caching to reuse previouly computed plans if queries are resubmitted.

It is worth to note that even with the use of heuristics, cost-based query optimization imposes a substantial overhead in the computations, but it is worthy for expensive queries. For this reason, optimizers often use simple heristics for cheap queries, and perform a more exhaustive enumeration for more expensive queries.

5 Statistics for cost estimation

Statistics of relations are of great importance for improving the performance of the system, because they allow to estimate the cost more accurately.

Some statistical information that is used:

5.1 Histograms

Histograms are useful for cost estimation. Histogram can be of two types:

Many databases store the n most frequent values and their counts, and they construct histogram for the remaining values. Usually, they are computed not on all the actual values, but on a sample of them.

This sampling approach make it possible for the statistics to be outdated, so they need to be recomputed:

5.2 Estimation of selection size

We want to estimate the size of a selection, but this depends on the conditions to fullfil:

For more complex conditions, we need a new definition:

Definition 5.1. The selectivity of a condition P is the probability that a tuple in the relation r satisfies P. If sP is the number of tuples satisfying the condition, then the selectivity is given by

     s
SP = -P.
     nr

5.3 Estimation of the size of joins

The cartesian product of two relations r1 and r2 contains exactly nr1 ×nr2 tuples, and each tuple occupies lr1 + lr2 bytes.

About the join:

Example 5.1. Estimating the size of a join.

Let perform

student ⊳⊲ takes

with the following information:

The most accurate estimation in this case is the one using the fact that ID is a foreign key, which implies that the number of tuples is the same as the number of the referencing relation, takes, so

c = ntakes = 10000.

Let’s nonetheless compute an estimate disregarding this:

c1 = 5000⋅10000 = 20000,
       2500

c = 5000⋅10000 = 10000.
2      5000

We choose the lower estimate, c2, which in this case is the same as the one we chose before! But this is no surprise, this will always happen with foreign keys, because V (A,r1) = nr1 if A is key, so we would have

     nr1 ⋅nr2
c2 =   nr1  = nr2,

which is the same value that we get using the other estimation. Also, V (A,r2) will be at most nr1, because it is a foreign key, so it cannot have more values than the referenced attribute!

Part III
Indexing

6 Conventional indexes

Definition 6.1. An index is a data structure that facilitates the recovering of data. The idea is to maintain pointers to the specific directions where some data is stored.

As we know, the disk can be logically seen as a sequence of pages of a certain size. Every file that we store in a computer must be stored in one or more pages. Now, imagine we want to retrieve a file’s content. For this, we need to fetch the data from where it is stored. If we don’t use indexes, we would need to sequentially traverse the disk until we find the desired file.

Example 6.1. Imagine a simplified setup with a disk of N pages and with files that occupy one page. If we need to recover a specific file that is stored in memory, without further information, it would take an average of N-
2 pages to be fetched.

An index can be used to mitigate this impact. There are multiple types of indexes, but the simplest form of an index is just a map in which each file identifier is associated to the direction of its first byte in memory. This way, only knowing which file we want to recover, we can access it directly using the index.

Example 6.2. In the previous setup, imagine we store an index in the first page. In this scenario, to recover a specific file we need to fetch the first page, look the index to get the direction of our file, and directly fetch the correct page. In total, we would fetch 2 pages.

Definition 6.2. A dense index is an index that maintains one pointer per key.

Properties of dense indexes:

Example 6.3. A dense index.

A dense index on a sequential file looks like this:

PIC

And on a non-sequential file like this:

PIC

If we want to retrieve the segment with key=30, we would do binary search in the index and then we would get the disk direction of this segment.

If we are asked to retrieve a segment with a key that is not in the index, we would directly return an error after searching for it and not finding it, because we are sure there is no segment with that key in the disk. For example, there is no segment with key 25, and so there is no entry in the index with key 25.

Definition 6.3. A sparse index maintains a pointer per page/block. This means that only the first key on the each block.

Properties of sparse index:

Example 6.4. Sparse index.

A sparse index looks like this:

PIC

If we want to retrieve the segment with key=30, the procedure is exactly as with a dense index.

If we want to retrieve the segment with key=40, then we would search for 30 (the biggest key smaller than 40), we would go to where it is. Then, we would advance until we found the segment with key=40.

If we want to retrieve the segment with key=25, we would search for 10, we would go to where it is. Then, we would advance until we reached the end of the page, and we would return an error, because now we are sure there is no segment with key=25.

6.1 Sparse second level index

When an index becomes very big, the searchs start to slow down, because the binary search needs to be done over a bigger index, which can even occupy several pages, which would need to be fetched.

When this happen, a possible way to speed up the search is to add a sparse index that points to the already defined index.

Example 6.5. A second level sparse index.

In this diagram we can see a second level sparse index on a sparse index:

PIC

If we wanted to retrieve the segment with key=30, we would check the 2nd level sparse index, we would get the page where the index should have this key stored. We would go to this page, and we would find the key=30 already in the index, so we would go to the indicated direction.

Question: does it make sense to use a second level dense index?

No, it would be a copy of the first level index, so we would not gain anything. Second level indexes are sparse. Note, nonetheless, that the first level index can indeed be of any kind.

Question: what is the tradeoff between sparse and dense indexes?

Sparse indexes needs less space to be stored and this also allows to have a bigger part of the index in memory when we need it. On the other hand, dense indexes can tell if any record exists without accessing the files.

6.2 How to deal with duplicate keys.

Imagine we have a disk with the following data:

PIC

A naïve solution would be to just use a dense index, where all the keys are listed repeatedly:

PIC

In this case, we are solving the problem of duplicate keys... but we are probably using more space that we wanted. It would be better to have unique keys in the indexes. Thus, a second approach could be to only store the first appearance of each key, and use it to fetch the data, scanning sequentially until all records with the same key have been retrieved. This is illustrated below:

PIC

See, nonetheless, that this solution requires that the file is sequentially stored in memory, because when we recover the second segment with key=10, in order to recover the third one, the only possible way is to continue a sequential scan.

Now, a third approach is an intermediate approach: we can use a sparse index with duplicate keys, meaning we index the first key in each page:

PIC

In this case, when we want to search for a key, we always need to go to the biggest key that is smaller than the requested one, even if the requested one is in the index keyset. For example, if in the example above we used the indexed direction for key=20, we would miss the first record of this key.

An improved version of this solution is to index only the first new key in each page:

PIC

Note that in this case, all keys will be listed once, so we don’t need to retrieve the file for inexistent keys, but sequentially ios required.

6.3 How to delete records

6.3.1 Deletion from sparse index with no duplicates

If we want to delete a record, we need to make sure that the index is updated if needed. The steps to delete a record are the following:

  1. To delete key K, do a binary search in the index.

  2. Depending if K is in the index or not:

    1. If K is not in the index we visit the direction of the biggest key that is smaller than the requested one. We advance until we find the record with key K and we delete it.

    2. If K is in the index, we visit its direction and we delete it. Now, two possibilities arise:

      1. If there are more records in the same page, we shift them up, and we update the key.

      2. If there no more records in the same page, we delete the key from the index and we shift the rest of the keys up.

Example 6.6. Case 2.(a): DELETE 40



Initial state Final state


PIC PIC


Legend


Red Delete


Example 6.7. Case 2.(b).i.: DELETE 30




Initial state Second state Final state



PIC PIC PIC



Legend



Red
Delete



Yellow
Update



Example 6.8. Case 2.(b).ii.: DELETE 30 and 40




Initial state Second state Final state



PIC PIC PIC



Legend



Red
Delete



Yellow
Update



6.3.2 Deletion from dense index

In this case, we will always find the keys to delete in the index, so the steps are easier:

  1. To delete key K, do a binary search on the index.

  2. Delete records in the corresponding page, shifting up the rest of the records to not leave holes.

  3. Update all shifted records.

Example 6.9. Deletion from dense index: DELETE 30




Initial state Second state Final state



PIC PIC PIC



Legend



Red
Delete



Yellow
Update



6.4 How to insert records

We need to follow the next steps6 :

  1. We want to insert record with key K. First, we do a binary search to see where it should be located.

  2. Now, in the first page that it can be located according to the index, two things can happen:

    1. If there is space for the record: we insert it.

    2. If there is not space for the record: we need to shift the following records down, updating the necessary index entries.

Example 6.10. INSERT 15




Initial state Second state Final state



PIC PIC PIC



Legend



Red
Delete



Yellow
Update



Green
Insert



6.5 Secondary indexes

Imagine we have an unordered file in memory, which we would like to be able to traverse in order without implying great costs. If we try to do this by sequentially scanning the disk, we would need to fetch several times each page and it would be highly inneficient, so we could think on using indexes to solve this problem.

As we have seen, sparse indexes cannot be used with unordered files (some records would be lost), so our only option here is to use an ordered dense index that enables us to recover each record in the desired order. Now, as we are indexing the whole file with an dense index, it is likely that the index is huge, so it seems convenient to add a second level sparse index to speed things up even more.

This is a secondary index:

Definition 6.4. A secondary index is a N-level index structure, composed by a first-level dense index and the rest of the levels are sparse. All this indexes are ordered to make use of binary search, in order to be capable of recovering unsequentially stored records from disk efficiently.

6.5.1 Duplicate values and secondary indexes

Again, duplicate keys pose a problem to secondary indexes. Think in the following setup:

PIC

Here, the solutions that we proposed before don’t work, because for them we needed sequentially stored files. In this case, again, the naïve solution is a dense index:

PIC

The problem with this solution is that this cause an excessive overhead, both in disk space (we are storing repeatedly the same keys) and in search time (because the index keyset needs to be accessed several times per key). An alternative is to store only once each key, and associate a list of pointers:

PIC

But this has the problem that the index entries can have different sizes, which difficult the search.

Another idea is to use buckets of pointers:

PIC

This structure is very helpful in some situations, for example when we want to get data with some conditions that involves different fields indexed.

Example 6.11. Imagine the relation EMP(name,dept,floor), with a primary index on name and two secondary indexes with bucket structure in dept and floor.

Now, let’s say we want to retrieve all employees in the department ’Toy’ and in floor 2. Our structure make this query very easy:

PIC

As we can see, it is possible to use both indexes and then fetch only those records that are return by the two of them!

7 B-Trees

This section is adapted from [2].

B-trees automatically maintain as many levels of index as is appropriate for the size of the file being indexed and manage the space on the blocks they use so that every block is between half used and completely full.

A B-tree organizes its blocks into a tree that is balanced, meaning that all paths from the root to a leaf have the same length. A BTree can be visualized in Figure 4.


PIC

Figure 4: A BTree. Source: [2].


There is a parameter n associated with each B-tree index, and this parameter determines the layout of all blocks of the B-tree. Each block will have space for n search-key values and n + 1 pointers.

We pick n to be as large as will allow n + 1 pointers and n keys to fit in one block.

Example 7.1. Suppose our blocks are 4096 bytes. Also let keys be integers of 4 bytes and let pointers be 8 bytes. If there is no header information kept on the blocks, then we want to find the largest integer value of n such that 4n + 8(n + 1) < 4096. That value is n = 340.

There are several important rules about what can appear in the blocks of a B-tree:

In Figure 4, the chosen n is 3.

7.1 Lookup in BTree

Suppose we want to find a record with key K. The procedure is:

Example 7.2. Search K = 29.

PIC

The path to be followed is colored green:

  1. In the root node, K1 < K and and there are no more keys, so choose i = 2.

  2. In the next node, we find K1 < K < K2, so choose i = 2.

  3. In the leave node, we find K2 = K, so choose i = 2.

7.2 Range queries

BTrees allow for a very efficient way to process range queries, that is, recover all records with keys lying in a given range [Kmin, Kmax]. The procedure is:

  1. Perform a lookup for Kmin, whether it is found or not, we will reach the correct leaf node.

  2. We traverse all leaf nodes until we find a key bigger than Kmax.

Example 7.3. Search for range [12,40].

PIC

It is colored in green all the nodes that would be accepted into the range query. The order is up-down and left-right in the leaf level.

7.3 Insertion into a BTree

The insertion is, in principle, recursive:

Example 7.4. Insert K = 40.

First, we lookup for the place where the record should be inserted.

PIC

As there is not enough place, we need to split the node.

PIC

Now, the key of the new node needs to be inserted into the parent node. But it is also full, so it needs to be splitted, too.

PICç

In the root node, the smaller key reachable from that node needs to bu inserted. In this case, the newly inserted one.

7.4 Deletion from a BTree

The steps to delete record with key K are:

  1. Lookup for the record.

  2. Delete the record from the data.

  3. Delete the key-pointer pair from the BTree.

  4. If the node from which we deleted still has the minimum number of pointers, that’s it. But it is possible that the node is less occupy than the minimum required after the deletion. We need to do one of two things:

    1. If one of the adjacent siblings of node N has more than the minimum number of keys and pointers, then one key-pointer pair can be moved to N, keeping the order of keys intact. Possibly, the keys at the parent of N must be adjusted to reflect the new situation.

    2. The hard case is when neither adjacent sibling can be used to provide an extra key for N. However, in that case, we have two adjacent nodes, N and a sibling M; the latter has the minimum number of keys and the former has fewer than the minimum. Therefore, together they have no more keys and pointers than are allowed in a single node. We merge these two nodes, effectively deleting one of them. We need to adjust the keys at the parent, and then delete a key and pointer at the parent. If the parent is still full enough, then we are done. If not, then we recursively apply the deletion algorithm at the parent. This process is called coalesce siblings.

Example 7.5. Delete K = 7.

PIC

First, we find the correct node and delete the record.

Now, the node left only has one pointer, so we need to fix this. As its left sibling node has 3 pointers, we can transfer the biggest one.

PIC

Also, we need to update the parent node.

Example 7.6. Now, delete K = 11.

PIC

First, we locate the correct node and delete the record. Again, the node ends up with less pointers than it should, but now the left sibling does not have more than the minimum amount of pointers (it has the minimum) and the node does not have right siblings (the node to the right is from another parent), so we need to merge the two siblings.

PIC

Part IV
Physical Query Plans

8 Physical Query Plans

We saw the steps to process a query: first, the query needs to be translated into and RA expression, which can then be modified using equivalence rules to get different expressions that lead to the same result. Then, it is needed to estimate the cost of each expression and to take the one that gives the minimum expected cost. For this, we need also to decide between several ways to access the data, e.g., whether to use an index or not, to order the data or not,... This is called physical query planning.

There are several ways to measure cost, but we are going to be using the number of disk blocks that must be read or written to execute a query plan.

We will also use different parameters:

8.1 Computing joins

A join operation can be computed in several ways, depending on the options available.

The simplest and most costly option is an iteration join, which just performs a double loop over the two relations. The pseudocode can be read in Algorithm 2.


for each r in R1 do 
       for each s in R2 do 
               if r.A = s.A then 
                       output (r,s)
Algorithm 2: Iteration Join

The merge join consists in first sorting the relations if they are not sorted, and then scanning them making use of the fact that they are ordered using the same attribute. The pseudocode can be read in Algorithm 3.


if R1 not sorted on attribute A then 
       sort R1 
if r2 not sorted on attribute A then 
       sort R2 
 
i=1, j=1 
while i<=T(R1) and j<=T(R2) do 
       if R1[i].A = R2[j].A then 
               k=j 
               while R1[i].A = R2[k].A do 
                       output (R1[i],R2[k]) 
                       k += 1 
               i += 1 
       else if R1[i].A > R2[j].A then 
               j += 1 
       else if R1[i].A < R2[j].A then 
               i += 1
Algorithm 3: Merge Join

The index join uses an index defined on the joining attribute on one of the relations. The pseudocode can be read in Algorithm 4.


for each r in R1 do 
       X <- index(R2, A r.A) # search in index on R2.A tuples with value r.A 
       for each s in X do 
               output (r,s)
Algorithm 4: Index Join

The hash join uses a hash function on the joinin attribute. The pseudocode can be read in Algorithm


hash R1 tuples into G buckets 
hash R2 tuples into H buckets 
 
for i=0 to k 
       match tuples that lie in G[i] and H[i]
Algorithm 5: Hash Join (k buckets G1...Gk, H1...Hk)

8.2 Factors that affect performance

Are the tuples of the relation stored physically together?

The more compactly stored in memory the relations are, the less number of pages needs to be fetched and so the performance will increase.

Are relations sorted by join attribute?

If the relations are already sorted by the join attribute, the merge join is a great option, because the costly part is the sorting (O(nlogn)), while the joinin itself is O(n+ m ) where n,m are the sizes of both relations.

Indexes exist?

If there are no indexes, the index join is not even an option. And when there are indexes, they are not always the best option, because if the attribute has low selectivity, the indexes will be returning single values often, and thus we will be only introducing overhead in the operation.

Example 8.1. Iteration join R1⊳⊲R2 where relations are not contiguous, T(R1) = 10000 tuples,T(R2) = 5000 tuples,S(R1) = S(R2 ) =  1
10 block and M = 101 blocks (so we can work with at most 1010 tuples in memory at once). In this case, B(R1 ) = 10000 and B(R2 ) = 5000.

For each tuple in R1, we need to:

So, the total cost is

C = 10000(1+ 5000) = 50 010 000 IOs.

Can this be improved? Yes!

If we do it reading 1000 tuples of R1 and doing the process in each of this chunks, we would need to, for each chunk in R1:

So

C = 10(1000+ 5000) = 60 000 IOs.

Can this be improved? Yes!

If we reverse the order of the join: R2⊳⊲R1, then, for each chunk in R2:

So

C = 5(1000+ 10000) = 55 000 IOs.

In fact, the bigger R2 is compared to R1 the greater gain obtained when changing the order.

Example 8.2. Iteration join R1⊳⊲R2 where relations are contiguous (same parameters). In this case, B(R1) = 1000 and B(R2) = 500.

For each chunk in R2:

Thus,

C = 5(100 +1000) = 5 500 IOs.

We can see how the contigous storage greatly increase performance of the joins.

Theorem 8.1. In general, for an iteration join R1⊳⊲R2 with sizes (in blocks) B(R1) and B(R2), and a memory capacity of M blocks, the formula for the cost is

    B (R1)
C = M----1 (M - 1 + B(R2)).

Proof. We want to first take as many blocks from the first relation as we can, but we need to leave space for joining the second relation, so we will use M - 1 blocks for storing the tuples for the first relation. This will need to be done B-(R1)
 M- 1 times. Now, for each of this iterations, we need to actually read the M - 1 blocks from R1 and to read all blocks from R2. So the formula arises. __

Example 8.3. Merge join R1⊳⊲R2 where both relations are already ordered by the joinin attribute and relations are contiguous. In this case, we will need to read all blocks containing R1 and all block containing R2, once. So

C = 1000 + 500 = 1 500 IOs.

So we can see how good a merge join where the relations are already ordered is. Let’s see how the other case performs:

Example 8.4. Merge join R1⊳⊲R2 where R1,R2 are not ordered, but are contiguous.

In this case, first we need to sort the relations and there are different ways to do this, we are going to explain one, the merge join:

Now, the cost a merge join in which the relations are not ordered is the cost of the ordering plus the cost of the join, so

C = 4000 + 2000 + 1500 = 7500 IOs.

Remember that the iteration cost was 5500 IOs, so in this case the merge join is not the best option.

Theorem 8.2. In general, for a merge join R1⊳⊲R2 with sizes (in blocks) B(R1) and B(R2) where the relations are contiguously stored, and a memory capacity of M blocks, the formula for the cost is

C = 5(B (R1)+ B (R2)).

Proof. The cost is

C = C    (R )+ C     (R )+ C   .
     order  1     order  2    join

We have seen that Corder(R) = 4B(R ) and Cjoin(R1,R2 ) = B(R1 )+B(R2), and so the formula arises. __

Example 8.5. Let in this case R1,R2 be contiguously stored, but unordered, with B(R1) = 10000 tuples and B(R2) = 5000 tuples. In this case, the iteration join has a cost of

      5000
CIJ = ---- (100 + 10000) = 505 000 IOs.
       100

And the merge join

CMJ  = 5(10000+ 5000) = 75 000 IOs.

So, in this case the merge sort is better, even without the relations being previously ordered.

Theorem 8.3. For a join R1⊳⊲R2 where the relations are contiguously stored and unordered, with sizes (in blocks) B(R1) and B(R2), a memory capacity of M blocks, and assuming that B(MR-21) > 4, a merge join is preferred to an iteration join if, and only if,

B (R ) > 5B-(R2)-.
    1    BM(R-21)- 4

Proof. The merge join is preferred to the iteration join if, and only if

                   B-(R1)                          B-(R1-)B-(R2-)
5(B (R1)+ B (R2)) < M - 1 (M - 1 + B (R2 )) = B(R1 )+   M - 1     ⇐⇒

4B (R1)+ 5B (R2) < B(R1)B-(R2)-⇐ ⇒
                      M - 1

      [    B-(R2-)]
B (R1) 4 - M - 1  < - 5B (R2 ) ⇐ ⇒

      [         ]
B (R ) B-(R2) - 4 > 5B (R ) ⇐⇒
    1  M  - 1            2

         5B (R )
B (R1) > B(R2)2--.
         M--1-- 4
__

Example 8.6. Let’s apply the theorem to our two previous examples:

  1. R2⊳⊲R1,B(R1 ) = 1000, B(R2) = 500, M = 101, then

     5B(R )     5000     5000
B(R1)-1--= 1000----=  ----> 833 > 500,
M--1-- 4    100 - 4    6

    so in this case the iteration join is preferred.

  2. R2⊳⊲R1,B(R1 ) = 10000,B(R2) = 5000,M = 101, then

    5B-(R1)--  -50000--  50000
B(R1)-- 4 = 10000- 4 =   96   < 521 < 5000,
M -1        100

    so in this case the merge join is preferred.

How much memory do we need for merge sort?

Until now, we have disregarded the memory needed to perform the merge sort, but this is a crucial aspect of it. If the relation does not fit entirely in memory, it is not straightforward to merge all the ordered chunks to obtain a fully ordered relation.

In general, if we have M blocks in memory, and B blocks to sort, then we will take chunks of size k, so we will have x
k chunks. Now, this number needs to be smaller than the memory size:

 B
M--≤ M,

or, equivalently

M 2 ≥ B or M ≥ √B.

Example 8.7. Following our examples: R1 is 1000 blocks, so M 31.62 and R2 is 500 blocks so M 22.36. In this case, we need M 32 blocks, so it could be done because the memory was M = 101 blocks.

Can the merge join be improved?

Yes, we are imposing that the whole relation needs to be sorted, but maybe we can join the sorted chunks without merging them.

If we did this, we would need to:

So the total cost would be

C = 2B (R1)+ 2B (R2 )+ [B (R1)+ B (R2)] = 3[B (R1 )+ B (R2 )].

Example 8.8. Index join R2⊳⊲R1 with an index on R1.A of two levels, R2 contiguously stored and unordered and assuming the index fits in memory. Then, the cost is:

So

C = 500 + matches.

Thus, we need to estimate how many matches there will be. We can treat several cases:

  1. If R1.A is a key attribute and R2.A is a foreign key:

    matches (R2, R1) = T (R2).

    In this case the cost is

    C = 500+ 5000 = 5500 IOs.

  2. If we know V (R1,A) (number of distinct values of attribute A in R1) and T(R1 ), we can assume uniformity and thus obtain

                     V (R1,A)
matches(R2,R1) = -T-(R1)- ×T (R2).

    In this case the cost is, assuming V (R ,A)
  1 = 5000,

             10000
C = 500+ -5000 5000 = 10500 IOs.

  3. If we know size(Dom  (R1, A)) (number of distinct values that attribute A can take) and T(R1 ), we can assume uniformity and thus obtain

    matches (R2,R1 ) =-----T-(R1)-----T (R2).
                 size(Dom (R1,A ))

    In this case the cost is, assumiing size(Dom  (R1,A )) = 1 000 000,

    C = 500+ -10000-5000 = 550 IOs.
         1000000

Example 8.9. Let’s see what happens if the index does not fit in memory.

Let the R1.A index occupy 201 blocks (1 root and 200 leaves), so it cannot be fully fitted in memory (M  = 101). We can store the root node and 99 leaf nodes in memory. Then for each value to check, there is a -99
200 chance that we can find the value in memory, and 101
200 that we don’t. Then, the cost of checking the value in the index is

C     = 0 × 99-+ 1 × 101≈ 0.5.
  index      200      200

Thus, the total cost is

C = 500+ 5000[0.5 +2] = 13000 IOs.

In this case, we have assumed the case 2. The detailed explanation is:

Theorem 8.4. In general, for an index join R1⊳⊲AR2 with sizes (in blocks) B(R1) and B(R2) and sizes (in tuples) T(R1) and T(R2), where:

  • There is an index for R1.A of size Bi.

  • R2 is contiguously stored in memory.

  • There is a memory capacity of M.

If the index does not fit in memory, the estimated cost is

                  [(         )          ]
C = B (R2 )+ T (R2 ) 1- M----2  + matched ,
                         Si

where matched depends on the assumtions about how the values of the index are distributed.

If the index fits in memory, the estimated cost is

C = B (R2)+ T (R2 )× matched.

Proof. We will need to read all blocks of R2, which sums up to B(R )
   2.

Then, for each tuple in R2, we need to check the index and recover all matches. Thus, if the index fits in memory, the result is obvious.

If the index does not fit in memory, we will store the root and M - 2 leave nodes. Thus

                        M---2-
P rob (value in memory ) =  Si  ,

so

Prob(value not in memory ) = 1- P rob(value in memory ) = 1- M-- 2-.
                                                          Si

And so, we obtain the desired formula. __

Let’s now continue with the hash join:

Example 8.10. Hash join R1⊳⊲R2 where R1,R2 are contiguously stored and unordered. According to [2], we may hash each relation to 100 buckets, so the average size of a bucket is 10 blocks for R1 and 5 blocks for R2. Since the smaller number, 5, is much less than the number of available buffers, we expect to have no trouble performing a one-pass join on each pair of buckets. The number of disk IOs is 1500 to read each of R1 and R2 while hashing into buckets, another 1500 to write all the buckets to disk, and a third 1500 to read each pair of buckets into main memory again while taking the one-pass join of corresponding buckets. Thus, the total cost is

C = 3 (B (R1)+ B (R2)) = 4500 IOs.

About the memory requirements, we need the buckets to fit into memory. We are taking M - 1 buckets, so the size of the buckets are B(R1)-
M -1 blocks and B-(R2)
 M- 1 blocks for buckets of R1 and buckets of R2, respectively. It is enough to fit the smaller one, say --B-
M -1 blocks. Then, we need to ensure that

  B
M---1-< M  - 1,

so we need to fulfill

B < (M - 1)2

or

√ --
  B < M - 1.

Part V
Extensibility

9 Extensible databases: PostgreSQL

This section is adapted from the course slides and PostgreSQL: Extensibility.

PostgreSQL is extensible because its operation is catalog-driven. Relational DB systems store information about databases, tables, columns, etc., in what are commonly known as system catalogs.

The catalogs appear to the user as tables like any other, but the DBMS stores its internal bookkeeping in them. One key difference between PostgreSQL and standard relational database systems is that PostgreSQL stores much more information in its catalogs: not only information about tables and columns, but also information about data types, functions, access methods, and so on. These tables can be modified by the user, and since PostgreSQL bases its operation on these tables, this means that PostgreSQL can be extended by users. By comparison, conventional database systems can only be extended by changing hardcoded procedures in the source code or by loading modules specially written by the DBMS vendor.

The PostgreSQL server can moreover incorporate user-written code into itself through dynamic loading. That is, the user can specify an object code file (e.g., a shared library) that implements a new type or function, and PostgreSQL will load it as required. Code written in SQL is even more trivial to add to the server. This ability to modify its operation “on the fly” makes PostgreSQL uniquely suited for rapid prototyping of new applications and storage structures.

9.1 Types

9.1.1 Base types

Base types are those, like integer, that are implemented below the level of the SQL language. PostgreSQL can only operate on such types through functions provided by the user and only understands the behavior of such types to the extent that the user describes them.

9.1.2 Container types

Container types can be arrays, composites and ranges:

9.1.3 Domains

A domain is based on a particular underlying type and for many purposes is interchangeable with its underlying type. However, a domain can have constraints that restrict its valid values to a subset of what the underlying type would allow. Domains are created using the SQL command CREATE DOMAIN.

9.1.4 Pseudo-types

There are a few “pseudo-types” for special purposes. Pseudo-types cannot appear as columns of tables or components of container types, but they can be used to declare the argument and result types of functions. This provides a mechanism within the type system to identify special classes of functions.

9.1.5 Polymorphic types

Some pseudo-types of special interest are the polymorphic types, which are used to declare polymorphic functions. This powerful feature allows a single function definition to operate on many different data types, with the specific data type(s) being determined by the data types actually passed to it in a particular call.

9.2 Functions

PostgreSQL provides four kinds of functions:

Every kind of function can take base types, composite types, or combinations of these as arguments (parameters). In addition, every kind of function can return a base type or a composite type. Functions can also be defined to return sets of base or composite values.

9.2.1 SQL functions

SQL functions execute an arbitrary list of SQL statements, returning the result of the last query in the list. In the simple (non-set) case, the first row of the last query’s result will be returned. (Bear in mind that “the first row” of a multirow result is not well-defined unless you use ORDER BY). If the last query happens to return no rows at all, the null value will be returned.

Alternatively, an SQL function can be declared to return a set (that is, multiple rows) by specifying the function’s return type as SETOF sometype, or equivalently by declaring it as RETURNS TABLE(columns). In this case all rows of the last query’s result are returned.

Any collection of commands in the SQL language can be packaged together and defined as a function. However, the final command must be a SELECT or have a RETURNING clause that returns whatever is specified as the function’s return type. Alternatively, if you want to define an SQL function that performs actions but has no useful value to return, you can define it as returning void.

Example 9.1. A SQL function defined by the user.

CREATE FUNCTION clean_emp() RETURNS void AS  
   DELETE FROM emp 
       WHERE salary < 0; 
 LANGUAGE SQL;

Remark 9.1. The entire body of an SQL function is parsed before any of it is executed. While an SQL function can contain commands that alter the system catalogs (e.g., CREATE TABLE), the effects of such commands will not be visible during parse analysis of later commands in the function. Thus, for example, CREATE TABLE foo (...); INSERT INTO foo VALUES(...); will not work as desired if packaged up into a single SQL function, since foo won’t exist yet when the INSERT command is parsed. It’s recommended to use PL/pgSQL instead of an SQL function in this type of situation.

Remark 9.2. More than one function can be defined with the same SQL name, so long as the arguments they take are different. In other words, function names can be overloaded. This is called function overloading.

9.2.2 Procedural functions

PostgreSQL allows user-defined functions to be written in other languages besides SQL and C. These other languages are generically called procedural languages (PLs). Procedural languages aren’t built into the PostgreSQL server; they are offered by loadable modules.

9.2.3 Internal functions

Internal functions are functions written in C that have been statically linked into the PostgreSQL server. The “body” of the function definition specifies the C-language name of the function, which need not be the same as the name being declared for SQL use.

Normally, all internal functions present in the server are declared during the initialization of the database cluster, but a user could use CREATE FUNCTION to create additional alias names for an internal function. Internal functions are declared in CREATE FUNCTION with language name internal.

Example 9.2. An internal function.

CREATE FUNCTION square_root(double precision) RETURNS double precision 
   AS dsqrt 
   LANGUAGE internal 
   STRICT;

9.2.4 C-Language functions

User-defined functions can be written in C (or a language that can be made compatible with C, such as C++). Such functions are compiled into dynamically loadable objects (also called shared libraries) and are loaded by the server on demand. The dynamic loading feature is what distinguishes “C language” functions from “internal” functions — the actual coding conventions are essentially the same for both. (Hence, the standard internal function library is a rich source of coding examples for user-defined C functions.)

Currently only one calling convention is used for C functions (“version 1”). Support for that calling convention is indicated by writing a PG_FUNCTION_INFO_V1() macro call for the function.

9.2.5 Function volatility categories

Every function has a volatility classification, with the possibilities being VOLATILE, STABLE, or IMMUTABLE. VOLATILE is the default if the CREATE FUNCTION command does not specify a category. The volatility category is a promise to the optimizer about the behavior of the function:

For best optimization results, you should label your functions with the strictest volatility category that is valid for them.

9.3 Procedures

A procedure is a database object similar to a function. The key differences are:

Collectively, functions and procedures are also known as routines. There are commands such as ALTER ROUTINE and DROP ROUTINE that can operate on functions and procedures without having to know which kind it is. Note, however, that there is no CREATE ROUTINE command.

9.4 Interfacing extensions to indexes

The procedures described thus far let you define new types, new functions, and new operators. However, we cannot yet define an index on a column of a new data type. To do this, we must define an operator class for the new data type.Operator classes can be grouped into operator families to show the relationships between semantically compatible classes. When only a single data type is involved, an operator class is sufficient.

The operators associated with an operator class are identified by “strategy numbers”, which serve to identify the semantics of each operator within the context of its operator class. For example, B-trees impose a strict ordering on keys, lesser to greater, and so operators like “less than” and “greater than or equal to” are interesting with respect to a B-tree. Because PostgreSQL allows the user to define operators, PostgreSQL cannot look at the name of an operator (e.g., < or >=) and tell what kind of comparison it is. Instead, the index method defines a set of “strategies”, which can be thought of as generalized operators. Each operator class specifies which actual operator corresponds to each strategy for a particular data type and interpretation of the index semantics.

Example 9.3. The B-tree index method defines five strategies, shown in the next Table.



Operation Strategy Number




less than 1


less than or equal 2


equal 3


greater than or equal 4


greater than 5


9.5 Steps to create a PostgreSQL extension

  1. Create the appropriate file structure: extension-version.sql, extension.c, Makefile, extension.control.

  2. Create the data types.

  3. Create I/O functions.

  4. Create constructors, getters, setters.

  5. Create needed functions.

  6. Create operators =,<,,>,,...

  7. Define operator classes for indexes.

Part VI
Failure Recovery and concurrency control

10 Failure recovery

Definition 10.1. Integrity constraints are predicates that all data in the database must satisfy.

A database is said to be in a consistent state if it satisfies all constraints defined on it. In such case, the database itself is said to be a consistent DB.

Remark 10.1. Databases cannot be consistent at all times, because when some operations are being done, it is possible to be in intermediate non-consistent states.

A transaction is a collection of actions that preserve consistency. Thus, a transaction should be the smallest unit of processing in the database.

A fundamental assumption about transactions is the correctness principle:

Correctness principle: If a transaction executes in the absence of any other transactions or system errors, and it starts with the database in a consistent state, then the database is also in a consistent state when the transaction ends.

There is a converse to the correctness principle that forms the motivation for both the logging techniques that we are going to see. This converse involves two points:

  1. A transaction is atomic, that is, it must be executed as a whole or not at all. If only part of a transaction executes, then the resulting database state may not be consistent. For example, if the system crashes in the middle of a transaction, if there is a media failure,...

  2. Transactions that execute simultaneously are likely to lead to an inconsistent state unless we take steps to control their interactions (refer to Section 11).

In order to study the details of logging algorithms and other transactionmanagement algorithms, we need a notation that describes all the operations that move data between address spaces. The primitives we shall use are:

10.1 Key problem: unfinished transactions

Unfinished transactions are a great problem when dealing with consistency. If we assume the correctness principle and all transactions execute completely (and isolated) then databases would always be consistent, and we would not be studying this, so there are reasons that makes transactions not to finish completely, leading to inconsistent states.

Example 10.1. Imagine we impose the constraint A = B and we want to execute the transaction

T1 : A A × 2
B B × 2

It is obvious that if the database is consistent at the beginning of the transaction, it will also be consistent at the end, because both values start being the same, and they are modified in the same way.

Let’s see how things can go wrong.

Imagine the following plan:

T1 : Read(A,t); t t × 2
Write(A,t);
Read(B,t); t t × 2
Write(B,t);
Output(A);
Output(B);

The initial state is:





Memory
Disk








A 8




B 8




After Read(A, t):





Memory
Disk








A 8 A 8




B 8




t 8




After t t × 2:





Memory
Disk








A 8 A 8




B 8




t 16




After Write(A, t):





Memory
Disk








A 16 A 8




B 8




t 16




After Read(B, t):





Memory
Disk








A 16 A 8




B 8 B 8




t 8




After t t × 2:





Memory
Disk








A 16 A 8




B 8 B 8




t 16




After Write(B, t):





Memory
Disk








A 16 A 8




B 16 B 8




t 16




After Output(A ):





Memory
Disk








A 16 A 16




B 16 B 8




t 16




After Output(B ):





Memory
Disk








A 16 A 16




B 16 B 16




t 16




If all actions execute, as we can see, the final state is consistent. Nonetheless, there is one point in the procedure when a failure in the system can leave it in an inconsistent state: after Output(A ) and before Output(B ) the database is inconsistent!

We need to be able to ensure atomicity of transactions: all actions are executed, or none of them. For this purpose, logging is an useful technique. Basically, the idea is to annotate all actions done in a file, and if the system crashes, we can consult this file and rollback unfinished transactions, continuing from the point left,...

10.2 Logging

Definition 10.2. A log is a file of log records, each telling something about what some transaction has done. If log records appear in nonvolatile storage, we can use them to restore the database to a consistent state after a system crash.

There are several forms of log record that are used with each of the types of logging we discuss in this chapter. These are:

  1. <START T>: This record indicates that transaction T has begun.

  2. <COMMIT T>: Transaction T has completed successfully and will make no more changes to database elements. Any changes to the database made by T should appear on disk. However, because we cannot control when the buffer manager chooses to copy blocks from memory to disk, we cannot in general be sure that the changes are already on disk when we see the <C0MMIT T> log record. If we insist that the changes already be on disk, this requirement must be enforced by the log manager (as is the case for undo logging).

  3. < ABORT T>: Transaction T could not complete successfully. If transaction T aborts, no changes it made can have been copied to disk, and it is the job of the transaction manager to make sure that such changes never appear on disk, or that their effect on disk is cancelled if they do.

10.2.1 Undo logging

Undo logging makes repairs to the database state by undoing the effects of transactions that may not have completed before the crash.

For an undo log, the only other kind of log record we need is an update record, which is a triple < T ,X ,v> . The meaning of this record is: transaction T has changed database element X , and its former value was v. The change reflected by an update record normally occurs in memory, not disk.

An undo log does not record the new value of a database element, only the old value. If recovery is necessary in a system using undo logging, the only thing the recovery manager will do is cancel the possible effect of a transaction on disk by restoring the old value.

Example 10.2. Let’s repeat the previous example with an undo log added to the scheme.

Initially:





Memory
Disk








A 8




B 8





Log




The transaction T1 starts:





Memory
Disk








A 8




B 8





Log


<T1,start>


Read(A,t);t t × 2:





Memory
Disk








A 8 A 8




B 8




t 16





Log


<T1,start>


Write(A, t):





Memory
Disk








A 16 A 8




B 8




t 16





Log


<T1,start>

<T1,A,8>

Reat(B, t);t t × 2:





Memory
Disk








A 16 A 8




B 8 B 8




t 16





Log


<T1,start>

<T1,A,8>

Write(B, t):





Memory
Disk








A 16 A 8




B 16 B 8




t 16





Log


<T1,start>

<T1,A,8>

<T1,B,8>

Output(A ):





Memory
Disk








A 16 A 16




B 16 B 8




t 16





Log


<T1,start>

<T1,A,8>

<T1,B,8>

Output(B ):





Memory
Disk








A 16 A 16




B 16 B 16




t 16





Log


<T1,start>

<T1,A,8>

<T1,B,8>

<T1,commit>

Imagine the system crashes after Output(A) and before Output(B ). When we switch on the system again, the database system manager would check the log and see that T1 is unfinished, so it would set the values of A and B to be 8, and consistency would be recovered.

Complications

Another aspect to take into account, is that the log must be first written in memory, not written to disk on every action, so some problems can arise.

Example 10.3. First bad state: DB modified before log is written



Memory




A 16


B 16


t 16


Log:


<T1,start>


<T1,A,8>


<T1,B,8>




Disk




A 16


B 8



Log






If the system crashes now, we lose the log information in memory, and we don’t have it on disk, so we would not be able to recover to the previous consistent state.

Example 10.4. Second bad state: log written before DB modified



Memory




A 16


B 16


t 16


Log:


<T1,start>


<T1,A,8>


<T1,B,8>


<T1,commit>




Disk




A 16


B 8



Log


<T1,start>

<T1,A,8>

<T1,B,8>

<T1,commit>

If the system fails now, we would think that B has been correctly chanegd, because the buffer manager has not issued the Output(B ) operator yet.

An undo log is sufficient to allow recovery from a system failure, provided transactions and the buffer manager obey two rules:

  1. If transaction T modifies database element X , then the log record of the form < T ,X ,v> must be written to disk before the new value of X is written to disk.

  2. If a transaction commits, then its COMMIT log record must be written to disk only after all database elements changed by the transaction have been written to disk, but as soon thereafter as possible.

In order to force log records to disk, the log manager needs a fiush-log command that tells the buffer manager to copy to disk any log blocks that have not previously been copied to disk or that have been changed since they were last copied. In sequences of actions, we shall show FLUSH LOG explicitly.

Example 10.5. Let’s repeat the example with all this rules:










Step Action t MemA MemB DiskA DiskB MemLog DiskLog


















1 <Start T>









2 Read(A,t) 8 8 8 8









3 t t × 2 16 8 8 8









4 Write(A,t) 16 16 8 8 <T,A,8>









5 Read(B,t) 8 16 8 8 8









6 t t × 2 16 16 8 8 8









7 Write(B, t) 16 16 16 8 8 <T,B,8>









8 FlushLog <Start T>;<T,A,8>;<T,B,8>









9 Output(A ) 16 16 16 16 8









10 Output(B ) 16 16 16 16 16









11 <Commit T>









12 FlushLog <Commit T>









In this case, at any point in the process, if there is a failure, we would be able to rollback to a previous consistent state using the log written in disk.

When an error ocurrs, there is a procedure to follow, which is detailed in Algorithm 6. Note that if a failure occurs during recovery, nothing goes wrong, because the recovery procedure would start over again when the system is switched on and the rollback operations would proceed in the same manner.


Let S = set of transaction with <Ti,start> in log, but no 
               <Ti,commit> not <Ti,abort> in log 
 
for each <Ti,X,v> in log in reverse order do 
       if Ti in S then 
               write(X,v) 
               output(X) 
 
for each Ti in S do 
       write <Ti,abort> to log
Algorithm 6: Undo logging: recovery rules

10.2.2 Redo logging

Undo logging has a potential problem that we cannot commit a transaction without first writing all its changed data to disk. Sometimes, we can save disk I/O ’s if we let changes to the database reside only in main memory for a while. As long as there is a log to fix things up in the event of a crash, it is safe to do so. The requirement for immediate backup of database elements to disk can be avoided if we use a logging mechanism called redo logging. The principal differences between redo and undo logging are:

  1. While undo logging cancels the effect of incomplete transactions and ignores committed ones during recovery, redo logging ignores incomplete transactions and repeats the changes made by committed transactions.

  2. While undo logging requires us to write changed database elements to disk before the COMMIT log record reaches disk, redo logging requires that the COMMIT record appear on disk before any changed values reach disk.

  3. While the old values of changed database elements are exactly what we need to recover when the undo rules U1 and U2 are followed, to recover using redo logging, we need the new values instead.

In redo logging the meaning of a log record <T, X , v> is “transaction T wrote new value v for database element X”. There is no indication of the old value of X in this record. Every time a transaction T modifies a database element X , a record of the form < T ,X ,v> must be written to the log.

For redo logging, the order in which data and log entries reach disk can be described by a single “redo rule,” called the write-ahead logging rule.

  1. Before modifying any database element X on disk, it is necessary that all log records pertaining to this modification of X , including both the update record < T ,X ,v> and the <C0MMIT T> record, must appear on disk.

Example 10.6. Let’s repeat the same example with this new logic.










Step Action t MemA MemB DiskA DiskB MemLog DiskLog


















1 <Start T>









2 Read(A,t) 8 8 8 8









3 t t × 2 16 8 8 8









4 Write(A,t) 16 16 8 8 <T,A,16>









5 Read(B,t) 8 16 8 8 8









6 t t × 2 16 16 8 8 8









7 Write(B, t) 16 16 16 8 8 <T,B,16>









8 <Commit T>









9 <Start T>;<T,A,16>;<T,B,16>;<Commit T>









10 Output(A ) 16 16 16 16 8









11 Output(B ) 16 16 16 16 16









The procedure to recover from a failure is different from that of undo logging. In undo logging, we discard uncommited changes, because we are unsure if they are done in the database. In redo logging, we proceed by doing again those changes that are commited, because these are now those that we are unsure about, while uncommitted changes we know for sure that have not been made. The recovery rules for redo logging are illustrated in Algorithm 7.


Let S = set of transaction with <Ti,start> in log, but no 
               <Ti,commit> not <Ti,end> in log 
 
for each <Ti,X,v> in log in forward order do 
       if Ti in S then 
               write(X,v) 
               output(X) 
 
for each Ti in S do 
       write <Ti,end> to log  
Algorithm 7: Redo logging: recovery rules

Combining <Ti,end> records

There exist objects that are accessed often, which are usually called hot objects. One idea about this object is that as they are accessed very often, they would require lots of I/O operations and, many times, updated values of these objects would not even be needed to have been on disk, so we can try to delay writing them to disk as long as we can work with their values in memory. This way, we can just write their latest value and perform less I/O operations.

Example 10.7. Imagine we have four transactions:

T1 : ..update X..,T2 : ..update X..,T3 : ..update X..,T4 : ..update X.. which can be executed with the following set of actions:

W  rite(X)
Output(X )
W  rite(X)
Output(X )
W  rite(X)
Output(X )
W  rite(X)
Output(X )

And this way we would be updating X unnecessarily. A better way to handle this is:

   W  rite(X)
   Ou-tp-ut-(X--)
   -W  rite(X)
   Ou-tp-ut-(X--)
   -W  rite(X)
   Ou-tp-ut-(X--)
   -W  rite(X)
   Output(X )
combined < end >

Nonetheless, there is an even better way to tackle this problem: checkpointing.

10.2.3 Checkpointing with undo logging

As we observed, recovery requires that the entire log be examined, in principle. When logging follows the undo style, once a transaction has its COMMIT log record written to disk, the log records of that transaction are no longer needed during recovery. We might imagine that we could delete the log prior to a COMMIT, but sometimes we cannot. The reason is that often many transactions execute at once. If we truncated the log after one transaction committed, log records pertaining to some other active transaction T might be lost and could not be used to undo T if recovery were necessary. The simplest way to untangle potential problems is to checkpoint the log periodically. In a simple checkpoint, we:

  1. Stop accepting new transactions.

  2. Wait until all currently active transactions commit or abort and have written a COMMIT or ABORT record on the log.

  3. Flush the log to disk.

  4. Write a log record <CKPT>, and flush the log again.

  5. Resume accepting transactions.

Example 10.8. An undo log with checkpointing:

 < Start T1 >
  < T1,A,5 >
 < Start T2 >
 < T2,B,10 >
  - - - - - -
 < T2,C,15 >
 < T1,D, 20 >
< Commit  T1 >
< Commit  T2 >
  < CKP  T >
 < Start T3 >
 < T3,E,25 >
      ...

In the dotted line, a checkpoint was launched, so no more transactions are accepted to execute until T1 and T2 finish. When both transactions commit, we can now write the checkpoint and accept new transactions, such as T3.

10.2.4 Checkpointing with redo logging

The steps to perform a checkpoint of a redo log are as follows:

  1. Write a log record <START CKPT (Ti,... ,Tk)>, where Ti,... ,Tk are all the active (uncommitted) transactions, and flush the log.

  2. Write to disk all database elements that were written to buffers but not yet to disk by transactions that had already committed when the START CKPT record was written to the log.

  3. Write an <END CKPT> record to the log and flush the log.

Example 10.9. A redo log with checkpointing:

When we start the checkpoint, only T2 is active, but the value of A written by T1 may have reached disk. If not, then we must copy A to disk before the checkpoint can end.

    < Start T1 >
     < T1,A,5 >
    < Start T2 >
   < Commit  T1 >
    < T2,B,10 >
< Start CKP T (T2) >
    < T2,C,15 >
    < Start T3 >
    < T3,D, 20 >
   < End CKP  T >
   < Commit  T2 >
   < Commit  T3 >

10.2.5 Undo/Redo logging

We have seen two different approaches to logging, differentiated by whether the log holds old values or new values when a database element is updated. Each has certain drawbacks:

An undo/redo log has the same sorts of log records as the other kinds of log, with one exception. The update log record that we write when a database element changes value has four components. Record < T ,X ,v,w > means that transaction T changed the value of database element X; its former value was v, and its new value is w. The constraints that an undo/redo logging system must follow are summarized by the following rule:

  1. Before modifying any database element X on disk because of changes made by some transaction T, it is necessary that the update record <T, X , v, w> appear on disk.

Example 10.10. An undo/redo log. Let’s redo our typical example:










Step Action t MemA MemB DiskA DiskB MemLog DiskLog


















1 <Start T>









2 Read(A,t) 8 8 8 8









3 t t × 2 16 8 8 8









4 Write(A,t) 16 16 8 8 <T,A,8,16>









5 Read(B,t) 8 16 8 8 8









6 t t × 2 16 16 8 8 8









7 Write(B, t) 16 16 16 8 8 <T,B,8,16>









8 FlushLog <Start T>;<T,A,8,16>;<T,B,8,16>









9 Output(A ) 16 16 16 16 8









10 <Commit T>









11 Output(B ) 16 16 16 16 16









Note that, in this case, the last three steps could’ve appeared in any order.

The undo/redo recovery policy is:

  1. Redo all the committed transactions in the order earlierst-first.

  2. Undo all the uncommitted transactions in the order latest-first.

11 Concurrency control

Interactions among concurrently executing transactions can cause the database state to become inconsistent, even when the transactions individually preserve correctness of the state, and there is no system failure. Thus, the timing of individual steps of different transactions needs to be regulated in some manner. This regulation is the job of the scheduler component of the DBMS, and the general process of assuring that transactions preserve consistency when executing simultaneously is called concurrency control.

11.1 Schedules: serial, serializable and conflict-serializable

Definition 11.1. A schedule is a sequence of the important actions taken by one or more transactions.

Example 11.1. Imagine we have the constraint A = B and the following two transactions:

T1 : Read(A) T2 : Read          (A)
A A + 100 A A × 2
Write(A ) Write           (A)
Read(B) Read          (B)
B B + 100 B B × 2
Write(B ) Write           (B)

Note how each of the transactions individually preserves the consistency of the database. Nonetheless, there exist some schedules that can make things go wrong!

Schedule A:





T1 T2 A B




25 25








Read(A); A<-A+100




Write(A); 125




Read(B); B<-B+100




Write(B); 125




Read(B); A<-A*2




Write(A); 250




Read(B); B<-B*2




Write(B); 250




250 250




This schedule poses no problems.

Schedule B:





T1 T2 A B




25 25








Read(A); A<-A*2




Write(A); 50




Read(B); B<-B*2




Write(B); 50




Read(A); A<-A+100




Write(A); 150




Read(B); B<-B+100




Write(B); 150




150 150




This schedule poses no problems. Note, nonetheless, how the different orderings affect the final result.

Schedule C:





T1 T2 A B




25 25








Read(A); A<-A+100




Write(A); 125




Read(A); A<-A*2




Write(A); 250




Read(B); B<-B+100




Write(B); 125




Read(B); B<-B*2




Write(B); 250




250 250




This schedule poses no problems.

Schedule D:





T1 T2 A B




25 25








Read(A); A<-A+100




Write(A); 125




Read(A); A<-A*2




Write(A); 250




Read(B); B<-B*2




Write(B); 50




Read(B); B<-B+100




Write(B); 150




250 150




But this schedule is problematic! The final state of the database is inconsistent.

We want schedules that are ’good’, in the sense that they ensure the final state of the database to be consistent, independent of the transactions’ semantics and the initial (consistent) state of the database.

For this, we should only look at the order of read and writes.

A schedule can be represented as its sequence of actions, where ri(X ) means X is read in transaction i and wi(X ) means X is written in transaction i.

For example, schedule C can be represented as

SC = r1(A)w1 (A )r2(A)w2 (A )r1(B)w1 (B )r2(B)w2 (B).

Definition 11.2. A schedule is serial if its actions consist of all the actions of one transaction, then all actions of another transaction, and so on.

For example, Schedules A and B are serial.

Remark 11.1. Note that all serial schedules work: they leave a consistent database, because of the correctness principle.

Definition 11.3. A schedule S is serializable if there is a serial schedule Ssuch that for every initial database state, the effects of S and Sare the same.

For example, Schedule C is serializable, with C= A, but schedule D is not serializable, because it leads to an inconsistent state.

The transaction game

It is a way to visually check for serializability (only for simple schedules).

Example 11.2. For example, Schedule C is:










A r w r w









B r w r w









T1 r w r w









T2 r w r w









Here, we represents taken by each transaction and the affected variable, in the order of the schedule. Now, two steps can be exchanged if they are next to one another and they can slide without colliding.










A r w r w









B r w r w









T1 r w r w









T2 r w r w









Here, the red colored steps cannot be exchanged, because they collide in the first row.










A r w r w









B r w r w









T1 r w r w









T2 r w r w









But now, the green colored steps can be exchanged, because they are next to each other and they do not collide if we slide the columns:










A r w r w









B r w r w









T1 r w r w









T2 r w r w









And this is schedule A!

Example 11.3. Now, for schedule D:










A r w r w









B r w r w









T1 r w r w









T2 r w r w









Everything is red!

Let’s continue the discussion to try to understand why D is different from C.

Definition 11.4. We define conflicting actions as those that cannot be reordered in a schedule. These are:

Definition 11.5. A schedule is conflict serializable if it can be serialized without violating any conflicting action. If S is conflict serializable to S, then S and Sare conflict equivalent.

These definitions allow us to create a mathematical tool for determining ’good’ schedules, understanding that as a conflict serializable schedule. This goal is achieved by a precedence graph:

Definition 11.6. A precedence graph of a schedule S, P(S), is graph P(S) = (N, E) where:

  • The nodes, N, are the transactions in S.

  • The arcs are directed, and (Ti,Tj) E whenever:

    • pi(A),qj(A) are actions in S

    • pi(A) < Sqj(A ), i.e., pi(A) precedes qj(A )

    • at least one of pi,qj is a write

Example 11.4. Let’s compute P(S) for

S = w3 (A)w2 (C )r1(A)w1 (B)r1(C)w2 (A)r4(A)w4 (D)

PIC

How to use precedence graphs to determine conflict-serializability?

To tell whether a schedule S is conflict-serializable, construct the precedence graph for S and ask if there are any cycles. If so, then S is not conflict-serializable. But if the graph is acyclic, then S is conflict-serializable.

Example 11.5. The schedule in Example 11.4 is not conflict serializable, because there is cycle between nodes 1 and 2.

Example 11.6. Is S = w1(A)r2(A )r3(A)w4(A ) conflict serializable?

First, construct the precedence graph:

PIC

In this case, there are no cycles, so S is conflict serializable.

11.2 How to enforce serializability: locking

11.2.1 Option 1: let luck be our friend

Run the system, recording the precedence graphs of the schedules used. At the end of the day (at some decided point in time), check for P(S ) cycles and declared if the execution was good.

Of course, this is not a very good option, because we are letting luck decide whether we are losing time and energy or not.

11.2.2 Option 2: a locking protocol

Prevent cycles from occurring!

Now, we define two new actions:

Definition 11.7. Rule 1: A transaction is well-formed if before performing any operation on an object A, it has locked it before, and it unlocks it afterwards.

Definition 11.8. Rule 2: A schedule is legal if all objects that are locked, have been previously unlocked (or never locked before).

Example 11.7. Let’s analyze some schedules in terms of well-form and legality:

Definition 11.9. Rule 3: The two phase locking (2PL) scheme refers to a strategy for scheduling in which there are no unlocks for transaction Ti until all locks for Ti have been acquired, and there are no locks for Ti after any unlock for Ti, i.e., all locks are acquired before any unlock is performed.

Theorem 11.1. All schedules verifying rules 1,2 and 3 are conflict-serializable. Meaning, if a schedule has all its transactions well-formed, the schedule is legal and it uses the 2PL locking scheme, then the schedule is conflict-serializable.

Remark 11.2. Note that the converse is not true: there are conflict-serializable schedules that are not 2PL.

Beyond the 2PL protocol, it is all a matter of improving performance and allowing more concurrency, for these, there exist more artifacts, such as shared locks, multiple granularity locks,...

11.3 Shared locks

Until now, we are locking an object before any action can be applied to it, but sometimes it is possible to actually grant access to several transactions to interact with the same object, if the actions that they need to perform are not in conflict. For example, if two transactions want to read from the same variable, they can do it without problem, but with our scheme we don’t allow this.

A way to amend this is to define shared locks which are locks that can be shared by several transactions, provided they only want to read the object.

We define new actions:

Note that it is usual to just represent both unlocks by ui(A), assuming the computer will execute the correct action depending on the state of the object.

Now, we have to redefine our rules:

Definition 11.10. Rule 1: a transaction is well-formed if:

  • Before performing a read action on an object, it has been previously locked, exclusively or shared (but better if it is shared locked, allowing for increased performance).

  • Before performing a write action on an object, it has been previously exclusively locked.

Rule 2: a schedule is legal if:

  • No exclusive lock is performed on a locked (exclusive or shared) object, until it has been unlocked.

  • No lock (exclusive or shared) is performed on a exclusively locked object, until it has been unlocked.

This rule can be summarize in a compatibility matrix, which shows which state changes are allowed:





Tj asks for




Ti holds
S X






S True False






X False False




Rule 3: a schedule respect the 2PL protocol if for any two-phase locked transaction Ti, no action xli(A ) or sli(A ) is preceded by an action ui(B) for any object B.

Notice that there are transactions that read and write the same object, so what should we do about this? There are two main options:

  1. We can just request an exclusive lock from the beginning.

  2. We can use an upgrade scheme, in which a shared lock is acquired if we are unsure if we will need to write the object later. When we need to write it, we ’upgrade’ the shared lock to be an exclusive lock. This can be technically achieved by allowing transactions to have two locks on the same object: one shared and one exclusive; or by releasing the shared lock and getting the exclusive lock7 .

11.4 More types of locks

11.4.1 Increment lock

We can define a new action, which is an atomic increment action, as

INi (A,k) ≡ {Read (A);A ← A + k;W rite (A )} .

A property of these actions is that they are commutative, so they do not conflict between each other, even though they involve writing the objects. This allows for more flexibility, because we can define a new lock, the increment lock:

And we have to, again, redefine the rules:

Definition 11.11. Rule 1: a transaction is well-formed if:

  • Before performing a read action on an object, it has been previously locked, exclusively or shared (but better if it is shared locked, allowing for increased performance).

  • Before performing a write action on an object, it has been previously exclusively locked.

  • Before performing an increment action on an object, it has been previously increment-locked.

Rule 2: a schedule is legal if:

  • No exclusive lock is performed on a locked (exclusive, increase or shared) object, until it has been unlocked.

  • No lock (exclusive, increase or shared) is performed on a exclusively locked object, until it has been unlocked.

  • An increase lock can only be performed on unlocked objects or increase-locked objects.

This rule can be summarize in a compatibility matrix, which shows which state changes are allowed:

Tj asks for










Ti holds
S X I










S True False False










X False False False










I False False True





Rule 3: a schedule respect the 2PL protocol if for any two-phase locked transaction Ti, no action xli(A ), sli(A ) or ili(A ) is preceded by an action ui(B) for any object B.

11.4.2 Update lock

A common deadlock problem that arises when we use upgrading shared locks is depicted below:



T1 T2




sl1(A)


sl2(A)


lx1(A)


lx2(A)


In this case, both transactions are waiting for the other to release the object A to be able to lock it exclusively, so there is a deadlock.

The solution implies decreasing the level of concurrency, but it is worthy to avoid such problematic cases. We define a new lock, in which a transaction which is unsure about if it will need to write an object, it acquire an update lock instead of the shared lock, and upgrades can only be made from this lock:

Note that if an object is shared-locked, it can be update-locked, but not the other way round (if we allow this, the behavior would not change).

Let’s redefine our three rules for this case:

Definition 11.12. Rule 1: a transaction is well-formed if:

  • Before performing a read action on an object, it has been previously locked, exclusively, update or shared (but better if it is shared locked, allowing for increased performance).

  • Before performing a write action on an object, it has been previously exclusively locked.

  • Before upgrading a lock, it has been previously update-locked.

Rule 2: a schedule is legal if:

  • No exclusive lock is performed on a locked (exclusive, update or shared) object, until it has been unlocked.

  • No lock (exclusive, update or shared) is performed on a exclusively locked object, until it has been unlocked.

  • An update lock can only be performed on unlocked objects or shared-locked objects.

This rule can be summarize in a compatibility matrix, which shows which state changes are allowed:

Tj asks for










Ti holds
S X U










S True False True










X False False False










U False False False





Note that in this case, an object can be locked in several modes (for instance, it can be locked in shared and update mode at the same time), so transitions are based on the most restrictive lock mode.

Rule 3: a schedule respect the 2PL protocol if for any two-phase locked transaction Ti, no action xli(A ) or sli(A ) is preceded by an action ui(B) for any object B.

11.5 Lock granularity

We have been talking about locking objects, but we have not specified which are these objects. They can be tuples, pages, relations,... In all cases the scheme works, but choosing the appropriate size for what we are locking is obviously going to affect performance. For instance:

We can define multi-granular locks, which can specify at what level they want to lock the objects involved.

For this, we define intentional locks, which can be of any of the types we have seen, but with a slightly different meaning:

Example 11.8. Imagine we have the relation R1 which has four tuples. If transaction 1, T1, wants to shared-lock the second tuple, t2, we need to obtain an intentional shared-lock on R1 and then a shared-lock on t2. This is depicted below:

PIC

Now, imagine T2 wants a shared-lock in the whole relation, then, it will can be acquired, because the compatibility matrix allows it. But it wanted an exclusive-lock in the whole relation, it would need to wait until the locked tuple is unlocked.

Example 11.9. Another example is starting with the previous one:

PIC

Now, imagine T2 wants to exclusively lock t4: this can be done, because t4 is unlocked. In this case, T2 needs to ask for an intentional exclusive lock on R1, which would be granted because the relation R1 is not fully locked. Then at tuple level, T2 would ask for an exclusive lock for t4, an it would be granted because it is unlocked:

PIC

Again, we can build the compatibility matrix for this new kind of locks:

Tj asks for














Ti holds
IS IX S SIX X





















IS True True True True False





















IX True True False False False





















S True False True False False





















SIX True False False False False





















X False False False False False







Also, there are restrictions in which states can subobjects have in terms of the states of the parent object:



Parent state Child possible states




IS IS,S


IX IS,S,IX,X,SIX


S none


SIX X,IX


X none


The rules to follow are the following:

  1. Follow the multiple granularity compatibility matrix.

  2. Lock root of tree first.

  3. Node Q can be locked by Ti in S or IS only if parent(Q) is locked by Ti in IX or IS.

  4. Node Q can be locked by Ti in X,SIX,IX only if parent(Q) is locked by Ti in IX,SIX.

  5. Ti is two-phase.

  6. Ti can unlock node Q only if none of Q’s children are locked by Ti.

Example 11.10. Let’s do some practice.

1) Start with this setup:

PIC

Can T2 access object f22 in mode X? If so, what locks would T2 get?

Yes, it can, because all the locks in the sequence are compatible with another IX lock and f2,2 is unlocked:

PIC

2) Start with this setup:

PIC

Can T2 access object f22 in mode X? If so, what locks would T2 get?

No, it cannot, because parent t2 is in X mode, so it cannot be locked in IX mode by T2.

3) From the last setup: can T2 access object f3,1 in X mode? What locks would T2 get?

Yes, it can, because f3,1 is unlocked, its parent t3 is unlocked, and its parent is in IX state, compatible with another IX state:

PIC

4) Start with this setup:

PIC

Can T2 access object f2,2 in S mode? What locks would T2 get?

Yes, because SIX and IX and compatible with IS:

PIC

5) In the previous setup: can T2 access object f2,2 in X mode? What locks would T2 get?

No, because SIX is not compatible with IX.

Part VII
Distributed Databases

12 Distributed databases

Distributed database management systems distribute and replicate data over multiple machines to try to meet the availability, durability, performance, regulatory amd scale requirements of large organizations, subject to physics.

A distributed database does two things:

The goal is to offer the same functionality and transactional semantics as a RDBMS with distributed features.

The reality is that there need to be done concessions in terms of functionality, transactional semantics and performance.

There are three main challenges in distributed databases, which are how to distribute the data, how to access the data and how to perform distributed transactions.

Definition 12.1. A shard is a horizontal partition of data in a database.

12.1 Data distribution

There are several ways to distribute the data:

In the context of data distribution, it is usually desirable to maintain approximately the same amount of data in each node. To achieve this, rebalancing is used. Rebalancing encompasses:

Another important concept is that of co-location, which makes use of the fact that some tables share some attributes. If a shared attribute is related to the distribution key, then we can store different tables in a distributed manner, in such a way that if we perform a join between these tables using this attribute, we minimize the interaction between tables from different nodes.

Other times, for the same purpose of increasing the efficiency of distributed joins it is useful to replicate small tables to enable fast joins, foreign keys and other operations. This technique is called reference tables.

There are more ways to tackle data distribution:

12.2 Distributed data access: distributed SQL

To scale query throughput linearly with the number of nodes, queries should only access one node. The techniques of co-location and reference table enable relatively complex queries. This idea of using the nodes information inside the queries to only access the desired nodes is called routing queries.

Example 12.2. A routing query trying to access only the node where the distribution key is 36:

INSERT INTO dist1(dist_key,value) VALUES(36, 11); 
 
SELECT * 
FROM dist1 
WHERE dist_key=36 AND value<11; 
 
UPDATE dist1 
SET value=3 
WHERE dist_key=36 AND value<11;

Nonetheless, the relational algebra can be extended to work in distributed system: it is called the multi-relational algebra, and it adds to the usual relational algebra the operations8 :

As in standard SQL, in distributed SQL it is needed to perform logical planning of the queries before executing them, with the need that the final result is the same as it would be if all data were in one node.

Example 12.3. Imagine we want to compute the mean of an attribute in a table which is distributed in different nodes. The steps to follow would be:

  1. In each node: SUM(A) and COUNT(A)

  2. In requesting node:

    1. COLLECT all data: sum_ = SUM(SUM(N->T.A), N in Nodes), count_ = SUM(COUNT(N->T.A), N in Nodes)

    2. mean = sum_ / count_

Also, different plans can be defined and the best one should be chosen following some optimization criteria.

Co-located joins

Imagine we have the query:

SELECT dist1.dist_key, count(*) 
FROM dist1 
JOIN dist2 ON (dist1.dist_key = dist1.dist_key) 
WHERE dist2.value < 44 
GROUP BY dist1. dist_key;

One way to do this is:

  1. In each node:

    1. Scan dist1

    2. Scan dist2 and filter dist2.value < 44

  2. In requesting node:

    1. Collect all dist1

    2. Collect all filtered dist2

    3. Perform the join

    4. Aggregate

But as we are using the distribution key for joining, we know that the tables that will join are stored together, so we can make use of the co-location by changing the plan to:

  1. In each node:

    1. Scan dist1

    2. Scan dist2 and filter dist2.value < 44

    3. Perform the join

    4. Aggregate

  2. In requesting node:

    1. Collect the values

For this changes to work, we are using that: filter is commutative with collect, group by dist_key is commutative with collect and join is co-located, so it is commutative with collect.

When we do this, we are working with much smaller tables, thus reducing the response time.

Re-partition joins

Now imagine the query:

SELECT dist1.dist_key, count(*) 
FROM dist1 
JOIN dist2 ON (dist1.dist_key = dist2.other_key 
WHERE dist2.value < 44 
GROUP BY dist1.dist_key;

In this case, the initial plan also work, but again we can do better repartitioning the dist2 tables to the nodes where they will be needed. This might seem like too heavy work to do, but if the tables are large enough, the whole relations might not even fit in only one node, so this can be the only way to be able to produce the response. The alternative plan is:

  1. In each node:

    1. Scan dist2

    2. Filter dist2.value < 44

  2. In each node:

    1. Repartition filtered tuples from dist2 with corresponding other_key to dist_key of this node

    2. Scan dist1

    3. Perform the join

    4. Aggregate

  3. In requesting node:

    1. Collect the values

Broadcast joins

Now, imagine the query:

WITH top10 AS ( 
       SELECT dist_key, count(*) 
       FROM dist1 
       GROUP BY 1 
       ORDER BY 2 
       LIMIT 10 
       ) 
 
SELECT * 
FROM dist2 
WHERE other_key IN (SELECT dist_key FROM top10);

The naïve plan is:

  1. In each node:

    1. Scan dist1

    2. Scan dist2

  2. In requesting node

    1. Collect dist1

    2. Aggregate dist1

    3. Sort/limit dist1

    4. Collect dist2

    5. Perform the join

But this can be improved by creating a subplan that handles order/limit under join and broadcasting the subplan to pull the collect above the join. The idea is that TOP10 among all the data is the same that the TOP10 among all the TOP10s in each node:

  1. In each node:

    1. Scan dist1

    2. Preaggregate: get TOP10 of the node

  2. In requesting node:

    1. Collect TOP10 of each node

    2. Merge all TOP10

    3. Sort/limit and get final TOP10

    4. Broadcast this TOP10 to all nodes

  3. In each node:

    1. Scan dist2

    2. Join dist2 with broadcasted TOP10

  4. In requesting node:

    1. Collect the values

Observations

As we have seen, the query plans depend heavily on the distribution key.

Runtime also depends on the query, data, the data size, network speed,...

This menas distributed databases require adjusting the distribution keys and queries to each other to achieve high performance.

12.3 Distributed transactions

Ideally, we have ACID transactions: Atomicity, Consistency, Isolation and Durability.

The main distribution challenges are:

Additionally, it is important to have a mechanism for distributed deadlock detection.

12.3.1 Atomicity

Atomicity is achieved through 2PC (2-Phase Commit):

  1. Store transactions on all nodes

  2. Store final commit decision and

    1. If success: commit all stored transactions

    2. If error: abort all prepared transactions

  3. Commit/abort prepared transactions after system failure

12.3.2 Isolation

If we query different nodes at different times, we may see a concurrent transaction as commited on one node, but not yet committed on another one.

Distributed snapshot isolation means we have the same view of what is commited and what not on all nodes.

We must also ensure consistency: any preceding write is seen as committed.

A common approach is to add a timestamp to each query, so queries see all commits with lower timestamps.

There are different ways of dealing with clock synchronization:

12.3.3 Considerations

Fully resolving a 2PC might take time in case of system failure.

Distributed deadlock detection is essential to stability, but not always implemented.

Snapshot isolation avoids seeing partially committed transactions, but at a cost, and read-your-writes consistency can be at risk.

12.4 Replication

Replication means to store the same data in different nodes. It can be useful for:

12.4.1 Quorums

The basic idea is to read from R nodes and write to W nodes, where R+W>N, being N the total number of nodes.

The challenge is to apply events in the same order wverywhere.

12.4.2 Follow the leader

We can also assign a temporary leader to serialize writes efficiently. This leader would then feed the rest of the nodes with the new data.

If one node fails (standby fail), the leader will continue writing to other replica.

If the leader fails (primary fail), a failover is initiated, a replica is promoted to leader and the rest of the replicas follow the new leader.

12.4.3 N-directional

All nodes accept writes and then decide how to reconcile conflicting changes.

12.5 CAP theorem

Basically, the CAP theorem states that a database can only have two out of the three properties: Consistency, Availability and Partitioning. In the case of distributed databases, partitioning is a must, so one must decide between C and A (note, nonetheless, that even though the theorem ensure that one cannot get the three properties in their most strict form, it is possible to have them in relaxed, yet good, ways). So, basically, according to this theorem, we need to decide between:

But this is an incomplete picture of the trade-offs that appear in a distributed database.

12.6 PACELC theorem

This is an improved version of the CAP theorem, but is still oversimplified:

12.7 More trade-offs

References

[1]   M. M. Astrahan, M. W. Blasgen, D. D. Chamberlin, K. P. Eswaran, J. N. Gray, P. P. Griffiths, W. F. King, R. A. Lorie, P. R. McJones, J. W. Mehl, G. R. Putzolu, I. L. Traiger, B. W. Wade, and V. Watson. System r: Relational approach to database management. ACM Transactional Database Systems.

[2]   Hector Garcia-Molina, Jeffrey Ullman D, and Jennifer Widom. Database Systems: The Complete Book. 01 2002.

[3]   Mahmoud Sakr. Infoh417 database systems architecture. Lecture Notes.

[4]   Jan Van den Bussche and Stijn Vansummeren. Translating sql into the relational algebra.