DBAzine.com: Trees in SQL: Nested Sets and Materialized Path

来源:百度文库 编辑:神马文学网 时间:2024/10/02 18:47:34
Trees in SQL: Nested Sets and Materialized Path


by Vadim Tropashko
Relational databases are universally conceived of as an advance over their predecessors network and hierarchicalmodels. Superior in every querying respect, they turned out to besurprisingly incomplete when modeling transitive dependencies. Almostevery couple of months a question about how to model a tree in thedatabase pops up at the comp.database.theory newsgroup. In this articleI'll investigate two out of four well known approaches to accomplishingthis and show a connection between them. We'll discover a new methodthat could be considered as a "mix-in" between materialized path andnested sets.
Adjacency List
Tree structure is a special case of Directed Acyclic Graph (DAG). One way to represent DAG structure is:
create table emp (
ename   varchar2(100),
mgrname varchar2(100)
);
Each record of the emp table identified by ename is referring to itsparent mgrname. For example, if JONES reports to KING, then the emptable contains record. Suppose,the emp table also includes .Then, if the emp table doesn't contain the record, and the same is true for every pair ofadjoined records, then it is called adjacency list. If the opposite is true, then the emp table is a transitively closed relation.
A typical hierarchical query would ask if SCOTT indirectly reportsto KING. Since we don't know the number of levels between the two, wecan't tell how many times to selfjoin emp, so that the task can't besolved in traditional SQL. If transitive closure tcemp of the emp tableis known, then the query is trivial:
select 'TRUE' from tcemp
where ename = 'SCOTT' and mgrname = 'KING'
The ease of querying comes at the expense oftransitive closure maintenance.
Alternatively, hierarchical queries can be answered with SQL extensions: either SQL3/DB2 recursive query
with tcemp as (
select ename,mgrname from tcemp
union
select tcemp.ename,emp.mgrname from tcemp,emp
where tcemp.mgrname = emp.ename
) select 'TRUE' from tcemp
where ename = 'SCOTT' and mgrname = 'KING';
that calculates tcemp as an intermediate relation, or Oracle proprietary connect-by syntax
select 'TRUE' from (
select ename from emp
connect by prior mgrname = ename
start with ename = 'SCOTT'
) where ename = 'KING';
in which the inner query "chases the pointers" from the SCOTT nodeto the root of the tree, and then the outer query checks whether theKING node is on the path.
Adjacency list is arguably the most intuitive tree model. Our main focus, however, would be the following two methods.
Materialized Path
In this approach each record stores the whole path to the root. Inour previous example, lets assume that KING is a root node. Then, therecord with ename = 'SCOTT' is connected to the root via the pathSCOTT->JONES->KING. Modern databases allow representing a list ofnodes as a single value, but since materialized path has been inventedlong before then, the convention stuck to plain character string ofnodes concatenated with some separator; most often '.' or '/'. In thelatter case, an analogy to pathnames in UNIX file system is especiallypronounced.
In more compact variation of the method, we use sibling numeratorsinstead of node's primary keys within the path string. Extending ourexample:
ENAME PATH
KING   1
JONES 1.1
SCOTT 1.1.1
ADAMS 1.1.1.1
FORD 1.1.2
SMITH 1.1.2.1
BLAKE 1.2
ALLEN 1.2.1
WARD 1.2.2
CLARK 1.3
MILLER 1.3.1
Path 1.1.2 indicates that FORD is the second child of the parent JONES.
Let's write some queries.
1. An employee FORD and chain of his supervisors:
select e1.ename from emp e1, emp e2
where e2.path like e1.path || '%'
and e2.name = 'FORD'
2. An employee JONES and all his (indirect) subordinates:
select e1.ename from emp e1, emp e2
where e1.path like e2.path || '%'
and e2.name = 'JONES'
Although both queries look symmetrical, there is a fundamentaldifference in their respective performances. If a subtree ofsubordinates is small compared to the size of the whole hierarchy, thenthe execution where database fetches e2 record by the name primary key,and then performs a range scan of  e1.path, which is guaranteed to bequick.
On the other hand, the "supervisors" query is roughly equivalent to
select e1.ename from emp e1, emp e2
where e2.path > e1.path and e2.path < e1.path || 'Z'
and e2.name = 'FORD'
Or, noticing that we essentially know e2.path, it can further be reduced to
select e1.ename from emp e1
where e2path > e1.path and e2path < e1.path || 'Z'
Here, it is clear that indexing on path doesn't work (except for "accidental" cases in which e2path happens to be near the domain boundary, so that predicate e2path > e1.path is selective).
The obvious solution is that we don't have to refer to the databaseto figure out all the supervisor paths! For example, supervisors of1.1.2 are 1.1 and 1. A simple recursive string parsing function canextract those paths, and then the supervisor names can be answered by
select e1.ename from emp where e1.path in ('1.1','1')
which should be executed as a fast concatenated plan.
Nested Sets
Both the materialized path andJoe Celko's nested sets provide the capability to answer hierarchical queries with standard SQL syntax. In both models, the global position of the node in the hierarchy is "encoded" as opposed to an adjacency list of which each link is a localconnection between immediate neighbors only. Similar to materializedpath, the nested sets model suffers from supervisors query performanceproblem:
select p2.emp from Personnel p1, Personnel p2
where p1.lft between p2.lft and p2.rgt
and p1.emp = 'Chuck'
(Note: This query is borrowed from thepreviously cited Celko article).Here, the problem is even more explicit than in the case of amaterialized path: we need to find all the intervals that cover a givenpoint. This problem is known to be difficult. Although there arespecialized indexing schemes like R-Tree, none of them is asuniversally accepted as B-Tree. For example, if the supervisor's pathcontains just 10 nodes and the size of the whole tree is 1000000, noneof indexing techniques could provide 1000000/10=100000 timesperformance increase. (Such a performance improvement factor istypically associated with index range scan in a similar, veryselective, data volume condition.)
Unlike a materialized path, the trick by which we computed all thenodes without querying the database doesn't work for nested sets.
Another — more fundamental — disadvantage of nested sets is that nested sets coding is volatile.If we insert a node into the middle of the hierarchy, all the intervalswith the boundaries above the insertion point have to be recomputed. Inother words, when we insert a record into the database, roughly half ofthe other records need to be updated. This is why the nested sets modelreceived only limited acceptance for static hierarchies.
Nested sets are intervals of integers. In an attempt to make thenested sets model more tolerant to insertions, Celko suggested we giveup the property that each node always has (rgt-lft+1)/2 children. In myopinion, this is a half-step towards a solution: any gap in anested set model with large gaps and spreads in the numberingstill could be covered with intervals leaving no space for adding morechildren, if those intervals are allowed to have boundaries at discrete points (i.e., integers) only. One needs to use a dense domain like rational, or real numbers instead.
Nested Intervals
Nested intervals generalize nested sets. A node [clft, crgt] is an (indirect) descendant of [plft, prgt] if:
plft <= clft and crgt >= prgt
The domain for interval boundaries is not limited by integersanymore: we admit rational or even real numbers, if necessary. Now,with a reasonable policy, adding a child node is never a problem. Oneexample of such a policy would be finding an unoccupied segment [lft1,rgt1] within a parent interval [plft, prgt] and inserting a child node [(2*lft1+rgt1)/3, (rgt1+2*lft)/3]:

After insertion, we still have two more unoccupied segments[lft1,(2*lft1+rgt1)/3] and [(rgt1+2*lft)/3,rgt1] to add more childrento the parent node.
We are going to amend this naive policy in the following sections.
Partial Order
Let's look at two-dimensional picture of nested intervals. Let'sassume that rgt is a horizontal axis x, and lft is a vertical one - y.Then, the nested intervals tree looks like this:

Each node [lft, rgt] has its descendants bounded within the two-dimensional coney >= lft & x <= rgt. Since the right interval boundary isalways less than the left one, none of the nodes are allowed above thediagonal y = x.
The other way to look at this picture is to notice that a child nodeis a descendant of the parent node whenever a set of all points definedby the child cone y >= clft & x <= crgt is a subset of theparent cone y >= plft & x <= prgt. A subset relationshipbetween the cones on the plane is a partial order.
Now that we know the two constraints to which tree nodes conform, I'll describe exactly how to place them at the xy plane.
The Mapping
Tree root choice is completely arbitrary: we'll assume the interval[0,1] to be the root node. In our geometrical interpretation, all thetree nodes belong to the lower triangle of the unit square at the xy plane.
We'll describe further details of the mapping by induction. For eachnode of the tree, let's first define two important points at the xy plane. The depth-first convergence pointis an intersection between the diagonal and the vertical line throughthe node. For example, the depth-first convergence point for is . The breadth-first convergence pointis an intersection between the diagonal and the horizontal line throughthe point. For example, the breadth-first convergence point for is .
Now, for each parent node, we define the position of the first childas a midpoint halfway between the parent point and depth-firstconvergence point. Then, each sibling is defined as a midpoint halfwaybetween the previous sibling point and breadth-first convergence point:

For example, node 2.1 is positioned at x=1/2, y=3/8.
Now that the mapping is defined, it is clear which dense domain weare using: it's not rationals, and not reals either, but binaryfractions (although, the former two would suffice, of course).
Interestingly, the descendant subtree for the parent node "1.2" is ascaled down replica of the subtree at node "1.1." Similarly, a subtreeat node 1.1 is a scaled down replica of the tree at node "1." Astructure with self-similarities is called a fractal.
Normalization
Next, we notice that x and y are not completely independent. We can tell what are both x and y if we know their sum.Given the numerator and denominator of the rational number representingthe sum of the node coordinates, we can calculate x and y coordinatesback as:
function x_numer( numer integer, denom integer )
RETURN integer IS
ret_num integer;
ret_den integer;
BEGIN
ret_num := numer+1;
ret_den := denom*2;
while floor(ret_num/2b = ret_num/2 loop
ret_num := ret_num/2;
ret_den := ret_den/2;
end loop;
RETURN ret_num;
END;
 
function x_denom( numer integer, denom integer )
...
RETURN ret_den;
END;
in which function x_denom body differs from x_numer in the returnvariable only. Informally, numer+1 increment would move theret_num/ret_den point vertically up to the diagonal, and then xcoordinate is half of the value, so we just multiplied the denominatorby two. Next, we reduce both numerator and denominator by the commonpower of two.
Naturally, y coordinate is defined as a complement to the sum:
function y_numer( numer integer, denom integer )
RETURN integer IS
num integer;
den integer;
BEGIN
num := x_numer(numer, denom);
den := x_denom(numer, denom);
while den < denom loop
num := num*2;
den := den*2;
end loop;
num := numer - num;
while floor(num/2) = num/2 loop
num := num/2;
den := den/2;
end loop;
RETURN num;
END;
 
function y_denom( numer integer, denom integer )
...
RETURN den;
END;
Now, the test (where 39/32 is the node 1.3.1):
select x_numer(39,32)||'/'||x_denom(39,32),y_numer(39,32)||'/'||y_denom(39,32) from dual5/8 19/32select 5/8+19/32, 39/32 from dual1.21875 1.21875
I don't use a floating point to represent rational numbers, andwrote all the functions with integer arithmetics instead. To put itbluntly, the floating point number concept in general, and the IEEEstandard in particular, is useful for rendering 3D-game graphics only.In the last test, however, we used a floating point just to verify that5/8 and 19/32, returned by the previous query, do indeed add to 39/32.
We'll store two integer numbers — numerator and denominator of the sum of the coordinates x and y — as an encoded node path. Incidentally, Celko's nested sets use two integers as well. Unlike nested sets, our mapping is stable: each node has a predefined placement at the xyplane, so that the queries involving node position in the hierarchycould be answered without reference to the database. In this respect,our hierarchy model is essentially a materialized path encoded as arational number.
Finding Parent Encoding and Sibling Number
Given a child node with numer/denom encoding, we find the node's parent like this:
function parent_numer( numer integer, denom integer )
RETURN integer IS
ret_num integer;
ret_den integer;
BEGIN
if numer=3 then
return NULL;
end if;
ret_num := (numer-1)/2;
ret_den := denom/2;
while floor((ret_num-1)/4) = (ret_num-1)/4 loop
ret_num := (ret_num+1)/2;
ret_den := ret_den/2;
end loop;
RETURN ret_num;
END;
 
function parent_denom( numer integer, denom integer )
...
RETURN ret_den;
END;
The idea behind the algorithm is the following: If the node is onthe very top level — and all these nodes have a numerator equal to 3 —then the node has no parent. Otherwise, we must move vertically downthe xy plane at a distance equal to the distance from thedepth-first convergence point. If the node happens to be the firstchild, then that is the answer. Otherwise, we must move horizontally ata distance equal to the distance from the breadth-first convergencepoint until we meet the parent node.
Here is the test of the method (in which 27/32 is the node 2.1.2, while 7/8 is 2.1):
select parent_numer(27,32)||'/'||parent_denom(27,32) from dual
 
7/8
In the previous method, counting the steps when navigating horizontally would give the sibling number:
function sibling_number( numer integer, denom integer )
RETURN integer IS
ret_num integer;
ret_den integer;
ret integer;
BEGIN
if numer=3 then
return NULL;
end if;
ret_num := (numer-1)/2;
ret_den := denom/2;
ret     := 1;
while floor((ret_num-1)/4) = (ret_num-1)/4 loop
if ret_num=1 and ret_den=1 then
return ret;
end if;
ret_num := (ret_num+1)/2;
ret_den := ret_den/2;
ret     := ret+1;
end loop;
RETURN ret;
END;
For a node at the very first level a special stop condition, ret_num=1 and ret_den=1 is needed.
The test:
select sibling_number(7,8) from dual
 
1
Calculating Materialized Path and Distance between nodes
Strictly speaking, we don't have to use a materialized path, sinceour encoding is an alternative. On the other hand, a materialized pathprovides a much more intuitive visualization of the node position inthe hierarchy, so that we can use the materialized path for input andoutput of the data if we provide the mapping to our model.
Implementation is a simple application of the methods from theprevious section. We print the sibling number, jump to the parent, thenrepeat the above two steps until we reach the root:
function path( numer integer, denom integer )RETURN varchar2 ISBEGINif numer is NULL thenreturn '';end if;RETURN path(parent_numer(numer, denom),parent_denom(numer, denom))|| '.' || sibling_number(numer, denom);END;select path(15,16) from dual.2.1.1
Now we are ready to write the main query: given the 2 nodes, P and C, when P is the parent of C?A more general query would return the number of levels between P and Cif C is reachable from P, and some exception indicator; otherwise:
function distance( num1 integer, den1 integer,num2 integer, den2 integer )RETURN integer ISBEGINif num1 is NULL thenreturn -999999;end if;if num1=num2 and den1=den2 thenreturn 0;end if;RETURN 1+distance(parent_numer(num1, den1),parent_denom(num1, den1),num2,den2);END;select distance(27,32,3,4) from dual2
Negative numbers are interpreted as exceptions. If the num1/den1node is not reachable from num2/den2, then the navigation converges tothe root, and level(num1/den1)-999999 would be returned (readers areadvised to find a less clumsy solution).
The alternative way to answer whether two nodes are connected is bysimply calculating the x and y coordinates, and checking if the parentinterval encloses the child. Although none of the methods refer todisk, checking whether the partial order exists between the pointsseems much less expensive! On the other hand, it is just a computerarchitecture artifact that comparing two integers is an atomicoperation. More thorough implementation of the method would involve adomain of integers with a unlimited range (those kinds of numbers aresupported by computer algebra systems), so that a comparison operationwould be iterative as well.
Our system wouldn't be complete without a function inverse to thepath, which returns a node's numer/denom value once the path isprovided. Let's introduce two auxiliary functions, first:
function child_numer( num integer, den integer, child integer )RETURN integer ISBEGINRETURN num*power(2, child)+3-power(2, child);END;function child_denom( num integer, den integer, child integer )RETURN integer ISBEGINRETURN den*power(2, child);END;select child_numer(3,2,3) || '/' ||child_denom(3,2,3) from dual19/16
For example, the third child of the node 1 (encoded as 3/2) is the node 1.3 (encoded as 19/16).
The path encoding function is:
function path_numer( path varchar2 )RETURN integer ISnum integer;den integer;postfix varchar2(1000);sibling varchar2(100);BEGINnum := 1;den := 1;postfix := '.' || path || '.';while length(postfix) > 1 loopsibling := substr(postfix, 2,instr(postfix,'.',2)-2);postfix := substr(postfix,instr(postfix,'.',2),length(postfix)-instr(postfix,'.',2)+1);num := child_numer(num,den,to_number(sibling));den := child_denom(num,den,to_number(sibling));end loop;RETURN num;END;function path_denom( path varchar2 )...RETURN den;END;select path_numer('2.1.3') || '/' ||path_denom('2.1.3') from dual51/64The Final Test
Now that the infrastructure is completed, we can test it. Let's create the hierarchy
create table emps (
name varchar2(30),
numer integer,
denom integer
)
 
alter table emps
ADD CONSTRAINT uk_name UNIQUE (name) USING INDEX
(CREATE UNIQUE INDEX name_idx on emps(name))
ADD CONSTRAINT UK_node
UNIQUE (numer, denom) USING INDEX
(CREATE UNIQUE INDEX node_idx on emps(numer, denom))
and fill it with some data:
insert into emps values ('KING',
path_numer('1'),path_denom('1'));
insert into emps values ('JONES',
path_numer('1.1'),path_denom('1.1'));
insert into emps values ('SCOTT',
path_numer('1.1.1'),path_denom('1.1.1'));
insert into emps values ('ADAMS',
path_numer('1.1.1.1'),path_denom('1.1.1.1'));
insert into emps values ('FORD',
path_numer('1.1.2'),path_denom('1.1.2'));
insert into emps values ('SMITH',
path_numer('1.1.2.1'),path_denom('1.1.2.1'));
insert into emps values ('BLAKE',
path_numer('1.2'),path_denom('1.2'));
insert into emps values ('ALLEN',
path_numer('1.2.1'),path_denom('1.2.1'));
insert into emps values ('WARD',
path_numer('1.2.2'),path_denom('1.2.2'));
insert into emps values ('MARTIN',
path_numer('1.2.3'),path_denom('1.2.3'));
insert into emps values ('TURNER',
path_numer('1.2.4'),path_denom('1.2.4'));
insert into emps values ('CLARK',
path_numer('1.3'),path_denom('1.3'));
insert into emps values ('MILLER',
path_numer('1.3.1'),path_denom('1.3.1'));
commit;
All the functions written in the previous sections are conveniently combined in a single view:
create or replace
view hierarchy as
select name, numer, denom,
y_numer(numer,denom) numer_left,
y_denom(numer,denom) denom_left,
x_numer(numer,denom) numer_right,
x_denom(numer,denom) denom_right,
path (numer,denom) path,
distance(numer,denom,3,2) depth
from emps
And, finally, we can create the hierarchical reports.
Depth-first enumeration, ordering by left interval boundary
select lpad(' ',3*depth)||name
from hierarchy order by numer_left/denom_left
 
LPAD('',3*DEPTH)||NAME
-----------------------------------------------
KING
CLARK
MILLER
BLAKE
TURNER
MARTIN
WARD
ALLEN
JONES
FORD
SMITH
SCOTT
ADAMS
Depth-first enumeration, ordering by right interval boundary
select lpad(' ',3*depth)||name
from hierarchy order by numer_right/denom_right desc
 
LPAD('',3*DEPTH)||NAME
-----------------------------------------------------
KING
JONES
SCOTT
ADAMS
FORD
SMITH
BLAKE
ALLEN
WARD
MARTIN
TURNER
CLARK
MILLER
Depth-first enumeration, ordering by path (output identical to #2)
select lpad(' ',3*depth)||name
from hierarchy order by path
 
LPAD('',3*DEPTH)||NAME
-----------------------------------------------------
KING
JONES
SCOTT
ADAMS
FORD
SMITH
BLAKE
ALLEN
WARD
MARTIN
TURNER
CLARK
MILLER
All the descendants of JONES, excluding himself:
select h1.name from hierarchy h1, hierarchy h2
where h2.name = 'JONES'
and distance(h1.numer, h1.denom,
h2.numer, h2.denom)>0;
 
NAME
------------------------------
SCOTT
ADAMS
FORD
SMITH
All the ancestors of FORD, excluding himself:
select h2.name from hierarchy h1, hierarchy h2
where h1.name = 'FORD'
and distance(h1.numer, h1.denom,
h2.numer, h2.denom)>0;
 
NAME
------------------------------
KING
JONES
--
Vadim Tropashkoworks for Real World Performance group at Oracle Corp. In prior life hewas application programmer and translated "The C++ ProgrammingLanguage" by B.Stroustrup, 2nd edition into Russian. His currentinterests include SQL Optimization, Constraint Databases, and ComputerAlgebra Systems.
贡献人:Vadim Tropashko
上次修改时间:2005-04-13 02:49 PM
Incorrect return in sibling_number() ?
张贴人:crocodile2u于2006-02-03 01:11 AM
[quote]
function sibling_number( numer integer, denom integer )
RETURN integer IS
ret_num integer;
ret_den integer;
ret integer;
BEGIN
if numer=3 then
return NULL;
end if;
....
RETURN ret;
END;
[/quote]
Is this behaviour really correct? What if we have numer=3 anddenom=4. I'd rather expect sibling_number(numer, denom) to returninteger 2, but never NULL.
( I realize that the article doesn't focus on details of impementation, but nonetheless... )
回复此评注
sibling number(张贴人:vadim于2006-02-08 05:05 PM)
ints too small
张贴人:sandfly于2006-03-03 09:25 AM
Thisis an interesting idea, but it has problems. If we write out therational number as a binary fraction, we can see that you are basicallyencoding the path as sequences of binary zeros, and the 'dots' are theones: node 2.3.2.1 would encode as:
0.01001011
and the path is: x2xx3x21
This implies the sum of the node numbers equals the number ofbinary digits required to express the numerator. Clearly you are goingto run out if you use a normal 32 or 64 bit int. In this sense it hasthe same problems as chelko's proposal.
For run, I analysed our portfolio hierarchy. We would need 115 bits to represent the numerator for 2344 portfolios.
回复此评注
ints too small - I agree!(张贴人:innovate2000于2006-03-03 06:27 PM)
see article:(张贴人:innovate2000于2006-03-03 11:51 PM)