Hierarchical Datacubes

Abstract

Many approaches have been proposed to pre-compute data cubes in order to efficiently respond to OLAP queries in data warehouses. However, few have proposed solutions integrating all of the possible outcomes, and it is this idea that leads the integration of hierarchical dimensions into these responses. To meet this need, we propose, in this paper, a complete redefinition of the framework and the formal definition of traditional database analysis through the prism of hierarchical dimensions. After characterizing the hierarchical data cube lattice, we introduce the hierarchical data cube and its most concise reduced representation, the closed hierarchical data cube. It offers compact replication so as to optimize storage space by removing redundancies of strongly correlated data. Such data are typical of data warehouses, and in particular in video games, our field of study and experimentation, where hierarchical dimension attributes are widely represented.

Share and Cite:

Nevot, M. , Nedjar, S. and Lakhal, L. (2023) Hierarchical Datacubes. Journal of Computer and Communications, 11, 43-72. doi: 10.4236/jcc.2023.116004.

1. Introduction and Motivations

Data warehouses ( [1] ), allow the storage of huge volumes of data accumulated over time in operational databases. In fact, recently, the evolution of technologies has led companies to store their data and thus preserve the “memory” of their activities. Data warehouses have been designed for this purpose. Unlike operational databases, they have some notable characteristics. First of all, they are not intended for the day-to-day management of the information system, but to provide genuine assistance in decision-making. Their users, the decision-makers, are therefore few in number and are interested not in the detail of the data but in general trends, according to this or that criterion, not explicitly stored. Warehouses’ data are also peculiar. Often inserted in the warehouse when refreshing, these data are persistent as they will not be updated. In addition, when inserted, these data are provided with a timestamp. They are said to be historicized. From the data deposits thus constituted, it was natural to seek to make the best use of it. Here again, it is not a question of formulating classic, simple and frequent requests fetching usually some tuples1, but to carry out analyses that require aggregation of the data in order to identify major trends. Such queries are particularly expensive because they require scans of large volumes of data and they are inherently complex. However, being part of a decision support process (hence the acronym OLAP for online analytical processing2), the formulation of these queries depends on both prior knowledge and the needs of decision-makers ( [3] [4] ). OLAP is opposed to online transaction processing (OLTP). It therefore takes place on an ad hoc basis and ideally should be interactive. However, the underlying calculation is complex and time-consuming, hence the idea of pre-calculating the results.

However, as it is not obvious how to predict the decision-makers’ assumptions, the preliminary calculations must consider all possible outcomes. It is this idea that encouraged us to integrate the hierarchical dimensions into these answers ( [5] ). To meet this need, we propose in this paper a complete redefinition of the framework and formal definition of traditional database analysis ( [6] ), the datacube, and its closed datacube through the prism of hierarchical dimensions. Hierarchies add their own modeling complexity to those of data warehouses ( [7] ), it is for this reason, and the consequent computation times that they induce, that few authors have been interested in this field. However, we believe that a work of framing and formal definition of hierarchical dimensions is judicious. Moreover, in general, we will consider two classes of OLAP operators. The first is dedicated to the data structure and allows its manipulation in order to retrieve remarkable information. The second is dedicated to the granularity of the data and proposes operators that aggregate and synthesise the information in one direction (roll-up), and which break them down or specify them in the other (drill down) ( [8] ). Dimensional hierarchies define the access paths in the data and allow the implementation of this second class of OLAP operators. It is also for this reason that studying the hierarchical dimensions is relevant although the volume of data generated, even greater than in the case of the classic datacube, must be understood by a particularly reduced approach to be completely usable.

Computing a hierarchical datacube, taking into account the hierarchical structure of the dimensions, can be judicious to obtain an optimized aggregation, a simplified analysis, an intuitive navigation and to help decision-making based on the hierarchies. Using the hierarchical structure of dimensions, it is possible to perform optimized aggregations by consolidating data at different levels of granularity. This allows data to be pre-aggregated at higher levels of the hierarchy, which reduces the size of the data cube and therefore improves query performance. A hierarchical datacube also helps simplify analysis by providing an aggregated view of data at different levels of granularity. The aim is to facilitate the understanding of trends and relationships between the different dimensions, avoiding the complexity of detailed data. The hierarchical structure of the dimensions in a hierarchical datacube also allows intuitive and easy navigation between the different levels of granularity. This allows users to explore data more efficiently and navigate between levels of detail and aggregations without extra effort. This can improve user experience and speed up data analysis. Finally, in many cases, the hierarchical structure of dimensions is meaningful from a decision-making perspective. By using a hierarchical datacube, it is possible to take this hierarchical structure into account to obtain more relevant decision information that is better aligned with the reality of the domain.

2. Use Cases

2.1. Open Match 3

For this paper, we introduce an example applied to a video game, the one we initially studied and experimented with when researching this contribution.

In this example, we will use the video puzzle-game3 of the Match 34 type open source5 Open Match 3, a web browser-based video game6, developed by Giordano Ferdinandi, Stephen Surtees and Rachel Kehoe of Clockwork Chilli.

Initially, we focused on this type of video game because of its immense popularity, making the example more eloquent and its application relevant. In addition, besides its ethical side, this choice of open source allowed us to have access and to directly modify the source code of the video game in order to accommodate all the probes we wanted to listen to.

2.2. Relation Example

Let’s consider the OM3 data warehouse (cf. Figure 1) of the Open Match 3 video game (cf. Figure 2).

Figure 1. Star schema of data warehouse OM3.

Figure 2. Video game open match 3.

A DTurn is identified by IdTurn and classified by Game and by Round. These attributes constitute a hierarchical structure, called Step (cf. Table 1).

A DSeries is identified by IdSeries and characterized by a Move causing one (or more, then called a chain) Combination of an Association, characterized by a matching Color (among the 4) and a Size of at least three elements. From an Association of four elements or more, bonus points and special elements (explosion, rayon destructeur, etc., cf. Figure 3) are granted (cf. Table 2).

Un DPlayer is identified by IdPlayer and classified by its Location (full). The most general location type is a Country (C), which includes a Region (R), which itself includes a City (C), specified by an IPAddress (I), itself specified by an OS (O, i.e. an operating system), a Browser (B), then a Lang (L) thus, at last, a Player (P) name. The values of the different attributes of DPlayer are coded as follows (cf. Table 3):

Figure 3. Video game open Match 3, example of an explosion.

Table 1. Relation of the dimension DTurn.

Table 2. Relation of the dimension DSeries.

Table 3. Relation of the dimension DPlayer.

The table of facts FactOM3 is composed of a Time remaining before the end of the game (naturally decreasing, but also increasing to a maximum with the Score, it is thus a decisional element), a Duration of realization, a Number of elements in correspondence, a Score (number of points won) total and a description of Shape (0.5 for horizontal, 1 for vertical, and between these two values for mixed) for a IdPlayer (IdP), a IdTurn (IdT) and a IdSeries (IdS). The values of the different attributes FactOM3 are coded as follows (cf. Table 4):

We only give a representation of a hierarchical dimension for DPlayer (cf. Figure 4), DTurn and DSeries being hierarchical dimensionswhich are much less well founded than this one. The maximum element ALLPlayer of this representation is defined by 3.5 in Subsection 3.2.

Figure 4. Dimension of DPlayer.

Table 4. Relation of the table of facts FactOM3.

3. Hierarchical Dimension

3.1. Types of Hierarchies

Dimensional hierarchies are formed by the set of attributes of a dimensional relation having a hierarchical relationship ( a a ). They correspond to relations of type 1 to several and offer the possibility of using the class of roll up and drill down ( [8] ).

The literature provides an overview of the different types of hierarchies ( [1] ) that we propose to present.

3.1.1. Simple Hierarchy

Simple hierarchies have levels that can be considered as lists and their instances form a tree. These hierarchies can be symmetrical if the tree of instances is balanced or asymmetrical if it is not.

3.1.2. Strict Hierarchy and Non-Strict Hierarchy

In the framework of a strict hierarchy, the generalisation of a value at a given level only leads to, at most, one value.

Example 3.1 The month-level generalisation of the date July, 14th 1789 is July.

In practice, there are many situations where this type of hierarchy is too restrictive. Typically, a video game can be of several game types. Such hierarchies are called non-strict.

3.1.3. Multiple Hierarchy

Multiple hierarchies make it possible to model situations where the different levels are no longer lists but directed graphs without cycles.

Example 3.2 Let us consider a time dimension. There are two ways of generalising the day level: on the week level or the month level, although these two levels are not comparable (i.e. one is not a generalisation of the other).

3.1.4. Parallel Hierarchies

For dimensions with only one attribute, there can only be one hierarchy per dimension. In practice, it is common for a single dimension to be composed of several attributes useful for the analysis.

If these attributes have an associated hierarchy, the dimensional hierarchy is said to be parallel and can be considered as a special case of multiple hierarchies. A parallel hierarchy can be composed of other types of hierarchies. Two categories of parallel hierarchies are generally distinguished:

• Independent parallel hierarchy: there is no common level between the different hierarchies composing it;

• Parallel hierarchy: there are common levels between the different hierarchies composing it; the hierarchy is said to be independent.

3.2. Formal Framework of a Hierarchical Dimension

Let r be a relation of schema R . In addition to the attributes of R , we will consider an additional attribute RowId qui which is implicit. The values of this attribute serve as a unique identifier for each tuple and are assigned at the time of their insertion. We note t x the tuple such that t [ R o w I d ] = x .

The attributes of R are divided into two sets:

1) D is the set of attributes of hierarchical dimensions, also called hierarchical categories. Moreover, the attributes of D are ordered.

2) M is the set of measured attributes, on which the aggregative or statistical functions are applied.

We simply call dimension a hierarchical dimension attribute. Likewise, we call hierarchy a hierarchy of values.

Definition 3.1 (Dimension)—In r, we call a dimension, a dimensional attribute, or a category, d i . d i D , r ( d i ) is the projection of r onto d i .

A dimension is characterised by a structure (cf. Definition 3.2) and a hierarchy (cf. Definition 3.3).

We call D the set of dimensions d i of r.

Example 3.3 Let’s consider the data warehouse OM3. Its set of dimensions D is:

D = { D Player , D Turn , D Series }

Definition 3.2 (Structure of a dimension)—For each dimension d i D , we define the structure of a dimension, or schema, S h as follows: d i D , S h ( d i ) is the structure of the dimension d i .

Example 3.4 Let’s consider the set of hierarchical dimension structures:

S h ( D Player ) = ( Τ Player , Country , Region , City , IPAddress , O S , Browser , Lang , Player )

S h ( D Turn ) = ( Τ Turn , Game , Round )

S h ( D Series ) = ( Τ Series , Move , Combination )

Definition 3.3 (Hierarchy)—For each dimension d i D , we define h i the hierarchy associated with that dimension, which we will simply call hierarchy.

Definition 3.4 (Level of a hierarchy)—For each hierarchy h i , we define h i ( e ) , the level, or tier, or echelon, of this hierarchy.

Definition 3.5 (Maximum element of a hierarchy)—A hierarchy h i associated with the dimension d i D contains a maximum element denoted Τ i defined by the value A L L i . This value takes the idea of the ALL from [9] , which represents the generalization of all the values of the domain of a dimension.

Example 3.5 Considering the hierarchy h Player of the dimension D Player (cf. Figure 4), the value of its maximum element is A L L Player .

Definition 3.6 (Depth of a hierarchy)—For each hierarchy h i , we define its Depth as follows: h i d i , d i D , P r o f ( h i ) is the depth of the hierarchy h i , i.e. its number of levels.

Example 3.6 Considering the hierarchy h Player of the dimension D Player (cf. Figure 4), D e p t h ( h Player ) = 9 .

Definition 3.7 (Domain of a dimension)—We define the domain of a dimension Dom as follows: d i D , D o m ( d i ) is the domain of dimension d i .

Definition 3.8 (Domain of a level of a dimension)—We define the domain of a level of a dimension, or dimensional footprint, Dom as follows:

d i D , e S h ( d i ) , D o m ( d i , e ) is the domain of the dimension d i for the level e, i.e. D o m ( d i , e ) corresponds to the set of possible values for the level e of the hierarchy h i ;

d i D , D o m ( h i ) = e S h ( d i ) D o m ( h i , e ) .

Definition 3.9 (Dimensional table)—In the “star” data model used in data warehouses (or datamart), each hierarchy can be represented by a table, of dimension D d i corresponding to the dimension eponymous (the two concepts are mingled), relational schema S h ( d i ) .

Definition 3.10 (Dimensional tuple)—For each x D o m ( d i ) , we define the dimensional tuple t x corresponding to the value x in the table D d i is made up of the list of ancestors of x in h i .

Likewise, for each x D o m ( d i , e ) , we define the dimensional tuple t x corresponding to the value x for the level e of the hierarchy h i in the table D d i is also made up of the list of ancestors of x in h i .

We simply call tuple a dimensional tuple.

Example 3.7 For x = P 2 , the tuple t x corresponding in the relation D Player is:

t x = ( F r a n c e , P A C A , M a r s e i l l e ,139.124.242.125, L i n u x , O p e r a , e n , P 2 )

For x = 139.124.242.125 , the tuple t x corresponding in the relation D Player is:

t x = ( F r a n c e , P A C A , M a r s e i l l e ,139.124.242.125, N U L L , N U L L , N U L L , N U L L )

3.3. Formal Definition of a Hierarchical Dimension

3.3.1. Multidimensional Space

Definition 3.11 (Multidimensional space)—The multidimensional space of r is noted and redefined with hierarchical dimensions as follows:

S p a c e ( r ) = { × d i D D o m ( h i ) { A L L i , , A L L j } } { ( , , ) }

where × symbolizes the Cartesian product, and ( , , ) the combination of empty values. is a tuple and represents a multidimensional pattern.

The value of each maximal element A L L i of a hierarchy h i is naturally contained in the hierarchy, in the associated domain D o m ( h i ) , and thus also in S p a c e ( r ) .

Example 3.8 The multidimensional space of the OM3, data warehouse, and its table of facts (cf. Table 4), is illustrated in Table 5. For the sake of clarity and conciseness, not all tuples in the multidimensional space are represented (there would be several thousand), and special values are abbreviated: A L L Player to A L L P , A L L Turn to A L L T and A L L Series to A L L S .

Tuples identified by values 1, …, 13 et 31 are possible tuples for r (because all their values are real), even if the tuple identified by value 31 is not explicitly in the relation r. In this tuple, the symbol means “empty value”. The other tuples (identifiers 14, …, 30) cannot be tuples of r because they contain at least one A L L i value. They therefore convey information at a more aggregated level of detail than the previous ones.

3.3.2. Specialisation Orders

The multidimensional space of r is structured by the specialisation relation between tiles. This order was originally introduced by [10] [11] in the context of concept learning.

The ordered set C L ( r ) = S p a c e ( r ) , s is a complete lattice called cube lattice.

Definition 3.12 (Intradimensional specialization order)—Let’s consider x , y D o m ( h i ) :

x d i y x isanancestorof y onthehierarchy h i .

Definition 3.13 (Multidimensional specialization order)—Let’s consider t , t S p a c e ( r ) , we define the order relation s as follows:

t s t d i D , t [ d i ] d i t [ d i ]

3.3.3. Functions

Definition 3.14 (Dimensional Attribute function) For a tuple t x de d i , the function A t t r i b u t e d i returns the set of attributes whose value is different from NULL.

Let’s consider t x a tuple of d i :

A t t r i b u t e d i ( t x ) = { A t x | t x [ A ] N U L L }

Example 3.9 For the dimension D Player :

If t x = ( F r a n c e , P A C A , M a r s e i l l e ,139.124.242.125, L i n u x , O p e r a , e n , P 2 )

Then A t t r i b u t e D Player ( t x ) = { F r a n c e , P A C A , M a r s e i l l e ,139.124.242.125, L i n u x , O p e r a , e n , P 2 }

Table 5. Multidimensional space of the data warehouse OM3.

If t y = ( F r a n c e , P A C A , M a r s e i l l e ,139.124.242.125, N U L L , N U L L , N U L L , N U L L )

Then A t t r i b u t e D Player ( t y ) = { F r a n c e , P A C A , M a r s e i l l e ,139.124.242.125 } .

3.3.4. Operators

Definition 3.15 (Dimensional min/max operators) - We define the operators min and max as follows:

m i n d i ( d i ) = { x D o m ( h i , e ) | e S h ( d i ) : e < e }

m a x d i ( d i ) = { x D o m ( h i , e ) | e S h ( d i ) : e > e }

Thus, m i n d i ( d i ) represents the set of possible values of the highest (general) level of the hierarchical dimension structure. In a similar way, m a x d i ( d i ) represents the set of possible values of the lowest (specific) level of the hierarchical dimension structure.

Example 3.10 For the dimension D Player :

m i n D P l a y e r ( D Player ) = { F r a n c e }

m a x D P l a y e r ( D Player ) = { P 1 , P 2 , P 3 }

The min/max operators can be overloaded in order to be applied to a set of tuples T of dimension d i , noted respectively m i n s ( T ) and m a x s ( T ) , by being implemented as follows:

m i n d i ( T ) = x D o m ( h i , e ) , t x | e S h ( d i ) : e < e

m a x d i ( T ) = x D o m ( h i , e ) , { t x | e S h ( d i ) : e > e }

Example 3.11 - For the dimension D Player , if T = { t x , t y } with:

t x = ( F r a n c e , P A C A , M a r s e i l l e ,139.124.242.125, L i n u x , O p e r a , e n , P 2 )

t y = ( F r a n c e , P A C A , M a r s e i l l e ,139.124.242.125, M a c O S , F i r e f o x , e s , P 3 )

So, we have:

m i n D Player ( T ) = ( F r a n c e , N U L L , N U L L , N U L L , N U L L , N U L L , N U L L , N U L L )

m a x D Player ( T ) = { ( N U L L , N U L L , N U L L , N U L L , N U L L , N U L L , N U L L , P 2 ) , ( N U L L , N U L L , N U L L , N U L L , N U L L , N U L L , N U L L , P 3 ) }

Definition 3.16 (Generalized min/max operators on the cube lattice)—We generalize the min operator on the cube lattice as follows:

{ T C L ( r ) , m i n s ( T ) = { t T | t T : t s t } d i D , m i n s ( T [ d i ] ) = { t [ d i ] T [ d i ] | t [ d i ] T [ d i ] : t [ d i ] s t [ d i ] }

Likewise, we generalize the max operator on the cube lattice as follows:

{ T C L ( r ) , m a x s ( T ) = { t T | t T : t s t } d i D , m a x s ( T [ d i ] ) = { t [ d i ] T [ d i ] | t [ d i ] T [ d i ] : t [ d i ] s t [ d i ] }

Definition 3.17 (Dimensional Sum operator). Let’s consider d i D , we define the Dimensional Sum + d i of d i as follows: x , y D o m ( h i ) , x + d i y is the nearest (small) common ancestor to x and y in h i . In other words:

x D o m ( h i , f ) , y D o m ( h i , g ) x + d i y = z D o m ( h i , e ) | z d i x and z d i y | e S h ( d i ) : e > e

Example 3.12 For the dimension D Player , if x = P a r i s and y = M a r s e i l l e then x + D Player y = F r a n c e .

The Dimensional Sum operator can be overloaded to be applied to all tuples t x and t y of a dimension d i , noted t x + d i t y , by being implemented as follows:

x D o m ( h i , f ) , y D o m ( h i , g ) , z D o m ( h i , e ) , e < f and e < g , t z = t x + d i t y = t x + d i y

Example 3.13 For the dimension D Player :

If x = P a r i s ,

t x = ( F r a n c e , I D F , P a r i s , N U L L , N U L L , N U L L , N U L L , N U L L )

And y = M a r s e i l l e ,

t y = ( F r a n c e , P A C A , M a r s e i l l e , N U L L , N U L L , N U L L , N U L L , N U L L )

Then z = F r a n c e ,

t z = t x + d i t y = ( F r a n c e , N U L L , N U L L , N U L L , N U L L , N U L L , N U L L , N U L L )

Definition 3.18 (Generalized Sum operator on the cube lattice). We generalize the Sum operator on the cube lattice as follows:

{ u , v C L ( r ) , { z = u + v } d i D , z [ d i ] = u [ d i ] + d i v [ d i ]

z is the sum of tuples u and v.

Definition 3.19 (Dimensional Product operator). Let’s consider d i D , we define the Dimensional Product d i of d i as follows: x , y D o m ( h i ) , x d i y is the set of nearest common descendants of x and y in h i . In other words:

f , g < P r o f ( h i ) , x D o m ( h i , f ) , y D o m ( h i , g ) , x d i y = { z D o m ( h i , e ) | x d i z et y d i z | e S h ( d i ) : e < e }

If x and y have no common descendants in h i , then:

x d i y = { }

Example 3.14 For the dimension D Player :

If x = P a r i s and y = 92.88.91.80 , x D Player y = { W i n d o w s }

If x = M a r s e i l l e and y = 139.124.242.125 , x D Player y = { L i n u x , M a c O S }

If x = P a r i s and y = M a r s e i l l e , x D Player y = { }

The Dimensional Product operator can be overloaded to be applied to all tuples t x and t y of a dimension d i , noted t x d i t y , by being implemented as follows:

f , g < P r o f ( h i ) , x D o m ( h i , f ) , y D o m ( h i , g ) , t z = t x d i t y = { { t x d i y } if z D o m ( h i , e ) , f < e and g < e { ( , , ) } otherwise .

Example 3.15 For the dimension D Player :

If x = M a r s e i l l e ,

t x = ( F r a n c e , P A C A , M a r s e i l l e , N U L L , N U L L , N U L L , N U L L , N U L L )

And y = 139.124.242.125 ,

t y = ( F r a n c e , P A C A , M a r s e i l l e ,139.124.242.125, N U L L , N U L L , N U L L , N U L L )

Then z = { L i n u x , M a c O S } ,

t z = t x d i t y = { ( F r a n c e , P A C A , M a r s e i l l e ,139.124.242.125, L i n u x , N U L L , N U L L , N U L L ) ( F r a n c e , P A C A , M a r s e i l l e ,139.124.242.125, M a c O S , N U L L , N U L L , N U L L ) }

Definition 3.20 (Generalized Product operator on the cube lattice). We generalize the Product operator on the cube lattice as follows:

{ u , v C L ( r ) , { z = u v } d i D , { z [ d i ] = u [ d i ] d i v [ d i ] }

z is the product of tuples u and v.

Definition 3.21 (Dimensional Semi-product operator). Let’s consider d i D , we define the dimensional Semi-product d i of d i as follows: x , y D o m ( h i ) , x d i y is the set of nearest descendants of x and y in h i . In other words:

f < P r o f ( h i ) , x , y D o m ( h i , f ) , x d i y = { z D o m ( h i , e ) | x d i z et y d i z | e S h ( d i ) : e < e }

If x and y do not have the same level, or if they do not have descendants, in h i , then:

x d i y = { }

Example 3.16 For the dimension D Player :

IF x = L i n u x and y = M a x O S , x D Player y = { O p e r a , F i r e f o x }

IF x = M a r s e i l l and y = M a x O S , x D Player y = { }

The Dimensional Semi-product operator can be overloaded to be applied to all tuples t x and t y of a dimension d i , noted t x d i t y , by being implemented as follows:

f < P r o f ( h i ) , x , y D o m ( h i , f ) , z D o m ( h i , e ) , t z = t x d i t y = { t x d i y | e > f | e S h ( d i ) : e < e }

x , y D o m ( h i , f ) , t z = t x d i t y = { { t x d i y } if z D o m ( h i , e ) , f < e { ( , , ) } otherwise .

Example 3.17 For the dimension D Player :

If x = L i n u x ,

t x = ( F r a n c e , P A C A , M a r s e i l l e ,139.124.242.125, L i n u x , N U L L , N U L L , N U L L )

And y = M a c O S ,

t y = ( F r a n c e , P A C A , M a r s e i l l e ,139.124.242.125, M a c O S , N U L L , N U L L , N U L L )

Then z = { O p e r a , F i r e f o x } ,

t z = t x d i t y = { ( F r a n c e , P A C A , M a r s e i l l e ,139.124.242.125, L i n u x , O p e r a , N U L L , N U L L ) ( F r a n c e , P A C A , M a r s e i l l e ,139.124.242.125, M a c O S , F i r e f o x , N U L L , N U L L ) }

Definition 3.22 (Generalized Semi-product operator on the cube lattice) We generalize the Semi-Product operator on the cube lattice as follows:

{ u , v C L ( r ) , { z = u v } d i D , { z [ d i ] = u [ d i ] d i v [ d i ] }

z is the product of tuples u and v.

3.3.5. Characterization of the Hierarchical Cube Lattice

By endowing the multidimensional space S p a c e ( r ) of r with orders of specialization and using the operators, notably Sum and Product, we propose an algebraic structure called Hierarchical Datacube Lattice, or Hierarchical Cube Lattice, or simply Cube Lattice, which sets a theoretical and general framework for multidimensional database mining. It is easily transposable from the standard datacube lattice. The following lemmas and propositions give the fundamental properties of the cube lattice, which are repeated in Theorem 6.

Lemma 1. The ordered set C L ( r ) = S p a c e ( r ) , s is a complete lattice called cube lattice for which:

T C L ( r ) , ⋀ T = + t T t où ⋀ symbolises the infimum.

T C L ( r ) , ⋁ T = t T t où ⋁ symbolises the supremum.

Lemma 2. The lattice C L ( r ) = S p a c e ( r ) , s is a co-atomic and atomic lattice.

Proposition 3 (Hierarchical lattice of parts of binary attributes). Let L ( r ) be the hierarchical lattice of parts of binary attributes of the binary relation, i.e.

the lattice P ( d i D d i a , a D o m ( d i ) ) , . Then there exists an order-embedding Φ :

C L ( r ) L ( r )

t { d i D d i a , a D o m ( d i ) if t = ( , , ) { d i t [ d i ] | d i A t t r i b u t e ( t ) } otherwise .

The rank of a tuple t, noted r a n k ( t ) , is the length of the smallest path (minimal number of arcs) in the cube lattice connecting it to the tuple ( A L L i , , A L L j ) . We thus have: r a n k ( t ) = | Φ ( t ) | if t ( , , ) , | D | + 1 otherwise.

Proposition 4 (Hierarchical co-atom/atom). Hierarchical co-atoms (respectively hierarchical atoms) are the maximal tuples, namely the most specific tuples (respectively maximal tuples) of the lattice deprived of its universal majorant (respectively minorant). The hierarchical co-atoms (respectively hierarchical atoms) of the hierarchical cube lattice of a relation r are noted C A t ( C L ( r ) ) (respectively A t ( C L ( r ) ) ).

Lemma 5. The cube lattice C L ( r ) is graduated. If | D | 2 then C L ( r ) is not distributive.

Lemma 5 shows that the cube lattice is a graduated lattice. Therefore, we can apply level-wise algorithms on this search space.

Theorem 6 (Hierarchical datacube lattice). Let r be a data warehouse composed of hierarchical dimensions and measures ( D M ). The ordered set C L ( r ) = S p a c e ( r ) , s is a complete, atomic, co-atomic and gradual hierarchical lattice called a hierarchical datacube lattice in which:

T C L ( r ) , ⋀ T = + t T t

T C L ( r ) , ⋁ T = t T t .

Example 3.18: Figure 5 shows the cube lattice of our example table of facts relation (cf. Table 4). In this diagram, the edges represent the links of generalization or specialization between tuples. The values of the coded attributes are as follows:

Disregarding ( , , ) , hierarchical co-atoms are tuples conveying information at the most detailed level, i.e. that of the actual values of dimensions. In other words, the hierarchical co-atoms are the potential tuples of a relation. Thus, we have:

Figure 5. Hasse diagram of the hierarchical cube lattice of OM3.

C A t ( C L ( OM 3 ) ) = { ( F r a n c e , S 1 , A 1 ) , ( F r a n c e , S 1 , A 1 1 ) , ( F r a n c e , S 1 , A 1 2 ) , ( F r a n c e , S 1 , A 2 ) , ( F r a n c e , S 1 1 , A 1 ) , ( I D F , S 1 , A 1 ) , ( e s , S 3 3 , A 7 12 ) , ( P 1 , S 1 , A 1 ) , ( P 1 , S 1 , A 2 ) , ( P 2 , S 3 , A 7 ) , }

The hierarchical atoms of the lattice offer the most synthetic information possible, with the exception of the tuple (ALLPlayer, ALLTurn, ALLSeries). Since we will only consider three dimensions here, all hierarchical atoms have two different synthetic A L L i values. Thus we have:

A t ( C L ( r ) ) = { ( F r a n c e , A L L Turn , A L L Series ) , ( I D F , A L L Turn , A L L Series ) , ( P a r i s , A L L Turn , A L L Series ) , ( A L L Player , S 1 , A L L Series ) , ( A L L Player , A L L Turn , A 7 ) , }

Property 3.1 (Size of a hierarchical cube lattice). The height (number of levels) of the hierarchical cube lattice is | D | + 1 . The number of elements for a level i ( i 1 , , | D | ) is:

d i D | d i | = i ( A d i | D o m ( h i ) | ) ( | D | i ) max A D ( | D o m ( h i ) | ) i .

The total number of elements in the hierarchical cube lattice is:

i = 1 , , | D | ( d i D | d i | = i ( A d i | D o m ( h i ) | ) ) + 2 = A D ( | D o m ( h i ) | + 1 ) + 1

The above property gives an analysis of the number of elements contained in a level of the hierarchical cube lattice and the total number of elements. This property is particularly important as it allows us to characterise the complexity of algorithms using the hierarchical cube lattice as a search space.

4. Hierarchical Datacube

The concept of a datacube in a relational world ( [9] ) is immediately transposed into a hierarchical datacube in a data warehouse world, with the integration of hierarchical dimensions.

Thus, we call the result of a partitioning query, according to a set X D of hierarchical dimensions, cuboid, noted c u b o i d X ( r ) . Likewise, the datacube constituted by the set of all cuboids is noted d a t a c u b e ( r ) . The cuboids can be ordered with respect to their level of detail by the partial order relation .

Example 4.1: The cuboid according to the hierarchical dimensions D Player , D Turn is less aggregated (more detailed) than the cuboid according to D Player or that according to D Turn . In other words:

C u b o i d D Player ( r ) C u b o i d D Player , D Turn ( r )

C u b o i d D Turn ( r ) C u b o i d D Player , D Turn ( r )

In a standard way, the set of cuboids endowed with this order forms a lattice. The cuboids of this lattice are grouped by level according to their number of hierarchical dimensions. These levels are numbered starting from the bottom of the lattice (cuboid bearing no dimension d i ) and going up towards the top (cuboid according to all possible criteria called “base cuboid”). Let’s consider two cuboids c u b o i d U ( r ) according to the hierarchical dimension subset U and c u b o i d V ( r ) according to V , if U V then c u b o i d U ( r ) c u b o i d V ( r ) , the cuboids are said to have a family relationship, c u b o i d V ( r ) is called the “ancestor” of c u b o i d U ( r ) and c u b o i d U ( r ) is the “offspring” of c u b o i d V ( r ) .

Figure 6 gives the representation of the cuboid lattice of the data warehouse OM3 according to the hierarchical dimensions D Player (IdP), D Turn (IdT) and D Series (IdS).

Figure 6. Cuboids of OM3 according to D Player (IdP), D Turn (IdT) and D Series (IdS).

By taking the example of the OM3 and its table of facts (cf. Table 4), we can express in SQL the hierarchical datacube necessary for an analysis according to the measurements Time and Score with respect to hierarchical dimensions IdP, IdT and IdS with the following query:

This query will calculate the 23 = 8 partitionings (where represents the partitioning along no dimension d i , i.e. aggregating the entire data warehouse into a single tuple:

• “IdP, IdT, IdS”;

• “IdP, IdT”;

• “IdP, IdS”;

• “IdT, IdS”;

• “IdP”;

• “IdT”;

• “IdS”;

• “ ”.

As with the traditional datacube, the naive way to compute this type of query is to rewrite it as a collection of eight aggregative queries and run them separately. However, each of its subqueries has its own pattern. To make all these cuboids uni-compatible, we use the value A L L i (cf. Definition 3.5): this value has a particular semantics, it is a generalization of all the values of the domain of an attribute of a dimension d i . Thus, the cuboids all share the same schema ( D M ) and can be grouped together in a single relation: the hierarchical datacube. Each multidimensional tuple consists of a set of values for its hierarchical dimensions and numerical values for their measures. The value of the measure is computed by aggregating the set of tuples of the original relation sharing the same values of the selected hierarchical dimensions.

Example 4.2 - The multidimensional tuple t = ( 8,1, A L L ) is obtained by aggregating by IdP and IdT the tuples t 1 = ( 8,1,1 ) and t 2 = ( 8,1,4 ) of the example data warehouse (cf. Table 4). The measures of t are obtained by applying the measure functions on the set of tuples needed for the aggregation.

In our case, we have:

f Time ( t , r ) = t 1 ( Time ) + t 2 ( Time ) = S U M ( 6.32,18.9 ) = 25.22

• …

f Score ( t , r ) = t 1 ( Score ) + t 2 ( Score ) = M A X ( 700,2300 ) = 2300

• …

With the example data warehouse (cf. Table 4), the query to naively compute the hierarchical datacube according to the hierarchical dimensions D Player (IdP), D Turn (IdT) and D Series (IdS) and the measures Time, Duration, Number, Score and Shape is as follows:

The result of this query is given by the Table 6. For clarity, different cuboids are separated by a horizontal line, and special values are abbreviated: ALLPlayer as ALLP, ALLTurn as ALLT and ALLSeries as ALLS.

5. Closed Hierarchical Datacube

The combinatorial explosion of results during datacube computations is a well known phenomenon ( [12] ), which is amplified with the computation of hierarchical datacubes. The closed hierarchical cube approach we propose aims at representing the hierarchical datacube without loss of information but with a consequent decrease of the necessary storage space. To do this, we eliminate possible redundancies by keeping only one representative for a set of tuples from the same data of the original relation. It is easily transposable from the standard closed datacube.

5.1. Experimental Results for Closed Datacube

Our objective now is to compare, through various experiments, the sizes of the datacube and the closed cube. In this sub-section, we report a summary of our results. All experiments are conducted on an Intel Pentium G2120 3.10 GHz with 3.6 GB main memory and running on Turnkey LAMP Stack (based on Debian GNU/Linux). We use the algorithm Close [13] (for which we have the sources) in order to perform experimental comparisons between the representations. We use real data sets to evaluate the effectiveness of our approach.

Table 6. Relational representation of the hierarchical datacube of OM3.

We use the real dataset SEP85L containing weather conditions at various weather stations from December 1981 through November 1991. This weather dataset has been frequently used in calibrating various cube algorithms [14] . Mushroom is a dataset widely known in frequent pattern mining. It provides various characteristics of mushrooms. Death is a dataset gathering information about patients’ decease with the date and cause. TombNecropolis and TombObjects are issued from archaeological excavation. They encompass a list of necropolises, their tombs and other properties like the country, the funeral rite, the objects discovered in the tombs and their description. Finally, Joint_Objects_ Tombs results from the natural join between TombObjects and TombNecropolis according to the identifiers of necropolises and tombs.

Table 7 gives the datasets used for experiments. The columns Attributes and Tuples stand for the number of attributes and tuples respectively. In the last column, the size in bytes of the dataset is reported (each dimension or attribute is encoded as an integer requiring 4 bytes for any value).

Table 8 illustrates the size of the studied representations for the various datasets.

These five datasets are only encompassing strongly correlated data. Thus we are in the most difficult cases. In this context, the closed cube reduces the size of the datacube from 8 to more than 300 times.

By using the SEP85L dataset, we have generated 9 datasets having from 2 to 10 dimensions by projecting the weather dataset on the first k dimensions ( 2 k 10 ). Table 9 presents the number of resulting patterns for the approaches.

Table 7. Experimental datasets.

Table 8. Size of the data cubes (in bytes).

Table 9. Experimental results for SEP85L.

Whatever the number of dimensions is, the closed cube is the smallest representation. It is always significantly reduced when compared to the datacube itself.

5.2. Formal Framework of Closed Hierarchical Datacube

The following definitions and corollary give the fundamental properties of the closed hierarchical datacube, which are repeated in Theorem 7.

Definition 5.1 (Hierarchical cube closure operator)—The hierarchical cube closure operator associates to any tuple t of the hierarchical cube lattice a single tuple called the hierarchical closure of t, or simply the closure of t. It is obtained by considering the set of tuples more specific than t and by determining the most general tuple of this set using the Sum operator. This operator is noted and defined as follows:

: C L ( r ) C L ( r ) t { + t | t s t and t r ( , , ) otherwise .

Example 5.1 Considering the multidimensional space of the data warehouse OM3 (cf. Table 5) and its table of facts (cf. Table 4), we have ( ( P 1 , A L L T , A L L S ) ) = ( P 1 , S 1 , A 1 ) + ( P 1 , S 1 , A 2 ) = ( P 1 , S 1 , A L L S ) and ( ( A L L P , A L L T , A 7 ) ) = ( P 3 , S 3 , A 7 ) .

Corollary 1 is a closure operator of C L ( r ) on r. satisfies the following properties:

t g t ( t , r ) g ( t , r ) (isotonicity);

t g ( t , r ) (extensivity);

( t , r ) = ( ( t , r ) , r ) (idempotency).

Definition 5.2 (Hierarchical cube closure system) Let’s consider ( r ) = { t C L ( r ) | ( t , r ) = t } . ( r ) is a closure system on r and the associated closure operator is . Any tuple belonging to ( r ) is a hierarchical closed tuple or a hierarchical cube closure.

Example 5.2 Considering the multidimensional space of the data warehouse OM3 (cf. Table 5) and its table of facts (cf. Table 4), we have:

( r ) = { ( P 1 , S 1 , A 1 ) , ( P 1 , S 1 , A 2 ) , ( P 2 , S 2 , A 3 ) , ( P 2 , S 2 , A 4 ) , ( P 2 , S 2 , A 5 ) , ( P 3 , S 3 , A 6 ) , ( P 3 , S 3 , A 7 ) , ( P 1 , S 1 , A L L S ) , ( P 2 , S 2 , A L L S ) , ( P 3 , S 3 , A L L S ) , ( , , ) }

Theorem 7. The partially ordered set C C L ( r ) = ( r ) , g is a complete and co-atomic lattice called a closed cube lattice such that:

T C C L ( r ) , ⋀ T = + t T t

T C C L ( r ) , ⋁ T = ( t T t , r )

All tuples with the same closure generalize the same tuples of the original relation. As they generalize the same original tuples, they share the same aggregated value of the measure (property of GROUP BY). For each tuple of the cube lattice, it is sufficient to calculate its closure to find its measure. Thus the closed hierarchical cube is a cover of the hierarchical datacube.

Example 5.3 - Figure 7 shows the closed hierarchical cube lattice of OM3.

Figure 7. Hasse diagram of the closed hierarchical cube lattice of OM3.

5.3. Possible Limitations of Closed Hierarchical Datacube

Despite offering advantages in term of storage efficiency and query acceleration, closed hierarchical datacube have some possible limitations and limitations to account for, like loss in details, analysis flexibility, hierarchy management and adaptability to non-hierarchical dimensions. One of the possible inconveniences of a closed hierarchical datacube is that data consolidation may imply a loss of detail or an over-aggregation of data. By aggregating data at higher hierarchical levels, specific details may be lost at lower hierarchical levels, which can limit the ability to perform fine-grained data analysis at the lowest levels of granularity. That can be an inconvenience if details analysis is needed for specific decisions or perspectives. A closed datacube can also be less flexible for ad hoc analysis, because data are consolidated at specific hierarchical levels. If new analysis questions need different granularity levels of different data groupings, this may require to compute new closed datacubes or to recompute the current ones, which will require more time and resources. Also, closed datacubes are typically pre-computed to aggregate data at higher hierarchical levels, which can speed up analysis queries. However, this also means that data are pre-computed and stored in advance, which can incur a cost in terms of storage space for updates and modifications of underlying data. For example, hierarchies management can be complex in a closed datacube, because there can be more consolidation and aggregation levels to consider. The definition and management of hierarchical relations between dimensions can require a special attention to ensure that consolidations and aggregations are correctly aligned with the needs of the analysis. Finally, closed datacubes are made to work with hierarchical dimensions where there is a parent-child relation between granularity levels. However, they might not be as suitable for non-hierarchical dimensions, where data don’t follow a clear hierarchical structure. In these cases, closed datacubes may not be as efficient to aggregate and consolidate data, which can lead to a loss in performance and accuracy in analysis results.

6. Conclusions

As part of multidimensional data warehouses (or datamart), after presenting an application case on a video game, having given a definition to the hierarchical dimensions and presented the different types of hierarchies ( [1] ), we have defined a theoretical framework and a formal definition of the attributes of hierarchical dimensions and their associated structure and hierarchy.

Inspired by the classical theory, and transposing it in a relatively direct way, we have defined, in a hierarchical context, the multidimensional space, specialisation orders ( [10] [11] ), functions and operators, in particular the Sum operator and the Product operator, and for each, defined at the dimensional level, overloaded and generalized on the cube lattice.

We then characterized the hierarchical datacube lattice and gave a calculation of its size, showing the complexity of algorithms using the hierarchical cube lattice as a search space.

We then proposed the hierarchical datacube, which shows the same appeal as the classic datacube ( [9] ), major concept for data warehouse management.

As with the original datacube, the closed hierarchical datacube is its most concise representation. It exploits the generally large redundancies in the data. This is a good way to reduce the size of a datacube. By removing them, it allows the necessary size to be reduced substantially without losing useful data. When the data are very correlated, the representation by a closed hierarchical cube shows all its interest because the reduction brought is quite significant, and nevertheless makes it possible to find even more information than with a traditional datacube. These strong correlations appear in real data sets and increase when the dimensional data is very scattered over very large domains. Such characteristics are typical of data warehouses ( [15] [16] ), and in particular in the field of video games where hierarchical dimension attributes are often legion.

NOTES

1Ordered collection of the values of an indefinite number of attributes relating to the same object.

2This term was defined by [2] through twelve rules to be applied to a database so that it is OLAP: 1) Multidimensional conceptual view; 2) Transparency; 3) Accessibility; 4) Constancy of response times; 5) Client/server architecture; 6) Dimension independence; 7) Management of sparse matrices; 8) Multi-user support; 9) No restrictions on inter- and intra-dimensional 10) Intuitive data manipulation 11) Flexible reporting 12) Unlimited number of dimensions and unlimited number of items on dimensions.

3A video game of puzzle is a type of video game based on reflection. Its name comes from the jigsaw puzzle, a game consisting of putting pieces in order. It can be a game in which the player has to move elements in a specific way. Many of these games are called tile-matching games.

4A match 3 game is a type of video game in which the player must combine three elements of the same colour or shape in order to eliminate them from the game board. These games are often considered as reflection games and are generally based on the combination of strategy and luck. Match 3 games can be played on various platforms, including computers, mobile phones, and game consoles, and they are generally very popular with different age groups. Match 3 games can be simple entertainment or they can offer increasingly complex challenges as the player progresses through the game.

5The open source term designates an approach to software development (and now extends to works of the mind) whose license respects criteria precisely established by the Open Source Initiative (https://opensource.org/) and in which the source code is freely available and can be used, modified and distributed by anyone. The aim of the open source approach is to allow more people to contribute to the development and improvement of software, which can lead to increased innovation and improved quality. Open source licences prescribe the conditions under which source code can be used, modified and distributed. Some licences require that changes to the source code be published so that others can benefit from them, while other licences allow the user to keep their changes private).

6A browser-based video game is a type of online game that can be played through a web browser and does not require installation or downloading.

Conflicts of Interest

The authors declare no conflicts of interest regarding the publication of this paper.

References

[1] Inmon, W.H. (1996) Building the Data Warehouse. 2nd Edition, John Wiley and Sons, Inc., Hoboken.
[2] Codd, E.F. (1990) The Relational Model for Database Management, Version 2. Addison-Wesley, Boston.
[3] Chaudhuri, S. and Dayal, U. (1997) An Overview of Data Warehousing and OLAP Technology. SIGMOD Record, 26, 65-74.
https://doi.org/10.1145/248603.248616
[4] Chaudhuri, S., Dayal, U. and Narasayya, V.R. (2011) An Overview of Business Intelligence Technology. Communications of the ACM, 54, 88-98.
https://doi.org/10.1145/1978542.1978562
[5] Hurtado, C.A. and Mendelzon, A.O. (2002) OLAP Dimension Constraints. Proceedings of the 21st ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, Madison, June 2002, 169-179.
https://doi.org/10.1145/543613.543636
[6] Kuijpers, B. and Vaisman, A. (2016) A Formal Algebra for OLAP.
[7] Malinowski, E. and Zimányi, E. (2004) OLAP Hierarchies: A Conceptual Perspective. Advanced Information Systems Engineering, 16th International Conference, Riga, 7-11 June 2004, 477-491.
[8] Favre, C., et al. (2013) Les entrepots de donnees pour les nuls... ou pas!
[9] Gray, J., et al. (1997) Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub Totals. Data Mining and Knowledge Discovery, 1, 29-53.
https://doi.org/10.1023/A:1009726021843
[10] Mitchell, T.M. (1982) Generalization as Search. Artificial Intelligence, 18, 203-226.
https://doi.org/10.1016/0004-3702(82)90040-6
[11] Mitchell, T.M. (1997) Machine Learning. McGraw-Hill, New York.
[12] Han, J.W., Kamber, M. and Pei, J. (2011) Data Mining: Concepts and Techniques. 3rd Edition, Morgan Kaufmann, Burlington.
http://hanj.cs.illinois.edu/bk3
[13] Pasquier, N., et al. (1999) Efficient Mining of Association Rules Using Closed Itemset Lattices. Information Systems, 24, 25-46.
https://doi.org/10.1016/S0306-4379(99)00003-4
[14] Wang, W., et al. (2002) Condensed Cube: An Efficient Approach to Reducing Data Cube Size. Proceedings of the 18th International Conference on Data Engineering, San Jose, 26 February-1 March 2002, 155-165.
https://doi.org/10.1109/ICDE.2002.994705
[15] Ross, K.A. and Srivastava, D. (1997) Fast Computation of Sparse Datacubes. 23rd International Conference on Very Large Databases, VLDB 1997, Athens, 26-29 August 1997, 116-125.
[16] Beyer, K.S. and Ramakrishnan, R. (1999) Bottom-Up Computation of Sparse and Iceberg CUBEs. In: Delis, A., Faloutsos, C. and Ghandeharizadeh, S., Eds., SIGMOD Conference, ACM Press, New York, 359-370.
https://doi.org/10.1145/304182.304214

Copyright © 2024 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.