Evaluating Relational Ranking Queries Involving Both Text Attributes and Numeric Attributes

In many database applications, ranking queries may reference both text and numeric attributes, where the ranking functions are based on both semantic distances/similarities for text attributes and numeric distances for numeric attributes. In this paper, we propose a new method for evaluating such type of ranking queries over a relational database. By statistics and training, this method builds a mechanism that combines the semantic and numeric distances, and the mechanism can be used to balance the effects of text attributes and numeric attributes on matching a given query and tuples in database search. The basic idea of the method is to create an index based on WordNet to expand the tuple words semantically for text attributes and on the information of numeric attributes. The candidate results for a query are retrieved by the index and a simple SQL selection statement, and then top-N answers are obtained. The results of extensive experiments indicate that the performance of this new strategy is efficient and effective.


Introduction
A relational ranking query (or top-N query) is to find the N tuples that satisfy the query condition the best but not necessarily completely, and the results are sorted according to a given ranking function.Researches on top-N queries have intensified since late 1990s [1][2][3], and most of the researches involve numeric attributes and use a numeric distance function (say, L p -norm distances, p =1, 2, and ∞) to reduce a massive result set of a conventional query to a few of the most relevant answers.However, there are many applications where ranking queries involving both text attributes and numeric attributes are available for processing.
Example 1.Consider a database of used books with schema: Books(id#, title, price, year,…).Suppose a customer wants to buy a used book with title on "film", price around "$50" and year about "2000", where title is a text attribute with semantics, and price and year are two numeric attributes.Obviously, a book on "algebra" with price = "$50" and year = "2000" is not the desired result for the customer though the values for price and year exactly match that of the query respectively.Another book on "movie", however, with price = "$53" and year = "2001" maybe satisfy the need of the cus-tomer.Example 2. As shown in Figure 1, database IPUMS has two relations Ipum99 and Occ50, which come from [4].In relation Occ50(num, occ50), its primary key num is the value label of occupation1950.Relation Ipum99 has 61 numeric attributes where A29 is age, A50 means income, and A40 is the foreign key referencing Occ50.num.Furthermore, Ipum99 is added an attribute idx by us as identifier.A user is looking for the information of "a horticulturist with age about 50" from IPUMS.Since there is no such word "horticulturist" in Occ50.occ50, the answer will be nothing by using the traditional SQL selection statement.In fact, there is a tuple ("Gardeners, except farm, and groundskeepers", age = 51, …) (with gray color) in IPUMS may be an answer for the user.

Ipum99 Occ50
WordNet::Similarity [5] is an open source Perl module for measuring the semantic distance/similarity between two words; however, it is not easy for us to use the source directly to evaluate ranking queries [6].
Researches on semantic search in IR and SW (semantic web) have gained attention since 1990s.Taking thesaurus ontology navigation as a step in query expansion, many query expansion techniques are employed in keyword search, say, query terms are expanded to WordNet synonyms and meronyms, using the Boolean OR operations available in most web search engines [7].Different from the above query expansion techniques in IR and SW, we discuss semantic match between query words and tuple words in relational databases via tuple expansion.
In recent years, ontologies have been used to build applications in database community [8,9]; however, it is a challenging job to construct efficient ontologies [9,6].Our method is different from the ontology-based techniques.Firstly, instead of dealing with the challenge of constructing ontology, we create a simple table maintained by RDBMS, and store information of index and ranking function into the table.Secondly, our strategy is more general since it is based on WordNet.Finally, our techniques are more efficient than the above ontology-based methods due to the efficiency of RDBMS.
Another different yet related research topic is keyword search in relational databases [10][11][12], which supports free-form keyword search in relational databases without necessarily requiring the users to know the database schema.Keyword search may be suitable for any text attribute, but it is exact search without dealing with semantic match.For instance, it cannot return the books on "movie" for the query keyword "film".If query words match exactly tuple words, the results of semantic search will contain those of keyword search, and improve the effectiveness of keyword search by using our method.
There are two challenging problems for evaluating the type of relational ranking queries in this paper.The first is how to design a good semantic distance function that measure the semantic similarity between the query words and the tuple words, [6] presented a solution for this problem.The second is how to combine the semantic and numeric distances.We employ statistics and training to solve the problem, and create an index to process ranking queries in terms of semantic and numeric matching in database search.Moreover, this work is a continuation of the work in [6], which studied the processing of relational ranking queries only with text attributes, without numeric attributes.

Problem Definition and Ranking Function
Assume that R 0 (idx, A, B where t A = (tw 1 , tw 2 , …, tw n ), and call tw i a tuple word and b j a tuple value (1 As defined in [6], given a tuple word w, its kinship words include the five kinds of words in WordNet: (1) word w itself, (2) morph, (3) synonyms, (4) the immediate hyponyms (subordinates), and ( 5) the immediate hypernyms (superordinates).The set of all kinship words of w is denoted by K(w).For instance, the kinship word set of "computers" is K(computers) ={computers, computer, data processor, machine, internet site, calculator, …}.
Consider a ranking query q = (q A , q 1 , q 2 , …, q m ), where q A = (qw 1 , qw 2 , …, qw k ) is a word-string with k query words, and q j is a numeric value (1 and d A = d(q A , t A ) that is the semantic distance between q A and t A defined by Definition 1 to 3 in [6].Moreover, d A belongs to the interval (0, 1].
We need to find a mechanism combining the semantic and numeric distances d A and p j s, which can be used to evaluate the query q over the relation R 0 , i.e., to define a Intuitively, the ranking function d(q, t) should satisfy the following: First, a smaller d(q, t) indicates closer the pair (q, t).Second, d(q, t) needs to balance the effects of d A and p j s in matching q with t.Finally, it should be easy to implement.
To obtain d(q, t), we use the statistics of the domains of B 1 , B 2 , …, B m and training.
Since the semantic distance d A ∈ (0, 1], we normalize it by scientific notation, d A = α ×10 −h , where 1.0 ≤ α <10.0, and h is a nonnegative number (i.e., h≥ 0, if h = 0, we define −0 = 0), say, d A = 0.001041 = 1.041×10 −3 .We will see that h (the absolute value of the exponent) plays an important role in the ranking function d(q, t).In collecting statistics, there is a step of cleaning data and removing outlier, and then we get Min(B j ) and Max(B j ) of numeric attribute B j (1 ≤ j ≤ m).Based on the semantics of attribute B j , we obtain its reasonable unit Z j >0, say, Z year = 1 for attribute year and Z price = 0.01 for price respectively in database BOOK.
Let e j := c j + 1 ≥ 1, for 1 ≤ j ≤ m.By using the statistics and training in our experiments, we obtain the rank- For example, in our experiments for relation Books (title, price, year,…) with 56180 tuples.price is in (0, 1000] (except 633 tuples with price > 1000), and year belongs to [1958,2008] (except 651 tuples with year < 1958).Let attribute A, B 1 and B 2 be title, price and year respectively, then For query q, we will return dis-N tuples defined in [6] to replace top-N ones.Let T be a set of tuples, a tuple t∈T is called a dis-N tuple of q, if d(q, t) ≤ min N {d(q, t i ) | t i ∈T}, which means the Nth minimum value in the set {d(q, t i ) | t i ∈T }.

Creation of sn-Index
We extend w-index in [6] to sn-index (stands for semantic and numeric index) in this section.
The attribute Word indicates the kinship word of tuple words, and the relevant information will be stored in the attribute DBValue.The value of DBValue is a string with form "tid,d,n,f ;tid,d,n,f;…;" where "tid,d,n, f ;" is a node, d = 0, 1, . . ., 5, "d = i" is the subscript i of semantic distance d i defined by Definition 1 in [6] , n is the number of tuple words in the tuple with tid, and f is the frequency in the set of kinship words of the tuple.The attribute Size is the number of tids associated with kw, and the duplicate tids are counted repeatedly.The attribute dbNSize means the number of nodes in DBValue.
Attribute BValue is a string as "tid,idx

Procedure of Creating sn-Index
Reconstructing w-index in [6], we obtain sn-index as shown in Figure 2. Sn-index consists of four parts: (1) a hash table with a hash function h(), its each bucket contains a pointer pWnPointer which points to a node in wn-list; (2) one wn-list, its node has structure {iRow, kw, size, dbNSize, bSize, pDBList, pBList}; Step 1.For the special character string t A = (tw 1 , …, tw n ), calculate its hash value h=h(t A ); check the pointer pWnPointer in the bucket of the hash Step 3. Storing sn-index.
To evaluate query q, we use two strategies.Strategy-1, the entire sn-index is in main memory.Strategy-2 will store db-lists and b-lists in fixed disk and only load the hash table and wn-list into main memory.

Evaluation of Ranking Query
For query q = (q A , q 1 , q 2 , …, q m ), firstly, matching the query words with kinship words of R(tid, A) via sn-index, we obtain the set T (= {tid}) of identifiers of candidate tuples, and compute the semantic distances between q and its candidate tuples, and then get the set L of information of numeric attributes by each identifier tid in T; secondly, compute d(q, t) = ψ(d A , p 1 , p 2 , …, p m ) between q and each of its candidate tuples, and then obtain {idx}, which is the sorted set of identifiers of dis-N results according to d(q, t); finally, we retrieve the dis-N tupels from underlying database and display the ranked answers.
To get the set T is an important step for query processing.The intermediate results are stored in a temporary list, denoted by T-List.Its node has the structure {tid, d[K], d A } where K is the maximum number of query words for q A = (qw 1 ,…, qw k ).If k > K, let q A = (qw 1 , …, qw K ) (K = 30 in our implementations).The intermediate results for obtaining set L are saved in a temporary list L-List, and the structure of its node is {d A , p 1 , p 2 , …, p m , d, tid, idx} We discuss Strategy-2 of storage first (i.e., we only load the hash table and wn-list in main memory).
(1) Normalization of q A .We remove some symbols, character strings or stop words, and replace some of them by normal strings for q A .Denoted again by q A = (qw 1 , …, qw k ).
(2) For each qw i ∈ q A (i= 1, …, k), calculate hash value h=h(qw i ), and check qw i in wn-list.If qw i is found out, we get the wnNode in wn-list, and then use the SQL selection statement select * from IndexTable where id# = wnNode.iRowto obtain dbNode(s).For each dbNode, we save the values into T-List, and compute d A = d(q A , t A ), then obtain T.
(3) If above T≠∅, for each tid∈T, we get its corresponding bNode(s) by using select statements from Indextable, store them into L-List, compute d := ψ(d A , p 1 , …, p m ), and then obtain the set L (={idx}) of candidate dis-N tuple identifiers.
(4) Given a positive integer N, we get the set of dis-N tuple identifiers L N ⊆ L; then, we obtain the dis-N tuples of the query q by using the SQL selection statement as the following format: Select R0.*, R.A from R0, R where (R0.FKid = R.tid) and (R.idx in L N ) and … (5) Display the dis-N tuples sorted by d(q, t).
Next, we discuss Strategy-1 of storage, i.e., the entire w-index is saved into main memory.The query processing is similar to the above situations, except for steps (2) and (3), we can get dbNode(s) and bNode(s) from sn-index directly, without using select statements.Thus, the response time of Strategy-1 will be smaller than that of Strategy-2.

Experimental results
Our experiments are carried out using Microsoft's SQL Server 2000 and VC++6.0 on a PC with Windows XP, Intel(R) Core(TM) i5-2400/3.10GHz3.09GHz CPU, and 2.98GB memory.In addition, WordNet 2.1 and its API functions, ODBC, and ODBC API functions are used.
Datasets: We use two real datasets.The first one is IPUMS with two relations that is a fragment of US Census Bureau data [4] The second dataset BOOK comes from the Library at Beijing University of Technology, which is the record set of English books in the Library, and produces two relations.One is Books(id#, isbn, title, author, publisher, price, year, FKid) having 56180 tuples.The other relation Titles(tid, title) has 48107 tuples (duplicate titles are removed).In addition, Books and Titles act as R 0 and R respectively.Books.id#corresponds to R 0 .idx,and Books.FKid is the foreign key referencing Titles.tid.
Attributes tid, A, B 1 , and B 2 described in Section 2 correspond num, occ50, age and income for IPUMS, and tid, title, price and year for BOOK, respectively.
Space cost of sn-index: Strategy-1 and Strategy-2 are used for IPUMS and BOOK respectively.The main memory space costs are: index-space-IPUMS is about 3.6MB, and index-space-BOOK about 3MB.
Workloads: We build a program to create a workload that is a set of 100 queries for each database.First, we choose 100 tuples from R randomly, and then for each tuple t, select 1~10 kinship word(s) from K(t) randomly, where the numbers of simple queries with 1~3 words and complex queries with 4~10 words are both 50.We classify them into 10 groups G i (i =1, 2, …, 10), and the query in G i has i search word(s).The size of G i is random.

Elapsed Time
Figure 3 illustrates the average elapsed time for executing all queries in each G i for IPUMS by fourteen curves itN's and dtN's, which stand for index-time's and DB-time's respectively.The seven curves it1 to it100 are (almost) the same, and are related to queries, but not to N (N = 1, 2, …, 100), which are from 11 to 230 milliseconds.The other seven curves dt1 to dt100 are the costs of retrieving tuples from DB by using SQL selection statements for the natural join of Ipum99 and Occ50, which are smaller than 600 milliseconds.

Precision
It is difficult to confirm whether one of tuples retrieved matches a query semantically by a computer, and it is too big a job to recognize semantic match manually for a large dataset [6].Therefore, the traditional recall for evaluating IR systems is not suitable for measuring semantic match when the dataset is large, and then we report only precision.Figures 5 and 6 illustrate the average precision for IPUMS and BOOK respectively.We can see that a smaller N has a larger precision; therefore, a smaller N indicates more matching tuples appear in its dis-N results.The precisions for IPUMS are 1, 0.93, 0.90, 0.73, 0.62, 0.52, and 0.44 for N = 1, 2, 3, 10, 20, 40, and 100, respectively.The precisions for BOOK are 0.93, 0.86, 0.78, 0.63, 0.55, 0.45, and 0.34 for N = 1, 2, 3, 10, 20, 40, and 100, respectively.

Conclusions
We proposed a new method to evaluate relational ranking queries that reference both text attributes and numeric attributes.The method builds a ranking function combining the semantic and numeric distances, and creates an index based on WordNet to expand the tuple words semantically for a text attribute and on the information of numeric attributes.Thus, the results for a query are retrieved by the index and a simple SQL selection statement for the natural join of relations, and ranked according to the ranking function.We conducted extensive experiments to measure the performance of this new technique using two real datasets.The results of experiments demonstrated that our strategy is efficient and effective.

Figure 3 .
Figure 3.The elapsed times for IPUMSFigure4shows the elapsed time with fourteen curves for BOOK.The seven curves it1 to it100 are (almost) the same, and are related to queries, but not to N (N = 1, 2, …, 100), which are between 34ms and 346ms.Curves dt1 to dt100 show the average elapsed times for accessing database to retrieve tuples.The result sets are different for various dis-N queries.The larger N means the longer the elapsed time.The DB-times are less than 20ms for 1≤ N ≤ 3.If 10≤ N ≤ 100, the DB-times are between 50 and 300ms.

Figure 4 .
Figure 4.The elapsed times for BOOK

.tid, then R 0 = R S with S.FKid = R.tid. Let t be a tuple in R 0 , then t[A] =
(tw 1 , tw 2 , …, tw n ) is the word-string with n words on the text attribute A, and t

table ,
be the size of the result set {(idx, b 1 , b 2 , …, b m )}, Create a new b-list, and z bNode(s), insert the bNode(s) into the b-list.(c)Create a wnNode for wn-list, kw:= (tw 1 , …, tw n ), and dbNSize:=1 and bSize:= z.Step 2. For each tuple word tw i ∈ t, and each kinship word kw of tw i , get h=h(kw), check pWnPointer in the bucket of the hash table, and wn-list, 1 , B 2 , …, B m from R 0 where FKid = tid Let z (1) If pWnPointer is NULL, or kw is not in wn-list, do the same jobs as the above (1) in Step1 except for replacing t A by kw and defining d = k, k ∈ {0, 1, …, 5} such that d k is d(kw, tw i ) in Definition 1 in [6].(2) Else if kw has been in wn-list, then get its db-list and b-list.There are two cases.Case 1, in the db-list, there is a dbNode with dbNode.tid= t.tid,replacing d by the smaller if the two distances are different, and increasing f by 1, that is OK.Case 2, if no dbNode in the db-list satisfies dbNode.tid= t.tid,increase dbNSize by 1 in the wnNode with kw, create and insert a new dbNode into the db-list according to the increasing order of distance d.In addition, by the SQL selection statement as (b) above, get a result set {(idx, b 1 , b 2 , …, b m )} with size z ≥ 1. Create z bNode(s), insert the node(s) into the b-list, and bSize:= z.