^{1}

^{*}

^{1}

A large number of ontologies have been introduced by the biomedical community in recent years. Knowledge discovery for entity identification from ontology has become an important research area, and it is always interesting to discovery how associations are established to connect concepts in a single ontology or across multiple ontologies. However, due to the exponential growth of biomedical big data and their complicated associations, it becomes very challenging to detect key associations among entities in an inefficient dynamic manner. Therefore, there exists a gap between the increasing needs for association detection and large volume of biomedical ontologies. In this paper, to bridge this gap, we presented a knowledge discovery framework, the BioBroker, for grouping entities to facilitate the process of biomedical knowledge discovery in an intelligent way. Specifically, we developed an innovative knowledge discovery algorithm that combines a graph clustering method and an indexing technique to discovery knowledge patterns over a set of interlinked data sources in an efficient way. We have demonstrated capabilities of the BioBroker for query execution with a use case study on a subset of the Bio2RDF life science linked data.

With a large number of ontologies have been introduced by the biomedical community in recent years, one of the issues researchers are facing in healthcare and biomedical research is the challenging in analytics associated with large, complex, and dynamic healthcare data (e.g., electronic health records (EHRs), biomedical ontologies). Since there lacks appropriate tools and computational infrastructure that can be fully understood and utilized by involved personnel, very few capacities can be found to carry out analyses of these datasets [

Increasingly, we are also seeing the emergence of biomedical and scientific collaboration. The Semantic Web Health Care and Life Sciences Interest Group (HCLSIG) [

Although semantic web provides a solution for biomedical information exchange, there still exist significant difficulties on semantic seamless interoperability and interchange [

Many efforts have been made to perform knowledge discovery with semantic web techniques. For example, in general settings, vSparQL was introduced to enable application ontologies to be derived from these large, fragmented sources such as the FMA [

In this study, we presented a knowledge discovery framework BioBroker, which equipped with innovative algorithms that combine graph clustering method and an indexing technique. The aim of this framework is to generate cohesive query statements out of heterogeneous ontologies and execute these queries for the purpose of knowledge acquisition and discovery.

In the following, we first introduce materials used in this study. Next, we describe the methods and evaluation approaches used to build and test the framework. We then present the results followed by discussion. Lastly, we conclude and discuss potential future directions.

The Resource Description Framework (RDF)

The RDF is a standard model for data interchange and information exchange on the web. It extends the linking structure of the web to use URIs to name the relationship between things as well as the two ends of the link, which are usually referred to as a triplet [

Bio2RDF

Bio2RDF is a collection of biological knowledge bases which leverages semantic web technologies to provide interlinked life science data [

Cytoscape

The Cytoscape is an open source software used to visualize bioinformatics information and network [

OpenLink Virtuoso

The OpenLink Virtuoso is a triple store database for managing linked data from existing data silos [

The objective of this research is to find predicate patterns with a high degree of connectivity and identify a relatively small number of hops via highly connected nodes to traverse the RDF graphs. We are presenting how to define and discover such patterns of those significant nodes and use them for scalable query processing. We present our predicate-centric model in terms of definition of predicate patterns, discovery of patterns, and usage of patterns during query processing.

Predicate Patterns

A predicate P is representing a binary relation between two concepts (c1 and c2) in ontology. In RDF/OWL, P is represented as a property to express any kind of relationship (e.g., SubClassOf, Type) between domain (subject) and range (object) [

Share Patterns: As shown in _{1} and P _{2} where Si, Sj are a set of subjects

Patterns | Semantic and Pragmatic Knowledge | |
---|---|---|

Exact | Partial | |

Subject-Object Share | Si == Sj && Oi == Oj | Si >= Sj or Si <= Sj & Oi >= Oj or Oi <= Oj |

Subject Share | Si == Sj | Si >= Sj or Si <= Sj |

Object Share | Oi == Oj | Oi >= Oj or Oi <= Oj |

Patterns | Semantic and Pragmatic Knowledge | |
---|---|---|

Symbol | Condition | |

Path Connectivity | Si → P1→ Oi → P2 → Oj | P1 ≠ P2 && Oi = Sj |

Cycle Connectivity | Si → P1 → Oi → P2 → Oj | P1 ≠ P2 && Oi = Sj && Si = Oj |

and Oi, Oj are a set of objects in given ontologies.

Connection Patterns: According to

Ontology Clustering with Predicates

Based on the defined two predicate patterns, we found out that predicates play an important role as hubs to share information and connect entities among heterogeneous data. Therefore, we gave a hypothesis that graphs can be fuzzy clustered based on predicate sharing and distance measurement, and data in the same clustered group have a closer relationship than when in different ones.

Predicate Neighboring Level Determination: First, we need to define the boundary of domains in terms of sets of concepts and relations over the datasets. For this purpose, we proposed a predicate neighboring algorithm to determine the closeness of each of the two different predicates. Different shapes of edges denote different relationships between predicates p i and p j through concepts C. Level 1 has four different combinations that are based on a predicate sharing pattern as well as a connection pattern. Levels 2 and 3 have two various paths, respectively, that are based only on a predicate connection pattern. The formal definition is shown in Definition 1. It is obvious to find that the closeness of the relationship decreases as the level increases. Here we set the upper limit to three because we assume any relationship between predicates and beyond three levels is sparse.

Definition 1: Given a directed graph G ( V , E ) , Vertices V s , V p , V s o denote subject, predicate, and object nodes in the RDF graph, respectively. Let d ( p i , p j ) represent the shortest distance between p i and p j , r ( p i , p j ) determine the reachability between p i and p j n ( p i , p j ) indicates the neighbors’ closest level between p i and p j :

n ( p i , p j ) = { 1 , if d ( p i , p j ) = 1 2 , if d ( p i , p j ) = 2 and r ( p i , p j ) = true 3 , if d ( p i , p j ) = 3 and r ( p i , p j ) = true

Predicate Similarity Measurement Calculation

We utilized clustering approach to discover predicate association patterns from ontologies. The similarity based confusion measurement for the clustering algorithm varies based on different neighboring levels for each pair of predicates. Basically, we give higher weightage to closer predicates and lower weightage to further predicates. We give Definitions 2, 3, and 4 based on three levels respectively. The formula to generate a confusion matrix for a clustering algorithm is given by Definition 5.

Definition 2: Denote p i and p j as predicates in a RDF graph. A set of sets { { S i 1 } , { S j 1 } } contain all the predicates such that ∀ m ∈ { S i 1 } → n ( m , p i ) = 1 and ∀ n ∈ { S j 1 } → n ( n , p j ) = 1 . Let P ( x ) represent the number of entities that directly connect to predicate set x and E ( e ) represents the number of entities for a given entity set e. Given entity set { C 1 } so that ∀ e 1 ∈ { C 1 } → e 1 ∈ { S i 1 } and e 1 ∈ { S j 1 } . The probability-based similarity P S i j = E ( C 1 ) P ( S i 1 ) ∗ E ( C 1 ) P ( S j 1 ) .

Definition 3: Denote p i and p j as predicates in an RDF graph. A set of sets { { S i 1 } , { S j 1 } , { S i 2 } , { S j 2 } } contain all the predicates such that ∀ m ∈ { S i 1 } → n ( m , p i ) = 1 , ∀ n ∈ { S j 1 } → n ( n , p j ) = 1 , ∀ x ∈ { S i 2 } → n ( x , p i ) = 2 and ∀ y ∈ { S j 2 } → n ( y , p j ) = 2 . Let P ( x ) represent the number of entities directly connect to predicate set x and E ( e ) represent the number of entities for a given entity set e.

Given entity set { C 1 } such that ∀ e 1 ∈ { C 1 } → e 1 ∈ { S i 1 } and e 1 ∈ { S i 2 } or ∀ e 1 ∈ { C 1 } → e 1 ∈ { S j 1 } and e 1 ∈ { S j 2 } . The probability-based similarity P S i j 1 = E ( C 1 ) P ( S i 1 ) E ( C 1 ) P ( S i 1 ) ∗ E ( C 1 ) P ( S i 2 ) and P S i j 2 = E ( C 1 ) P ( S j 1 ) ∗ E ( C 1 ) P ( S j 2 ) .

Given entity set { C 2 } such that ∀ e 2 ∈ { C 2 } → e 2 ∈ { S i 2 } and e 2 ∈ { S j 2 } . The probability-based similarity P S i j 3 = E ( C 2 ) P ( S i 2 ) ∗ E ( C 2 ) P ( S j 2 ) . Thus, P S i j = M a x ( P S i j 1 , P S i j 2 ) ∗ P S i j 3

Definition 4: Denote p i and p j as predicates in an RDF graph. A set of sets { { S i 1 } , { S j 1 } , { S i 2 } , { S j 2 } , { S i 3 } , { S j 3 } } contain all the predicates such that ∀ m ∈ { S i 1 } → n ( m , p i ) = 1 , ∀ n ∈ { S j 1 } → n ( n , p j ) = 1 , ∀ x ∈ { S i 2 } → n ( x , p i ) = 2 and ∀ y ∈ { S j 2 } → n ( y , p j ) = 2 , ∀ t ∈ { S i 3 } → n ( t , p i ) = 3 and ∀ k ∈ { S j 3 } → n ( k , p j ) = 3 . Let P ( x ) represent the number of entities directly connected to predicate set x and E ( e ) represents the number of entities for a given entity set e.

Given set { C 1 } such that ∀ e 1 ∈ { C 1 } → e 1 ∈ { S i 1 } and e 1 ∈ { S i 2 } or ∀ e 1 ∈ { C 1 } → e 1 ∈ { S j 1 } and e 1 ∈ { S j 2 } . The probability-based similarity P S i j 1 = E ( C 1 ) P ( S i 1 ) ∗ E ( C 1 ) P ( S i 2 ) and P S i j 2 = E ( C 1 ) P ( S j 1 ) ∗ E ( C 1 ) P ( S j 2 ) .

Given set { C 2 } such that ∀ e 2 ∈ { C 2 } → e 2 ∈ { S i 2 } and e 2 ∈ { S i 3 } or ∀ e 2 ∈ { C 2 } → e 2 ∈ { S j 2 } and e 2 ∈ { S j 3 } .

The probability-based similarity P S i j 3 = E ( C 2 ) P ( S i 2 ) ∗ E ( C 2 ) P ( S i 3 ) and P S i j 4 = E ( C 2 ) P ( S j 2 ) ∗ E ( C 2 ) P ( S j 3 )

Given set { C 3 } such that ∀ e 3 ∈ { C 3 } → e 3 ∈ { S i 3 } and e 3 ∈ { S j 3 } . The probability-based similarity P S i j 5 = E ( C 3 ) P ( S i 3 ) ∗ E ( C 3 ) P ( S j 3 ) thus. P S i j = M a x ( P S i j 1 ∗ P S i j 3 , P S i j 2 ∗ P S i j 4 ) ∗ P S i j 5

Definition 5: Given confusion matrix CM and total number of predicate n. Denote P S i j as the probability-based similarity score between predicates p_{i} and p_{j} based on different levels, so that:

CM [ p i , p j ] = { P S i j , if p i ≠ p j , 0 ≤ i ≤ n , 0 ≤ j ≤ n 1 , if p i = p j , 0 ≤ i ≤ n , 0 ≤ j ≤ n

We posit that predicate clustering is a required step for efficient query processing involving the alignment and integration of ontologies. Here we clarify our approach to efficient query processing and query generation within the above theoretical framework. A query processing consists of a collection of several relationships between multiple properties. Given that properties are more closely related to some properties more than others, property clustering and partition can be utilized for efficient query processing―the task of classifying a collection of properties into clusters. The guiding principle is to minimize inter-cluster similarity and maximize intra-cluster similarity, based on the notion of semantic distance.

Hierarchical Fuzzy C-Means Clustering Algorithm

To discover the correlation between predicates, we used an innovative Hierarchical Fuzzy C-Means (HFCM) clustering algorithm. We created the HFCM algorithm and made a functional extension based on a Fuzzy C-Means clustering algorithm [

Indexing for Ontology and Data

Based on the variety of large biomedical data spreading in different clusters, a new indexing technique was developed for representing predicate patterns of ontologies from the clusters. Specifically, a two-level encoding approach has

been developed to index the RDF schema, instance, and triple. For the cluster spaces, the two-level hierarchical indexing technique provides efficient representation of complex relations between nodes and predicate association patterns. We used binary encoding to index OWL/RDF schema and make binary with bitmap encoding together to index the OWL/RDF instance. For schema level, our assumption is that the size of schema for each medical and healthcare knowledge base should be a constant. The total size of schema encoding can be controlled even if binary encoding increases drastically. We used the binary index from binary 10 and started encoding with predicate to make sure all the predicate encoding was less than the entities encoding. For instance level, we assigned a unique bitmap index to each instance under its schema encoding. Our design philosophy is that instances with different schemas can share the same encoding but instances under the same schema must be assigned a unique indexing. Therefore, with the huge amount of instances, bitmap indexing colud be used in a scalable way and the combination of both binary and bitmap indexing uniquely determined an instance. For triple level, we applied logic or operation on schema encoding of the RDF subject, predicate, and object to generate the result. If a triple did not have a cycle, then we set the object schema encoding to be larger than the subject encoding. If a triple had a cycle, we used the right most bit as the indication of cycle bit and set the subject encoding as larger than the object encoding. In such a design, we can easily differentiate a cycle triple with a non-cycle one. Definition 6 illustrates this encoding approach in specific.

Definition 6: Given the i^{th} node (i ≥ 0) of Schema set {S}, j^{th} node (j ≥ 0) of Instance set {I}, predecessor set {m} and {n} contain all the father nodes of i and j, respectively. Denote each RDF triplet t as {s, p, o}. Let S(i) represent schema encoding set, I(j) represent instance encoding set, TS(t) represent triple schema encoding set, TI(t) represent triple instance encoding set and integer number R represent the magnitude of the data:

S ( i ) = { { S ( i − 1 ) ∨ 2 i } , if { m } ≠ ∅ and ∀ i ∈ { m } { 2 i } , if { m } = ∅

I ( j ) = { { S ( i ) + I ( j − 1 ) + 1 R } , if { n } ≠ ∅ and ∀ j ∈ { n } { S ( i ) + 1 R } , if { n } = ∅

T S ( t ) = { { S ( s ) ∨ S ( p ) ∨ S ( o ) | S ( o ) > S ( s ) > S ( p ) > 1 } , if { s , p , o } does not form a cycle { S ( s ) ∨ S ( p ) ∨ S ( o ) ∨ 1 | S ( s ) > S ( o ) > S ( p ) > 1 } , if { s , p , o } forms a cycle

T I ( t ) = { T S ( t ) + I ( s ) R + I ( o ) R }

Query Processing using Predicate Patterns based Clustering and Indexing

An intuitive query system was implemented based on clustering and indexing based on predicate patterns for imported medical data. Due to this innovated approach, the users’ query could be answered with high accuracy and performance. A structured representation of semantic relations between concepts can be intuitively extended to query systems. Some features of our prototype BioBroker framework are listed below.

Integrated OWL/RDF Schema Clustering: Different OWL/RDF medical sets can be imported to the BioBroker. Our system is able to parse the schema based on data and apply the HFCM algorithm on schema based on predicate similarity. Clustering graphs are also generated accordingly and triple with the same predicate among different schema sets can be linked.

Query Boundary: Query processing can be optimized based on the proposed concept of query boundary. The boundary can be determined by predicate association and clustering sets. A query boundary characterizes a particular dynamic reasoning and query capability of the proposed model that is specifically tailored

for query semantics. More specifically, it can be proved that for a specific kind of user’s query, there exist a fixed set of abstract patterns that are involved in the query processing process. This fixed set is called the query boundary for the specific type of users’ query. A query was described into query boundary within clusters and the BioBroker used a different green color to indicate such boundary specifically. As an example shown in

Interactive Query Design with Suggestion: The predicates extracted from a user’s SPARQL query [

Query Indexing to Optimize Benchmark Query Performance: The BioBroker translates each SPARQL query to query indexing format based on medical Ontology and data indexing. Therefore, executing the SPARQL query is actually performing logical operations on schema binary indexing and mathematical operations on instance bitmap indexing. In

Evaluation

The BioBroker prototype system was implemented using Java on Eclipse Juno Integrated Development Environment [

The evaluation of the BioBroker system is conducted in terms of the valid of clustering result and justified query benchmark generation. We used three ontologies from Bio2RDF release 2 to evaluate our system. Detailed information for each ontology is given in

To select the optimal clustering algorithm for knowledge discovery, we first compared performances yielded by the Hierarchical Fuzzy C-Means (HFCM), the Partition Around Medoids (PAM) algorithm [

Ontology | Triple Types | Entities | Properties | Triple Instances |
---|---|---|---|---|

Drugbank | 306 | 91 | 56 | 3,649,750 |

HUGO Gene Nomenclature Committee (HGNC) | 84 | 19 | 18 | 3,628,205 |

Mouse Genome Informatics (MGI) | 140 | 17 | 19 | 8,206,813 |

selected optimal clustering algorithm.

For query evaluation, we selected eight query benchmarks [

Evaluation for HFCM Algorithm

As shown in

ID | Size | Predicate Name |
---|---|---|

Cluster 1 | 17 | MGI:allele, MGI:marker, MGI:phenotype, MGI:strain.type, MGI:x.ensembl.protein, MGI:x.ensembl.transcript, MGI:x.genbank, MGI:x.pubmed, MGI:x.refseq.protein, MGI:x.refseq.transcript, MGI:x.trembl, MGI:x.uniprot, MGI:x.vega.protein, MGI:x.vega.transcript, MGI:xHGNC, MGI: theoretical.pi, MGI:xENSEMBL |

Cluster 2 | 6 | HGNC:x.ccds, HGNC:x.ncbigene, HGNC:x.omim, HGNC:x.refseq, HGNC:x.uniprot, DrugBank:xref |

Cluster 3 | 9 | HGNC:has.approved.symbol, HGNC:is.approved.symbol.of, HGNC:status, HGNC:x.ensembl, HGNC:x.mgi, HGNC:x.pubmed, HGNC:x.rgd, HGNC:x.ucsc, HGNC:x.vega |

Cluster 4 | 6 | DrugBank:form, DrugBank:ingredient, DrugBank:ingredients, DrugBank:route,DrugBank:source,DrugBank:molecular.weight |

Cluster 5 | 18 | DrugBank:manufacturer, DrugBank:mechanism.of.action, DrugBank:molecular.weight, DrugBank:name, DrugBank:packager, DrugBank:pharmacology, DrugBank:protein.binding, DrugBank:route.of.elimination, DrugBank:specific.function, DrugBank:substructure, DrugBank:synonym, DrugBank:target, DrugBank:mixture, DrugBank:theoretical.pi, DrugBank:toxicity, DrugBank:transmembrane.regions, DrugBank:transporter, DrugBank:value, DrugBank:volume.of.distribution |

Cluster 6 | 19 | DrugBank:absorption, DrugBank:action, DrugBank:affected.organism, DrugBank:biotransformation, DrugBank:brand, DrugBank:calculated.property, DrugBank:category, DrugBank:cellular.location, DrugBank:dosage, DrugBank:drug, DrugBank:experimental.property, DrugBank:food.interaction, DrugBank:gene.name, DrugBank:general.function, DrugBank:half.life, DrugBank:indication, DrugBank:kingdom, DrugBank:locus, |

Cluster 7 | 10 | DrugBank:approved, DrugBank:country, DrugBank:ddi.interactor.in, DrugBank:enzyme, DrugBank:expires, DrugBank:mixture, DrugBank:patent, DrugBank:price, DrugBank:product, DrugBank:drug |

Detailed clustering information by the HFCM is given in

Evaluation for Query Performance

Query benchmark was established and detailed information of queries can be found in

We executed all queries on the Virtuotoso Database and retrieved relevant

Query Number | Answers |
---|---|

Q1 | r = DrugBank:DB00157_711, s = DrugBank:DB00157, e = DrugBank:12 |

Q2 | ddi = DrugBank:DB00001_DB01381, d1 = DrugBank;DB00001, d2 = DrugBank:DB00001, patent = uspatent:5180668, data2 = 2010-01-19 |

Q3 | m = DrugBank:Cauterex, i = domase alfa + fibrinolysin + gentamicin sulfate, d = DrugBank:DB00003 |

Q4 | d = DrugBank:DB00006, f = Dan Shen, dong quai, evening primrose oil, gingko, policosanol, willow bark |

Q5 | hgnc = HGNC:26946, marker = MGI:1913367, mgi = MGI:1913367, uni = Uniprot:Q9CR13, ens = Ensembl: ENSMUSG00000019689 |

Q6 | Hgnc = HGNC:7863, target = DrugBank:11, u1 = UNIPROT:Q13423, u2 = Uniprot:Q13423 |

Q7 | Target = DrugBank:9, hgnc = HGNC:5211 |

Q8 | Target = DrugBank:6601, hgnc = HGNC:24427, mgi = MGI:88574, e1 = Ensembl: ENSMUSG00000015340, e2 = ENSMUSG000000197953 |

Q1 | Q2 | Q3 | Q4 | Q5 | Q6 | Q7 | Q8 | |
---|---|---|---|---|---|---|---|---|

BioBroker | 8 | 17 | 10 | 5 | 8 | 4 | 4 | 25 |

Virtuoso | 525 | 954 | 590 | 125 | 403 | 219 | 168 | 1037 |

results as shown in

We also tested query execution performances on Bio2RDF DrugBank, HGNC and MGI dataset with query 1 - 8. We compared our indexed query performance with Virtuoso based SPARQL query performance. The small scale data we used has 3,651,750 triples and 105 properties. The performance comparison results are showed in

There are several studies for extension of the SPARQL query with some extended patterns such as path SPARQL [

This paper presents a predicate pattern based model equipped with index technique for query suggestion, visualization, scalable query and reasoning with large biomedical ontology schema and data. The proposed model transforms conjunctive SPARQL queries into efficient pattern based queries over a set of interlinked medical data sources. The benefits of predicate-based query processing were shown with discovery of predicate patterns. The proposed model was evaluated with the Bio2RDF datasets and the experimental results of the query designing and results showed the superiority of the proposed predicate-centric model compared to existing query models.

In the future, we will combine graph network analysis approaches [

Human Phenotype Ontology (HPO) [

Shen, F.C. and Lee, Y. (2018) BioBroker: Knowledge Discovery Framework for Heterogeneous Biomedical Ontologies and Data. Journal of Intelligent Learning Systems and Applications, 10, 1-20. https://doi.org/10.4236/jilsa.2018.101001