
F. PESSINA ET AL.
Copyright © 2013 SciRes. ENG
2. Genomic and Proteomic Data Warehouse
2.1. Integrative Data Schema
To integrate heterogeneous data available from many
different sources, abstraction and generalization of con-
cepts to be integrated are paramount. As well, modularity
and customizability of global data schema are vital to
support easy integration, data schema extension with the
inclusion of new data types and sources, and mainten-
ance with respect to data, format and schema evolution
of the integrated original data sources. With such aims
and with the goal o f creating the GPDW, as illustrated in
[6], we focused on biomedical molecular entities and
features described by data to be integrated, provided by
distinct sources. Briefly, we abstracted and generalized
such features and defined our integrated relational data
schema as composed of multiple interconnected modules.
Each module represents a single feature, whose data are
provided by one or more of the integrated data sources,
and it is composed by a number of data tables that de-
pends on the integrated data. Such tables are hierarchi-
cally related as shown in the Directed Acyclic Graph
(DAG) in Figure 1.
Feature modules can be pair wise associated; such as-
sociations represent the valuable association/annotation
data provided by the integrated data sources, which are
stored in hierarchically related a ssociation tables (Figure
2).
The feature modules and their associations contained
in a specific instance/version of our generalized global
data schema depend on the particular data sources and
their provided data that are integrated in that specific data
schema instance. To support the automatic construction
Figure 1. Directed acyclic graph of the tables in a GPDW
feature module.
Figure 2. DAG of the association tables between two GPKB
feature modules.
and updating of a database adopting such data schema,
we defined a procedure to register the data sources and
their feature data to be integrated, and to collect all the
required metadata information about them and their as-
sociations. We store these metadata in a specific metada-
ta schema, useful to seamlessly and transparently access
all data in the database regardless the specific database
version.
2.2. Data Integrated in t he GPDW
The GPDW adopts our defined modular data schema to
integrate data provided by several of the main bioinfor-
matics databases, including Entrez Gene, Homologene,
MINT, IntAct, Expasy Enzyme, GO, GOA, BioCyc,
KEGG, Reactome, eVOC and OMIM. Currently, data in
the GPDW regard several features, including DNA se-
quences, genes, transcripts, proteins, enzymes, protein
domains, small molecules of biological interest, biologi-
cal function features (i.e. Gene Ontology biological
processes, molecular functions and cellular components),
pathways, gene expression features, genetic disorders,
clinical synopses and their association.
Among others, at time of writing the GPDW contains
9,537,645 genes of 9,631 organisms, 38,960,202 proteins
of 338,004 species, 19,522 protein domains and 824,797
protein domains annotations, 28,889 biochemical path-
ways and 171 ,372 pathway an notations (77,8 12 gene and
93,560 protein annotations), 35,252 Gene Ontology terms
and 64,185,070 Gene Ontology annotations (1,272,168
gene and 62,912,902 protein annotations) , 10,212 human
genetic disorders and their 27,705 gene annotations.
These figures demonstrate the valuable unique characte-
ristics of the GPDW.
3. Dynamic Composition and Result
Visualization of GPDW Data Extraction
SQL Queries
To enable any user to easily compose queries, although
complex, on all data integrated in the GPDW, we devel-
oped a Web application in Java programming language
using Servlets and Java Server Pages (JSP) technology. It
is publicly available at
http://www.bioinformatics.dei.polimi.it/GPKB/. Through
a visual interface (Figure 3), the user is only required to
select, out of the features integrated in the GPDW, the
ones and their attributes to be included in the query, to-
gether with the conditions on the data values to be re-
trieved. All information about the GPDW content re-
quired to build the visual interface is taken from the
GPDW metadata. Thus, transparently to the user, the
visualized features and their attributes automatically
adapt to the content of the specific GPDW instance.
Interactive menus, present in the visual interface for