1. Introduction
The study of hydrological processes and their associated extreme events (e.g., floods and droughts) is of paramount importance to human lives, global climate change, a healthy and sustainable ecological environment, and the national/international economy. In today’s scientific modeling for such very complex hydrological and environmental systems, coupling between different hydrological models and/or among hydrological models and climate models becomes a common practice.
A desirable framework for these models’ coupling should support modularity, flexibility and interoperability, in the sense that individual models’ development and implementation are independent with each other and have their own integrity and autonomy. Such a model coupling framework can not only greatly facilitate and fit in interdisciplinary and collaborative scientific work environment, but also dramatically increase the efficiency and flexibility of model couplings in research and operational practices.
In general, integration of hydrological models can be either a single point-to-point connection with a unique one-way interaction, or a multiple point interactive and coordinated set of collaborative activities. The former can be referred to as a simple coupling problem whereas the latter as a sophisticated coupling problem. In hydrological, environmental and climate fields, models, either physically-based or data-driven, are usually realized and simulated in terms of software systems. Due to the increasing complexity and heterogeneity of software packages employed and hardware and operating systems platforms used for individual models’ development, we seek to develop a systematic approach and framework to facilitate such complex model coupling and integration. Service-Oriented Architecture (SOA) and scientific workflow have great potential to achieve our goal. Following the SOA approach, our idea is to encapsulate each individual model’s functionality in services in addressing the needs of modularity, flexibility, autonomy, and interoperability of individual models in the coupling framework. As services can be distributed over Internet and reused, SOA can indeed promote remote collaborative and interdisciplinary team work and make different targeted models/systems’ integrations efficient. On the other hand, SOA alone is not adequate to address the problem of automating coordinated set of collaborative interactions among models for sophisticated couplings. This leads to scientific workflow. The rest of the paper is organized as follows. Section 2 reviews and discusses how SOA, scientific workflow and other potential techniques could be used to integrate and coordinate systems that represent scientific models. Major advantages and disadvantages of these techniques, found in the scientific and commercial computing fields, are discussed. In Section 3, we propose a general architecture for model couplings based on SOA and scientific workflow. Section 4 describes MoteWS, a prototype web services-based system developed to publish field measurement data from wireless sensor networks to illustrate and test preliminarily the main SOA component of our proposed architecture for model/system couplings. Finally, in Section 5, our learned lessons during our prototype development are discussed, recommendations and our future work along this research direction is provided.
2. Reviews and Analyses
2.1. Related Work
Some work dealing with the challenges of connecting collaborative hydrological models exists (e.g., [1,2]). For example, a theory-based analysis (e.g., [3]) is presented to show some solutions that can meet the needs of model integration. Some proofs-of-concept of scientific workflow or SOA architecture work, including [4-9], also exists. In [10], a scientific workflow analysis goes even further to find systematic ways to study results and to improve the design. One of the main goals of this paper is to show, in a condensed but easy-to-read manner, why SOA and scientific workflow are a better solution compared to other alternative techniques in order to couple, coordinate, collaborate and evolve hydrologic models in the process of hydrologic studies. While these reasons appear to be assumptions in other work, we believe it is important to establish them explicitly rather than implicitly to provide insights, and to pave the way to our proposal of a general architecture for hydrologic model coupling in the next section. In the following, our analyses are given regarding potential techniques for model coupling in a comprehensive manner, which is not available all together in the previous work.
2.2. Potential Techniques for Integration
The techniques can be classified into the following categories.
1) Data integration: The models or systems use a shared repository of data. Each system puts information in the repository, where others can find it and read it. The receiver can poll the repository continuously until it finds a message, or, the repository can provide an event-alert system. Advantages: Usually easy to implement. Disadvantages: Affects independence of each application. The integration is easily lost if any of the applications evolves.
2) Business integration: In this type of communication, one system sends a message from a core component (or a logic component) and the other system receives it in a core component (or a logic component). The business integration can be made using special sub-layer in the logic layer designed to process the messages (see Figure 1). Advantages: Each application keeps its independence. Disadvantages: Requires more work because some ser-
Figure 1. Illustration of different types of integration.
vice is required to be developed. This service will contain the code to access the other layers, data, etc.
3) Presentation integration: The GUI of one system allows the user to access the GUI of another system. Advantages: Allows the end user to visually perceive the integration. Disadvantages: Most times it creates strong dependences between applications.
For the task of integration between hydrological models, the type of integration that best fits the requirements is the business integration because:
1) It makes it easier to keep models independent.
2) It makes it easier to keep low coupling1 between systems.
3) Transparent to end users.
4) It is independent of data storage implementation.
To offer functionality to other systems, each system can publish:
1) An API2: This allows the client to access all objects, send, receive and update objects, and use services of these objects.
2) Services: The logic layer can publish independent services that are not part of the objects. It leads to Service Oriented Architecture (SOA).
For hydrological models’ integration, the SOA architecture is chosen because the services keep things independent and allow lower coupling between systems. Also it helps to keep the systems transparent to the other side of the system developers in the sense that the developers in one system do not need to know the implementation details and the objects’ structures of the other system.
The task of hydrological model integration requires a large amount of information coming in from multiple data sources and different models in a coordinated and collaborated way to obtain a solution. Thus, such a complex task leads to a next one, the flow control task (Coordination of activities).
2.3. Potential Techniques for Coordination of Activities
The techniques can be classified into the following categories.
2.3.1. By Design
· Description: In this technique, each model can send or receive messages following the rules established in the flow-design. The flow of activities can be centralized or decentralized without a real (automated) restriction. Then, in maintenance time, these systems tend to become decentralized, it means that each model will send and receive messages from another model.
· Problems:
◦ If a developer breaks the rule and sends a message back to the original sender and this message asks to re-do computations, it can produce a dead-lock.
◦ Each system must be able to implement communication protocols to any other system to be connected. For example:
- System “A” in language C++ running on Windows.
- System “B” running PHP on Linux.
- System “D” running Java on Solaris.
- System “E” running Pascal on Windows.
- All systems need the code to communicate to each other’s languages and operating systems.
◦ The whole-integrated system works fine only if all the developers of each single system agree the flowdesign paper and respect it always, though in reality, nothing ensures that.
◦ If the flow-design changes, all the systems and interactions must be changed. It puts integrity in risk for all the individual systems.
◦ Updating and improvement of the integrated system are expensive, risky and restricted.
· Advantages:
◦ Requires no invest in any control tools.
· Where the control is: There is not automated control. There is only paper control.
2.3.2. By Central Control
· Description: In this technique, one model (or system) is named Controller. This Controller sends requests to other models, but the other models do not send messages between them. The Controller has the code to control the flow of information and activities that the other models perform. It can become decentralized in maintenance time.
· Problems:
◦ It works okay only if the other models never send messages between each other.
◦ It works okay only if the Controller implements a flow-design without mixing the flow code and the business (e.g., hydrology) code. Nothing ensures that.
◦ The Controller must be able to implement communication protocols to any other system to be connected. This gives additional responsibilities to a system made for the hydrological modeling.
◦ Updating and improvement of the system are very expensive and restricted.
· Advantages:
◦ If the flow needs to change, only the Controller application must be reviewed and changed.
· Where the control is: The control is always implemented in the code of the Controller model.
2.3.3. By Message Broker
· Description: All the involved systems send the messages to a component designed to intermediate communications only. The systems publish their services in the broker. Other systems can subscribe to the published services and then consume them. See Figure 2.
· Problems:
◦ The flow control is embedded in the publish/subscribe design, in the implicit flow-design and in the connections that each system performs to the broker. Thus, the flow control becomes difficult to understand, maintain and change.
◦ If a system does not respect the flow-design semantics, it could create an infinite loop.
· Advantages:
◦ Models can send messages between them with almost no restrictions.
◦ If there are infinite loops or errors, debugging is much easier than without the broker.
◦ Models keep a high level of independency and low coupling. Thus, the capacity of each model to evolve is not affected, as is for the previous techniques.
◦ Each system must be able to implement only one communication protocol: The one between its own language/platform and the broker.
◦ In the category of central control, one of the models (i.e., Controller model) was required to have code to communicate with all of the other systems. Here, the broker only establishes some communication protocols for some standard platforms (or maybe only one) and the others need to adapt to it.
◦ The whole integrated system is ordered, traceable, repeatable. The flow-control is difficult to modify but is easier to analyze and debug.
· Where is the control: Mostly in the publish/subscribe protocol in the broker, but also in the flow-design semantics and some parts are in the models. See Figure 3.
Figure 2. Integration using a message broker.
Figure 3. Flow definition in P/S protocols.
2.3.4. By Workflow
· Description: The flow controller is a component called workflow. This dedicated component has the full responsibility of controlling and coordinating all the processes required to complete the computations. Each time that the workflow requires to run specific models, it will call them. The workflow component can act also as either a broker or not. In this case we will assume the workflow is also a broker to achieve all the advantages described in both message broker control and workflow control.
· Problems:
◦ It requires developers to have skills in workflow theory and integration.
◦ It requires a thorough flow design.
◦ It requires standards for processes, communication, units and composition of data.
· Advantages:
◦ Each model can communicate to anyone else. Better yet, in this technique, each model does not need to communicate directly to others; the workflow component will send request to the other models and receive their responses.
◦ Integration loops are controlled by only one component. Thus, infinite loops and dead-locks occur less and are easier to correct.
◦ Each system keeps as much independency as possible with a lower coupling.
◦ Each system must implement one communication protocol: The one between its own language and the workflow.
◦ The flow-control code and the hydrological code are completely separated.
◦ The whole integrated system is ordered, traceable, repeatable and the flow-control is much easier to understand, modify and debug.
· Where is the control: In the workflow component.
3. Models Coupling Architecture
We propose a general architecture for the hydrological model coupling, in which the selected communication technique is web services and the selected flow-control technique is workflow.
The driven criteria to choose the integration and flowcontrol technique is the maintainability. In commercial software, the maintenance cost represents between 60% and 80% of the cost of the life cycle. In scientific software, the maintenance (or evolution) is even more intensive due to the constant change of the concepts and ideas during ongoing research. One of the keys for maintainability is loose coupling because it implies that modifying one component does not affect others, reducing the complexity of the software system evolution. The communication technique that better fits these criteria is the business integration (see 0 ). It could be implemented through an API or services. Usually the use of an API would be more efficient but implies that the client component requires knowing the object structure of the host component, whereas the implementation in terms of services would be more transparent although a little bit more overheads. Those services also will be web services because: 1) The models are remotely reachable through a network; 2) This is a kind of services well known and very mature; 3) This is a standard and well defined way of interoperation; 4) The models are deployed on machines capable to communicate by SOAP on http servers.
The flow-control technique that better fits the maintainability criteria is by workflow because it keeps coupling as low as possible (see 0 ). The proposed architectture supports the requirement that models do not communicate to each other in any predefined way. Each model publishes some functionalities through services that are independent from other models. Moreover, each service uses WSDL to publish the metadata that allow others to find and use them. In that way the architecture fits the SOA paradigm. Nevertheless, SOA alone is not sufficient in the hydrological model coupling. This is because the publish/subscribe protocols that are typically used in SOA would define an implicit order for the model coupling interactions. Figure 3 shows an example of a configuration in a publish/subscribe broker. It can be seen clearly from Figure 3 that Model 3 will first consume Model 1’s service before it publishes its service to Model 4. Thus, it implicitly defines the service sequence.
However, implicit design tends to obscure operational logic and flow control among services and thus is not always adequate, especially for a complex model coupling system. For example, in Figure 3 the sequence from the subscription to the publishing of Model 3 (i.e., Arrow 2-Model 3-Arrow 3) is defined outside the flow definition. Implicit definition is insufficient and error-prone also if the problem is complex or flow-control changes permanently. Scientific workflow solves this requirement through explicit definition of the interaction flow.
The workflow component uses the functionality that is already published by the models as services.
It can be located in a separate machine, so that its performance will not be affected by individual model runs. The workflow component will have the control of the process and will be responsible for using the models as required. It is possible to use a model multiple times or establish loops with recurrent calling to models (timesteps). The models will have no direct communication. Models will only response to requests made by the workflow. The models will publish their services as web services. Our proposed architecture is illustrated in Figure 4.
Models located in different facilities can work together through remote connections. The interaction starts when the controlling machine starts a computation (see LOCATION 2 in Figure 4). The workflow residing at LOCATION 2 knows which and how the activities should be made. For example, this workflow can execute some activities and after that it determines the next execution offered by Model 1.
Figure 4. General architecture proposed.
At this point, the workflow will establish a connection with, for example, web service 2 (WS2) of Model 1. The workflow will call this web service to send required parameters and will receive a response from Model 1. Model 1 will then use the parameters received to execute hydrological computations and obtain results.
Responses from Model 1 can be used for decision making, or can be used as parameters for sending to another model, say, Model 2. For example, assume the response from Model 1’s WS2 will be used as parameters for Model 2’s WS3. In this case, the work-flow will establish a connection to web service 3 of Model 2. After having a response from Model 2, the workflow can finish its work or can start a new iteration calling again the models as many times as required, and so on.
As the models and workflow illustrated in Figure 4 will be available through Internet, they will become part of the Cyber-infrastructure. We adopt the following simple definition for Cyber-infrastructure: The set of all the services and resources available through a network to work in a scientific-collaborative way.
For example, as illustrated in Figure 4, for people working on Model 1, cyber-infrastructure is the cloud that helps them work with a workflow and the collaborating model—Model 2. On the other hand, for people working on Model 2, cyber-infrastructure is the cloud that helps them work with a workflow and the collaborating model—Model 1.
4. Motews: A Web Services Based Prototype System
To examine and test out proposed architecture, we started implementing the communication part—the web services—by developing an online system to publish realworld field measurements from wireless sensor network in near real-time.
4.1. Description of the Prototype System
The field measurements are being collected through a wireless sensor network, made up with devices that have sensors and transmitters (we call them motes for simplicity). Each wireless mote transmits all the observations to a data sink gateway called the net-bridge. There is one net-bridge for each field network in our two testbeds. The net-bridge, connecting to both wireless base station and Internet, is a gateway with Linux operating system and wall power. In addition to collecting data from the mote network, the net-bridge is also able to publish network services (like web services). That is, a web service was deployed in the net-bridge to publish the observations collected from the fields. This web service will dispatch a tar file including all the files collected from a given date.
The web service has been also deployed in a central data server at the hydrology lab at the University of Pittsburgh that collects information from all the netbridges. This central server has client software to poll each hour from all of the net-bridges. If new files are found, they are added to the data repository. The central server publishes the similar web service that the netbridges do. But in this case, the user can send a parameter indicating the source (i.e., site ID) of data to be retrieved. The final result is that one can retrieve either consolidated data by using the web service provided at the central data server with one-hour delay at most, or single-site data by using the web service deployed at each gateway with little delay. Due to the fact that the field measurement data are published through web services, other applications, hydrological models, workflows or desktop programs can be easily integrated with these sources of data.
4.2. Design and Implementation
Figures 5 and 6 show the schematic view of the sensor data collection and dissemination through web services, respectively. The motes have sensors that take measures from the field, and send the observations through the multiple hop wireless networks to the net-bridges. The net-bridges collect the observations and store them in files. Each net-bridge publishes a web service that dispatches these files on demand. The central server uses the client software designed to access the web service published by the net-bridge and to retrieve the files. The central server also publishes a web service of its own that can be accessed by users to get the data from multiple test bed fields (e.g., two different sites at present). Through such web services of sensor data gathering from multiple testbeds, the architecture and communications of the
Figure 5. A schema of information collection setup from two different remote testbeds to the University of Pittsburgh campus.
wireless mote networks themselves do not need to be modified. Figure 7 shows the structural view of the web services framework. Different kinds of clients access to the web services through a framework that offers a unique portal at the central server. The framework uses part of the URL as a parameter to determine which web service will response to a given request. These components (the web services) should implement some functions with given names and parameters, and can be connected to form the web services framework.
In our prototype system, the framework chosen is Axis2C, and it provides (among others):
· Transport of the SOAP messages.
· Queuing messages in pipes.
· Locate and dispatch incoming messages to the service that will response it.
· Validate messages.
· Embeddable in C applications.
Figure 8 shows a modular view of web service. The first group of functionalities has the responsibility of