X-Hamiltonian Surface Broadcast Algorithm

Broadcast is one of the most important approach in distributed memory parallel computers that is used to find a routing approach from one source to all nodes in the mesh. Broadcasting is a data communication task in which corresponds to one-to-all communication. Routing schema is the approach used to determine the road that is used to send a message from a source node to destination nodes. In this paper, we propose an efficient algorithm for broadcasting on an all-port wormhole-routed 3D mesh with arbitrary size. Wormhole routing is a fundamental routing mechanism in modern parallel computers which is characterized with low communication latency. We show how to apply this approach to 3-D meshes. In wormhole, routing large network packets are broken into small pieces called FLITs (flow control digits). The destination address is kept in the first flit which is called the header flit and sets up the routing behavior for all subsequent flits associated with the packet. In this paper, we introduce an efficient algorithm, X-Hamiltonian Surface Broadcast (X-HSB) which uses broadcast communication facility with deadlock-free wormhole routing in general three dimensional networks. In this paper, the behaviors of this algorithm are compared to the previous results using simulation; our paradigm reduces broadcast latency and is simpler. The results presented in this paper indicate the advantage of our proposed algorithm.


Introduction
One of the important and popular architecture in both of scientific computation and high speed computing applications is multicomputer architecture.In a multicomputer network, processors often need to communicate with each other for various reasons, such as data exchange or event synchronization.Efficient communication has been recognized to be critical for high performance computing.Multicomputer network consists of hundreds or thousands of nodes connected in some fixed topology; each of them contains a microprocessor, local memory, and other supporting devices.Most of the popular direct network topologies fall in the general category of either n-dimensional meshes or k-ary n-cubes because of their regular topologies and simple routing.Among topologies, the most commonly used are meshes, tori, hypercubes and trees.The mesh-based topologies are the most regular simple architecture that is used in multicomputers.This is because that the implementation of mesh is very simple and easy to understand.These properties are necessary for every topology design [1].Recent interest in multicomputer systems is therefore concentrated on two-dimensional or three-dimensional mesh and torus networks.Such technology has been adopted by the Intel Touchstone DELTA [2], MIT J-machine [3], Intel Paragon [4] and [5], Caltech MOSAIC [6], Cray T3D and T3E [7] and [8].
Wormhole routing has been widely accepted as the predominant switching mechanism in current multicomputer systems.One of the efficient message routing algorithms of multicomputers is wormhole-switching.The wormhole switching is considered widely used in routing algorithms because of some reasons, firstly the buffering that is required in it is low buffering, allowing for efficient router implementation [9].Secondly, it characterizes with low communication latency [10].In wormhole routing, each message is divided into packets; the packets are serialized into a sequence of flits (flow control units) for transmission.A flit is the smallest unit of information that a channel can accept or refuse [11].The header flit of a message governs the route, and the remaining flits follow it in a pipeline fashion through the network.Only the header flit knows where the packet is going, and the remaining data flits must follow the header flit.If all the channels which the header is routed are busy, the header is blocked until one of those channels is freed; the flow control within the network blocks the trailing flits.Wormhole routing avoids using storage bandwidth in the nodes through which messages are routed.It also makes the message latency quite insensitive to the destination in the network.Deadlock is occurs when a set of messages is blocked forever in the network because each message in the set holds one or more resources needed by another message in this set.No communication can occur over the deadlocked channels until exceptional action is taken to break the deadlock [12] [13].Based on these features, many several routing algorithms have been proposed for wormhole communications networks [14]- [25].
The main facility of the wormhole routing is the communication latency, where it is independent of the distance between the source and destination nodes; it consists of three parts, start-up latency network latency, and blocking latency.The start-up latency is the time required to start a message, which involves operation system overheads.The network latency consists of channel propagation and router latencies, i.e., the elapsed time after the head of a message has entered the network at the source until the tail of the message emerges from the network at the destination.The blocking latency accounts for all delays associated with contention for routing resources among the various worms in the network [26].
One of the most fundamental communication operations is broadcast, in which the same message is delivered from a source node to all nodes in the network.Efficient broadcast communication is widely used.This is because it's useful in message-passing applications, and is also necessary in several other operations, such as replication and barrier synchronization [27], which are supported in data parallel languages.Many algorithms have been proposed for broadcast communication in wormhole-routed networks over the past few years [28]- [31].A broadcasting algorithm is applicable in diversified manners in several fields such as management of shared variables for distributed programming, image processing, data copying in database of large-scale network, and data collection in wireless sensor network (WSN), and for this, an effective broadcasting algorithm is necessary [32].
This paper uses a 3-D mesh topology with Bi-directional channels.A graph M(V, E) can be used to model a 3-D mesh topology where processor can be represented by each node in V(M) and communication channel can be represented by each edge in E(M).The mesh graph is formally defined below.Where m (rows) × n (columns) × r (layers) 3-D mesh comprises mnr nodes connected in a grid fashion.
Definition 1: An m × n × r 3-D mesh graph is a directed graph M(V, E), where the following conditions exist: Three kinds of ports can be distinguished in mesh topology.One-Port, Multi-Ports and All-Ports [33].This study use the All-Port router model where routers can send/receive multiple messages simultaneously to a neighboring node, and that node can simultaneously send/receive messages along all ejection and injection channels [34] and [35].
Broadcast problem can be applied by using a Hamilton path from source to destinations.In this paper we introduce the new broadcast algorithm based on the Hamiltonian model for 3-D mesh multicomputers.The rest of the paper is organized as follows.Section 2 presents the related works.Section 3 presents the Hamiltonian model to the 3-D mesh networks and introduce the new broadcast algorithm based on the Hamiltonian model.In Section 4 we examine and compare the performance of our proposed algorithm with other previous algorithms.Finally, Section 5 concludes this study.

Related Works
Interconnection networks are used in massively parallel computers to allow cooperation between several nodes (processing elements).An important advantage of Hamiltonian paths is to design a deadlock-free algorithm of wormhole-routed network.The basic function of an interconnection network is the routing algorithm.Routing Algorithm determines the path from a source node to a destination node in a particular network.In this section we produce two routing algorithm depends in Hamiltonian paths that are presented in [36].

3-DHB Algorithm
The 3-DHB algorithm exploits the features of Hamiltonian paths (using Humiliation label equation for 3-D mesh that is presented in [36] to implement broadcast in two message-passing steps.In 3-DHB algorithm, the network is divided into two subnetworks N U and N L by the source, where N U contains all channels where their routing directions are from the nodes labeled from low to high number and N L contains all channels where their routing directions are from the nodes labeled from high to low numbers [36].
In 3-DHB, the source node divides the destination set D into two subsets, D U and D L , where D U contain the destination nodes in N U and D L contain the destination nodes in N L .The source sends the message to destination nodes in D U using channels in N U subnetwork and to destination nodes in D L using channels in N L subnetwork.
The destination nodes in D U is sorted in ascending order by the source and the destination nodes in D L is sorted in descending order by the source, where the source using the L value as the key in both cases.Two messages were constructed and sent into tow disjoint subnetworks N U and N L , one containing D U as part of the header and the other containing D L as part of the header [36].
To study the performance of 3-DHB algorithm, consider the example shown in Figure 1 for a 4 × 4 × 4 mesh topology labeling using a Hamiltonian path.The source node with Hamilton labeled 25 which has an integer coordinate (1, 1, 1) will send a broadcast to all nodes in its 3-D mesh multicomputer.
The 3-DSPHB algorithm for All-Port 3-D mesh based on the Hamiltonian paths to implement broadcast in six message-passing steps.A major consequence of All-Port architecture is that the local processor may transmit (receive) multiple messages simultaneously.The difference between 3-DHB and 3-DSPHB routing algorithm is that how many message preparation can be operated at the source node.In the 3-DSPHB routing algorithm the destination sets D U and D L of the 3-DHB algorithm are further partitioned.The source divide the set D U into six subsets, The first one containing the all nodes whose x coordinates are greater than or equal the x coordinate of a source, the second one containing the all nodes whose x coordinates are smaller than the x coordinate of a source, the third one containing the all nodes whose y coordinates are greater than or equal the y coordinate of a source, the fourth one containing the all nodes whose y coordinates are smaller than the y coordinate of a source, the fifth one containing the all nodes whose z coordinates are greater than or equal the z coordinate of a source and the sixth one containing the all nodes whose z coordinates are smaller the z coordinate of a source.The same partitioned schema can be done for set D L [36].
To study the performance of 3-DSPHB algorithm, consider the example shown in Figure 2 for a 4 × 4 × 4 mesh topology labeling using a Hamiltonian path.The source node with Hamilton labeled 25 which has an integer coordinate (1, 1, 1) will send a broadcast to all nodes in its 3-D mesh multicomputer.

Hamiltonian Models to the 3-D Mesh Networks as Set of Surfaces
Hamiltonian-path routing is considered as one of the most efficient important approach for designing a  deadlock-free algorithm of wormhole-routed network [37] and [38].In this paper the routing algorithm is the Hamiltonian-path, at first, a Hamiltonian path is embedded onto target 3-D mesh network topology, and then messages are routed on the basis of the Hamiltonian path.
A 3-D mesh network can be divided into set of column surfaces.Each surface will be labeled with Hamilton model.In the following, we give the node labeling function ( ) for a 3-D mesh.For an m × n × r 3-D mesh, we can assignment a label function  to each surface (according to x dimension) of 3-D mesh which can be expressed in terms of the x-, y-and z-coordinates as follows: Each node in each surface can be assign a unique number using the above labeling function starts from 0, followed by 1, and so on until the last node labeled n * r − 1.
Using the Hamiltonian model labeling, each surface (set of column surface) can be partitioned into two subnetworks.The first one is called high-channel network (N U ) and contains all channels where their routing directions are from the nodes labeled from low to high number.The second one is called the low-channel network (N L ) and contains all channels where their routing directions are from the nodes labeled from high to low numbers.According to this partition schema, the network will be divided into two disjoint subnetworks and each subnetwork has its independent set of physical links in the network.
A Hamiltonian labeling strategy for 3 × 3 × 3 3-D mesh can be shown in Figure 3(a), where every node can be represented by its integer coordinate (x, y, z) and there are three surfaces of subnet works according to x coordinates.A high-channel subnetwork (N U ) can be shown in Figure 3(b) for each surface.A low-channel subnetwork (N L ) can be shown in Figure 3(c) for each surface.

The Proposed Algorithm: X-Hamiltonian Surface Broadcast (X-HSB)
In this paper our algorithm X-HSB algorithm is based on splitting the 3-D mesh network into set of surfaces.Each surface will be labeled with Hamilton model that presented in Section 3.1.In fact, if we have m × n × r 3-D mesh, then we have m surfaces of mesh.Formally, the sub-networks are described by the following expression: Suppose that the coordinate of the source node u 0 is represented by (x 0 , y 0 , z 0 ), and D represent the all nodes of 3-D mesh.The simple idea of this algorithm is as follow: Step 1: The source split the D into 2 subsets D S and D R , where D S contain the nodes which their xcoordinates are equal x 0 , D R contain the rest nodes which their x coordinates are not equal x 0 .Formally, the subsets are described by the following expression:  Step 4: Construct two messages, one containing nodes in D U as part of the header and the other containing nodes in D L as part of the header.The source sends the message to subset D U through subnetwork N U and sends the message to subset D L through subnetwork N L .
Step 5: The source divides D R into 2 subsets D RXG and D RXS , where D RXG contain the nodes which their x coordinates are greater than x 0 and D RXS contain the nodes which their x coordinates are smaller than x 0 .
Step 6: The source divides D RXG into set of subsets.Formally, the subsets of D RXG are described by the following expression: Step 7: Each above subset of D RXG represents a surface.The source (x 0 , y 0 , z 0 ) sends the message to a node, which is represented by its integer coordinate (x 0+1 , y 0 , z 0 ) which will be act as a source on its surface, to a node represented by its integer coordinate (x 0+2 , y 0 , z 0 ) which will be act as a source on its surface and so on to a node represented by its integer coordinate (x m−1 , y 0 , z 0 ) which will be act as a source on its surface.All the alternative sources will send the message to all nodes in their surfaces by repeating the above steps from step 2 to step 4.
Step 8: The source divides D RXS into set of subsets.Formally, the subsets of D RXS are described by the following expression: { } Step 9: Each above subset of D RXS represents a surface.The source (x 0 , y 0 , z 0 ) sends the message to a node, which is represented by its integer coordinate (x 0−1 , y 0 , z 0 ) which will be act as a source on its surface, to a node represented by its integer coordinate (x 0−2 , y 0 , z 0 ) which will be act as a source on its surface and so on to a node represented by its integer coordinate (0, y 0 , z 0 ) which will be act as a source on its surface.All the alternative sources will send the message to all nodes in their surfaces by repeating the above steps from step 2 to step 4.
Theorem 1: X-HSB algorithm is deadlock-free.Proof: At the source node, X-HSB algorithm divides the network into m disjoint subnetworks.This is obvious since . Then X-HSB algorithm is deadlock-free at eachsubnetwork.To complete our proof, we should prove that there are no dependencies in each subnetwork.Each subnetwork will be divided into two disjoint subnetworks N U and N L .In N U a message entering a node and always leaves on a node with label greater than label of entered node, while in N L a message entering a node and always leaves on a node with label lower than label of entered node, therefore, no cyclic dependency can exist among the channels.So X-HSB is deadlock-free.

Comparative Study
To study the performance of X-HSB algorithm, consider the example shown in Figure 4 for 4 × 4 × 4 mesh topology labeling using a Hamiltonian path for each surface.The source node with Hamilton labeled 5 at surface 2 which has an integer coordinate (1, 1, 2) will send a broadcast to all nodes in its 3-D mesh multicomputer.
The routing pattern of X-HSB algorithm is shown with bold lines as shown in Figure 4 for each surface, where first surface have x coordinate = 0, second surface have x coordinate = 1, third surface have x coordinate = 2 and fourth surface have x coordinate = 3.

Simulation
In order to compare the performance of our proposed broadcast routing algorithms, the simulation program used to model broadcast communication in 3-D mesh networks is written in VC++ and uses an event-driven simulation package, CSIM [39].CSIM is used a multiple processes to execute testing in parallel fashion and provides a very realistic environment which can be used in modular simulation programs.The simulation program for broadcast communication is part of a larger simulator, called MultiSim [40].MultiSim was used to simulate broadcast operations in 3D mesh of different sizes.
To study the performance of our algorithm with broadcast schema we random a large number of different source nodes with different size of messages and different injection rate.All simulations were performed for a 6 × 6 × 6 3-D mesh.The software that overhead for buffers allocating, messages coping, router initializing, etc. is represented the startup latency β.The message is divided into flits, so the number of these flits in a message denotes the message length.
In order to evaluate our proposed algorithm X-HSB, we compare its performance with 3-DHB and 3-DSPHB algorithms that is presented in [36].The average broadcast latencies of three algorithms across different message lengths (100 bytes to 2000 bytes) and startup latency β is set to 100 microseconds are shown in Figure 5.The advantage of the X-HSB algorithm is peter performance than the advantage of the 3-DHB and 3-DSPHB algorithms.In X-HSB the source node divided the net into m subnets (no. of columns) as shown in Section 3.1 and  each subnets divided their destinations into two subsets (use Hamilton-path for each surface), so the path from source node to destination nodes in each surface is less than the path in the 3-DHB and 3-DSPHB routing algo-rithms; this advantage can be shown in termed of generated traffic.Since communication latency in wormhole-routed broadcast systems is dependent on the length of the message, average latencies of X-HSB algorithm is less sensitive as the length of the message as shown in Figure 5.The performance depends strongly on the start-up time that required for each algorithm.Because 3-DSPHB is all-ports it takes six startup latency times while 3-DHB takes only two start up latency times, so average latencies of 3-DSPHB algorithm linearly increase as the length of the message.However, the disadvantage of 3-DSPHB algorithm increases with the message lengths.
Figure 6 plot the network latency obtained by the three algorithms versus various network loads.The startup latency β is set to 100 microseconds.We have fixed the message length as 300 flits.All three algorithms exhibit good performance at low loads.The X-HSB algorithms, however, are less sensitive to increased load than the 3-DHB and 3-DSPHB algorithms.This result occurs because in the X-HSB algorithm the source node divide the destination nodes into 2 subset for each surface (no. of coulombs), in 3-DHB the source divide the destination nodes into two subsets while in 3-DSPHB divide destination nodes into six subsets, so when any flits from another broadcast message is send and the recent broadcast transmission is not complete, then these flits will be blocked until the recent broadcast is complete.In fact, if the load is very high, the 3-DSPHB may decrease system throughput and increase message latency.The X-HSB routing exhibits lower latency than 3-DHB and 3-DSPHB routing because, as shown earlier, paths tend to be shortest, generating less traffic.Hence, the network will not saturate as quickly.The disadvantage of 3-DHB and 3-DSPHB routing algorithms increases when loads on net is very high.

Conclusions
Broadcast, as one of the most fundamental collective communication operations, is highly demanded in massive parallel computer algorithms like parallel graph algorithms, fast Fourier transform, and barrier synchronization.Broadcast is frequently invoked as one of the most essential communication operations.
In this paper, we propose a widely applicable solution for the broadcast problem in a wormhole routed mesh.We propose a new broadcasting algorithm (X-HSB) which guarantees the delivery (to all processors connected to the source) in the 3-D meshes.Our algorithms are proposed with variants that can be easily expressed and understood.We have also shown that our proposed algorithm is deadlock-free.The performance study shows that the proposed algorithm X-HSB is the best performance than the existing algorithms 3-DHB and 3-DSPHB.

Figure 1 .
Figure 1.The routing patterns of 3-DHB algorithm, bold lines of N U subnetwork, and dashed lines of N L subnetwork.

2 :
The source node divides D S into two subsets D U and D L as follows:

3 :
The source sort the nodes in D U in ascending order and the destination nodes in D L in descending order as the label key.

Figure 4 .
Figure 4.The routing patterns of X-HSB algorithm.

Figure 5 .
Figure 5. Performance comparison of X-HSB algorithm with 3-DHB and 3-DSPHB algorithms under different message size.