^{1}

^{1}

^{*}

^{1}

The efficient construction of contours of a radio propagation map is crucial in using radio propagation maps in a number of real-time communication and network applications. In this research work, we first propose an adaptive region construction (ARC) technique capable of constructing contours of different resolutions of a radio propagation map. Next, the process of implementing the ARC technique for real-time execution on a GPU is presented. The drawbacks of the first implementation using only the global memory are discussed, and optimization techniques to improve the performance are discussed and implemented. Simulations are performed with varying sizes of radio propagation maps, and the suitability of the ARC technique for real-time operation is presented. A speedup of 25× is achieved with the shared version of the GPU compared to the sequential CPU implementation. Also, the contour constructed using the ARC technique is compared to that constructed using the convex hull approach demonstrating the higher accuracy of the contour from the ARC technique.

A mobile ad-hoc network (MANET) of nodes (equipped with sensors) can be deployed rapidly in an environment to provide a communication infrastructure for a number of applications like environmental monitoring, rescue and defense operations to mention a few. However, the successful deployment of a MANET is dependent on the ability of neighboring nodes establishing a wireless communication link and provides connectivity across the network. Establishing a communication link is dependent on the availability of radio spectrum, transmission power, interference from neighboring nodes, and the effect of terrain on radio propagation. Depending on the deployment environment, the effect of terrain on radio propagation can be modeled with free space (Omni-directional) or “two-ray” propagation models if the terrain does not have any hills, buildings, and foliage affecting the propagation of radio signals minimally. However, in a realistic environment where MANETs are deployed, the terrain consists of hills, buildings, foliage, etc., causing reflections, diffraction, blocking and unreliable path loss estimates which render free space or two-ray propagation models ineffective in modeling the effect of terrain on radio propagation. Hence, more complex propagation modeling techniques like the Walfisch-Ikegami model (WIM), and 3D ray tracing have to be used to determine the effect of terrain on radio propagation. The effect of terrain on the radio propagation by WIM or 3D ray tracing is quantified as radio propagation map.

Radio propagation maps specify the path loss (or received signal strength) at various distances and directions from the transmitter taking into account the effect of the terrain. The path loss [

The application of radio propagation maps in localizing nodes, and antenna beam forming [

In the research work [

The success rate of the localization algorithm using convex hulls representing the contour of a radio propagation map of the small geographical area was 80% with an accuracy of 1 m. However, with an increase in the size of the geographical area, the performance of the algorithm deteriorated first due to the convex hulls not capturing the contour of the radio propagation map accurately. The second issue was with the nonlinear increase of the computational time for constructing a convex hull of a radio propagation map of the large geographical area (a suburb of a city) with high resolution rendering the localization algorithm not suitable for real-time application.

Hence in this work, we propose an adaptive region construction technique to capture the contour of a radio propagation map of large geographical area accurately and suitable for real-time implementation. We propose the use of a general purpose graphical processing unit (GPGPU) based adaptive region construction (ARC) for constructing multiple convex hulls of a radio propagation map of a large geographical area and combine multiple convex hulls to form regions representing the contour of radio propagation maps accurately.

This paper is organized as follows: In Section 2, a brief discussion of the localization algorithm [

Miles et al. envisioned a single moving beacon mounted on a vehicle capable of moving and broadcasting its position periodically to nodes in its transmission range based on radio propagation models [

Transmission and comparison of these maps require large bandwidth and computational capability. Also, it is difficult to implement an intelligent localization algorithm based on the shapes of the range maps. The above drawbacks are addressed by using convex hulls to represent the irregular radio propagation shapes with regular geometric shapes [

Even though the use of convex hulls to represent radio propagation maps had reduced the storage requirement and transmission bandwidth, it introduced significant errors in localization and increased the computational burden. In

2.5 km by 2.5 km. The difference in area due to the artifact introduced between the contour of the radio propagation map and the corresponding convex hull was approximately 0.4 km^{2}, which is about 1/15^{th} of the original area. This difference in the area can be even more significant when considering radio propagation maps of large urban areas. Along with the introduction of artifacts, the computation of convex hulls of very large geographic areas is a computationally intensive problem. Even though efficient algorithms [

In this research work, we propose a new method to assist computation of contours of a radio propagation map known as the Adaptive Region Construction (ARC) technique. The ARC technique first reduces the superfluous area introduced by the use of convex hulls representing radio propagation maps and thereby reduces the localization error. Second, the ARC technique is implemented using the general-purpose computing on graphic processing units (GPGPU) to reduce the computational time and make use of radio propagation maps in real-time applications feasible.

Curve simplification algorithms like the Ramer-Douglas-Peucker algorithm [

Many computer vision applications make use of convex hulls to approximate blobs and shapes in images. The authors of [

The authors of [

The authors of [

Cheng et al. [

Liu et al. [

Most of the research work discussed above requires contour and points internal to the contour to represent the radio propagation maps. However, for localization and other applications, accurate representations of the radio propagation map to reduce the storages and bandwidth requirements, which are suitable for real time applications that are required. Hence in our work, we present the adaptive region construction (ARC) technique capable of aiding the construction of an accurate contour representing the shape of the radio propagation map. The ARC technique described in this paper can define the given radio propagation map contour more accurately compared to a convex hull. Even though our algorithm is developed for approximating radio propagation maps, it can improve the accuracy of many applications that make use of sample approximations and demand real-time/near real-time performance. Our algorithm is computationally less complexity and is parallelizable, which makes it suited for real-time applications.

To review, by definition, a set, C, is convex [_{1}, x_{2} ∈ C and any θ where 0 ≤ θ ≤ 1 the following condition holds: θ x 1 + ( 1 − θ ) x 2 ∈ C .

In simple terms, this means that a set is convex if the direct path between any two points in the set is entirely included in the set.

convex hull of the set shown in

By constructing a convex hull of a range map, the storage and transmission bandwidth requirements can be greatly reduced. This results from the fact that only a small number of boundary points are required to represent the convex hull, which approximates the actual radio propagation map. This will serve as a lossy compression technique for the localization method.

Another benefit of the convex hull is that it can be used to make intelligent movement decisions more easily as the computation of the intersecting area only requires the use of the boundary points of the intersecting convex hulls instead of the entire radio propagation maps.

The Andrew’s monotone chain convex hull algorithm [

The upper hull is computed in a similar fashion, and the two hull sets are joined to find the final convex hull. Essentially, the algorithm works by comparing points to lines formed between previous points starting from left to right to make the upper hull, and then from right to left to make the lower hull. The algorithm makes its decision on which point belongs in the hull by computing the curl between the vector composed of the previously selected point and the second to last point in the hull and vector between the current point and the second to last point in the hull. _{minmax,}_{1} and P_{minmax,}_{2}, computed from Equation (1) will result in a positive number, indicating that the point P_{2} lies to the relative interior if the line between P_{minmax} and P_{2}.

C r l = P min max , 1 × P min max , 2 (1)

In order to satisfy Equation (1), the points included in the hull must be located to the relative exterior of all points included in the hull, and in line with all points included in the hull. In the case illustrated in _{1} will be discarded from the hull and replaced, by P_{2}. Then the algorithm proceeds by checking points to the right until it reaches the right-most point. Then it begins moving back to the left computing the lower hull in a similar fashion.

The divide and conquer approach was developed by [

To merge the convex hulls, common tangents are constructed between two consecutive convex hulls and the convex hulls are merged hierarchically. In

This section describes the process of adaptive region construction (ARC). This approach is developed to represent the radio propagation characteristics like signal strength at a spatial location in an efficient way. The procedure described here combines the ideas from the Andrews’s monotone chain convex hull algorithm [

map is exemplified in

O ( n m log ( n m ) ︸ Andre w ′ smonotonechain + n m log ( n m ) ︸ Divide & Conquer ) = O ( 2 n m log ( n m ) ) ,

where n is the total number of points in the dataset, m is the number of processes or threads that can execute simultaneously, and n / m ≥ 3 as at least three points are required to compute a convex hull. This shows that having many processes running in parallel reduces the computational complexity of the algorithm.

Heterogeneous computing is the approach of using accelerators/co-processors in conjunction with Central Processing Units (CPUs) to solve computationally intensive problems. Accelerators can be vector processors; many core processors like Graphics processing units (GPUs) and Intel Xeon Ph is that improve the performance of applications by utilizing parallelism. GPUs are specialized hardware designed to handle the intensive operation of the rendering of image frames for output to a display device. With the emergence of programmable shaders, researchers started using GPUs to solve problems involving matrices and vectors to achieve performance improvement by making use of parallelism. When GPUs are used for computations in non-graphics related problems, it is known as general purpose GPU (GPGPU) computing. Initial efforts of programming GPUs involved refactoring the problems to use graphics primitives provided by the graphics application programming interfaces. NVIDIA’s Compute Unified Device Architecture (CUDA) [

The generalized hardware hierarchy in NVIDIA GPUs consists of multiple arithmetic and logic units (ALUs), and they are called CUDA cores as shown in the right-half of

as shown in the left half of

GPU memory can be classified into 3 categories namely the registers, shared memory, and global memory as shown in

NVIDIA GPUs follow the single program, multiple threads (SPMT) execution model of parallel computing. This means a group of threads execute the same set of instructions in lock-step, though conditional branches in the algorithm can violate the lock-step execution of instructions contributing to an increase in computational time. The SMs in NVIDIA GPUs always execute instructions

with a granularity of 32 threads known as a warp. A SM has multiple warp schedulers allocating hardware resources to each thread/warp and scheduling the concurrent execution of multiple warps based on the requested shared resources per thread.

The CUDA programming model can be exploited to implement both data level and task level parallelism in the implementation of ARC. The given data is divided into smaller segments and Andrew’s monotone chain convex hull algorithm is used on individual segments to construct intermediate convex hulls. Each CUDA thread operates on a segment of data and computes one convex hull. CUDA has the capability to spawn a large number of threads to compute several convex hulls in parallel. Once the intermediate convex hulls are constructed by individual threads, each thread next considers two consecutive convex hulls at a time and constructs common tangents to merge the two hulls. This process is shown in

tations on the upper and lower half. The CUDA streams approach is used to compute the upper and lower hull in parallel and exploit task level parallelism.

An initial version of the algorithm utilizing both data level and task level parallelism was implemented on a NVIDIA Tesla K40c accelerator and hence forth known as the naïve version. For the naïve version, the number of points processed by each thread was fixed at a value of 4 and the kernel execution time of the naive version is shown in

The NVIDIA Visual Profiler [

of data are not perfectly overlapped. This is mainly due to the GPU being stalled as it waits for all the data required by the kernel to be transferred before starting the computations.

Furthermore, the profiler also identifies additional performance bottlenecks which are summarized below:

・ Low warp execution efficiency due to divergent branches

The profiler indicates low warp execution efficiency for the kernel functions signifying the inefficient use of GPUs for computation. The compute resources are best utilized when all the threads in a warp are active. The algorithm is implemented with different control statements that result in branching, and the profiler recognizes 33.2% and 93% divergence in the kernel function that computes intermediate convex hulls and the kernel function that merges the hulls respectively. The number of active threads in an SPMD execution model can be improved by having less divergent branches executing different instructions within the same warp.

・ Global memory alignment and access pattern

The profiler identifies inefficient use of memory bandwidth due to misaligned global memory access patterns. As the instructions are issued per warp in an SPMD execution model, 32 threads in a warp cooperatively request a single memory access, which is serviced by one or more memory transactions. Un- aligned and non-coalesced memory access due to warp divergence or the pattern of memory addresses requested by each thread can result in inefficient memory accesses. For uncached global memory accesses, the data always flows through the L2 cache, and it performs four 32-byte transactions in a single memory cycle. In ARC, redundant loads of data occur if the threads in a warp access data points such that N mod (128) ≠ 0, where N is the total number of data points accessed by the threads in a warp as shown in

・ L2 cache access latency

The profiler records 2.7 million global memory loads performed at a rate of 155.852 GB/s and 5.3 million reads from the L2 cache. The L2 cache reads are

Kernel functions | CGMA |
---|---|

lower Hull On GPU | 1/2 |

upper Hull On GPU | 1/2 |

merge Lower Hull | 8/15 |

merge Upper Hull | 8/15 |

higher because the algorithm reuses spatially adjacent data in computations, benefitting by both temporal and spatial locality of data. As an example, we have the arrays that store the size of intermediate convex hulls and the convex hulls themselves accessed repeatedly within the same kernel function and therefore are cached. However, the L2 cache located outside the SMs has significant memory access latency of 100 clock cycles, and this latency can be reduced by moving data that is reused to a cache closer to the SMs. The cache closer to the SMs which can be programmatically controlled in the GPUs is known as the shared memory which has a latency of 12 to 32 clock cycles.

GPUs use DDR5 memory, which is a high bandwidth memory but has latency [

CGMA = Number off loating point operations Number of global memory accesses . (2)

If CGMA is significantly greater than 1, the GPU spends more time performing computations rather than fetching data from memory. These types of problems are called compute bound problems. On the other hand, if the CGMA is less or close to 1, the problem is memory bound indicating that the GPU spends most of the time fetching data from the memory rather than computing.

In order to improve the performance, we have to increase the CGMA for our implementation. Considering Equation (2), we can either increase the numerator to improve CGMA or decrease the denominator. Increasing the numerator is not a feasible option because increasing the number of floating point operations translates to artificially introducing the computational complexity of the existing algorithm. Therefore, we consider the second option, which is to decrease the value of the denominator. This can be done by reducing the number of global memory accesses and specifically multiple accesses to the same data either on the global memory or L2 cache. We use shared memory, which is a user controlled cache to store chunks of data from global memory. Later, we use the data in the shared memory to perform the computations. This reduces the memory access latency due to multiple accesses of the data on global memory and L2 cache.

The profiler analysis of the naïve version along with the CGMA computations provides insights about the possible approaches that can improve performance. This section discusses the various optimization approaches used to improve the performance of the naïve version.

To improve L2 cache access latency by reusing on-chip data, and reduce the global memory bandwidth required by the kernels we make use of shared memory. Assuming each thread performs only one iteration of the algorithm, the kernel function that computes convex hulls of one half of the given set of points have to access the global memory 16 times, and the kernel function that combines two consecutive convex hulls has to access the global memory 30 times in the naïve version. The shared memory latency being 12 to 32 cycles is about 50 times lower than the uncached global memory latency [

In the kernel function that computes either the upper half or lower half of intermediate convex hulls, the data seen by each thread block is loaded into the shared memory. Each thread computes convex hulls by considering a small number of elements. The number of elements processed by each thread is calculated as the ratio of the total number of points to the total number of threads. Copying the data from the global memory to the shared memory and performing computations using the copied data on the shared memory is shown in Step 1(a) and 1(b) of

convex hull from the next thread block is loaded into the shared memory (

We also use shared memory as a scratchpad memory to store the size of intermediate convex hulls and also enabled L1 caching (16 KB) along with the shared memory (48 KB) to cache global memory transactions.

While loading the data into shared memory for combining two consecutive convex hulls, each thread loads one convex hull into the shared memory. But the threads at the end of each thread block (except for the last block) must load two convex hulls, one at the end of thread block and the other from the beginning of the next thread block. This can be easily achieved by using simple control statements on a traditional CPU based computing system. CPUs have complex hardware with advanced branch prediction mechanisms to implement control statements. On a CPU, there are pipelines for each program flow of the control statement. If the predicted branch is false, a CPU can quickly switch to the other pipeline and continue with the execution flow, eluding any significant performance penalty.

On the other hand, GPUs are simple devices with no branch prediction mechanisms requiring all the 32 threads in a warp execute in a synchronous fashion. If different threads in a warp execute different instructions, the GPU flushes the execution pipeline each time to load new instructions resulting in the sequential execution of each branch of the control statement. Also, since all threads in a warp execute in parallel, some of the threads in a warp will be idle and will become active during the upcoming sequence that will make the previously executing threads in that warp idle as shown in

To avoid warp divergence, we loaded both the flow paths of the control statements into the same branch by making use of multiple if statements instead of if-else chains as exemplified in

While accessing global memory, the data has to pass through L2 cache by default, and four 32-byte transactions are performed to fetch 128 bytes of single

precision data for the threads in a warp. On enabling the L1 cache, a 128-byte transaction request is used to load single precision data for a warp. In other words, NVIDIA GPUs has a L1 cache line granularity of 128 bytes and an L2 cache line granularity of 32 bytes. The memory fetches from the global memory is a major performance bottleneck, and it is necessary to keep the number of load transactions to a minimum. One way to keep the load transactions to a minimum is to load only the required data by a warp and avoid redundant data loads.

We adjust the number of points seen by each thread to construct intermediate convex hulls such that the data requested by a warp is a multiple of cache line granularity depending on the problem size, thereby minimizing redundant loads of data. NVIDIA also reports [

The transfer of data from the host to the device takes place over the PCIe bus. Even though it is not possible to increase the speed of data transfer due to hardware limitations, it is possible to reduce the time that the GPU spends waiting for data. Data is allocated on the CPU memory as pageable memory. Pageable memory can be swapped into the secondary storage by the operating system to give an illusion of additional main memory than available. Since the GPU does not have control over the paging operation, it takes more time for the data to be transferred from pageable memory to GPU memory. To decrease the data transfer time from CPU memory to GPU memory, we used pinned memory on the CPU. Pinned memory or page-locked memory is a non-swappable memory allocation on the CPU random access memory (RAM) preventing the operating system from swapping the allocated memory to secondary storage. This allows the data transfer between CPU and GPU through the PCIe bus at a higher bandwidth.

We implemented the ARC technique on a NVIDIA GPU using the CUDA C programming model. The hardware platform consists of an Intel Xeon E5-2620-0 (Sandy Bridge) processor for implementing the sequential Andrew’s monotone chain convex hull algorithm and NVIDIA Tesla K40c for implementing the ARC technique.

Sets of random points with a normal distribution to test and compare the optimized implementation of ARC technique with the sequential Andrew’s monotone chain convex hull algorithm were generated.

significant improvement in performance. In

version in addition to the use of shared memory. Even though an overall 9.3× speedup has been achieved, the speedup remains constant with increasing number of points as depicted in

The second goal for using the ARC technique was to eliminate the artifacts present in the contour of a radio propagation map determined using the convex hull approach. Given a set of points, the ARC technique can either construct a convex hull or a set of points, which is not a convex hull representing the contour of a radio propagation map accurately. If the ARC technique is forced to use a single thread, i.e., a sequential construct, the set of points obtained using ARC will match the convex hull. However, by varying the number of threads, the result can be a non-convex hull with varying levels of granularity. The resulting set of points obtained using ARC is selected by computing a number of intermediate convex hulls to fit the given set of points. These intermediate convex hulls are merged consecutively in order to obtain the resulting set. In other words, the number of intermediate convex hulls constructed represents the “resolution” or detail with which the radio propagation map is approximated. In our implementation, since each thread constructs one intermediate convex hull, the resolution of approximation will depend on the number of threads. Decreasing the number of threads decreases the number of intermediate convex hulls, and degrades the application performance as the load handled by each thread increases. ARC does not result in the contour of a radio propagation map directly but also includes points inside the contour that are eliminated using simple techniques [

The technique of adaptive region construction is a low complexity approach that can represent the given contour with varying degrees of details. Adaptive region construction technique provides the capability to construct the contour of a radio propagation map efficiently. The implementation of the adaptive region construction technique on a GPU using the CUDA programming model has been demonstrated. The GPU implementation provides good application performance (speedup) for high resolution representation of contours but is not suitable for low resolution representations. By applying optimization techniques to the naïve version, a 21× improvement in computational performance for large data sets was achieved. As most of the applications that use radio propagation maps are benefited by the detailed representation of radio propagation maps, the ARC technique fulfills the necessity for a fast algorithm. The ARC technique is not only suitable for real-time operation but also avoids artifacts in contrast to the contours determined using the convex hull approach.

In addition to using the ARC technique for determining the contour of a radio propagation map, it is also possible to approximate other spatial data. Using the ARC, multi-resolution representation of the spatial data is possible. The multi-resolution representation of large spatial data sets allows improved processing

time and lower storage requirements.

The ARC technique as mentioned previously is inefficiently operating on low resolution radio propagation maps. Also, with large resolution, special attention has to be paid to the memory transfers between the CPU and the GPU. However, with the newer versions of the NVIDIA GPU equipped with the NVlink technology, the latency due to memory transfers is significantly reduced.

Ramakrishnaiah, V.B., Muknahallipatna, S.S. and Kubichek, R.F. (2017) Adaptive Region Construction for Efficient Use of Radio Propagation Maps. Journal of Computer and Communications, 5, 21-51. https://doi.org/10.4236/jcc.2017.58003