Hardware Design of Moving Object Detection on Reconfigurable System

Moving object detection including background subtraction and morphological processing is a critical research topic for video surveillance because of its high computational loading and power consumption. This paper proposes a hardware design to accelerate the computation of background subtraction with low power consumption. A real-time background subtraction method is designed with a frame-buffer scheme and function partition to improve throughput, and implemented using Verilog HDL on FPGA. The design parallelizes the computations of background update and subtraction with a seven-stage pipeline. A stripe-based morphological processing and accounting for the completion of detected objects is devised. Simulation results for videos of VGA resolutions on a low-end FPGA device show 368 fps throughput for only the real-time background subtraction module, and 51 fps for the whole system, including off-chip memory access. Real-time efficiency with low power consumption and low resource utilization is thus demonstrated.


Introduction
Moving object detection is typically the first processing stage in video surveillance systems.It is one of the most important steps in smart video surveillance, which aims to detect foreground objects and events.Foreground is essential for tracking objects and maintaining their identities.The detection of foreground objects can be achieved by the representation of scene background.The foreground object is determined by locating significant differences between the current frame and background representation.
There are many proposed methods for detecting moving objects, such as temporal difference, optical flow and background subtraction [1].Non-recursive methods including temporal difference and optical flow adopt sliding windows to build background models.This method consists of saving all image frames of each picture into a moving window, and applying some statistic measures such as median filter [2]- [4] or mean filter [5] to analyze the change of each pixel by time in the window screen in order to estimate the background image.The goal of moving windows is to ensure that the background model is at its most up to date condition in order to cut out old pixels and allow the entry of new pixels.However, it does require longer moving windows when dealing with items that move slowly, which in turn requires very large pixel memory space.Background subtraction is more important in present study and has the best accuracy among these methods.Common background subtraction methods include Running Average (RA) [6], Gaussian mixture model (GMM) [7] and nonparametric kernel methods [8].Although GMM and nonparametric methods are stable algorithms, their complexity [9] [10] makes it impossible to be implemented by hardware approach.RA is also a stable algorithm, but its signal processing is more tractable for hardware implementation because of single modality in statistical modeling.
Figure 1 illustrates a general algorithmic step of background subtraction.
( ) ( ) x y + is the adaptable threshold value, which is input to the Morphology process unit.

,
x y t σ + is the adaptable threshold value of the location of the background image with morphology process when time is 1 t + .DSP (Digital Signal Processing), GPU (Graphics Processing Unit) and FPGA (Field Programmable Gate Array) are three hardware solutions used to accelerate background subtraction algorithms.DSP is a powerful and very fast microprocessor, able to achieve real-time digital signal processing.DSP can use instruction-level parallelization technique with multiple paths to achieve speedup, but its computational power is limited [11].GPUs are very powerful processors which outperform central processing units in special applications on computers and portable devices.GPUs are also capable of parallel computing by placing many threads on massive cores and computing them at the same time.However, they are not power efficient.FPGA is an integrated circuit designed to be configured by a designer and generally specified using a hardware description language (HDL).It is similar to that used for an ASIC (Application-specific Integrated Circuit).FPGA has the ability to handle complex operations in parallel by its reconfigurable capability.This capability combined with pipeline design can speed up background subtraction.The architecture of the FPGA is highly parallel and tailored to efficiently construct image and complex algorithms.FPGAs are therefore suitable for implementations of image processing and computer vision algorithms in embedded systems.
Many studies have been devoted to accelerating moving object detection by FPGA.The following papers are consistently devoted to achieving real-time performance for 640 × 480 or 720 × 576 resolutions with high-end and high-power FPGA.Appiah et al. [12] achieved 60 fps by FPGA, which is promising compared with the 25 fps achieved by a 3 GHz Pentium 4.However, the method requires larger hardware resources because four frame buffers must be allocated to handle the process.Cucchiara et al. [13] applied four FPGAs with four frame buffers to achieve object detection.Its best performance can only achieve 5 fps, which is far from meeting real-time requirements.Elgammal et al. [8] used Wronskian statistics for moving object detection and implemented it on FPGA.However, its maximum performance of 15 fps is not sufficient for real-time detection [14]- [16].Genovese et al. [17] applied an OpenCV GMM algorithm implementation able to process 22 fps on 1920 × 1080 resolution when implemented on Virtex5 FPGA.Jang et al. [18] proposed a circuit able to process 1024 × 1024 video sequences at 38 fps when implemented on a VirtexII FPGA platform.Genovese et al. [19] applied two hardware implementations of the OpenCV version of the Gaussian Mixture Model (GMM), a background identification algorithm.When implemented on Virtex6, the proposed circuit processed 60 fps on a 1920 × 1080 resolution.
The architecture proposed in this study was designed with a hardware design for running average and morphology algorithms by FPGA.This system is proposed in order to achieve real-time performance with low power consumption.The proposed circuit has been experimentally validated through experimental measurements on a hardware platform.
The main contributions of this paper are the following: 1) An innovative, hardware-oriented formulation of the RA equations that allows hardware speed improvement and saving.
2) A background subtraction and morphology algorithm is proposed and accelerated by reconfigurable hardware which allows embedded systems to operate in real-time.
3) The experimental demonstration of the proposed FPGA circuit in running on-line video systems.The main block includes five modules as shown in Figure 2. The block must convert color format from RGB to YCbCr, establish the background model by Y components, apply the background subtraction to compare the foreground and background's Y components, and finally perform morphology to obtain the result of object detection.
The remainder of this paper is organized as follows.Section 2 reviews the running average and morphology algorithm, and also presents the proposed Real-Time Background Subtraction (RTBS) design.Section 3 presents the proposed RTBS system.Simulation and verification of the design are given in Section 4. Section 5 concludes advantages of the design.

Methodology
This section mainly describes the system build algorithms and methods.It first reviews the formulation of the background subtraction and the RTBS algorithm.The RTBS improvement of the RA algorithm by removing division operation is then described.Dataflow of the RTBS is also analyzed.Next, the morphology algorithm and pipeline process are discussed.Dataflow of the morphology is also analyzed.Finally, the pipeline process diagram of morphology is explained.

Background Subtraction
The background subtraction algorithm consists of two steps: differencing and background modeling.The differencing step extracts motion pixels by computing the difference between the current frame and the background model.The background model is statistically built with single modality assumption.The differencing step can be formulated as follows: , , ,

M x y P x y B x y
where ( ) is the subtraction result.The background model of each pixel is assumed to be single Gaussian, and its parameters can be recursively updated by a new frame, which can improve computational efficiency and reduce memory resource allocation [8].The recursive form of the expected value of the single Gaussian, ( ) where ( ) where standard deviation ( ) is applied to the adaptive threshold and recursively updated by the current frame.The parameter λ determines the desired precision of thresholding.
The above formulation for background subtraction and updating includes five multiplications, two divisions, and one radical expression operation.The divisions in Equation ( 2) and Equation ( 4) require significant logic circuit resources and can slow down computational performance.By applying integer arithmetic to replace the division operation with bit shifting, the hardware design of the RA algorithm is significantly improved.In addition, the variance 2 t σ obtained in Equation ( 4) must be square-rooted into standard deviation t σ for the thre- sholding in Equation (3).The implicit radical expression must be eliminated.As a result, reformulation is necessary in order to shorten computational time as well as reduce resource consumption.
In order to apply shift circuit instead of division, the denominators of Equation (2) and Equation ( 4) must be modified.The 2 t σ is replaced with t V and the t σ in Equation ( 3) is substituted with t V , where both sides of the condition must be squared.The three new equations are given below: , , , ,0 , where N is the m power of 2, i.e., 2 m N = .The ( ) and ( ) As a result, shift circuit is applied to replacing division operations.Mathematically, the reformulation produces residue between RA and RTBS on background updating.However, it can be demonstrated that, practically, the residue vanishes as t increases.
Before using Verilog for hardware design, Equation ( 5) is analyzed in more detail to identify the data flow of background updating.The data flow is shown in Figure 3(a).The data flow of Equation ( 6) to find the adaptive threshold is shown in Figure 3(b).Equation ( 7) performs a new adaptive thresholding mechanism to find objects.Its dataflow diagram is shown in Figure 3(c).
Now the required arithmetic operations between RTBS and RA are compared.A detailed comparison is given in Table 1.Although RTBS requires two extra multiplications, it eliminates the need for division and radical expression, which can dramatically reduce hardware resource utilization.Figure 4 illustrates the residues for 128 N = .Higher residues may exist for small t, but residues diminish to zero when t N = .Radical Expression (√) 1 0

Morphology
Morphology theory's hardware designs mainly apply the solutions of corrosion and expansion.Before discussing these two solutions, it is necessary to understand an important step, which is shown in Figure 5, called the structuring element.This study mainly uses a 3 × 3 mask, using liner buffer alone with register to get the 9 closest pixel (M1 ~ M9), in which the M5 will be the Origin.Erosion and dilation are the two basic elements of image handling, and both Opening and Closing will apply these two principles.However, both erosion and dilation must establish the 3 × 3 image window in order to obtain the pixels from M1 ~ M9 and to define P1 ~ P9 as the value in the structuring element.The final number which applies the AND gate is erosion, while the one using the OR gate is expansion (see Figure 6).
The following section will describe the flow of the morphology data process.As with the DFG shown in Figure 7, a few nodes are first defined as follows: • "E" is the image window for 3 × 3 for image erosion.
• "D" is the image window for 3 × 3 for dilation.
First, it is necessary to establish a 3 × 3 image window and enter node "E" for erosion followed by node "D" for dilation.After the image is opened for clear up, a 3 × 3 image window is established, and node "D" is entered to handle the erosion, then again, another 3 × 3 image window is established in order to enter "E" for  dilation in order to complete Closing to fix up the image.
In Morphology, a 4-stage pipeline is applied to handle the issue as shown in Figure 8.

System
This section first reviews the system of the RTBS algorithm.This study applies a Field Programmable Gate Array. Figure 9 shows the system structure of a moving object detection system.A 1.3 megapixel CMOS digital module was applied to obtain the image resource, further image handling was conducted by FPGA, the image color scheme was changed from RGB to YCbCr, the background was established by its Y, the Background Subtraction was applied to compare the foreground and background's Y, and moving object detection was finally achieved.
The Real Time Background Subtraction (RTBS) integrated into the system implements Background Subtraction, background updating and adaptive threshold.Thus, the overall operation can be implemented in hardware using subs, adds, shifters and multipliers.The RTBS employs hardware features such as parallelism and pipelining.The architecture is pipelined into 7 stages.In the input image data fetches 2 pixels from the input image port, and forwards them to the input ports of the line buffers (10 bits to each buffer), which have a FIFO structure.
Background Subtraction and Morphology were integrated mainly due to the simple hardware structure, which can be easily parallelized to provide more than one pixel per single cycle.This is important, as the ability of the RTBS to perform parallel computations depends on the ability of the line buffer to provide multiple pixels per cycle.The Morphology employs hardware features such as parallelism and pipelining, in an effort to parallelize the repetitive calculations involved in the Erosion and Dilation operations, and uses optimized memory structures in order to reduce the memory reading redundancy.The architecture of the RTBS is shown in Figure 10.
The architecture implementation of the RTBS algorithm consists of a memory controller, RGB to YCbCr color space converters (RGB 2YCvCr) and Background Image storage.The memory controller fetches the RGB color values corresponding to the support first Line Buffer (FIFO Buffer 1) and second Line Buffer (FIFO Buffer 2) in a column-wise fashion (1 pixel value per input image every clock cycle) from the external memory.The architecture of the 4 port SDRAM control block is shown in Figure 11.
Those values are then converted to grayscale by the RGB2YCbCr unit, and to their corresponding 10-bit YCbCr representation by the RGB2YCbCr units.The Background image values of RTBS computed by the Background Subtraction unit and Morphology unit are temporarily stored in on-chip SDRAM frame buffers.
The design was simulated and implemented on a low-end FPGA with low power consumption.The FPGA has about five hundred thousand gate counts, 150 18 × 18 multipliers and 60 KB internal memory.The main system clock works at 25 MHz, which has very low power consumption compared with that of a PC with a 2.4 GHz processor.Verilog HDL is adopted for implementation and QUARTUS II for synthesis.An external 64 MB SDRAM memory is used for frame buffers to store background models.A 1.3 megapixel CMOS sensor module is responsible for acquiring raw color images for the FPGA platform.The raw data format comes with a Bayes pattern arrangement, and has to be converted to RGB color format.The system is divided into three blocks, as shown in Figure 12.The main block receives raw Bayes data and performs the RTBS background subtraction task.Background models t B and t V are stored in the off-chip SDRAM memory because of the limited internal memory of the FPGA.
The subtraction result is sent to the VGA controller for display due to the constraints of peripheral circuits, and the system has two clock rates: 25 MHz for CMOS image acquisition and processing, and 120 MHz for SDRAM access of background models.

Simulation and Experiment
First, this section presents the FPGA hardware circuit by experiment image testing to verify its accuracy.Next, Frame Rate is applied in order to analyze and discuss the Throughput.The experiment also applies the results of the BS and RTBS equations to determine the difference between the two after image analysis.The process of the FPGA hardware resource usage is explained later in the article.

Frame Rate Analysis and Experiment
This section will further verify the performance of the system.A seven-segment display was used to count the processed number of frames within a minute-with an average result of 51 fps.This result differs from the 127 fps result from synthesis and simulation as synthesis and simulation have different clock rates.
Next, the main clock is analyzed.The clock rate in the main block is 25 MHz.Furthermore, each frame will have the waiting blanking time, with the current time of exposure setting.The actual image information clock is around 70%, which equals 25 MHz × 0.70 = 17.5 MHz.It is further calculated that its frame rate is 56 fps (17.5 MHz/0.3072M).
From the above analysis the result of 56 fps is derived, which is very close to 51 fps.It is also known that the limit of this experiment is the main clock, which is due to the CMOS sensor clock.
Moreover, the RTBS was implemented using a 1.8 GHz P4 CPU with 1GB DDR/333 MHz memory.This yielded a frame rate of 3.22 fps, which differs significantly from the FPGA result.Table 2 shows more detailed frame rate results, and demonstrates that hardware is far better than software in terms of efficiency.This study uses QUARTUS II to further analyze the frame rate performance by simulation.The QUARTUS II contains TimeQuest Timing Analyzer that applies industry-standard Synopsys Design Constraint methodology for constraint designs.The timing characteristics and timing performance of our system are obtained from the analyzer.The timing analysis reports that the clock period of the main block is 8.8 ns, which equals 113 MHz.Therefore, theoretically, the frame rate of the main block is 368 fps (113 MHz/0.3072M).The SDRAM controller's critical path clock period is 6.4 ns, which equals 156 MHz.However, because the SDRAM controller consists of 4 read/write ports, only 39 MHz can be achieved.Therefore, the frame rate is 127 fps (39 MHz/ 0.3072 M).
The proposed method was tested against this test set, and achieved the result shown in Figure 13.

Difference between BS and RTBS Equations
There is one well-known issue of background subtraction approach.If there is an object in the image at the beginning when the system starts to establish its background, the object will stay in the background for a period.This issue cloud also be a problem for the proposed RTBS method because the RTBS is an approximation of background subtraction methods.An experiment for RTBS concerning remain image is conducted, and analyzed with 128 N = .During the FPGA experiment the frame number is shown by a seven-section monitor, which means a one is added to the seven-section monitor when the system completes the processing of one image.The experiment applies a man's hand as the background's foreign body, and maintains it for 20, 40, 60, 80 and 100 frames of 128, after which the hand object is removed.It is then noted how many frames it takes to clear up the object.
Results of the experiment are shown in Figure 14.The RTBS needs two frames to remove the hand object if the object appears in the beginning and lasts for 20 frames.With more lasting frames, the time to remove increases linearly.However, after 80 frames, it only needs around 10 image frames to clear up the object.

FPGA Hardware Resource Usage
FPGA resources include logic circuit and memory.Two frame buffers of about 998 KB of SDRAM memory are required external resources.Table 3 shows the usage of FPGA resources for each function in the system.The analysis of resource utilization and memory requirement was calculated by using Altera QUARTUS II analyzer tool.QUARTUS II tool allows the user to launch Modelsim simulator from within the software using Native-Link.It facilitates the process of simulation by providing an easy to use mechanism and precompiled libraries for EDA (Electronic Design Automation) RTL (Register Transfer Level) and Gate-level Timing simulation.Generally speaking the LE (Logic Element) of ALTERA requires 8 ~ 21 logic gates.Typically it will be 12 logic gates [20]; internal memory usually requires 4 logic gates combined as 1 bit [20].This study uses a typical value to estimate the design for the hardware's standard logic gates.Table 4 shows the resource analysis index created for the DE2 Development board: From the above index, it is determined that FPGA uses N4K internal memory, which is about equal to using 371,616 logic gates plus the BS equation, and other logic circuits use almost 19,416 logic gates.Therefore, the whole FPGA uses around 371,616 + 19,416 = 391,032 logic gates.
Finally, Table 5 shows the realization of the proposed hardware performance compared with that achieved in past papers on Background Subtraction algorithms in FPGA.

Conclusion
This paper proposes a background subtraction and morphology algorithm accelerated by reconfigurable hardware which can help embedded systems achieve real-time security monitoring.The design partitions the functions into background modeling, subtraction and morphology.The high-cost function, background modeling, is reformulated by eliminating division operations, which both reduces resource utilization and improves performance.Data flow analysis further details the calculation of the design.In simulation, a high frame rate of 384 fps for the background subtraction with morphology and modeling module can be achieved at 25 Mhz for 640 × 480 resolution videos.Real-time performance of 51 fps for the whole system, including off-chip memory access, demonstrates the efficiency of the design.The implementation on low-end FPGA with low frequency indicates low power consumption.The final verification results show resource utilization of no more than 400 K gate counts, two frame buffers, and 1 MB SDRAM memory size.Further study of complex background subtraction algorithms such as Gaussian mixture model and LBP background subtraction is promising.

+
is the pixel location of the established background image at time t.is the adaptable threshold value of the location of the object image with noise when time is 1 t + .

Figure 2 .
Figure 2. Detailed processing modules of the main block.

t B x y is+
the established background image at time t, and k is the learning parameter controlling the background learning speed.The difference image by an adaptive threshold obtained by recursively updating the variance of the Gaussian model.

Figure 9 .
Figure 9. Structure of FPGA moving object detection system.

Figure 10 .
Figure 10.The architecture of the RTBS.

Figure 14 .
Figure 14.The linear relating between the lasting time of an object and the removal time of the object.

Table 1 .
Arithmetic operations of running average and the RTBS.

Table 2 .
Frame rate analysis.

Table 3 .
Resource utilization in FPGA.

Table 4 .
Using resource percentage in DE22C35FPGA.