1. Introduction
Controlling the anthropogenic pressure in beach areas near the shore is important for several reasons, including preserving natural habitats: coastal ecosystems are delicate and complex, and they are often threatened by human activity such as beach development and recreational activities (Luna da Silva et al., 2017). Controlling human presence can help preserve the natural habitats and biodiversity of these areas, as well as protecting wildlife: many species of marine animals, such as shorebirds, use beaches as nesting and breeding grounds. Human presence can disturb these animals and interfere with their natural behaviors. By controlling human presence, we can minimize the impact on vulnerable species.
Human activity on beaches can lead to pollution, including litter (Corraini et al., 2018) and chemical pollution from sunscreen and other products, like plastics (Thushari and Senevirathna, 2020) used and abandoned. Education and less anthropogenic pressure can reduce the amount of pollution entering the marine environment near the shoreline.
Beaches can be dangerous places in at least two ways: in the water, waves and currents can be strong enough to overcome a good swimmer and, on the opposite side, on land, unstable cliffs can cause fatal landslides. Controlling human presence can help ensure public safety and prevent irreversible situations by keeping human pressure at a level that the locally available rescue services can respond to.
It can be challenging to accurately estimate the number of people visiting public beaches, especially during peak times or busy seasons (Green et al., 2005). This can be due to several factors, such as the extent of beach areas, the variability in visitor patterns, and the difficulty in counting people in such a dynamic environment. However, there are several methods (Morgan, 2018) that can be used to estimate the number of visitors to public beaches, such as:
1) Manual counts: This involves physically counting the number of people entering and leaving the beach area at specific times or intervals. While this method can be time-consuming and labor-intensive, it can provide an accurate estimate of the number of visitors.
2) Automated counters: These are devices that use sensors or cameras to count the number of people entering and leaving the beach area. This method can be more efficient than manual counts, but it may not be as accurate in crowded or congested areas.
3) Surveys: Visitors can be asked to fill out surveys that collect information about their visit, such as the date and time they arrived, the duration of their stay, and the activities they engaged in. This method can provide valuable insights into visitor behavior and preferences, but it may not capture the entire population of visitors.
4) Aerial imagery, like UAV or drone acquired images (Turner et al., 2016; Adade et al., 2021), can provide a bird’s-eye view of the beach extend and can be used to estimate the number of people based on the density of crowds, if the ground resolution is not enough to discriminate individuals.
While estimating the number of visitors to public beaches can be challenging, using a combination of these methods can help provide a more accurate estimate of human presence and inform management decisions to protect the environment, ensure public safety, and enhance visitor experiences, evolving to an ecosystem-based management system (Sardá et al., 2015), crucial for both cases—beaches that are still semi-wild and beaches in urban areas (Cabioch and Robert, 2022).
In the case of beach management (Domingo, 2021), deep learning methods such as You Only Look Once (YOLO) could be used to detect and track people on the beach using images or video footage from surveillance cameras or drones. This would enable authorities to monitor beach activity, estimate visitor numbers, and identify potential safety risks or environmental concerns. The YOLO series (Redmon et al., 2016) are popular object detection algorithms available online that uses convolutional neural networks (CNNs) to identify and track objects in real or near real-time, with a minimum of requirements in terms of operator skills and hardware—any version can work in a regular laptop. YOLO can also be trained to recognize specific behaviors or activities, such as swimming or sunbathing, which can provide additional insights into visitor behavior patterns. The algorithm can be customized to detect and track specific objects or people of interest, such as lifeguards or emergency responders, to ensure fast communication and more effective coverage in terms of assistance.
If beach surveillance data is shared and used in conjunction with road authorities, it can help avoid human surcharge in more critical areas; for example, if beach surveillance data shows that visitor numbers are exceeding the capacity of a particular beach area, road authorities could implement traffic control measures or divert visitors to less crowded areas. Authorities can use the data to identify potential safety risks or environmental concerns and take proactive measures to address them.
By sharing data on visitor numbers, traffic flows, and visitor behavior patterns, authorities can work together to manage visitor flows and prevent the impact of excessive human activity on sensitive ecosystems.
Overall, sharing beach surveillance data with road authorities can help promote sustainable tourism and protect delicate ecosystems from the negative impacts of human activity. It requires collaboration and coordination between multiple stakeholders, but the benefits of such efforts can be significant for both the environment and visitors.
This paper intends to show the possibility of controlling the beach population, detecting and counting people with RGB images acquired from a high point of view with a regular surveillance camera, using a YOLO v7 model to process the images. What’s new is the use of high-performance online deep learning software, which is not fooled by differences in the background (water, sand, rocks, vegetation) because it focuses on the objects used to train it.
2. Materials and Methods
A collection of images of a surveillance camera acquired in August, 2013 at Portinho da Arrábida zone, an area with several beaches in southwest Portugal, were used. The camera is a Nikon D80, programmed to aperture priority, and automatic white balance, with a lens of 300 mm focal length. These are large RGB images with 3 bands, 2896 by 1944 pixels, that were pre-processed in a Matlab environment (Matlab, n.d.) to tiles of a convenient size (960 × 960) considering the processing capabilities of the hardware available.
The methodology followed is based on the use of a recent version of YOLO deep-learning algorithms, YOLO v7 (July 2022), available online in a GitHub repository (GitHub, n.d.).
The conceptual framework of this methodology follows the steps of supervised classification: the objects of interest (OoI) must be previously identified in a representative subset of images; these annotated images are divided into training and validation datasets and will be the basis for building the model using CNNs to define discriminative features. It takes several iterations before the training data produces a model that, when tested on the validation data, gives satisfactory results (Figure 1).
To manually identify the targets, we used a tool available online, MakeSenseAI (n.d.), which works in three distinct stages: 1) upload the set of 33 images for annotation, 2) identify the OoI in each image with the tools available in the graphical interface and 3) download the annotations in the form of text files in a format compatible with YOLO.
The set of annotated images and corresponding labels is divided into two subsets, one (22 images) for training and one (11 images) for validation. The first contains 781 annotated examples of the objects of interest and the second contains 465, together constituting a representative sample of the targets, both in terms of position and location. The remaining 180 image tiles were used to evaluate the performance of the model.
One of the strengths of YOLO v7 and its previous versions is the possibility of transfer learning (Wang et al., 2022), which consists of reusing an already trained network as the basis for a new problem. Since basic features such as edges, forms or shapes are common to many different objects, a trained network can be used to implement a new problem, with a well-established set of initial weights from training on very large datasets, such as Common Objects in Context (COCO), which was trained with over 320,000 annotated images (Lin et al., 2015). The new discriminators coming from the images we have annotated are added to this solid base, defining the last layers of the CNN, and adjusting the detector to the new problem.
The train was done once and took two hours nineteen minutes for 750 iterations in a laptop equipped with dual Core Intel i7-10750H processor, 16 GB SDRAM, and a graphic unity NVIDIA GeForce RTX 2060 Max-Q 6GB. The resulting weights defining the new model can be used to detect the same objects of interest on any similar image, with a processing time around 0.2 s for each image.
The results are quantified as usual in terms of Precision and Recall. Precision is defined as the percentage of objects correctly classified in the set of all the targets identified by the model as positives (Equation (1)).
Precision = True Positives/(True Positives + False Positives) (1)
Recall concerns the percentage of positives correctly detected in the sum of all occurrences really positives (Equation (2)), that include false negatives—undetected occurrences.
(a) (b)
Figure 1. Example of an image before and after inference processing. The objects of interest detected by YOLO v7 in image (a) are marked in blue in image (b), with the respective confidence. The confidence threshold is 0.24.
Recall = True Positives/(True Positives + False Negatives)(2)
The F1 score is a cross measure of the accuracy of a model in a data set, based in the harmonic mean of the accuracy and recall of the results (Equation (3)).
F1 = 2 × (Precision × Recall)/(Precision + Recall)(3)
A high value of F1 (near 1, as F1 range is [0, 1]) indicates perfect results of both precision and recall, and a low value means that the model has a poor performance either in precision, recall or both.
3. Results and Discussion
The largest model of YOLO v7 was used (e6e) and trained with a few custom alterations in the default hyperparameters, to be fully applicable to the nature of the “human” data, namely the possible change in scale was constrained to a factor of 1:2, rotation to 0.5 degrees, and translation to zero pixels. Shear, perspective and flip upside down are kept to zero; the admissible alterations for hue, saturation and intensity in the color space are the defaults, 0.015 for hue, 0.7 for saturation and 0.4 for intensity.
The resulting model, presenting a precision of 0.83, recall of 0.78 and mAP@50 of 0.82, was fine-tuned to find the best values for the threshold intersection over union (IoU) and the minimum confidence to be applied in the inference. The first one concerns the ratio of intersection to union of the forms predicted and used in the train stage, contributing to the inference phase to avoid double and triple detections centered in the same target.
The confidence threshold determines the minimum confidence we demand from the model about each target really being of the class trained. As the confidence for each detection may be displayed in the classified image along with the bonding box around the object detected, we can easily assess whether any changes are required. A low value of confidence will include detections with a small probability of being true positives, while a higher confidence threshold will keep only detections with a high probability of such. We use an IoU value of 0.35 and a confidence threshold of 0.24.
The model was applied to the remaining test dataset of 180 images, and the results were evaluated by human curation, resulting in three numbers associated to each test image—number of targets detected by the model, and numbers of false positives and false negatives amongst the detections (Table 1).
Table 1. Detections, false positives and false negatives in 180 images.
Detected (conf = 0.24, IoU = 0.35) |
5401 |
True Positives |
5672 |
False Positives |
116 |
False Negatives |
387 |
The model achieved gives an overall precision of 0.980 and a recall of 0.936 on the test data set. Considering the 93.2% of true positives detected (correctly detecting 5285 in 5672 targets identified by human curation in the 180 images) and the processing times around 0.2 s per image, it constitutes an interesting result for application in surveillance, with an F1 score of 0.958, considering that this deep-learning tool is open-source software and can be installed on an average laptop without any special top-end hardware requirements. The train stage needs a human operator to identify the objects of interest in a subset of images, with an online tool, which can consume a few hours of processing, or not that much, if automatic identification can be used.
The inference has two outputs: an image with all the occurrences identified by the model surrounded by boxes with the confidence of each detection attached, and a numeric output including the total of objects detected and processing time.
The parametrization of the inference, meaning the values for the ideal confidence threshold and IoU for a particular dataset, is the most demanding step in the procedure, and strongly depends on the weights given to false positives and false negatives in the problem, so it depends on the goal involved in each application.
We focus on minimizing false positives, so we worked with a parametrization that produces 98.3% of the 180 test images with a maximum of 3 false positives and 76.1% with a maximum of 3 false negatives, being able to detect many partial targets (Figure 2).
In Figure 2, we can see several objects partially hidden by the water that are detected with high confidence, as well as many targets in different positions that do not represent a challenge to this kind of algorithm. Targets on different backgrounds—water, light-colored dry sand and dark-colored wet sand (Figure 2)—have the same probability of correct detection, since the training of this type of model focuses on the characteristics of the target, regardless of the background.
(a) (b)
Figure 2. Example of an image before and after inference with IoU = 0.35 and confidence threshold 0.24. The objects of interest detected by YOLO v7 in image (a) are marked in green in image (b), with the respective confidence.
4. Conclusion
The acquisition of images with a high line of sight to a beach area allows the use of the methodology described to quantify the human surcharge with good precision, without any estimates or unnecessary interactions, in any time lapse desired.
To implement this type of deep learning solution an initial annotation stage must be carried out outside the workflow, preferably by a human operator who annotates the train and validation images, due to the wide variety of poses on the beach that would not be easily identified in automatic labeling. The variety of situations observed is beyond the data augmentation capabilities offered by hyper parameterization, which is mostly limited to affine transformations.
Once the model is ready, each image with any number of targets present will take less than a second to process, and the final figures will be more consistent, since they are not subject to the subjectivity of a human interpreter.
There are situations in which this type of remote surveillance is highly desirable, such as certain beaches where the lack of consolidation of the cliffs is a known risk, but which are frequented anyway, often from access by sea; these are situations in which surveillance has two relevant aspects at the same time, the safety of people and the preservation of ecosystems (Sardá et al., 2015).
The procedure described can be applied to any guarded area where the human load has become critical, such as historic sites or fragile dune ecosystems, where it could be used to keep the flow of people within the observed area of interest to reasonable numbers, for instance with data exchange with the entry points.
Funding
This study had the support of national funds through Fundação para a Ciência e Tecnologia, under the project LA/P/0069/2020,
(https://doi.org/10.54499/LA/P/0069/2020) granted to the ARNET (Aquatic Research Network Associated Laboratory), UIDB/04292/2020,
(https://doi.org/10.54499/UIDB/04292/2020) and UIDP/04292/2020
(https://doi.org/10.54499/UIDP/04292/2020), granted to MARE (Marine and Environmental Sciences Centre).