Individual Minke Whale Recognition Using Deep Learning Convolutional Neural Networks

The only known predictable aggregation of dwarf minke whales (Balaenoptera acutorostrata subsp.) occurs in the Australian offshore waters of the northern Great Barrier Reef in May-August each year. The identification of individual whales is required for research on the whales’ population characteristics and for monitoring the potential impacts of tourism activities, including commercial swims with the whales. At present, it is not cost-effective for researchers to manually process and analyze the tens of thousands of underwater images collated after each observation/tourist season, and a large data base of historical non-identified imagery exists. This study reports the first proof of concept for recognizing individual dwarf minke whales using the Deep Learning Convolutional Neural Networks (CNN).The “off-the-shelf” Image net-trained VGG16 CNN was used as the feature-encoder of the per-pixel sematic segmentation Automatic Minke Whale Recognizer (AMWR). The most frequently photographed whale in a sample of 76 individual whales (MW1020) was identified in 179 images out of the total 1320 images provided. Training and image augmentation procedures were developed to compensate for the small number of available images. The trained AMWR achieved 93% prediction accuracy on the testing subset of 36 positive/MW1020 and 228 negative/not-MW1020 images, where each negative image contained at least one of the other 75 whales. Furthermore on the test subset, AMWR achieved 74% precision, 80% recall, and 4% false-positive rate, making the presented approach comparable or better to other state-of-the-art individual animal recognition results.


Introduction
The dwarf minke whale (Balaenoptera acutorostrata subsp.) is the second smallest baleen whale, born at approximately 2m in length and growing to a maximum measured length of 7.8 m [1].Dwarf minke whales are distributed throughout the southern hemisphere, including Antarctica, and were first acknowledged as a distinct form of minke in 1985 [1].The only known predictable aggregation of dwarf minke whales occurs in the Australian offshore waters of the northern Great Barrier Reef (GBR) each year throughout the Australian winter months [3].This aggregation supports a local swim-with-whales tourism industry [2] [3].The predictable nature of this aggregation has also enabled dedicated research of dwarf minke whales, which has contributed to seminal work on dwarf minke whale biology [4], behavior [5], and assessment and management of swim-with-whales activities [2].Outputs from this work have informed and shaped management policies and expanded knowledge of both the subspecies in general and, specifically, the interactions with the tourism industry.The uniqueness of this aggregation presents an opportunity to conduct research and improve the knowledge base for a poorly understood oceanic rorqual whale, as well as a responsibility to ensure that tourism activities are managed sustainably [2] [3] [5].
The identification of individual whales underpins much of the scientific research on dwarf minke whales and the monitoring of tourism activities.While in the GBR, these whales are highly inquisitive, readily approaching vessels and divers and often maintaining contact for prolonged periods [3] [5].This behavior provides good opportunities for passengers aboard the swim-with tourism vessels to photograph dwarf minke whales.The whales' color patterns have been shown to remain stable over many years, and are sufficiently complex to allow for unequivocal identification of individuals [3] [6] [7].The stability of these patterns and the regular, in-water access provided to researchers by tourism vessels has made the dwarf minke whale an ideal species for photo-identification (photo-ID) [6] [8].
Photo-ID is a simple, non-invasive technique widely used to study a range of biological and behavioral characteristics of wild animal populations.Ideal candidates for photo-ID are those with stable color patterns and/or other markings that are unique to each individual, so that individuals can be easily distinguished from each other and their identifiable markings remain the same over time.The automation of the photo-ID process is often highly specific to the required species, e.g.fin contour of great white sharks [9].Due to its fundamental research role, photo-ID is an active research area for many species, e.g.green sea turtles [10], gorillas [11], and dolphins [12].Journal of Geoscience and Environment Protection For minke whales, photo-ID has typically involved visual comparison of large numbers of photographsby trained researchers; thus, the process is time-intensive.
Much of the imagery used for photo-identification of dwarf minke whales in recent years has come from tourists and crew aboard swim-with whales dive tourism vessels [8].The quantity of this donated imagery has increased dramatically with the availability of low-cost digital underwater cameras and the resultant rise in popularity of these items among tourists [8].Researchers are now obtaining tens of thousands of photographs and video clips each season.Consequently, it is no longer cost-effective for researchers to manually process and analyze such quantities of images, and a large database of historical non-identified imagery exists.In order to utilize the increasing quantity of imagery to address key biological and ecological knowledge gaps about these whales, automatic computer-vision based recognition software is required, and was the main focus of this study.
Over the last few years the Deep Learning Convolutional Neural Networks (CNNs) revolutionized the field of computer-vision image recognition [13].For example, the Alex Net image classification CNN [14] won the Imagenet Large Scale Visual Recognition Challenge (ILSVRC) [15] in 2012, and since then all the ILSVRC13-ILSVRC17 winners used CNNs of various architectural configurations as their key features, e.g.[16].It is customary to refer to such CNNs as been trained-on-Imagenet.
A typical Imagenet-trained CNN is setup to classify as many as 1000 different types of objects.Therefore, it is plausible to expect that such a CNN could distinguish at least 1000 different individual dwarf minke whales if it is trained or re-trained appropriately.This direct approach, however, has a number of limiting factors.First, millions of images are available in the Imagenet for training CNNs, which is presently not feasible for dwarf minke whales, where the number of images available for an individual whale may vary between one and several thousand.Second, typical Imagenet object categories are very different, e.g.differences in images for dogs and people, whereas all minke whales fit essentially the same category for the Imagenet (i.e.near-identical body shape, proportions and general color).Third, the output of a classification CNN is a single probability number for each available class, where category and class are used as equivalent terms in this study.Such probability prediction has limited value to a marine biologist, as it does not explain why/how CNN arrived at its prediction.This is known as the black-box perception and/or criticism of the classification CNNs.The black-box CNN prediction is unavoidable in studies where animals are identified by their "faces", e.g. for gorillas [11], and identification uses facial geometrical proportions and is essentially the full face.Fortunately in the case of dwarf minke whales, they are currently identified by finely detailed color patterns and scars (Figure 1), which could be recognized and localized by CNN, and then confirmed by a trained researcher.
The black-box limitation of the classification CNNs has a natural solution Journal of Geoscience and Environment Protection when the CNNs are configured to perform semantic segmentation of images, where an image is segmented into per-pixel categories [17].The output of segmentation CNNs is a per-pixel heat-map (also known as the probability or activation map) for each class.Therefore, a researcher could easily verify the CNN prediction by viewing the heat-map corresponding to the recognized individual whale (Figure 2).This approach was successfully validated in this proof of concept study by training a segmentation CNN to recognize a single whale within 1320 images of 76 different whales.

Dataset
The underwater imagery dataset used in this study consisted of 1320 digital photographs of dwarf minke whales (Balaenoptera acutorostrata subsp.).All images were sorted according to unique individual animals.In some cases only left or right sides of a whale was identified, without knowing if corresponding images belonged to the same whale or not.Where it was possible to match the left and right sides to the same whale, the related imagery was labeled accordingly and placed together in the same folder.As a result, the dataset identified 76 different whales.The identification process was extremely time consuming even for trained researchers as it required recording and cataloguing the color patterns and scars of 76 different whales, and/or reviewing any new image against at least 76 other whale images thus relying on researchers' memory to identify matches with any efficiency.The number of available images varied greatly between individuals; the MW1020 individual had the largest number of images (179), and several whales had only one image per individual.

Segmentation Neural Network
As described in the introduction, this study used a segmentation CNN rather than a classification CNN to recognize an individual minke whale and localize the recognized unique features.Specifically, the most accurate segmentation FCN-8s model from the Fully Convolutional Networks (FCN) [17] was selected due to the following considerations.
First, the FCN-8s model is based on the VGG16 CNN model [16], which was one of the top performers in the ILSVRC14 [15].Second, this study used the Deep Learning python framework Keras [18] with Tensor Flow [19] as the processing backend.The Imagenet pre-trained VGG16 model was available within Keras [18], and the FCN-8s model had a number of publically available Keras-based implementations, e.g.[20].For this study, FCN-8s version was recreated in Kerasdirectly from the original Caffe source code of the FCN-8s model [21], and released to public domain [22].
Third, at the time of writing, the FCN-8s publication [17] had the largest numbers of citations among segmentation CNNs making it a widely accepted base-line model for semantic segmentation.Adopting this well-known FCN-8s model for this study was intended to make the presented method be reproduced and/or replicated more easily for additional/different minke whale images or for other animal species recognition studies.
In terms of the actual implementation, the FCN-8s model was built by reusing all VGG16 convolutional layers, which were loaded with the Imagenet-trained VGG16 weights available in Keras [18].Such reuse of CNN weights is often referred to as the knowledge transfer [23].VGG16 was designed to recognize 1000 classes of objects.Since this study was dealing with the maximum of 76 individual whales, the original VGG16/FCN-8s 4096 neurons were reduced to 1024 neurons when the last two dense (non-convolutional) VGG16 layers fc1 and fc2 were converted to their convolutional equivalents as per the FCN-8s model.This reduced the total FCN-8s size to approximately 160 MB when stored on disk, comparing to 540 MB for the original FCN-8s model with 4096 neurons in the fc1 and fc2 layers.The non-VGG16 convolutional layers were initialized by the uniform distribution as per [24].Sigmoid activation [25] function was used in the last (i.e.prediction) layer.

Data Augmentation and Training Workflow
The adopted FCN-8s [17]  The second or training augmentation protocol (TAP480) was applied to the ISP640 processed images, where each image was: The following training workflow was adopted for this study.All available images were sequentially numbered and split into five approximately equal subsets.
The first three subsets were used as a single training set, i.e. 60% of all available images.The fourth and the fifth subsets became the validation and testing sets, respectively.More precisely, the i th image was allocated to validation or test if ( 1) i + or i were multiple of 5, respectively, where all remaining images were assigned to the training set.
The training of FCN-8s was done in up to 100 cycles.In each cycle, TAP480 was further applied to the already ISP640-processed images.The training images were loaded into memory as a ( , , , )

Minke Whale Locator
Being a segmentation model, the FCN-8s model required the ground-truth Journal of Geoscience and Environment Protection per-pixel binary mask for each of the training and validation images.Therefore, the auxiliary goal of this study was to design the required workflow to be as scalable as possible for future larger training datasets.Creating the ground-truth per-pixel binary masks was clearly the least scalable component of this study, and required a scalable solution.This was solved by training an instance of FCN-8s to be the Minke Whale Locator (MWL).
To train MWL, 100 images were segmented by hand (including 50 of the MW1020 individual) to produce binary per-pixel ground-truth mask Y for each of the 100 images.Then MWL was trained as per preceding Section 2.2 with the following modifications.In addition to TAP480, images were flipped horizontally with 0.5 probability.The available 100 images were split 70 for training, and 30 for validation, where the rest of the not-segmented images were considered to be the testing set.The Keras version of the RMS prop optimizer was used with 10 −4 learning rate, and 10 −3 learning rate decay after each weights update, where RMS prop "divides the learning rate for a weight by a running average of the magnitudes of recent gradients for that weight" [26].Once the per-pixel valida- tion accuracy stopped improving (usually at around 95%), the Stochastic Gradient Descent (SGD) optimizer was used with 10 −4 learning rate, 10 −3 learning rate decay, 0.9 momentum, and enabled Nesterov momentum.
Trained MWL was applied to all available images to automatically generate one largest rectangular binary mask per ISP640 pre-processed image.Note that since MWL was fully convolutional, it was rebuilt to accommodate any required image dimensions, where one side was always 640 (due to ISP640) but the other side was varied.The mask generation was done as follows.For each image, the per-pixel prediction heat-map ( , ) p Y i j was converted to binary mask B via, ( , ) 1 B i j = , ( , ) 0.8 where i and j were the row and column pixel location indices, respectively, and where the remaining mask values were set to zero, i.e. ( , ) 0 Y i j < .The largest connected non-zero area was filled to complete its minimum-enclosing rectangle, and saved as the only non-zero values of the final binary mask.

Automatic Minke Whale Recognition
Similar to the preceding MWL model, an instance of the FCN-8s model was created for a required number of K individual whales to be the Automatic Minke Whale Recognition (AMWR) model.To train AMWR, the automatically created (by MWL) masks for the K whales were reviewed for correctness.Specifically, each MWL-generated rectangular mask was checked to make sure it enclosed correct whale if multiple whales were present in an image.Also, if the mask did not enclose the whole whale, the mask was verified to enclose all whales' features, which a biologist could use to identify that whale, i.e. fin coloration patterns and distinct scars.Note that in this study, the MWL model was nothing more than a convenience tool to automate ground-truth mask creation.There-Journal of Geoscience and Environment Protection  On the test subset, AMWR achieved 4% false-positive rate (Table 1).Low fp rate was viewed as essential to support a workflow where many thousands of unsorted images could be scanned for the known whales, and the number of "false-alarm" instances would remain feasible to be classified manually.AMWR's test precision (74%) and recall (80%) results (last column of Table 1) were better than the corresponding state-of-the-art gorilla identification results [11] of approximately 60%.The AMWR's test accuracy (93%) and precision (74%) were comparable to the 81% average precision achieved in the state-of-the-art great white shark identification results [9].The validation and test prediction metrics were comparable (third and fourth columns in Table 1) supporting the achieved testvalues to be the expected benchmark/baseline values of the AMWR model in future similar circumstances/studies.

Conclusion
Due to the increasing abundance of underwater digital imagery, the manual identification of individual dwarf minke whales from images and videos has become cost-ineffective.It has become excessively time-consuming to manually check if an unsorted image contains a new whale or a known whale, e.g. from the

Figure 1 .
Figure 1.Example of individual minke whale distinct fin color pattern and scars.

•
Randomly rotated in the range of [ 45, 45] − + degrees, where the input image was reflected to fill pixels outside the original boundary as required; • Randomly resized in the scale range of [0.75,1.25] ,or by up to 25% zooming in or out; • Randomly shifted in each color channel in the [ 25.5, 25.5] − range, where 25.5 was the 10% of maximum color values 255; • Randomly gamma shifted in the [ 25.5, 25.5] − range, where all color channels values were shifted together; • Randomly cropped to retain 480 480 × pixels; • Imagenet color mean values were subtracted as commonly done when working with the Imagenet-trained VGG16 model.

3 C
= was due to the three available color channels.The corresponding to the loaded training images were the ground-truth binary per-pixel masks, which were loaded as a one-hot encoded ( , , , ) l k = if the ( , ) m l pixel belonged to the k th class in the ith image and zero otherwise.The required number of classes K was 1 K = for the automatic whale locator and a single whale classifier, as described later on in this paper.The validation v X and v Y tensors were constructed in similar fashion.The per-pixel binary cross-entropy loss function, e.g.p.231 of[25], was aver-agedas required and used as the training loss metric.Due to the available Graphical Processing Unit (GPU) memory limits, training was done in batches of only four images.Up to 16 training epochs were allowed per cycle, where one feed-forward and one back-propagation passes through all N t -loaded image-mask pairs were considered to be one epoch.Training for a given cycle was aborted if the validation loss metric did not decrease after two epochs, this is commonly known as early stopping.Note that the early stopping was the only place where the validation images were used in training.In order to prevent the indirect overfitting of the validation images, they were augmented by TAP480 before each training cycle similar to the training set.

Figure 2 .
Figure 2. Example of AMWR per-pixel prediction for MW1020 individual.The pixels with the prediction heat-map values above 0.99 were illustrated by amplifying the corresponding image pixel intensities by factor of 1.5.