Deepfakes Detection Techniques Using Deep Learning: A Survey

Deep learning is an effective and useful technique that has been widely applied in a variety of fields, including computer vision, machine vision, and natural language processing. Deepfakes uses deep learning technology to manipulate images and videos of a person that humans cannot differentiate them from the real one. In recent years, many studies have been conducted to understand how deepfakes work and many approaches based on deep learning have been introduced to detect deepfakes videos or images. In this paper, we conduct a comprehensive review of deepfakes creation and detection technologies using deep learning approaches. In addition, we give a thorough analysis of various technologies and their application in deepfakes detection. Our study will be beneficial for researchers in this field as it will cover the recent state-of-art methods that discover deepfakes videos or images in social contents. In addition, it will help comparison with the existing works because of the detailed description of the latest methods and dataset used in this domain.


Introduction
With the technology becoming accessible to any user, lots of deepfake videos have been spread through social media. Deepfake refers to manipulated digital media such as images of videos where the image or video of a person is replaced with another person's likeness. In fact, deepfake is one of the increasingly serious issues in modern society. Deepfake has been frequently used to swipe faces of popular Hollywood celebrities over porn images videos deepfake was also used to produce misleading information and rumors for politicians [1] [2] [3]. In

Basics of Artificial Neural Networks (ANNs)
The basic concept of Artificial Neural Networks (ANNs) is partially inspired by how the human brain functions. Figure 1 shows artificial neural networks architecture. Neural networks are multi layers networks that consist of a single input layer, one or multi hidden layers and one output layers. The input to neural networks is a set of input values [12]. The goal of neural networks is to predict and classify those values into predefined categories.
The first layer in neural network is the input layers which takes input values and pass them to the next layer [13]. In our example, the input values are x 1 , x 2 , x 3 and x 4 . The second layer is the Hidden layers which a set of connected unites called artificial neurons (nodes). The edges that connect the neurons represents how all the neurons are interconnected and how can receive and send signals through multi layers. Each connection has a weight associated with it which represents the connections between two units. In our network, the 1st hidden layer consists of 3 neurons and the 2nd layer contains 4 neurons. Each neuron receives number of inputs from previous layer and a bias value. A bias value is an extra value which equal to 1. If a neuron has n inputs, it should have n weight values which can be represented by the following learning formula (Equations (1) and (2)): Sigmoid function Equation (3)

Deep Learning
Deep learning is a machine learning method based on the same idea of neural network [13] [14]. In deep learning, the word deep indicates the use of multiple hidden layers in the network. Inspired by artificial networks, the deep learning architecture uses an unbounded number of hidden layers of bounded size to extract higher information from raw input data. The number of hidden layers is determined based on the complexity of the training data [6]. More complex data requires more hidden layers to effectively produce the correct results. In recent years, deep learning has been used successfully in a variety of areas, including computer vision, audio processing, automatic translation, and natural language processing. Applying deep learning in these fields provides state-of-art results compared with the machine learning approaches. Deep learning also has shown promising results in deepfake detection. In literature, several techniques based on deep learning have been proposed including: 1) convolutional neural network (CNN); 2) recurrent neural network (RNN); 3) long short-term memory (LSTM). In the following sections, we briefly describe these techniques and then explain its implementation on deepfake discovery.

Convolutional Neural Network (CNN)
A convolutional neural network (CNN) is the most commonly used deep neural network model. CNN, like neural networks, has an input and output layer, as well as one or more hidden layers. In CNN [15], the hidden layers first read the inputs from the first layer and then apply a convolution mathematical operation on the input values. Here, convolution indicates a matrix multiplication or other

Deepfake Generation and Detection
Deepfake is a technique that uses the Generative adversarial networks (GANs) methods to generate fictitious photographs and videos. In this section, we fist give an overview of the current applications and tools to create deepfake image and videos. Then, we discuss some deep learning detection techniques to overcome this issue.

Deepfake Generation
Generative adversarial networks (GANs) are a form of deep neural network that has been commonly used to generate deep fake. One advantage of GNAs is that it capable to learn from a set of training data set and create a sample of data with the same features and characteristics. For example, GANs can be used to swipe a A. M. Almars Journal of Computer and Communications "real" image or the video of a person with that of a "fake" one [1]. The architecture of GANs consists of two neural networks components: an encoder and decoder. First, the model uses the encoder to train on a large data set to create fake data. Then, the decoder is used to learn the fake data from realistic data. However, this model requires a large amount data (images and videos) to generate realistic-looking faces. Figure 2 shows the GNA architecture. As illustrated in the figure, the encoder first receives random inputs seeds to generate a fake sample. Those fake samples are used to train the decoder. The decoder is simply a binary classifier, and it takes the real samples and fake samples as inputs and then, decoder applies a SoftMax function to distinguish the realistic data from the fake one.
Many deepfake applications have already been around for quite a few years.
FakeApp is the first method that has been used widely for deepfake creation.
This FakeApp capable of swapping faces on videos using autoencoder-decoder pairing structure developed by a Reddit user [20] [21]. Similar to GANs, Fa-keApp consists of the autoencoder which is used to construct latent features of the human face images and, the decoder which is used to re-extract the features for the human face images. This simple technique is powerful as it capable to produce extremely realistic fake videos that hard for people to differentiate from the real one. VGGFace is another is another popular deepfake technique based on the generative adversarial network (GAN). The architecture of VGGFace [22] is improved by adding two layers called adversarial loss and perceptual lost.
Those layers is added to autoencoder-decoder capture latent features of face images such as eye movements in order to produce more believable and realistic fake images.
CycleGAN [23] is a deepfake technique that extracts the characteristics of one image and produces another image with the same characteristics via the GAN architecture. This method applies cycle loss function that enables them to learn the latent features. Dissimilar from FakeApp, CycleGAN is unsupervised method that can perform image-to-image conversion without using paired examples. On other words, the model learns the features of a collection of images from the source and target that do not need to be related to each other's.

Deepfake Detection
Deep learning has achieved great success in deepfake detection. In this subsection below, we first discuss the Image Detection models using deep learning technologies and then Video Detection models are presented.

Image Detection Models
Different methods have been proposed to detect the GAN generated images using deep networks. Tariq et al. [24] suggested neural network-based methods for detecting fake GAN videos. This method employs pre-processing techniques to analyses the statistical features of image and enhances the detection of fake face image created by humans [25]. Nhu et al. [26] also introduces another approach In addition to the traditional deepfake detection models, a hybrid approach was introduced to effectively detect the fake images [28] [29] [30]. Zhou et al. [29] for example proposed a two-stream network for detecting face tampering (see Figure 3). The face classification stream is used on GoogleNet [31] to train the model on tampered and authentic images. Then, the patch triplet stream is used to analysis features using steganalysis feature extractor and captures low show that this approach can learn both fake and real images. Another hybrid approach was introduced which use a pairwise-learning for deepfake image detection [30]. The approach first uses GANs to create and generate a fake image.
Then, on the popular fake feature network (CFFN) generated by GANs, a pairwise-learning model is used to capture the discriminant information between the fake image and the real image. The evaluation results show that this approach can overcome the shortcomings of the existing state-of-the-art fake image detectors.

Video Detection Models
For the last years, deep learning methods have been successfully applied for fake image detection. However, the current deep learning methods for image cannot be directly applied for fake videos detection due to the availability of significant loss of frame information after video compression [32] [33]. In the subsection below, we have divided the related work in deepfake video detection into two main categories: biological singles analysis and spatial and temporal features analysis. the similarity and identify the fake video and the real one. This approach is tested on two deepfake identification benchmark datasets, DeepfakeTIMIT dataset [42], and DFDC [43]. The models yield an accuracy of 96.6 percent on DF-TIMIT datasets and 84.4 percent on DFDC datasets, respectively.

2) Spatial and Temporal Features Analysis
Most current deepfake detection methods only use a single video frames [44].
In fact, video manipulation can be carried out on multiple frame-level features.
Recently, many researches have shown that analyzing the temporal sequence between frames can successfully help to discriminate the real video or the fake one.
In this paper [8], the authors introduced a temporally-aware model to detect deepfake videos. The model first employs a convolutional neural network (CNN) for frame features extraction. Afterwards, these features are passed to LSTM layer to analysis a temporal sequence for face manipulation between frames. Finally, a softmax function is used to classify the video as either real or fake. Based on the previous version of Cycle-GAN [45], Bansal et al. [46] introduced a new approach called, Recycle-GAN, which uses conditional generative adversarial networks to merge spatial and temporal data. The evaluation results show that combining the spatial and temporal constraints can produce an effective output. Furthermore, Sabir et al. [47] also propose a new approach based on recurrent convolutional network. The approach consists of two analysis stages: face processing stage followed by face manipulation detection. In the processing, face cropping and alignment is extracted using Spatial Transformer Network (STN). Then, the output from the previous stages is passed for face manipulation detection using the recurrent convolutional network, where the temporal information across frames is analyzed. See Figure 5. The approach is evaluated in a  . The proposed method is a two-step process. The first step is for face detection, cropping and alignment. The second step is for manipulation detection.
public available dataset FaceForensics++ [48]. The result shows state-of-the-art performance compared with the existing models.
The dataset FFHQ contains a collection of 70,000 face images with a high-quality resolution generated by generative adversarial networks (GAN). The images were collected from Flicker platform and contain images with a variety of accessories such as eyeglasses, sunglasses, hats, etc. According to author, a pre-processing step was done in the dataset to prune the set, and remove noises from photos.

100K-Faces
100K-Faces [50] is a well-known publicly available dataset which includes 100,000 unique human images generated using StyleGAN [49]. StyleGAN was applied on a large dataset consists of more than 29,000 images gathered from 69 different models, generating photos with a flat background.

CASIA-WebFace
Dong et al. [52] presented a database called CASIA-WebFace that includes about Then the photos of those celebrities are extracted using clustering methods.

VGGFace2
A large-scale face dataset called VGGFace2 was introduced by Cao et al. [53].
This database contains over three million face photographs from over nine thousand different subjects, with an average of more than 300 images per subject. Images were gathered from the Google engine which has a wide range of information such as ethnicity, illumination, age, and occupation (e.g., actors, athletes, and politicians).

The Eye-Blinking Dataset
The current available dataset has not been designed to deal with the eye-blinking detection. Li et al. [34] released the eye-blinking datasets which specially designed for this purpose. This dataset consists of 50 interviews and videos for each person for each person that lasts approximate thirty seconds with one eye blinking happened at least one time. Using their own tools, the author then tags the left and right eye states for each video clip. The details and description of the dataset is available at http://www.cs.albany.edu/%E2%88%BClsw/downloads.html .

DeepfakeTIMIT
DeepfakeTIMIT is a dataset of videos released by Korshunov et al. [42] using the database contains a collection of swapped faces videos generated using the GAN-based approach. The dataset was produced with a lower quality model with 64 × 64 size and a higher quality model with 128 × 128 input/output size.
Each non-real video collection contains 32 subjects. The author created ten fictitious videos for each subject.

Challenges and Open Issues
The massive availability of applications and tools that create deepfake images and videos lead to large numbers of deepfake images and videos generated every

Conclusion and Future Directions
Deepfake had become popular due to the massive availability of images and videos in social contents. This is particularly important nowadays because the tools for making deepfakes are becoming more accessible, and social media sites will easily allowing people to distribute and share such fake contents. Deep learning methods have received a lot of interest in a variety of areas. Recently, various deep learning-based methods have been proposed to address this issue and successfully detect fake images and videos. In this paper, we first discuss the current applications and tools that have been widely used to create fake images and videos. Then, we have reviewed current deepfake methods and divided them in this paper into two major techniques: image detection techniques and video detection techniques. We provided a detailed description of the current deepfake methods in terms of architecture, tool and performance. We also highlighted the publicly accessible datasets used by the science community, categorizing them by dataset sort, source, and method. Finally, we have also discussed the current challenges and provide insights into future research on deepfake detection using deep learning.
Although deep learning has shown a remarkable performance in deepfakes detection, the quality of deepfake has been increasing. Hence, the current deep learning methods need to improve as well to successfully identify fake videos and images. In addition, for the current deep learning methods, there is not a clear method to know the number of layers needed and which architecture is appropriate for deepfake detection. Another area of investigation is the incorporation of identification of deepfake detection methods into social media platform in order to improve their effectiveness in coping with the pervasive effects of deepfakes and reduce its impacts.