Carving Thumbnail/s and Embedded JPEG Files Using Image Pattern Matching

Images (typically JPEG) are used as evidence against cyber perpetrators. Typically the file is carved using standard patterns. Many concentrate on carving JPEG files and overlook the important of thumbnail in assisting forensic investigation. However, a new unique pattern is used to detect thumbnail/s and embedded JPEG file. This paper is to introduce a tool call PattrecCarv to recognize thumbnail/s or embedded JPEG files using unique hex patterns (UHP). A tool called PattrecCarv is developed to automatically carve thumbnail/s and embedded JPEG files using DFRWS 2006 and DFRWS 2007 datasets. The tool successfully recovers 11.5% more thumbnails and embedded JPEG files than Pred-Clus.


Introduction
A method to recover files from the disk collected from the crime scene which is known as cyber evidences is called file carving [1][2][3].Year by year, with the increasing number of computers and other digital devices usages, file carving technique also evolve drastically.In carving a JPEG images, the focus is in fragmentation file carving.Although there are researchers discussed on fragmentation such as [2][3][4], it is not yet completely solved.
This paper is introducing a way to assist in reducing number of JPEG files to be processed for fragmentation point detection.A JPEG image can be embedded into other file types such as doc, ppt, etc.Furthermore, thumbnails can be embedded into a JPEG image itself to ease the recovering and organizing of the original image.A JPEG image can contains none, single or two thumbnails in the image itself.A thumbnail which is a reduced version of an image carried similar feature as the original.This thumbnail is always mistaken as another JPEG image.Therefore, knowledge of thumbnail's existence helps investigator to separate JPEG files with thumbnail/s and concentrates to investigate correct point where the real fragmentation occurs.They can then identify which JPEG image is fragmented with another JPEG images.With this way, they can ascertain that those fragments are belongs to two or more different files, not file/s within a file.This is important because during the reassembling process, if a thumbnail is mistakenly identified as another JPEG files, the original file may corrupt because of missing fragments.Consequently, this is also accelerating the reassembling process by allowing investigators to concentrate on real fragmentation situation.
In this paper, a novel algorithm called PattrecCarv is proposed.PattrecCarv is developed to recover thumbnail/s or embedded files from DFRWS 2006 and 2007 datasets using unique hex patterns (UHP).The output can be used as a pre-processing data to simplify the process for recovering of JPEG image.
The rest of the paper is organized as follows.Section II is the related works consists of an overview of JPEG standard and thumbnail and embedded JPEG files while Section III brief on PredClus algorithm.Section IV describes the proposed PattrecCarv algorithm.Section V describes the experimentations done.Section VI describes the result and discussion.Finally section VII concludes this paper.

An Overview of JPEG Standard
Computer forensics is to recover evidences resides on a computer, by mean to solve pornography cases [1][2][3].This involves image files obtained from the perpetrator in certain format like Bitmap and JPEG but most common format is JPEG.JPEG is popular because of its compressed file that can reduce the size required to allocate an image.Joint Photographic Experts Group (JPEG) was formed by International Telegraph and Telephone Consultative Committee in 1986 inspired by an effort of International Organization of Standard (ISO) to find ways to use high resolution graphics and pictures in computers [4].JPEG introduced compression standard for both grayscale and color continuous-tone images.The details of JPEG compressed data formats can be found in [5].There are two types of JPEG that are mostly used today, JPEG File Interchange Format (JFIF) and JPEG Exchangeable Image File Format (Exif) [6,7].JFIF is popular for internet file while EXIF is the popular image file format used for digital camera [8].

Thumbnail and Embedded JPEG Files
Both in JFIF and Exif format allow for embedding thumbnail/s into a JPEG file.A JPEG image with a complete SOI/EOI can be embedded into an original JPEG image to ease the recovering and organizing of the original image.This file is known as thumbnail.Thumbnails are reduced size version of images that can be used to recover and organize the picture [9] while embedded JPEG files are referred to original JPEG files that are embedded to other types of files such as PPT, WORDS and EXCEL.Thumbnails are used to speed up images search or page load on the Internet and also being used in image organizing programs.Thumbnails are compatible on most modern operating systems or desktop environments such as Microsoft Windows, Mac OS X, KDE and GNOME [10].A JPEG image can contain none or a single or two thumbnails.Therefore, a JPEG image can have several SOI/EOI pairs [11].Mohamad in [12] and [13] asserted the role of thumbnail to serve as a method of recognizing the corrupted images because of its small size that have a better chance for full recovery without corruption [14].A thumbnail carried similar features as the original.Hence, using thumbnail/s, crime investigators can identify which images or pictures that have potential to be used as evidences against cyber perpetrator.
Guo in [9] proposed thumbnails as a method to recover JPEG image from fragment data.In brief, thumbnails do serve multiple roles.Besides contributing in the process of recovering and organizing JPEG files, thumbnails help in recognizing corrupted images and also, information about thumbnail's location can be used in carving fragmentation JPEG images to recover the original files.Abdullah et al. [15] proposed PredClus as a method to recognize thumbnail/s and embedded JPEG files.However, using PredClus which using cluster size to determine the location of thumbnail/s or embedded file may miss some thumbnails that resides at the start of cluster.This situation occurs when a JPEG image with thumbnail/s require more than one cluster to store the data.Sometimes, the start of thumbnail will be at the start of second cluster.In this situation, the thumbnail/s will be ignored by PredClus.Hence, an alternative technique to distinguish thumbnail or embedded JPEG file with the original is by using pattern matching technique.In carving JPEG images especially fragmented JPEG files, it will ease the process of preparing evidence if the carver can distinguish between original images, thumbnails and embedded images.

Predclus Algorithm
PredClus is developed to automatically determine cluster size of a dataset.Using this information, JPEG images that are not located at the starting address of any cluster are marked as thumbnails or embedded JPEG files.The algorithm of PredClus is introduced to predict cluster size used in both DFRWS 2006 and 2007.
First, data from dataset is read.These data are in hex values.The hex values then matched with the standard JPEG header.However, in this experiment, additional markers are also used instead of standard JPEG header, 0xFFD8 alone.The additional markers used are 0xFFE0, 0xFFE1, 0xFFE2, 0xFFC4 and 0xFFDB.When matched, the offset for each markers matched is retrieved.Using formula as mentioned earlier, the determinant value is calculated.If the determinant value = 0, then file found is counted.This is done for each cluster size which are 512-byte, 1-kb, 2-kb, 4-kb and 8-kb cluster.Please refer to [15] for detail explanation of PredClus.
After the determinant value for all JPEG files in the datasets is extracted, files found for each cluster size then are summed.The percentage for each cluster size is calculated.Then, a report is produced.

Pattreccarv Algorithm
This section discusses on the development of the proposed algorithm called PattrecCarv.The algorithm is adapted from dual-byte-marker algorithm proposed by [1] to detect JPEG headers (SOI), thumbnails and embedded JPEG files.DFRWS 2006 and 2007 datasets are used for testing this algorithm.Nevertheless, the algorithm can also work with other datasets.
A thumbnail in JFIF format can be recognized using UHP in Table 1 while a thumbnail in EXIF format as shown in Table 2. On the other hand, embedded JPEG files can be recognized using UHP as in Table 3.   2), then thumbnail is found.
Repeat STEP 1, STEP 2 and STEP 3 until end of data.
The algorithm of PattrecCarv is illustrated in Figure 1.

Experimentation
This section discusses on the experiment designed for ThumbedCarv model as illustrated in The details of PredClus algorithm are discussed in [15].PattrecCarv consists of function to carve thumbnails and embedded JPEG files using pattern recognition technique.To carve thumbnails, three sets of validated markers as shown in Table 1 and Table 2 are used while validated markers for carving embedded JPEG files are shown in Table 3.The algorithm starts with reading the dataset.Once JPEG SOI markers are found, the next APP0 markers are read.If it is matched, it reads the next 9 th hex value.If next two bytes hex values match the embedded JPEG file markers, the current offset is recorded.If not, next one byte hex value is read.Once, a JFIF thumbnails markers are detected, the offset value of the thumbnail is recorded.If the APP0 markers are not found after SOI markers, the algorithm checks the next two bytes hex values.If they match with validated markers as in Table 1, then thumbnail is detected.Finally, the report of thumbnails and embedded JPEG files is generated.

Result and Discussion
The screenshot for PattrecCarv output can be clearly examined in Figure 3 and Figure 4. Figure 4 depicts total number of thumbnails with JFIF headers detected from DFRWS 2006 dataset is 6 and none for embedded JPEG files.There are also no thumbnails using UHP 0 x FFD8 Figure 4 shows total number of thumbnails and embedded JPEG files detected is 33 which is correspond with 1 thumbnail with the JFIF header, 18 thumbnails recognized using UHP of 0xFFD8 and 0xFFDB, 2 thumbnails detected using UHP of 0xFFD8 and 0xFFC4 and 12 embedded JPEG files.
Table 4 shows the comparisons done on PredClus and PattrecCarv algorithms.From the table, there is not a distinct difference of execution time but there is some interesting findings in term of thumbnail/ embedded file found, original detect as thumbnail, thumbnail or embedded JPEG files missed and false detection.Clearly, PattrecCarv detects more thumbnail/embedded file compared to PredClus though it falsely detect 4 files as thumbnails.PredClus is using cluster size information to determine the detected file is either thumbnail or embedded file or original file.That is the reason it did not make any mistake in determining thumbnail/embedded file.However, using PredClus, some thumbnails can be missed.This is caused by a big JPEG file that pushes the header of thumbnail to be stored at the start of file.Furthermore, PredClus cannot differentiate between thumbnails and embedded files because it does not know any information about the file detected; only the size of cluster is known.Although PattrecCarv has falsely detected 5 files, but it does not miss to carve any thumbnails/embedded files and separate between thumbnails and embedded files.The original file detected as thumbnail is caused by fragmented data.An experiment has been conducted manually to investigate the cause of this condition.It is found that all original files detected as thumbnails are fragmented with another JPEG files or other files.

Conclusion
JPEG file can be in a form of original JPEG file, thumbnail or embedded in another file.However, the importance of thumbnail and embedded file should not be overlooked in forensic investigation.This paper introduces a unique file pattern matching technique, which is embedded in a tool called PattrecCarv.PredClus assumes that all images starting from the first byte of a cluster as an image which may mistakenly detect a thumbnail as an original file where as PattrecCarv uses unique file pattern matching technique.Based on experiments done using DFRWS 2006 and 2007 data sets, PattrecCarv successfully carves thumbails and embedded JPEG files more efficiently as compared to PredClus.

Figure 2 .
Both algorithms, PredClus and PattrecCarv are installed into this model and the results from both algorithms are compared.Both algorithms are developed using C++ language in Windows 7 with Intel® Core TM2 Quad CPU and 2GB of physical memory.The input of this model is from two datasets, DFRWS 2006 and DFRWS 2007.Comparisons are made for these algorithms (PredClus and PattrecCarv) based on the number of successfully JPEG thumbnails and embedded JPEG files recovered.

Figure 1 .Figure 2 .
Figure 1.Algorithm used in PattrecCarv for carving thumbnails and embedded JPEG files.

Table 1 )
 If two-byte structure read is a SOI marker, then jump to the 9 th hex value.STEP 2: Once SOI is found, locate the embedded UHP (refer to TableIII)  If the 9 th hex value is the embedded UHP, then em-