Creating a Dataset to Boost Civil Engineering Deep Learning Research and Application

With cutting edge deep learning breakthrough, numerous innovations in many fields including civil engineering are stimulated. However, a fundamental issue that civil engineering research community currently facing is lack of a publicly available, free, quality-controlled and human-annotated large dataset that supports and drives civil engineering deep learning research and applications on such as intelligent transportation including connected vehicle, structural health monitoring, and bridge inspection. This paper is a general discussion about demanding needs and construction of a long-anticipated dataset for researchers and engineers in civil engineering and beyond for providing critical training, testing and benchmarking data. The establishment of such a free dataset will remove a major hurdle and boost deep learning research in civil engineering and we hope this work will urge researchers, engineers, government agencies and even computer scientists to work together to start building such datasets. A framework has been developed for the proposed database. Also, some pilot study databases were developed for concrete crack detection, pavement crack detection using normal and infrared thermography, as well as pedestrian and bicyclist detection. A convolution neural network model called Faster RCNN was deployed to check the detection accuracy and a 98% detection accuracy of the proposed datasets was obtained.


Introduction
With the breakthrough of deep learning due to advances of hardware such as GPU and Google cloud TPU chip [1], available large datasets as ImageNet [2] and benchmarks and algorithm improvements such as better activation functions, better weight-initialization schemes and better optimization schemes [3], numerous innovations have been unveiled and researchers from various communities are being excited. Civil engineering researchers have applied this novel computer technology to damage detection of structure [4] [5], concrete crack [6] [7], bridge structural components [8], pavement [9], tunnel [10], transmission tower [11] and roof [12]. Some of the most significant commercial products built on deep learning are Apple HomePod®, Amazon Echo®, and Google Home®.
One big force that drives the advance in computer vision, machine learning and deep learning is publicly available large and high-quality datasets that stand behind various competitions hosted by Kaggle platform that challenges data science and Artificial Intelligence. One example of such competition is Image Large Scale Visual Recognition Challenge (ILSVRC) that is hosted annually in order to challenge computer algorithms for object localization/detection from images and videos [13].
Data has been playing an irreplaceable role behind recent explosions and boom in artificial intelligence (AI). The key element or foundation of deep learning application and research is the dataset that supports training and testing deep neural networks and learning skills. In computer vision and machine learning (ML) research community, there are openly available datasets like Im-ageNet [2] that supported the pioneering deep learning publication AlexNet [14] and spawned the current AI boom, MS COCO [15] for objection, segmentation and captioning, CIFAR-10 [16], a labelled subset of 80 million tiny images that support ML [17], and MNIST [18] for handwritten digits. It took tremendous efforts, time and investment to build such large-scale datasets. For example, MS COCO researchers utilized more than 70,000 worker hours [15] to label and annotate millions of images and object instances. The data is so important to AI and deep learning, that it was reported very recently in June 2018 that the President's most senior technology advisor claimed that the White House may consider releasing some government data to push AI research [19].
However, the open dataset is of limited value to civil engineering research community. Most of the research papers in civil engineering fields used their proprietary datasets which are generally quite small compared to the above general-purpose datasets. While the success of deep learning owes to availability of large-scale labeled data [20], it would be less advantageous if only small datasets were used. Sun et al. [20] concluded that the performance of vision tasks increases logarithmically if the size of dataset gets large. If research data could be widely shared, a larger dataset would be created with fewer efforts. But researchers are reluctant to share their data due to legal issues and many other barriers  [22] proposed a few simple techniques to address small dataset for image classification problem, which includes data augmentation, and transfer learning [23]. Nevertheless, transfer learning only works for deep learning when the model features extracted from a large dataset from pre-training are general.
In this sense, use for example a pre-trained ConvNet on ImageNet for transfer learning of crack and damage of concrete is still questionable and investigation must be done to properly evaluate the feasibility of using this technique for civil engineering applications purpose.
In the rest of the paper, a reader will expect the discussion of the application of AI in Civil Engineering research and how important a good database makes the AI techniques more efficient for the users. In the following section, a dataset structure is proposed and discussed how it can improve the traditional database limitations. At the end, some proposed databases were developed and tested using convolution neural network (CNN) models.

Civil Engineering Research and Application Needs
Modern computer technologies are making civil engineering, one of the oldest engineering disciplines, smarter and more intelligent. Worldwide cities and governments are launching smart cities [24] initiative. Intelligent Infrastructure can be one of the key characteristics that supports smart city initiatives. Intelligent infrastructure addresses intelligent transportation [25] [26], smart buildings and structures [27] [28], smart bridges and tunnels [29], smart pavement monitoring system [30], and etc.
The USDOT's Intelligent Transportation Systems (ITS) ITS Strategies Plan 2015-2019 outlined the goal of "Realizing Connected Vehicle (CV) Implementation" and "Advancing Automation" as the primary technological drivers of ITS [31]. Connected vehicle applications build an interoperable wireless communication network through dedicated short-range communications (DSRC) [32] or possibly on the deploying 5G telecom network in the near future, which collects vehicles, infrastructure (as traffic lights), and wireless devices (as cell phones) to prevent vehicle crash, improve mobility by reducing delay and congestion and benefit environment by cutting emission. In September 2016, US DOT selected New York City, Tampa-Hillsborough and Wyoming to launch its Connected Vehicle Pilot Deployment Program, which serves an initial effort to deploy, and test the cutting-edge CV technology [33]. Connected vehicle research has still been its early stage, which relies and demands sizable datasets that can serve benchmarking data for developing critical algorithms. Valuable, publicly available, selected and well-organized data related will be always appreciated in order to help advance fundamental knowledge and provide strong and timely support to connected vehicle research.
Allocation of Federal transportation funds and transportation infrastructure management and planning requires traffic monitoring, vehicle count, and classification. Very affordable video devices and State-of-the-art deep neural network based computer techniques/algorithms such as YOLO [34], and Faster R-CNN [35] can provide simple, manageable, cost-effective, real-time solution to any type of traffic counting and classification problem. A large dataset with images of all types of motorized and nonmotorized vehicles, pedestrians and bicyclists will ensure a well-trained deep neural network for detecting classifying and counting traffic.
Structural health monitoring [36] to buildings, bridges, tunnels, and any other civil infrastructure will provide a real-time and preventative strategy to identify and monitor potential damage to a structure. Application of wireless sensor network and wireless smart sensors [37] is the recent trends and future of civil infrastructure health monitoring. Trained deep neural network can be a strong candidate to automatically process the collected ambient vibrations, wind, strain, displacement data for structural damage detection and condition health assessment. However, training a deep neural network calls for a reliable large dataset.
Other potential application of deep neural network can be dam and nuclear power plant concrete structure health monitoring, which may safeguard welfare and lives of hundreds of thousands of citizens.
Another critical civil engineering need is bridge inspection automation. With more than 56,000 or 9.1% structurally deficient bridges in the US, more and more bridges may require even shorter inspection interval than the basic 2-year requirement, which means more inspection and maintenance efforts, and higher costs and more dangerous works. With affordable Unmanned Aerial Vehicle (UAV) and deep learning computer technology, it is a trend to partially or completely replace vision inspection. Many civil engineering researchers including the authors of this article and computer scientists are working enthusiastically towards automation of bridge inspections. Recently we proposed a framework of coupling UAV and deep learning for civil infrastructure condition assessment.
One of the major challenges civil engineering community face in applying deep learning in bridge inspection is shortage of an image dataset that has good representation of all bridge components to be inspected [38].
Deep learning has also been attempted to solve time-series based real-world ap-

The Proposed Dataset
This section presents the difference of the proposed dataset from the existing ones, what should be included in the proposed dataset, how it can possibly be organized, the data collection methods and finally tools to build such a dataset.

Difference from Existing Datasets
The image datasets ImageNet provided by FHWA National Highway Institute in order to produce quality labels for images to be used for bridge inspection. In addition, data in the proposed dataset rely on not only internet and contribution from researchers, but also government agencies. Even though government open data site data.gov offers tremendous amount of data, finding the right data for state-of-the-art in civil engineering deep learning research is without much luck.

Data Structure
The popular ImageNet was created with a hierarchical structure according to WordNet [44] that is a large lexical database of English words grouped into synonym set or synsets. Compared to ImageNet, the proposed dataset is more discipline (civil engineering) specific, it would be more appropriate to organize the dataset referring to government published documents, national standards or widely accepted classifications in the discipline. As seen in Figure 1 that depicts the hierarchical structure of the proposed dataset, subtrees of class bridge are organized following National Bridge Inspection Standards and Bridge Inspector's Engineering Reference Manual [45].

Data Collection
Data should have good representation and cover a broad range of research and application to serve the cutting-edge deep learning research in civil engineering. Labelling of images can be a daunting task, which may be completed in multiple ways such as web-based annotation LabelMe [46], Amazon Mechanical Turk (AMT) and even with a computer game [47].
Humans who receive special trainings may need to label some types of data such as images for highway bridge inspection.
The following is a list of various methods that may be used to collect data for the proposed dataset.
• Data mining online resources Internet based data collection [48] is a relatively easy and cheap way to collect a large thus more representative data. Both ImageNet and MS COCO collect images from internet. It is not a challenging task to use python script to automatically extract and scrape data. However, data collected from internet may be satisfying for daily life applications such as detection and recognition of cats and dogs, they may not always meet the demands of specific civil engineering research and applications. Another concern of using internet-based data is piracy and copyright protection. Fortunately, non-commercial use of data for education, and research is generally allowed. For example, researchers and educators may download images acquired from web by ImageNet for non-commercial and/or educational purchases under certain conditions and terms.
• Image data may be obtained by querying several image search engines such as Google Images, Bing Images, and Flickr. For example, MS COCO collected non-iconic images from Flickr. Kaggle can also be a good source to identify good data.
• Request data from DOT and local state DOTs and other government agencies.
This could be the best approach to obtain high quality data. The Freedom of Information Act (FOIA) [49] is a Federal law that gives individuals the right to access to any US federal agency records unless the agencies the release is prohibited by law or protected by nine exemptions, which means we may not necessarily be able to request all the data of interest.
There are tremendous amount of data and we first need to identify and decide what are the most valuable data to be requested that can be potentially used for deep learning studies. One example of valuable data can be images taken by bridge inspectors owned by state DOTs. These images may serve high quality data for supervised training and testing deep neural network for bridge inspection purpose.
• Collect and archive as traffic and data from publicly available open data portal provided by State DOTs and other agencies. Table 1 shows a few examples of publicly available free traffic data portal provided by state DOTs. The proposed dataset website should extract data from those data portals, organize and classify for deep learning traffic flow research use, which may need heavy involvement of data cleansing to better serve the deep learning research needs.
• Promote and encourage share of data.
High quality data is the heart of any research work and excellent data builds the best possible foundation for deep neural network related publications. However, individual author may have their small datasets and look for even larger dataset for their use. "Take one, return one" (a researcher may download data if they contribute) may encourage share of data among peer researchers.
• Launch competitions based on the proposed dataset to help advance and develop better algorithms for civil engineering deep learning research and application. Kaggle sets an example of providing predictive modeling competition platform to solve a wide variety of problems in different fields of computer science, Engineering

Construction and Maintenance
We can build an online dataset using the open-source data portal platform CKAN that allows easy data storage, distribution and share. CKAN is being used by public institutions [67] and government data catalogues, such as Data.gov and HealthData.gov in the US, data.gov.uk in the UK, and many others [68].
Construction and maintenance of such a dataset will need support from research grants, and donations.

Building and Testing a Pilot Study Database
A small dataset for concrete crack detection was built with 1499 concrete-crack images and 589 concrete-not-crack images. Figure 2 shows the sprite image of the proposed database for concrete crack detection. For bicycle detection and counting, a database was created with 988 test images and 4822 train images. The bicycle images were taken from Google using a special data scrapping software tool and the images were labeled and annotated using LabelImg software.
Moreover, a dataset was created for pavement crack detection using a 336-test image and 2284 train. The images were collected using a hand-held mobile phone and a drone. A total of 11 categories of flexible pavement crack images and 7 types of rigid pavement crack images were included in this dataset. The database images were annotated and labeled using LabelImg software with more than 50 hours of manual labors. Additionally, an infrared thermography dataset  was created with 24 test and 84 train infrared thermography images Figure 3 shows the sprite images of the pavement crack detection images with infrared thermography images. The databases were uploaded in a Google share drive (http://bit.ly/2ujAhMd).
All the developed databases were tested using a deep learning convolution neural network model called Faster RCNN. The description of the test parameters and the procedures are out of the scope of this paper. The reader can find more depth in-formation of the model selection and training in this article [7].
However, the Faster RCNN model successfully detects the pavement crack using normal images ( Figure 4) and infrared thermography crack images ( Figure 5), as well as the pedestrian and bicyclist images ( Figure 6). All the test images from the proposed pilot study databases show a 98% confidence level which means the database annotation and labeling are in the right direction.

Conclusion and Discussion
Big dataset is like fuel to engine that delivers power to the civil engineering AI research plane. A publicly available, free and labelled dataset is to address a Engineering  fundamental issue of advancing deep learning research and application in civil engineering and beyond: high quality data. It would be a tremendous help to civil researchers to build their innovative and cutting-edge works in intelligent transportation, connected vehicle, structural health monitoring, bridge inspection, and more real-life applications on the proposed dataset. Our pilot study Figure 5. Concrete crack detection using faster RCNN with 98% confidence. shows some of the proposed datasets for concrete crack detections, pavement crack detections as well as pedestrian and bicyclist detections with 98% confidence level.