Hybrid Warehouse Model and Solutions for Climate Data Analysis

Recently, due to the rapid growth increment of data sensors, a massive volume of data is generated from different sources. The way of administering such data in a sense storing, managing, analyzing, and extracting insightful information from the massive volume of data is a challenging task. Big data analytics is becoming a vital research area in domains such as climate data analysis which demands fast access to data. Nowadays, an open-source plat-form namely MapReduce which is a distributed computing framework is widely used in many domains of big data analysis. In our work, we have developed a conceptual framework of data modeling essentially useful for the implementation of a hybrid data warehouse model to store the features of National Climatic Data Center (NCDC) climate data. The hybrid data warehouse model for climate big data enables for the identification of weather patterns that would be applicable in agricultural and other similar climate change-related studies that will play a major role in recommending actions to be taken by domain experts and make contingency plans over extreme cases of weather variability.


Introduction
Nowadays, the volume of data generated from different sources is rapidly increasing. Hence, administering and processing such a massive volume of data is challenging. It is time-consuming, costly, and has many obstacles to researchers involving it. The way of administering such data in a sense storing, managing, analyzing and extracting insightful information from the huge size of data is a challenging task. The volume of data produced on a daily basis is a massive amount whereby this accelerating growth of data is due to the growth of the Internet of Things (IoT), Artificial Intelligence and Data Science. To make use of the huge volume of data, researchers in the domain often use cutting-edge techniques such as analytical tools and techniques with the help of Artificial Intelligence and machine learning methods. The main concern with big data is privacy, security, and discrimination. Researchers are then to start building support systems such as data warehouse.
The huge amount of weather data and climatic variables are recorded manually or digitally using many resources such as weather stations and satellites around the country. Thus, the storage and manipulation task of the information has to be effective, integrated and more flexible in Meteorological and climatology studies to achieve an effective and intensive analysis. The access to recorded Meteorological data varies according to the certain process, for instance, in weather forecasting tasks the raw data needed to be accessed rapidly and in Climatology, to increase the accuracy it is important to have high-quality historical weather data [1].
With technological advances in the area of tools and the different types of equipment to collect weather data, researchers can access and share huge data. Generally, the NCDC and Daily Global Weather Measurements 1901-2020 (GSOD, NCDC) is the largest active archive of climate data available online so far. In addition, the history of the NCDC dataset goes back to more than 150 years of data and the size of new data collected daily is close to 224 gigabytes. Moreover, it is easily accessible and downloads NCDC and GSOD datasets from NCDC website [2].
In the literature, it is suggested that a clear definition of a data warehouse which is a single, consistent, complete storage of data acquired from various sources in order to analyze it using a business intelligence tool [3] [4]. Data warehouse technology is applicable in a lot of domains in industry that use historical data for prediction, statistical analysis, and decision making, for instance, banking, consumer goods, finance industry, weather, healthcare, and Internet of Things (IoT) [1] [5] [6] [7] [8] [9]. Moreover, the huge amount of sensor data collected from the Meteorological domain has many problems to be addressed in the preprocessing stages. It demands careful manipulation so as to carry out weather analysis in an accurate way by extracting relevant patterns [10].
The traditional way of storing Meteorological data was file-based. Whereas, recently saving weather data in relational database management systems (RDMS) is attracting the attention of researchers in the area of climatology studies. As a result, many institutions operating in this domain are shifting towards storing their data in terms of RDMS [11]. Chen [12] describes a data warehouse framework namely Cheetah which is designed using the MapReduce platform. Dimri et al. [13] proposed a data warehouse model for storing weather data using On-Line Analysis Processing (OLAP) method to generate proper data for weather analysis and provide a multi-dimensional report. José et al. [14] developed a data warehouse to save climatic variables in the weather stations of Mexico. The au-thors have used the SQL Server for storing the final version of data in the proposed model. Sameer and Madhu [15] discussed how to design a data warehousing model using Hadoop.
The main contributions of this work are: • The main contribution of this work is the definition and implementation of hybrid data warehouse infrastructure to support the distribution of weather data storage, computing, and parallel programming.
• The new data warehouse can be implemented in different ways to store huge data sets and workloads for distribution in hybrid Hadoop platform.
• The main aim is to preserve and improve a traditional data warehouse for reporting, OLAP, and performance management while new development in data platforms for advanced analytics.
• The hybrid data warehouse model for climate big data enables for the identification of weather patterns that would be applicable in agricultural and other similar climate change-related studies that will play a major role in recommending actions to be taken by domain experts and make contingency plans over extreme cases of weather variability.
The rest of the paper is organized as follows. Section 2 presents a review of the related works in the data warehouse and big data context. Section 3 deals with the dataset description. Section 4 describes the concept and architecture of the data warehouse. Section 5 presents the concept of Big Data and the required tools such as Hadoop, Pig, Hive, and Sqoop. Section 6 shows the proposed data warehouse model for the weather data under consideration. Finally, concluding remarks and summaries are presented in Section 7.

Related Work
There are many research works that have been done to design a data warehouse based on Big Data and Hadoop framework. Kalra and Steiner [7] explained the quality and content of data vary over time based on the types of information for instance, weather, and health data. Thus, this data needs to be gathered, processed and stored in different formats. The authors developed a weather data warehouse model that enables dynamic and smooth integration of new information sources and data formats. The proposed architecture depicts an active and flexible weather data warehouse that provides a broad variety of weather data from various sources to different weather-based applications.
Néstor et al. [16] discussed the problems of data recorded by meteorological which require strategies for capturing, delivering, storing and processing to increase the quality and stability of the data. They proposed a star schema model for a data warehouse that allows storage and analysis of historical multidimensional hydro-climatological data. Moreover, the proposed data warehouse provides efficient data storage in which data collected from two networks of hydro-meteorological stations goes back to more than 50 years in the city of Mani- Doreswamy et al., [1] proposed a scalable architecture for a hybrid data warehouse approach for climate data using Hadoop and various big data tools. The proposed schema enables the identification of weather patterns and it is better to derive knowledge from the data in comparison to the traditional database.
Vuong et al. [17] show how designing and developing a data warehouse for the agriculture field has the main role in establishing a crop intelligence platform. They describe the requirements for efficient agricultural data-warehouses such as privacy, security, and real-time access among its stakeholders. Thus, the proposed system architecture and a database schema for designing and implementing an efficient agricultural data warehouse in Big Data and data mining.

NCDC Dataset Details
This section presents specific descriptions about the data produced by the NCDC dataset for Saudia country. The NCDC is a large dataset that has more than 9000 stations around the globe and is available online from NCDC meteorological site [2]. Figure 1 shows the selected Saudi Arabia weather stations from the NCDC dataset and each station has 16 attributes. Table 1 presents the column names and their corresponding description.   Table 2 presents information such as station number, station name, latitude, longitude, begin and end date for an illustrative case of the selected Saudi Arabia weather stations from the NCDC dataset. Moreover, Table 3 demonstrates a sample data of one station of the NCDC dataset and the corresponding values for each feature. The problem associated with the NCDC dataset is that it has several missing values. The missing data for the selected attributes are taking the following values: 9999.9, 999.9 or 99.99. For example, a 9999.9 shows a missing value for the variables TEMP, DEWP, SLP, STP, MAX and MIN, whereas 999.9 indicates a missing value for the column names VISIB, WDSP, MXSPD, GUST and SNDP. Moreover, 99.99 shows a missing value for the column PRCP.

Data Warehouse (DW)
In the 1990s, Bill Inmon suggested the first architecture of data warehouse (DW) model. While Gartner, in 2005, provided a clear concept of the DW. The main task of DW schema is to accumulate and keep data from various sources for future analysis and decision making. Generally, the classic relational database schema is used to store, manage and query structured data. The present DW model is briefed as follows [4]: • Subject oriented: In this case, the entire data is manipulated to classify into various domain areas, for instance, each domain will have complete data related to each subject.   • Nonvolatile: This means that the unmodified data shall be saved for long period of time in the DW. • Time variant: In this case, the DW has the ability to store the newly modified versions of records. • Not virtual: In this regard, the DW saves the data in the physical storage area for a longer period of time persistently. The ETL stands for Extract, Transform and Load that stands for three database operations that are combined into one tool to extract data from different sources and place it into another data warehouse. The extract is the operation of reading data from multiple sources and different natures data such as structured, semistructured and unstructured data. The transform operation is used to fit the data into the analytical model. Moreover, the load operation deals with physically storing the data into the data warehouse. Finally, analysis, visualization and report generating are carried out as a final core activity in constructing a DW model (See Figure 2) [3].

The OLTP Data Processing
The Online Transactional Processing (OLTP) is a type of data processing technique  . Traditional data warehouse [5].
that deals with transaction-related tasks and sub-tasks [18]. The common tasks in OLTP are inserting, updating, and deleting of data in a database by a large number of users concurrently. The OLTP handles recent operational data in small to medium-sized data with the goal to perform daily operations. It uses simple queries with read/write operations essential for faster transaction speeds. OLTP architecture is used to manage the day-to-day transaction of a business entity. The main aim of OLTP is for data processing and not for data analysis. Some example of OLTP related-system, such as the ATM center where it makes sure that withdrawal of a certain amount from ATM machine can not be more than the amount deposited in the saving account. Therefore, OLTP systems are highly optimized for transactional related tasks such as online banking, online flight ticket booking, sending a text message, order entry and adding a textbook to shopping carts [19]. The OLTP is known for the following two features namely concurrency and atomicity whereby the atomicity guarantees if a single step is incomplete during the transaction, then the entire process automatically stops. On the other hand, the concurrency property of the system prevents the altering of data by multiple users simultaneously. The main benefit of the OLTP system is it allows the administration of daily transactions of organizations' data and widens customer satisfaction by simplifying routine processes. Table 4 illustrates the differences between OLTP and OLAP.

Data Warehouse: Terminologies
The famous approaches used for data modeling are: 1) The Dimensional Model or Star Schema is created using two types of tables namely fact and dimension.
2) The Normalized Model is designed similarly to the way OLTP is designed. Moreover, a star schema is easier and faster in terms of executing queries, while the normalized model is easier when the process of updating information is done [1] [4]. Table 5 presents a short summary between the dimensional model and the normalized approach in Data Warehouse.  The normalized approach (3NF model)

Data warehouse
States that the data warehouse should be modeled using a Dimensional Model (star schema or snowflake) States that the data warehouse should be modeled using an E-R model/normalized model

Data
The data is partitioned into either: facts: which are generally numeric transaction data; dimensions: which are the reference information that gives context to the facts.
The data in the data warehouse are stored following database normalization rules. Tables are grouped together by subject areas that reflect general data categories (e.g., data on customers, products, finance, etc.). The normalized structure divides data into entities, which creates several tables in a relational database. Each of the created entities is converted into separate physical tables when the database is implemented.

Advantages
A key advantage of a dimensional approach is that the data warehouse is easier for the user to understand and to use. The retrieval of data from the data warehouse tends to operate very quickly.
The main advantage of this approach is that it is straightforward to add information into the database.

Disadvantages
The main disadvantage of the dimensional approach is that In order to maintain the integrity of facts and dimensions, loading the data warehouse with data from different operational systems is complicated.
A disadvantage of this approach is that, because of the number of tables involved, it can be difficult for users both to join data from different sources into meaningful information and then access the information without a precise understanding of the sources of data and of the data structure of the data warehouse.

Fundamentals of Data Warehouse
To design the data warehouse model, there are a set of basic fundamentals such as grain, additivity, facts, dimension, and calendar tables. The description of these fundamentals are presented as follow: • Grain: It presents to the scale or level of granularity of fact table such that all facts should have the same grain. • Additivity: This is the property of numeric facts that can be used in competi-tions such as averaging, min, max, while the other types of facts are also taken into counted.
• Facts tables: These tables are connected with dimension tables using one or more foreign keys. • Dimension tables: these types of tables are used to execute the required query and generates the reports. • Calendar dimension: This table is the basic table to make simple dates and is linked with both fact and dimension tables. The quality of the data warehouse has a significant role in the accuracy of data analysis [23]. The main criteria to measure the quality of a data warehouse is described as follows: 1) Access to information should be easy. 2) Recorded data should be consistent and integrating correctly. 3) Data warehouses should be adapted to any change. Finally, 4) data warehouse should get acceptance by end-users [17].

Big Data
In recent years, big data indicates the size, velocity, variety, and diversity of the data sets that are alarmingly growing to create a challenge in storing and analyzing using the conventional database systems. [23]. Recently, many popular data technologies trying to address the challenges of the new Big Data and Internet of Things (IoT) applications that generate data of various sizes are currently ranging from terabytes to petabytes and also considered as a synthetic data generator for structured, semi-structured, and unstructured data [4]. Thus, there are different types of data sources in many domains which create a huge volume of data (Big Data), for instance, video archives, sensor data, Internet text and documents, social networks, tweets, blogs, log files, biochemical, medical records, and transactional records [3]. Recently, the datasets in big data are characterized by n Vs whereby the n refers to the 9 characteristics of the datasets such as Veracity, Variety, Velocity, Volume, Validity, Variability, Volatility, Visualization and Value [24].

Hadoop
Apache Hadoop is an open-source distributed framework that is widely used for parallel storage, and efficacy in processing the big data on the cluster of machines using high-level programming languages. Hadoop modules have many features to help developers and researchers such as graphical user interfaces, simple administration tools and provide high-level languages such as Java and Python. Moreover, the advantages of this framework are high availability, fault tolerance and scalability to process petabytes of data by the Hadoop cluster. The Hadoop cluster is a group of thousands of computers connected with each other to store Big Data and to run the MapReduce programs in parallel. Generally, users can execute remotely jobs using the Hadoop cluster [25] [26]. Figure 3 shows the principal components of the Hadoop framework in a Journal of Computer and Communications  [28]. Basically, HDFS is considered the primary storage of large datasets in Hadoop and distributed file system management. While, MapReduce is a general programming framework designed specifically for parallel processing the big size of unstructured data as shown in Figure 4.

HDFS Architecture
The general structure of the HDFS system is simply based on the main communication master/slave framework. The HDFS cluster has two essential elements namely NameNode, and DataNode [29]. Moreover, the cluster has specific Na-meNode which is responsible for storing metadata, control access to files stored in DataNode and manages the file system. The cluster can have a limited number of DataNodes which are used to store the data as shown in Figure 5.

MapReduce
Recently, Dean    and is especially supported by Google under the Apache Hadoop project [31]. In general, the developers address big issues by writing the MapReduce programs where the input files split into small chunks to adapt to the HDFS system and the parallel computations [4]. The "Map" and "Reduce" classes are the fundamental components of Ma-pReduce. Figure 6 shows the main steps involved in the process flow of Ma-pReduce. In the Map step, the input data divides into a number of small chunks such that all sub-files are allocated parallelly to different mappers. Moreover, the basic idea of the mapper method is to read the corresponding chunk as a bench of keys and their value pairs and similarly, the results of this method are a set of (key, value) pairs. The shuffles and sorts phase comes immediately after the map phase and the input for this stage is the result of the mapper functions, and it produces the (key, value) pairs that assigned to the reducers. In the final process, the output (key, value) pairs of previous tasks are grouped based on the key and assigned to the reducers to produce the final results which are stored in HDFS [32]. The Hadoop framework has a JobTracker service that assigns the tasks of MapReduce to clearly defined nodes within the Hadoop cluster, specifically the nodes that have the data. On the other hand, the TaskTracker is a node that accepts  operations from a JobTracker service such as Map, Reduce and Shuffle. During the execution of MapReduce job, JobTracker and TaskTracker tools are applicable to schedule, monitor, and restart processes in case of failing as shown in Figure 6.

Hadoop Ecosystem
This section describes the Hadoop Ecosystem that has many modules, MapReduce and HDFS that are used in the ETL operations [33]

2) Apache Hive
This is an open-source project and an efficient query language that simplifies the development of applications using the MapReduce framework. Apache Hive has two main components namely Hive Query Language (HiveQL) and Hive metastore [3] [40] [41]. HiveQL is a query language for Hive software and used to facilitate writing queries data stored in Apache HBase and HDFS. While, the Hive metastore is the master repository of Hive metadata which is used to store metadata about data files, blocks in the HDFS NameNode [42]. Moreover, Hive is used in data warehousing implementation such that it facilitates the operations such as reading, writing, and managing large files stored in HDFS [43]. 3

Implementation
The commonly used methods for building a meteorological data warehouse are the classical DW, big data tools such as Hadoop, Sqoop, Hive, Pig, and Spark.
Finally, the hybrid DW model merges traditional DW with big data. In this work, the proposed data warehouse model constructed based on the third approach such that the following big data tools are applied to implement the model as shown in Figure 7. The list of tools of big data are described as follows: • RDBMS: It is used to store the records of the collected data.
• SQOOP: It is used to import records from RDBMS into HDFS, and to transfer the final results of aggregation operations to the data warehouse.
• HDFS: It is used to save big files.
• Hive: It is used to create tables and databases on HDFS. Moreover, it is has the ability to perform join, partition, merge, or aggregation operations.

High Dimensional Schema for NCDC Dataset
The data-warehouse schema is a logical description of the whole database [8].
The main components of data-warehouse include fact tables, dimension tables, and their dependencies. A dimension is simply a row and column of the highdimensional table that contains the number of samples and their corresponding attributes. There are some main operations that can be applied to the high-dimensional table such as grouping, filtering and labeling. Moreover, the samples in dimension tables are identified by a unique Key and each column represents a range of values that are found by measuring using given units. The available data set is collected for each year as compressed files. Figure 8 shows the design of the relational schema for the collected data. For instance, the weather_station table contains information about each station that has columns namely station_STN, latitude, longitude, ..., etc. Similarly, the countries table has information about each country and it is linked to the states table. Finally, each parameter key in the Parameter table is used to connect with the corresponding table. Figure 9 demonstrates the proposed star schema on NCDC weather datasets that contains two types of tables and it presents one fact table and 7-dimensional tables. The primary task of OLTP schema is used for performing the preprocessing stage such as reduce redundancy, normalize the data and check for its integrity. In addition, it also has another advantage such as it is possible to create, update or delete any column in a particular table. The weather fact table has seven dimension tables, namely STATION, TIME, TEMPERATURE, PARAMETER, SLP, DEWP, and PRCP. In general, the fact table consists of a number of primary keys and many keys that refer to their corresponding dimension table. On the other hand, the primary key in the dimension table has the same name as the table that contains observations collected from a particular station. Table 6 describes the dimension tables including all the relevant attributes.       Figure 11 illustrates the simplified design of the proposed Hadoop data warehouse model. This design composes of three main tables namely Countries, Parameters, and Weather_Stations. However, the other tables from OLTP schema have been denormalized into one of these three main tables, for instance, the daily_measure_VISIB, daily_measure_PRCP, daily_measure_DEWP, and dai-ly_measure_SLP tables have been denormalized into Parameters.

The Proposed Hadoop Data Warehouse Model
There are many reasons to use the Hadoop framework in the data warehouse model, it has the capability to store the same records such as OLTP, save large files that can be distributed over HDFS and both DW and Hadoop are allowed the data to be moved between them efficiently. However, the way to store data in Hadoop is totally different from OLTP. In OLTP, it is allowed to update data by one record at a time and only one operation is executed such as insert/update/ delete. So, the OLTP schema allows the following operations for the values of the records: updates/deletes/inserts to change these values. On the other hand, for the HDFS system in Hadoop, it is not allowed to update the values of records, it allows only deleting old data and replaces it by the new ones. To solve this problem in Hadoop, it is essential to create two versions of tables such that the second table is used to append only new records. For instance, the weath-er_stations table has another version Station_history which is an append-only table and is used to save the history of all stations. Sequentially, once the import job executed, the new data imported from OLTP appended to the end of the Sta-tion_history table and the final data are stored in weather_stations.
To make this clear, the following example illustrates the concept of insert/ update/delete in Hadoop, Table 7 Table 8. Similarly, Table 9 demonstrates the final version of Weather_stations table that has only the complete rows.

The Aggregates Design
The primary task of a data warehouse is to make great flexibility and efficiency for the query process. So, aggregations are used to decrease the query time using pre-computed summary data. In this section, the following tasks namely ingestion, aggregation, and data export are briefed.

1) Ingestion
The first stage of the conventional ETL approach is to transfer the stored data from the OLTP schema to a data warehouse. Generally, the Sqoop tool is used for ingesting data from OLTP into HDFS in Hadoop as shown in Figure 12. Moreover, to ingest a small data from OLTP into Hadoop is require one task using Sqoop. On the other hand, if the data size is large, then it will need many tasks or repeat the Sqoop job many times as shown in Figure 13. Moreover, the list of files transferred after Sqoop task done is shown in Figure 14. 2

) Aggregation
The aggregation operation is one of the most time-consuming operations to perform in the classical database. Thus, to reduce the running time of ETL and aggregation operations are implemented in Hadoop. As the Hadoop framework   is distributed and can be run tasks in parallel, the aggregation tasks are executed faster than the OLTP database. Hive and Impala are the most popular tools known for aggregation over Hadoop. Finally, aggregation operations such as average, max, count, and summation are implemented cheaper and scalable using Hadoop that store a huge number of records. For example, the following Hive code calculates the records count and the average temperature for each station, respectively. CREATE  3) Data Export This is the next step after ingestion and aggregation to transfer the cleaned data from HDFS to the real data warehouse. In this stage, the preferred tool is Sqoop that can be applied to shift the data from the Hadoop system to the traditional data warehouse with insert and update operations. For instance, the following Sqoop code exports data from avg_temp table in weather_dwh database  stored in Hadoop to a database table. sqoop export --connect \ jdbc::mysql://localhost:3306/weather_dwh \ --username hive_user \ --table avg_temp --export-dir /user/hive/warehouse/avg_temp \ -m 16 --update-key station_STN \ --input-fields-terminated-by '\001' \ --lines-terminated-by '\n

Conclusion
In this work, we explored the different notions regarding big data, data warehouse, and presented in detail the proposed data warehouse for weather data that typically constructed on top of the strong Hadoop system. Moreover, the flexible meteorological data warehouse successfully produced using the suggested star schema model and different Big Data software. The presented schema includes all necessary fact and dimension tables in order to deal with scale and efficient analytical models. Furthermore, the suggested data warehouse model optimized for NCDC that was available. The advantages of this model are the possibility to add new variables; different queries are easily done in a flexible way, and similarly, it easy to extract data. Finally, the proposed model is flexible, adaptable, and quite qualified for the rapid increase of data from different weather variables without any major change.

Conflicts of Interest
The author declares no conflicts of interest regarding the publication of this paper.