Fifty-Six Big Data V’s Characteristics and Proposed Strategies to Overcome Security and Privacy Challenges (BD2)

The amount of data that is traveling across the internet today, including very large and complex set of raw facts that are not only large, but also, complex, noisy, heterogeneous, and longitudinal data as well. Companies, institutions, healthcare system, mobile application capturing devices and sensors, traffic management, banking, retail, education etc., use piles of data which are fur-ther used for creating reports in order to ensure continuity regarding the services that they have to offer. Recently, Big data is one of the most important topics in IT industry. Managing Big data needs new techniques because traditional security and privacy mechanisms are inadequate and unable to manage complex distributed computing for different types of data. New types of data have different and new challenges also. A lot of researches treat with big data challenges starting from Doug Laney’s landmark paper, during the previous two decades; the big challenge is how to operate a huge volume of data that has to be securely delivered through the internet and reach its destination in-tact. The present paper highlights important concepts of Fifty-six Big Data V’s characteristics. This paper also highlights the security and privacy Challenges that Big Data faces and solving this problem by proposed technological solutions that help us avoiding these challenging problems.


Introduction
Big data term is referring to data that is so large and complex and also exceeds the processing capability of traditional data management systems and software

Fifty-Six V's Characteristics of Big Data [1]
Many researchers have a lot of attention for studied big data characteristics starting from 3 V's characteristics (Volume, Velocity, and Variety) which leads to add some more V's to the characterization of big data. Other authors used the term pillars, or dimensions instead of big data V's characteristics [2]. By the time the three V's with "Veracity", "Value", "Variability" reached fifty-six V's characteristics, two of them were added by the author as explained in [1]. Several V's are awarded in Figure 1. Different declarations for each "V" characteristic (dimension) are presented as follows as explained in Table 1: Many companies have already amount of archived "Ocean of data" in the form data or of information that can come from every possible sensor, logs, hundreds of hours of YouTube uploaded videos, billions of gigabytes from global mobile traffic [1].

Variety
Big Data is represented by different formats and varied types of data between structured, semi-structured, multi-structured and mostly unstructured data as well that came from many types of data resources, so it is heterogeneous in both size and type, consequently cannot be put together into a relational database [1] [3].

Vagueness
The meaning of found data is often very unclear, not only has how much data been available but also how much it is not obscure [1] [5].

Vulnerability
This means that no system is perfect, which means it's probable there is a way for its hardware or software to be agreement, successively meaning that any associated data can be tacked or manipulated [1] [4].

Volatility
What time does remain data valid and should be stored. How old does data need to be before it is considered irrelevant [1] [4].

Visualization
Refers to the application of more recent visualization techniques to explain the relationships between data and can display real-time changes and more illustrative graphics, thus going beyond pie, bar and other charts [1] [4].
14. Viscosity It is occasionally used to express the delay, latency or lost time in the data relative to the phenomenon being described [1] [6].

Virality
Measures the rate at which data can propagate through a network [1] [6].

Virtual
Enterprises and other groups can benefit from big data virtualization because it authorizes them to use all the data assets they gather to accomplish various goals and objectives [1].

Valences
It is a measure indicating how dense the data is [1]. 18. Viability Viability could be seen as carefully choosing those attributes in the data that are most likely to forecast outcomes that matter most to organizations [1]. 19. Virility With Big Data it means that it creates itself. The more Big Data you have, the more Big Data gets strength and forceful [1] [7] [8] [9] [10].

Vendible
The very existence of client's for Big Data shows crucially that it is appreciable-this is evident from the communication of some known means of trading with subscribers data [1] [7] [11] [12].

Vanity
Vain of data means that it is glad with the effect it produces on other individuals, [1] [7] [11] [12]. 22. Voracity Big Data is potentially so insatiable that it may achieve the influence, manage and the possibility to consume itself [1] [7] [11] [12].

Visual
We currently live in a world of seeing, watching, and exchanging photos and videos, whether they are personal or product pictures or weather photos through the Internet [1] [7] [13]. 25. Vitality Vitality of the data is an important perception that is vital and is included in the concept of value [1] [7] [13]. 26. Vincularity It implies in its exact meaning connectivity or linkage. This idea is very pertinent in today's interconnected world through the internet [1] [7] [14].

Valor
The specific data that has the possibility to produce value and guiding how this can be accomplished [1] [7] [15]. 29. Verbosity Understanding how to quickly separate the meaning you keep about from its repetition is important for efficiency of processing [1] [7] [13]. 30.

Big Data Privacy and Security Challenges
There are several studies that have addressed various threats to big data privacy and security from more than one concept or view. One view of these challenges divides challenges into categories; some of them come back to a function of the characteristics of BD, by its existing analysis methods and models, and some, through the limitations of current data processing system. All are introduced in  [20], issues of privacy [17] and ethical considerations relevant to mining such data [21]. Tole [3] asserts that building a viable solution for large and multifaceted data is a challenge that businesses are constantly learning and then implementing new approaches. For example, one of the biggest problems regarding BD is the infrastructure's high costs and the actress in its most important component, namely, Hardware equipment that is very expensive even with the availability of cloud computing technologies [22].
In addition, to arrange and sort data, so that valuable information can be constructed, human analysis is often required. While the computing technologies required to facilitate these data are keeping pace, the need of the human expertise and talents to benefit from BD, that are not always available and this proves to be another big challenge. As reported by Akerkar [23] and Zicari [24], the broad challenges of BD can be seen through three main classifications, based on the data life cycle: data, process and management challenges as explained in   2) Process Challenges: are related to series of how techniques: how to capture data, how to merge data, how to modify data, how to choose the right model for analysis and how to provide the results. With the big rate of data explosion, storage systems of organizations and enterprises are confronting major challenges from mountains of data, and the ever increasing of generated data [25]. Value can be generated from large data set. For example [26], Facebook increases its ad revenue by mining its users' personal preferences and creating profiles, showcasing advertisers which products they are most interested in. Google also uses data from different applications as Google search, YouTube, and Gmail accounts to profile users' manners and habits. Despite the tremendous benefits that can be acquired in large data set, big data requests for storage and processing poses a major challenge. The total size of data that have been generated by the end of 2015 is estimated at 7.9 zettabytes (ZB), which almost five times as many as 2020, that is expected to reach 35 ZB. In this phase of big data life cycle, the key challenges of this phase are as follows: 3) Management Challenges: Cover for example privacy, security, governance and ethical aspects [16]. Data Censorship can be challenging since it includes everything from security and privacy to meeting compliance standards and the ethical use of data. With big data, management problems expand even bigger because the data shape is unstructured and unpredictable.

Data Challenges View No. 2
Web makes it easier to collect and share knowledge as well data in raw shape. Big Data is about how these data can be stored, processed, and comprehended with an aim of using it in expecting the action in future with a reasonable accuracy and allowable time delay. Here, the security and privacy challenges of big data in each phase of the big data three phase's lifecycle are evaluated as explained in Figure 3

1) Big data acquisition
Different data processing constructions for big data have been suggested to address different properties of big data [27]. Data acquisition could be defined as the process of gathering, filtering, and cleaning data before the data is put in a data warehouse or any other storage solution.   Overall, due to the above issues, big data when acquired could possibly become the carrier of advanced persistent threat (APT) [28]. APT thrives in such situations where there is a variety of data sources and non-standard data formats for data streams coming as an ongoing process in social networks [18]. When an APT code is hidden in big data, it becomes difficult to be detected in real-time. Hackers could attack the data source, destination, and all the connectivity by capitalizing on their vulnerabilities, which could result in an enlarged attack by launching a botnet. Therefore, it is important to enforce data security and privacy policies within a real-time processing environment of big data during the data acquisition stage itself. It is essential to connect the right endpoints of a network for the data flow along with sophisticated authentication and privacy policies for big data.
2) Big data storage With the big rate of data explosion, storage systems of organizations Companies are facing challenges from massive amounts of data, and the ever-increasing number of data generated [25]. Value can be generated from large data set. For example [18], Facebook increases its ad revenue by mining its users' personal preferences and creating profiles, showcasing advertisers which products they are most interested in. Google also uses data from different applications as Google search, YouTube, and Gmail accounts to profile users' manners and habits. Despite the tremendous benefits that can be acquired in large data set, big data requests for storage and processing poses a major challenge. The total size of data that have been generated by the end of 2015 is estimated at 7.9 zettabytes (ZB), which almost five times as many as 2020, that is expected to reach 35 ZB. In this phase of big data life cycle, the key challenges of this phase are as follows: a) Venue: With the growing massiveness of big data, the Volume dimension affects the server infrastructure of an organization. Traditional data warehouses may not suffice, and alternate storage systems such as distributed, cloud, and other outsourced big data servers are required to be employed to cope with the volume as well as the increasing Velocity of big data storage [18]. b) Volatility: Structured, unstructured, and semi-structured data are getting accumulated in the big data storage with high Volatility from various channels, including online transactions of sales, customer feedback, social media messages, indirectly associated with the business operations of an organization [18]. c) Valence: Big data is also shared among multiple related departments for their day-to-day transactions and functional operations. Hence, the big data connectivity among different data centers that are in-house, cloud-based, or outsourced can make the big data quite dense, impacting on the Valence dimension of big data [18]. d) Validity: How quality consistency, preciseness, reasonableness and correctness the data for its intentional use. Validity in data gathering means that your detections accurately represent the incident you are declared to measure.
According to Forbes, estimated data scientist's spent 60 percent of their time in cleansing their data before being able to do any analysis.
Due to the above, the integrity of the data could be affected when multiparty operations take place on the same data storage in huge amounts and in increasing real-time speed [27]. Traditional encryption and security measures to maintain data integrity may not help as multiple mechanisms could be in place. Such a disparate environment could encourage sniffers to reach the servers by exploiting their security policy differences and vulnerabilities. Any misuse of data could lead to privacy leakage. These factors increase the risk of information theft and user privacy infringement.
Traditional access control methods are mainly classified as mandatory, discretionary or role-based, and none of them can be effectively applied to big data storage due to the diversity of users and their permission rights in such a highly dynamic environment [18]. Hence, new trustworthy data access controls must be established, adhering to appropriate security and privacy protection schemes and policies [29]. Good practices for backup and recovery must be followed in dealing with historic data that require be archiving or destroying at every stage of the big data life cycle.

3) Big data analytics
Big Data Analytics is the key to data value creation and that is why it is important to focus on that feature of analytics [18]. Data Analytics requires implementation of an algorithmic or mechanical process to derive insights through several data sets to look for significant correlations between each other. It is used in several fields to allow making better decisions. The focus of Data Analytics lies in deduction, which is the process of acquires conclusions that are based on what the researcher already knows. Successful implementation of big data analytics requires a combination of skills, people and processes that can work in perfect synchronization with each other. We have identified key challenges in this phase that are mapped to the prominent V's of big data as follows: a) Volume (The amount of collected data): With today's data-driven organizations and using of big data, risk managers and other employees are often confused with the amount of data that is collected [30].

Proposed Big Data Security and Privacy Strategies
Four main technologies are proposed to comprehensively cope with the 56 V's during the three phases of big data lifecycle as in [18], namely data acquisition, data storage and data analytics. These are: 1) Data Provenance Technology, 2) Data Encryption and Access Control Technology, 3) Data Mining Technology, and 4) Block Chain Technology.
But with some adjustments to cope with the large and increased number of V's characteristics that reached 56 V's not only 11 V's as in [18], and also with chosen promising proposed method for each of the four techniques which explained as follows:

Data Provenance Technology
To adapt data provenance technology for addressing the security and privacy challenges in the data acquisition phase of big data. In traditional computing systems, data provenance method was used to determine the source of the data in data warehouse by adopting the labeled technique. With big data, the data acquisition involves diverse data sources from the Internet, cloud, social, and IoT networks. While big sensing data streams come with novel encryption schemes, attacks are possible right from the data acquisition phase [18] [31].
Hence, metadata about these data sources such as the data origin, the process used for dissemination and any intermediate calculations could be recorded in order to facilitate mining of the information at the time of data streaming itself.
Hence, the first strategy their proposed technique is to adapt data provenance technology for effectively using data analytics techniques for detecting anomalies in the data acquisition phase of big data. However, collecting provenance metadata must adhere to privacy compliance. Another important issue is that it could become complex with application tools generating growingly large provenance graphs for establishing metadata dependencies.  [18].
In this paper, a promising algorithm by Zirije Hasani and Samedin Krrabaj [18] [33] is chosen. In this work, proposed algorithm used makes an enhancement of forecasting models and gives a short description about the algorithms and then they are categorized by type of prediction as: predictive and non-predictive algorithms [33]. They implement the Genetic Algorithm (GA) to periodically op- • Starting with ideas of numerous papers [35] [36] [37], it uses the GA optimization process, to optimize α, β, γ, ω, the HW and TDHW smoothing parameters, where they added optimization of the three new parameters k, n and δ, [33]; • Improvement is made in the new definition of the optimization function based on the input training datasets with the annotated anomaly intervals, enhanced Hyndman's MASE [38] definition where k and n define the two sliding windows intervals, and δ is the threshold parameter [33]; • The positive feedback learning process is achieved if the anomalies detected in the next time frame, by the proposed detection engine based on the computed optimal parameters from the annotated anomalies of previous one, are verified/acknowledged by human and reused for parameter optimization  The data used for experiments are known as anomaly detection benchmarks NUMENTA [39] and Yahoo [40] datasets with annotated anomalies and our real log data from the Macedonian national education system e-dnevnik [33].
Based on the experimental evaluation of the detection rate and precision, performed on sets of synthetic and real data periodic streams, can be concluded that proposed HW with GA optimized parameters (α, β, γ, δ, k, n) and with improved MASE outperforms the other algorithms. This can't be concluded for the TDHW with GA optimization. Due to the HW iterative procedures, detection time is appropriate for the real-time anomaly detection. Optimization with GA that is also rather fast, with rather a small number of iterations (about 25 -30 iterations are needed to achieve all tagged anomalies recognition in the training sets), can be done in batch mode on training sets, as also re-optimization with verified newly detected anomalies [33].

Data Encryption and Access Control Technology
To adapt advanced encryption techniques and access control schemes in big data storage systems [18]. Contemporary schemes such as homomorphic, attribute-based, and image encryption are being explored to ensure that sensitive private data is secured in cloud and other big data storage and service platforms [41] [42]. Even though homomorphic encryption allows some operations on encrypted data without decrypting it, the computing efficiency and scalability of homomorphic encryption schemes need improvement in order to be able to handle big data. On the other hand, the attribute-based encryption technique is regarded more appropriate for end-to-end security of private data in the cloud environment since the decryption of the encrypted data is possible only if a set of attributes of the user's secret key matches with the attributes of the encrypted data [42]. One of the major challenges of this scheme is the implementation of revocation since each attribute may belong to different multiple set of users [43] [44]. Anonymising data with a hidden key field could be useful for privacy protection. However, using data analytics such as correlation of data from multiple sources, an attacker would be able to identify the anonymized data. Hence, in addition to having good cryptographic techniques to ensure privacy and integrity of active big data storage, proof of data storage needs to be continuously ensured. Another important aspect is to provide proof of the archived data storage in order to verify that the files are not deleted or modified by attackers. Hadoop has become a promising platform to reliably process and store which is difficult to decrypt by any unauthorized access. • The user authorization access is based on the user define policy which reflects the overall organizational structure and also, depends upon a set of attributes within the system. • With the proposed algorithm, the security of data is not only dependent on the secrecy of encryption algorithm but also on the security of the key. This provides dual layer security for the data.
This proves that the proposed technique is fast enough to secure the data without adding delay. Also, the proposed ABHE algorithm has a higher throughput which proves its applicability on big data. It provides a feasible solution for secure communication between one Data Node to other Data Node. The proposed encryption technique does not increase the file size therefore it saves the memory and bandwidth, and hence reduces traffic in a network. Also, it has an ability to encrypt structured as well as unstructured data under a single platform. Only HDFS client can encrypt or decrypt data with accurate attributes and password. The proposed technique provides a dual layer security for all Data Node as data is not confined to a specific device and clients can access the system and data from anywhere. This encryption approach may be reckoned as a premise for visualizing and designing even more robust approaches to ensure optimum security of big data.

Data Mining Technology
To adapt data mining techniques within big data analytics to intelligently perform behavior mining of access controls, authentication and incident logs [18].
Data mining technologies are on the rise to identify vulnerabilities and risks in big data and to predict any threats as a prevention technique from any possible malicious attack [46] [47] [48]. Role mining algorithms automatically extract and optimize the roles that can be automatically generated based on the user's access records for efficiently providing personalized data services for mass users.
However, in big data environment, it is important to ensure the dynamic changes and the quality of data pertaining to the roles assigned to users and roles related to the permissions-set that simplify rights management. In big data environment, it may not be possible to accurately specify the data which users can access. In such a context, adopting risk-adaptive access controls using statistical methods and information theory would be applicable. However, defining and quantifying the risks are quite difficult. Hence, authentication based on behavior characteristics of users could be adopted, but the big data system needs to be trained with the training dataset as a continuous process. Incident logs pertaining to the Intranets, Internet, social, and IoT networks as well as email servers could be analyzed to detect abnormal behavior or anomaly patterns using appropriate data mining techniques [18] [32] [49]. While traditional threat analysis cannot cope with big data, by using behavior mining of metadata of various re- source pools related to big data, anomalies can be analyzed to predict the threats, such as an APT attack. In behavior mining, trend analysis is performed, and pattern proximity is measured to define the relation between datasets. A distance function is usually used to measure the pattern proximity [28]. The distance function defines the proximity between two datasets based on their attributes. It is obvious that a group of datasets which has the minimum distance value between them belong to the same cluster. The most popular general distance function , i j d between two datasets i x and j x with p attributes is the Minkowski distance metric in the normed vector space of order m, and is used to calculate the pattern proximity as follows: ∑ when m = 2, the Minkowski distance is the commonly used Euclidean distance metric as follows: The Euclidean function works well when the datasets exhibit compact or isolated clusters and is suitable for patterns with multiple dimensions [18]. Big data security can be enhanced by studying the pattern proximity to predict threats by training with the similarity metrics of distances between anomaly datasets and normal datasets based on server/network logs, historical data of incidents and social media data. However, threat detection schemes require scalability and interoperability for the big data environment.  from malicious attacks is to record server logs that are analyzed for anomalies. In such contexts, a MapReduce model for distance-based anomaly analysis can be deployed to find k nearest neighbors for a data point and to use its total distance to the k nearest neighbors as the anomaly score in a data mining algorithm [51]. In order to perform the anomaly detection task, the MapReduce functionality can be divided into two jobs as given below: • A MapReduce job to find pair wise distance between all data points.
• A MapReduce job to find the k nearest neighbors of a data object and to find the weight of the object with respect to the k nearest neighbors.
Data objects can be partitioned by hashing the object ID and all possible combination of hash values for every two object types can be considered [18]. The Euclidean distance computation between objects for each hash value pair can be distributed among the reducers with the mapper output key as a function of the two hash values. However, configurable number of hash buckets should be chosen appropriately to distribute the load uniformly across the reducers by using Hadoop's default reducer partitioner [18]. The parallel processing feature of Hadoop speeds up the processing time resulting in a real-time data mining of large datasets of log files efficiently for detecting anomalous events in the server.
In this paper we chose proposed method for anomaly detection in log files, based on data mining techniques for dynamic rule creation [52]. To support parallel processing, this method employs Apache Hadoop framework, providing distributed storage and distributed processing of data. Outcomes of its testing show potential to discover new types of breaches and plausible error rates below 10%.

Blockchain Technology
To adapt distributed trusted system based on blockchain for big data security and privacy protection [18]. Blockchain technology has demonstrated to be the new mode of trusted interaction and exchange of information by eliminating intermediate parties and supporting direct communication between two parties in a network through replication of information and validation processes [53]. In big data context, each data item or record in a database is a block containing transaction details including the transaction date and a link to the previous block. The integrity of data is maintained in blockchain technology. This is because corrupted data cannot enter into the blockchain as checks are carried out continuously in search of patterns in real-time by the various computers on the network. Also, blockchain allows sharing of data more wisely as contracted by the users, thereby preventing cybercrime and data leakage. Blockchain data could also provide valuable insights into the behaviors, trends and can be used for predictive analytics.  • Irreversibility: encrypted data may be unrecoverable when the private key of a user is lost. • Adaptability challenges: organizations need to adapt the technology for integrating it in their existing supply chain systems, which may require a big change management and learning curve. • Current limitations: there are high operational costs associated with running blockchain technology as it requires expert developers, substantial computing power, and revamping resources to cater to its storage limitations.
• Risks and threats-while the blockchain technology greatly addresses the security challenges of big data, it is not threat proof. If the attackers are able to penetrate into majority of the network, then there is a risk of losing the entire database. In this paper blockchain implementations using [57] possibly classified into three categories [58] [59] that vary from each other by the different permission levels that different categories of participants are assigned to.
• Public blockchains are accessible to all participants, anywhere in the world.
Anyone can join or leave the network at any time, record a transaction, take part in the validation of the blocks or obtain a copy of them, without any previous control.
• Permissioned blockchains have rules that set out who can take part in the validation process or even register transactions. They can, depending on the Due to the Bitcoin and similar cryptocurrencies, the first classification is the most-known, as these digital currencies tend to operate in public Blockchains, where any participant in the network can see all transactions already made and update the ledger with new ones. This is also the riskiest type of Blockchain, according to [60]. Permissioned Blockchains allow any user to see the history of transactions, but only selected members can update it. Because it contains more restrictive rules about who can participate, observe and validate transactions, this model is emerging in industry sectors, being used for the exchange of tangible and intangible assets between enterprises. Finally, according to some experts [3], the parameters of the private Blockchains do not respect the traditional properties of Blockchains, such as decentralization and shared validation. In any case, private Blockchains do not raise specific issues regarding their compliance with the EU GDPR. traditional distributed databases can be considered.

Conclusion and Future Work
We introduced Big Data challenges from more than one view. One view of these challenges divides challenges into categories; some of them come back to a function of the characteristics of BD, by its existing analysis methods and models, and some, through the limitations of current data processing system. Another view seeks appropriate protection and privacy needs to be enforced throughout the big data lifecycle phases. To avoid these challenges, we propose the use of four main technologies to comprehensively cope with the 56 V during the three phases of big data lifecycle, namely data acquisition, data storage and data analytics. The first technology we propose to use is in Data Provenance Technology.
Here, we chose a promising algorithm by Zirije Hasani and Samedin Krrabaj [33] that is evaluated on the known anomaly detection benchmarks NUMENTA and Yahoo datasets with annotated anomalies and real log data generated by the National education information system. Based on the experimental evaluation of the detection rate and precision, performed on sets of synthetic and real data periodic streams, we can conclude that using proposed HW with GA optimized parameters (α, β, γ, δ, k, n) and with improved MASE outperforms the other algorithms. The second technology we propose to use is in Data Encryption and Access Control Technology, here to adapt advanced encryption techniques and access control schemes in big data storage systems. We chose Attribute Based Honey Encryption (ABHE) methodology to solve the issue of data security in Hadoop storage as in [45]. It shows considerable improvement in performance during the encryption-decryption of files. This approach works on files that are encoded inside the HDFS and decoded inside the Mapper. This proves that it is fast enough to secure the data without adding delay. Also, it has a higher throughput which proves its applicability on big data. The proposed ABHE algo- Node as data is not confined to a specific device and clients can access the system and data from anywhere. The third technology we propose to use is in Data Mining Technology; here we intended to adapt data mining techniques within big data analytics to intelligently perform behavior mining of access controls, authentication and incident logs [18]. We chose proposed method for anomaly detection in log files, based on data mining techniques for dynamic rule creation [52]. To support parallel processing, this method employs Apache Hadoop framework, providing distributed storage and distributed processing of data.
Outcomes of its testing show potential to discover new types of breaches and plausible error rates below 10%. Compared to this method Java implementation, single-node cluster Hadoop implementation performs more than ten times faster. The fourth technology we propose to use it in Blockchain Technology; here we intended to adapt distributed trusted system based on blockchain for big data security and privacy protection. Blockchain implementations are classified using [57] into three categories that vary from each other by the different permission levels named as, Public Blockchains, Permissioned Blockchains, and Private Blockchains. This classification gives everyone different privileges according to its need. This paper (BD2) is considered complete in our series of big data papers. We started our series with paper titled (BD1) [1], which introduces excellent details about Old and New Big Data V's Characteristics and its applications.
Future work, which is already in progress, is "Using Hadoop Technology to Overcome Big Data Trials (BD3)".

Conflicts of Interest
The author declares no conflicts of interest regarding the publication of this paper.