Analysis of Computer Network Reliability and Criticality: Technique and Features

The paper describes modern technologies of Computer Network Reliability. Software tool is developed to estimate of the CCN critical failure probability (construction of a criticality matrix) by results of the FME(C)A-technique. The internal information factors, such as collisions and congestion of switchboards, routers and servers, influence on a network reliability and safety (besides of hardware and software reliability and external extreme factors). The means and features of Failures Modes and Effects (Critical) Analysis (FME(C)A) for reliability and criticality analysis of corporate computer networks (CCN) are considered. The examples of FME(C)A-Technique for structured cable system (SCS) is given. We also discuss measures that can be used for criticality analysis and possible means of criticality reduction. Finally, we describe a technique and basic principles of dependable development and deployment of computer networks that are based on results of FMECA analysis and procedures of optimization choice of means for fault-tolerance ensuring.


Introduction
Lots of formalized dependability assessment techniques based on failure criticality analysis (FME(C)A), construction of the event and fault tree (FTA), emergency situation analysis (HAZOP) [1,2], etc. has been developed during the last decade.The International Standard [3] describes Failure Mode, Effects and Criticality Analysis (FMECA), and gives guidance as to how they may be applied to achieve various objectives by  providing the procedural steps necessary to perform an analysis;  identifying appropriate terms, assumptions, criticality measures, failure modes;  defining basic principles;  providing examples of the necessary worksheets and other tabular forms.FME(C)A is a methodology to identify and analyze potential failure modes of the various parts of a system and the effects these failures may have on the system.The purpose of FME(C)A-technique is specification of modes, sources and critical failure effects, including multiple and dependent failures, assessment of methods and different means CCN fault-tolerance and safety ensuring.It includes four main steps.
1) Analysis of a system structure and possible failures of different systems.
2) Analysis of the failures modes and effects.As a result, the FMEA-table should be built.
3) Qualitative analysis of the failures criticality on the base of their probability of occurrence and severity.As a result, the criticality matrix should be built.
4) Identification of the most critical failures as those that lie above the established criticality diagonal.FME(C)A is used to identify, prioritize, and eliminate potential failures from the system, design or process before they reach the customer FME(C)A is a technique to "resolve potential problems in a system before they occur".However, this technique has to be adopted for the system features.
The safety and fault-tolerance ensuring of CCN for critical application (CA) (NPP I & C Systems, Airspace Control Systems, Banking System, etc.) is an actual and important problem.The use of FME(C)A-technique [3], allows to identify the critical failures and failure effects for CCNCA and other kinds of CCNs, to detect the safety threats, to determine necessity of the redundancy introduction and other means for enhancement a probability of accident-free failure effects.
The purpose of this paper is an analysis of features of FME(C)A-technique application for corporate computer networks that are the core of distributed information and control systems (I&CS).The safety and faulttolerance ensuring of CCN for critical application (CA) (NPP I&C Systems, Airspace Control Systems, Banking System, etc.) is an actual and important problem.The use of FME(C)A-technique [3], allows to identify the critical failures and failure effects for CCNCA and other kinds of CCNs, to detect the safety threats, to determine necessity of the redundancy introduction and other means for enhancement a probability of accident-free failure effects.
It is confirmed in publications that show method's appropriateness for security assessment using so-called F(I)MEA (Failure (and Intrusion) Modes and Effects Analysis)-technique and failure effects analysis from recovery time view [4,5].

Features of FME(С)A-Technique Application for CCN Dependability Analysis
Application of methods of the analysis of a Mode and consequences of failures FMEA, and also the analysis of a Mode and Effects of critical failures-FME(C)A for quality standard of reliability of complexes of critical application allows to identify refusals and their Effects, to determine necessity of introduction of reservation of elements of system and the measures raising probability of trouble-free operation [6,7].The tasks of the reliability ensuring of computer network based on the open standards and models (for example, OSI or TCP/IP models) and used for critical applications according to COTS approach [8] are decided at various layers of these models.The distinctive network feature is that network failures are stipulated by four basic causes:  defects of the network hardware and software designing and production;  aging of the network physical components;  objective and subjective external extreme factors (EEF) such as seismic loads, electromagnetic disturbance (ED), human errors, hacking etc.;  internal information factors which consist in periodic increase of network traffic and, as a result, in conges-tion of switchboards, routers and servers.The network basic functional elements which may be analyzed by using FME(C)A-technique are SCS, passive and active telecommunication devices, such as hubs, switchboards and routers, servers and workstations etc. working at various layers of the OSI or TCP/IP models and fallible in consequence of four causes mentioned above.However, application of FME(C)A-technique for evaluation of reliability and fault tolerance through traffic overloads, unauthorized operations or human errors requires a separate discussion and are not considered in the given paper.Objects of FME(C)A are, as usual, I&CS components-hardware and software components.There is a modification of FME(C)A-method for software-SFME(C)A [9].In [10] it is proposed to apply FME(C)A to hierarchical structures and correspond them to hierarchy of FME(C)A-tables.

Results of Application FME(C)A-Technique for CCN Reliability Analysis
The classification of failure modes, causes, effects and means of safety and fault-tolerance ensuring for the network functional elements is obtained by using the FME(C)A-format.The various means of safety and faulttolerance ensuring of the network hardware and software are indicated in the last table column.The probability and the severity for each failure mode of specified computer network are determined on the basis of statistical information or expert estimations.It allows to construct a criticality grid, and with its help to execute a qualitative analysis of CCN reliability, to determine a set of the most critical failures and means for their recovery.The using of FME(C)A-technique is shown on an example of analysis of the National Airspace University computer network.Figure 1 shows the university structured cabling system (SCS) [11], also ,for example analysis of the FME(C)A-table for , backbone subsystem for which the FME(C)A-table was obtained (Table 1) and the criticality matrix was constructed (Table 2).
Figure 2 shows an hierarchical approach to the FME(C)A analysis of the computer network of the National Airspace University "Kh.A.I.".

Failures Criticality Analysis
The second step of FME(C)A technique is a criticality analysis of all failure modes.It performs with the purpose to explain the most serious failures and determine ways in which criticality of this failures can be reduced (Figure 3).
There are two common measures that are used for  The critical failures are those, which are above the criticality diagonal (see Figure 3).The criticality diagonal itself has to be set taking into account system reliability requirements or system safety level.For example, there are six different criticality diagonals in total that can be set in the criticality matrix that is shown on Figure 3.The higher is the criticality diagonal the more critical is the system.
In this paper we also propose to use an additional third measure to assess failure criticality, which describes duration of system nonoperability [12].It is very important for the computer and telecommunication systems where the small amount of incorrect connections (due to incurrect routing) is allowed whereas the high availability of the network is required.This measure depends on recovery time that can be reduced by using automated (computer-aided) recovery means instead of manual operations or automatic (unmanned) means instead of automated ones (Figure 4).For the computer networks these means include dynamic routing which is more preferable than static one, the spanning tree protocol against the manual recovery, etc.

Means of Failure Criticality Reduction
There are a lot of techniques that can be used for the failure criticality reduction, like:  Patch View System that control integrity of cabling channels and patch-panels at the level of structured cable system;  Adapter Fault Tolerance (AFT) technology that provide hot sparing of network adapters;  Adaptive Load Balancing (ALB), that allocate network traffic between four server's network adapters and four switch ports as well as AFT;  Fast Ether Channel (FEC) technology supporting flexible channel capacity as well as AFT;  Protocol of dynamic network reconfiguration Spanning Tree Protocol (STP);  Protocols of dynamic rooting like OSPF and Cis-coEIGRP that support load balancing.
Most of means mentioned above use redundancy of the cabling channels, ports and network equipment.Some technologies also provide possibility to increase network throughput by using existing redundant roots (like trunk technology) and allow automatic network reconfiguration to isolate failures.
Thus, incorporating of different fault-tolerant mechanisms together will provide possibility of complex and efficient failure criticality reduction.However, all existing means have to be ranked taking into account their cost and effectiveness as well as compatibility with another ones.

Dependable Development and Deployment
of Computer Networks

Using FMEA-Technique for Dependable Network Development
To develop and deploy dependable computer networks the common FMEA-table and criticality matrix describeing failures modes and effects have to be detailed taking into account actual logical and physical architecture of particular computer network as well as the set of network hardware, communication protocols and application software used (Figure 5).Two different development strategies are possible.For critical and business-critical applications it is necessary, as a rule, to provide the required level of dependability at the minimum cost, whereas for commercial applications it is important to provide the maximum dependability at the limited cost.
These goals can be achieved by solving optimization problem, taking into account failures criticality, probability of occurrence and cost of fault-tolerance means, their effectiveness and failures coverage.As a result the particular computer network must be updated by using chosen fault-tolerance means.
The principles proposed are in line with recent research [13] where a functional failure mode, effects and criticality analysis approach is proposed to address the dependability optimization of large and complex systems.

The Principles of Dependable and Secure Deployment of Computer Networks
Dependability and security of a computing system is its ability to timely deliver service that can justifiability by trusted [14].The typical network faults are physical faults of network equipment and communication media (i.e.cabling system), configuration errors (e.g.errors in static routing or firewall filtering rules or and security policies), design faults, as a rule, of software components, and interaction faults of physical (electromagnetic interference) or information nature (traffic congestions).
Fault and intrusion tolerance of computer networks, their security and dependability as a whole could be improved using the following principles.
1) Defense in depth and diversity (D & D).Defense in depth implicates joint usage of existing intrusion and fault-tolerance mechanisms at the different levels of the network architecture (cabling systems, network equipment, network technologies) and layers of the communication model (OSI or TCP/IP) to provide complex decision for dependability ensuring.
2) Adaptability and update (A & U).The essence of this principle is in the dynamic changing of the network architecture and diversity modes according to the observed failures and intrusions.The intellectual monitoring means for detection of failures and intrusions, their analysis and the choice of better network configurations could be used to achieve that.

Conclusions
CCN reliability and safety estimation is the complex task, which cannot be decided in isolation from application area.It is stipulated that the internal information factors, such as collisions and congestion of switchboards, routers and servers, influence on a network reliability and safety (besides of hardware and software reliability and external extreme factors).
Computer networks are the complex systems which contain a lot of elements.Therefore network failures are unavoidable.In this case the risk and criticality analysis [15], survivability and safety assessment [16] are more actual tasks than evaluation of the probability of nofailure operation.
As computer networks have a multilevel hierarchy the network element failures, generally, have a dependent character, i.e. the failure effects at one layer of the OSI or TCP/IP models are the sources of new failures at succeeding layers.This feature of computer networks can be taken into account by using layered analysis and representation its results as a hierarchy of FME(C)A-tables.A characteristic feature of active telecommunication devices is that they contain not only hardware, but also software components.For the software reliability and safety qualitative analysis the Software ME(C)A-technique may be used [17].
The software tool is developed to estimate of the CCN critical failure probability (construction of a criticality matrix) by results of the FME(C)A-technique.This tool consists of:  database containing common FME(C)A-tables for the network elements with an priori information;  conversational procedure of FME(C)A-analysis and evaluation of the specified network;  procedure of automatic generation of criticality grids and definition of the most critical network failures;  procedure of an automatic choice of critical failure recovery and fault-tolerance means.This tool also may be extended by procedures for network simulation and probabilistic assessment of reliability, safety and survivability.Directions of our future researches are connected with analysis of multiply failures during network development and maintenance and cost-effective means of reducing failures criticality.

Figure 2 .Figure 3 .
Figure 2. Mapping of assessed system hierarchy to hierarchy of FME(C)A-tables.Probability of failure occurence

Figure 5 .
Figure 5.Using FMEA-technique for dependable web services development.

Table 2 . Fragment of criticality matrix of university SCS backbone subsystem.
defines by "weight" of failure effects on all system and depends on function of faulty element.For computer network it can be degree of connectivity decrease.The probability of failure occurrence is determined by the net-work service conditions.It can be reduced by using structured redundancy.