Designing Intrusion Detection System for Web Documents Using Neural Network

Cryptographic systems are the most widely used techniques for information security. These systems however have their own pitfalls as they rely on prevention as their sole means of defense. That is why most of the organizations are attracted to the intrusion detection systems. The intrusion detection systems can be broadly categorized into two types, Anomaly and Misuse Detection systems. An anomaly-based system detects computer intrusions and misuse by monitoring system activity and classifying it as either normal or anomalous. Misuse detection systems can detect almost all known attack patterns; they however are hardly of any use to detect yet unknown attacks. In this paper, we use Neural Networks for detecting intrusive web documents available on Internet. For this purpose Back Propagation Neural (BPN) Network architecture is applied that is one of the most popular network architectures for supervised learning. Analysis is carried out on Internet Security and Acceleration (ISA) server 2000 log for finding out the web documents that should not be accessed by the unauthorized persons in an organization. There are lots of web documents available online on Internet that may be harmful for an organization. Most of these documents are blocked for use, but still users of the organization try to access these documents and may cause problem in the organization network.


Introduction
The information is the most important resource that must be managed efficiently.Besides management, its protection is also very important as it may lead to economic losses in today's electronic environment.For example, we can control our bank accounts from almost anywhere in the world using a suitable network, such as satellite and cellular phone networks to interact with the bank representatives, or the specialized wired ATM networks and the Internet for online banking services.The services supported by networks are very much useful and efficient, but these can be subverted by unscrupulous elements for their own benefits.So, suitable mechanism needs be employed to protect the information.In a survey of fraud against auto teller machines [1], it is reported that the patterns of fraud depends on those who were responsible for implementing and managing the systems.In USA, if a customer disputes a transaction, this is the responsibility of the bank to prove that the customer is mistaken or lying.This forced the US banks to protect their systems properly.But, in Britain, Norway and the Netherlands, the burden of proof lies on the customer.The bank is right if the customer could not prove it wrong.That is why the banks in these countries became careless.Eventually, epidemics of fraud demolished their satisfaction and in the meanwhile the US banks suffered much less fraud.Though they spent less money on security than their European counterparts, yet they spent it more effectively [2].A different kind of incentive failure was also seen in early 2000 with distributed denial of service attacks against a number of high profile websites.Those attacks exploited a number of weak machines to launch a large coordinated packet flood at a host.Since many of them flooded the victim at the same time, the traffic was more than the host could handle.Furthermore, because it came from many different sources, it could be very difficult to stop.Varian [3] discusses different kind attacks and their effects.The suggestions made in [3] are: the costs of distributed denial-of-service attacks should fall on the operators of the networks from which the flooding traffic originates.And assign legal liability to the parties that are best able to manage the risk as they will develop expertise for computer security and provide the required services to their clients.In next section we review the intrusion detection systems.

Early Intrusion Detection System
An intrusion occurs when an attacker gains unauthorized access to a valid user's account and performs disruptive behavior while masquerading as that user.The attacker may harm the user's account directly or can use it to launch attacks on other accounts or machines.In such scenario a useful method to detect it is to develop "patterns" of users of a computer system.The early intrusion detection efforts used to do manual review of a system audit trail that was inefficient approach as many systems did not collect enough data to provide an audit trail, or failed to protect the data against modification.Studies in [4] show that nearly all large corporations and most medium-sized organizations have installed some form of intrusion detection tool.In [5], the misuse detection methods using mobile agents are discussed.The methods to detecting intrusions can be anomaly detection or misuse detection.Misuse detection is mainly suitable for reliably detecting known patterns, but they are hardly of any use yet unknown attack methods.The mobile agents provide computational security by constantly moving around the Internet and propagating rules to solve misuse detection.The paper [6] discusses an Intrusion Detection System (IDS) architecture integrating both anomaly and misuse detection approaches.This architecture consists of three main modules: an anomaly detection module, a misuse detection module, and a decision support system module.The anomaly detection module uses a Self-Organizing Map (SOM) structure to model normal behavior and any deviation from the normal behavior is considered as an attack.The misuse detection module uses J.48 decision tree algorithm to classify different types of attacks.The decision support system analyzes and interprets the results for interpreting the results of both anomaly and misuse detection modules.In [7], strict anomaly detection method is discussed that uses the neural networks to a great effect.Now we review the important approaches used in the intrusion detection systems.

Rule Based Intrusion Detection Systems
The basic assumption in the rule-based intrusion detection systems is that the intrusion attempts can be characterized by sequences of user activities that lead to compromised system states and based on that they predict intrusion.These systems fire rules when audit records or system status information begins to indicate illegal activity.Two major approaches are followed in rule-based intrusion detection: state-based and model-based approach.In the former, the rule base is codified using the terminology found in the audit trails and Intrusion attempts are the sequences of system state as defined by audit trail information leading from an initial and limited access state to a final compromised state [8].In the later, the known intrusion attempts are modeled as sequences of user behavior.The intrusion detection system itself is responsible for determining how an identified user behavior may manifest itself in an audit trail.These systems have many benefits, such as large data processing, more intuitive explanations of intrusion attempts, and prediction of future actions.The rule-based systems however have some limitations.They lack flexibility in the rule-to-audit record representation.Slight variations in an attack sequence can affect the activity-rule comparison up to that extent that the intrusion may not be detected.While increasing the level of abstraction of the rule-base does provide a partial solution to this weakness, it also reduces the granularity of the intrusion detection device.A number of non-expert system-based approaches to intrusion detection have been discussed in [9][10][11][12].Most current approaches to detecting intrusions utilize some form of rule-based analysis.Expert systems are the most common form of rule-based intrusion detection approaches [13][14][15][16]].An Expert system consists of a set of rules that encode the knowledge of a human "expert".These rules are used by the system to make conclusions about the security-related data from the intrusion detection system.Unfortunately, the expert systems require frequent updates to remain current.While the expert systems offer an enhanced ability to review audit data, the required updates may be ignored or performed infrequently by the administrator.At a minimum, this leads to an expert system with reduced capabilities.At worst, this will degrade the security of the entire system by causing the system's users to be mislead into believing that the system is secure, even as one of the key components becomes increasingly ineffective over the time.

Network-Based and Host-Based Intrusion Detection Systems
A network-based intrusion detection system (NIDS) observes the traffic at specified points in the network and then checks that traffic packet by packet in real time to detect intrusion patterns.It can examine the activity at any layer of the network such as network layer, transport layer, and application layer protocol.The network-based systems are generally best at detecting the unauthorized outsider access and bandwidth theft/denial of service.When an unauthorized user logs in successfully, or attempts to log in, they are tracked with host-based IDS.However, detecting the unauthorized users before their logon attempt is best accomplished with network-based IDS.The packets that initiate bandwidth theft attacks can best be noticed with use of network-based IDS.Some of the network-based IDS are Shadow, Dragon, NFR, Re-alSecure, and NetProwler.Host-based Intrusion Detection systems are first of IDSs developed and implemented.They collect and analyze the data originated on a computer that provides a service, such as web server.After collecting the data from a given computer, it is analyzed.One example of the host-based system is programs that operate on a system and receive application or operating system audit logs.These programs are highly effective for detecting insider abuses.Residing on the trusted network systems themselves, they are close to the network's authenticated users.If one of these users attempts an unauthorized activity, the host-based systems usually detect and collect the most pertinent information in the quickest possible manner.In addition to detecting unauthorized insider activity, the host-based systems are also effective at detecting unauthorized file modification.The host-based IDSs are Windows NT/2000 Security Event Logs, RDMS audit sources, Enterprise Management systems audit data (such as Tivoli), and UNIX Syslog in their raw forms.
Graph-Based Intrusion Detection System (GrIDS) [17] uses a graphical representation to monitor the activity of entire network.EMERALD eXpert-BSM, a real-time forward-reasoning expert system, uses a knowledgebase to detect multiple forms of system misuse [18].In [19], a technique is discussed for detecting intrusions at the level of privileged processes.It is reported that short sequences of system calls executed by running programs are a good discriminator between normal and abnormal operating characteristics of several common UNIX programs.Analyzing the system calls made by a program is a reasonable approach to detect intrusions based on program behavior profiles [20].

Neural Network Based Intrusion Detection Systems
The neural network based intrusion detection systems have the ability to be trained and learn patterns in a given environment, which can be used to detect intrusions by recognizing patterns of an intrusion.The Artificial Neural Network based methods for intrusion detection are quite popular.Recently an investigation on the unsupervised neural network models and choice for most appropriate one among them for evaluation and implementation is discussed in [21].These can be used for both host-based and network based intrusion detection systems.For the success of IDS is the failure of firewalls to prevent many security intrusions.The intrusion detection systems can detect many of them that slip through firewalls.Many Anomalies based and Misuse based intrusion detection techniques have been designed to detect the abnormal behavior exhibited by the user in [22][23][24][25][26][27].Artificial neural networks have been suggested as alternatives to the statistical analysis [28][29][30].Statistical Analysis involves statistical comparison of current events to a predetermined set of baseline criteria.Neural networks are specifically discussed to identify the typical characteristics of system users and identify statistically significant variations from the user's established behavior.Artificial neural networks have also been discussed for use in the detection of computer viruses.In [31], neural networks are discussed as statistical analysis approaches in the detection of viruses and malicious software in computer networks.The neural network intrusion detection (NNID) system [32] uses neural networks to predict the next command a user will enter based on previous commands.Now we discuss our neural network based intrusion detection system.

Audit Logs Analysis Using Neural Networks
In this work, we collect the data from the ISA 2000 Web Access Log to analyze for possible intrusion attacks using the neural networks and then use the back propagation neural (BPN) network model for analyzing the input data.Different numbers of hidden layers are considered in the PBN algorithm.

ISA 2000 Web Access Log Analysis
Internet bandwidth is consumed by a variety of internet application protocols.The most popular application layer protocol that accesses Internet resources is the HTTP protocol.It is used to access the resources on the World Wide Web.Although bandwidth cost per-kilobyte or per-megabyte has come down over the years, yet the amount of bandwidth consumed by users on the campus network increases year after year.HTTP connections to Internet resources not only lead to increase in bandwidth usage, they also reduce the amount of bandwidth available on the Internet link for other important protocols and applications, such as SMTP, POP3 and VPN.In order to provide the desired data resources to users, it is stored at different locations using some kind of servers.
To further help the user in computer network environment, proxy servers are employed.A proxy server is a server (a computer system or an application program) which provides the services to user requests by making requests to other servers.A user connects to the proxy server, requesting a file, connection, web page, or other resource available from a different server.In an enterprise that uses the Internet, a proxy server is a server that acts as an intermediary between a workstation user and the Internet so that the enterprise can ensure security, administrative control, and caching service.It can receive a request for an Internet service (such as a Web page request) from a user.On clearing filtering requirements, the proxy server, assuming it is also a cacheserver, looks in its local cache of previously downloaded Web pages.If the desired pages are there, it returns them to the user without needing to forward the request to the Internet.In case the required pages are not in the cache, the proxy server, acting as a client on behalf of the user, uses one of its own IP addresses to request the pages

r-host
The domain name for the remote computer that provides service to the current connection.

r-ip
The network IP address of the remote computer that provides service to the current connection.

r-port
The reserved port number on the remote computer that provides service to the current connection.

time-taken
The total time, in milliseconds, that is needed by ISA Server to process the current connection cs-bytes The number of bytes sent from the remote computer and received by the client during the current connection.

sc-bytes
The number of bytes sent from the client to the remote computer during the current connection.

cs-protocol
The application protocol used for the connection.from the server out on the Internet.When the pages are received, the proxy server forwards them onto the user.

ISA Server 2000 Web Access Log
Internet Security and Acceleration (ISA) Server 2000 can help in reducing overall bandwidth usage and cost by caching Web contents on the ISA Server 2000.We use Microsoft ISA Server 2000 log to monitor and analyze the status of the Web proxy requests to find out the documents that are worthless in an organization.Table 1 shows the attributes used in ISA Server 2000 Log file.

Experiment
The input data is collected in terms of above mentioned attributes.Table 2 contains the values of the input data.
The data shown in Table 2 is not a valid input pattern for BPN.Before providing the data for training to the BPN, it needs be converted in the valid pattern.We perform the following steps for making a valid input for BPN.
 Select the ip address part of the destination web server and convert it in the integer number without delimiter.For example, the ip 216.239.63.83 is converted into 2162396383.This is a long number which in itself is not a valid input pattern for BPN.
 Normalize the input pattern in real numbers.After normalization the input data pattern is shown in Table 3.First column shows the normalized ip addresses and the second column shows 1 as valid ip address and 0 as invalid ip addresses.
 Train the BPN for this input pattern by taking dif ferentnumber of hidden layers.We use 2, 5 and 10 hidden layers.The number of epochs is taken as 50,000.Results

Results
The training of the neural networks has been conducted using the Back Propagation neural network algorithm for 50,000 iterations of the selected training data.After training the BPN, the following results are obtained.
The results obtained match very closely with the desired root mean square (RMS) error as shown in Table 5.Though this method is not designed to be used as a complete intrusion detection system, yet the results show the potential of neural networks to detect individual instances of possible misuse from a representative webbased data.Graphs in Figure 1 show the results for different number of hidden layers used in the BPN.It is evident from the graphs that the results are very close to desired output values, when we use 10 numbers of neurons for hidden layer.

Discussions
The above mentioned method can be used to find out the web documents that should not be allowed in the organization.Web Server log file is divided into two parts.One file contains only the destination ip addresses and the second file contains the corresponding source ip and date

Figure 1 .
Figure 1.Predicted output for test patterns: in (a) 2, in (b) 5, and in (c) 10 hidden layers are usedand time of the site being accessed.Input of the first file having ip addresses of the sites being accessed is converted into normalized ip address.This is the input pattern to Neural Network for testing.For the ip addresses having errors (invalid websites) and no errors (valid websites) the Neural Network is already trained.When a user tries to access a website that is in the invalid website record, it is detected by the system.At the time there is a

Table 1 . Attributes in ISA server 2000 log file
Common values are http for Hypertext Transfer Protocol, https for Secure HTTP, and ftp for File Transfer Protocol.

Table 3 . Normalized training patterns for BPN Normalized IP addresses Valid(0) / Invalid(1) Normalized IP addresses Valid(0) / Invalid(1)
for different number of hidden layers are shown in

Table 5 .
 After training the BPN, it is tested with test patterns as shown in Table4.