Two Approaches on Implementation of CBR and CRM Technologies to the Spam Filtering Problem ()
1. Introduction
The development of Internet has generated many problems the one of which is spam. Spam is undesirable message appearing in e-mail, search engines, chats, forums, IM (instant messaging). The most known and bothered kind of spam is email spam, as e-mail an effective, fast and cheap kind of communication. Almost each computer user has e-mail, and faces spam problem.
For 2010 year Symantec reports that the total amount of spam in mail traffic was 89.1%, and according to Kaspersky Lab annual report the total amount of spam in mail traffic was 90.8% [1,2]. Such a quantity for spam does electronic communication useless, and sometimes not secured. As spam grows very fast, spammers begin to send harmful software, Trojans, malicious content within it. According Symantec annual report for 2010 there has been registered more than 339,600 various viruses, which are hundreds times more than for 2009 [1]. As seen from above diagram (Figure 1) the numbers of registered malicious attacks increased in the summer in 2010, so that they were found in approximately 6% of all emails. According to Ferris Research estimations the worldwide cost of spam email in 2009 was roughly 130 billion dollars [3]. All these facts once again urge us to struggle with spam with most effective new methods. As spam changes too quickly (the body, subject, sender’s mail and IP addresses changes) and email filtration should be individual (the message noted as spam by one user for another one may be desirable) the effective anti-spam system should be trainable and personified.
2. Related Works
Every day computer users receive in their email boxes hundreds of spam messages from new email accounts. Frequently these messages are come with different subject, body automatically generated by robot software. It is almost impossible to filter them with such traditional methods as black-white lists. Applying artificial intelligence methods to the problem of filtering email accounts from unsolicited messages it is possible to raise efficiency of a filtration of spam. Artificial intelligence methods are [4]:
• Convection—machine learning methods based on a formalism and a statistical analysis;
• Computing—methods of iterative working out and the training based on the empirical data;
• Hybrid—methods using convection and computing methods in common.
One of convection methods is Case Based Reasoning (CBR). In this paper it is considered the possibility of CBR method application to spam filtration problem. CBR is a method of reasoning based on precedents. This is a computing model which uses previous events to understand and solve new problems. In some scientific literature CBR meets as “the theory of precedents”. The construction of CBR systems begins in 1982 year from

Figure 1. The percentage of email spam with malicious attachments in 2010, Kaspersky Lab [2].
Shank’s arguments where the notion reminders coordinate the last events with current events to allow generalization and a prediction [5]. Further Kolodner has developed the first CBR system CYRUS expanding Shank’s ideas. This CBR system is differing from expert systems. Expert systems store past experience as the generalized rules and objects, whereas CBR systems store past experience as a separate problem, solving episodes [6]. CBR systems try to solve new problem using events from earlier solved problems. So the main princeple of such systems is that one can solve new problems remembering similar events of similar situations.
CBR methods are successfully applied in various areas as classification, diagnostics, forecasting, planning and designing. Independently on a problem for their solving by CBR methods, it is necessary to execute certain sequence of tasks (Figure 2).
The basic stages of CBR tasks cycle are considered in such sequence [7]:
1) Choice of the most similar cases of the cases saved up in base.
2) Use of the information and knowledge of this case (set of cases) for the solving new problem.
3) Revision and changes of the solution of the new problem.
4) Preservation of this experience for the solving future problems.
The application of CBR method to spam filtration problem is considered in papers [8-12]. According to these works the classifier based on CBR proves better, than Naive Bayes in spam filtering. Distributed CBR approach can unite in itself spam filtration based on content filtration and collaborative filtration.
In work [13] there is described the anti-spam system ACABARASE developed on the basis of CBR which after certain training filters spam with less false-positive cases.
The spam filtration model SPAMHUNTING presented in works [14-18] also based on CBR, which applies the disjoint knowledge representation engine. This spam filter able to address the concept drift problem by combining a relevant term identification technique with an evolving sliding window strategy. The idea consists in to identify and remove the obsolete and irrelevant knowledge that has accumulated over to the passage of time. Continuous updating technique used in SPAMHUNTING works at two various levels: 1) indexation of the knowledge base; 2) continuous search of its best representation.
Another one machine learning technology is Customer Relation Management (CRM). In spite of the fact that CRM theory has 20 year history, and the expression customer relationship management has been in use since the early 1990s, it did not applied to spam filtering problem yet. But there is great practice of implementation of CRM to different problems [19-25].
3. CBR and CRM Implementation Approaches
In this paper it is considered the centralized system of a filtration from unsolicited bulk messages, coordinating all Internet Service Providers (ISP) within country and functioning as collaborative spam filter involving e-mail users of this system and all ISP. This mechanism can be realized at ISP level continuously updating system database with new spam templates, white-black-grey lists. ISP can operatively delegate or delete the data from databases, or transfer them to Network Service Provider (NSP) which provides ISP with Internet traffic (Figure 3).
The offered system has the multilayered hierarchical structure consisting of three levels: state, corporate and personal. At each level of multilayered hierarchical system there are server nodes in which there exists database of spam templates. In these databases the spam templates coming from lower level nodes or from the ordinary nodes-user’s of the same level are collected.
3.1. CBR
For above considered spam filtering problem we define the following cycle of tasks according to CBR theory. At the first step when the user of our multilayered hierarchical spam filtration system reports to the server about new coming spam message system indicates is as a new case.
This new case is compared to the previous cases which have been saved up in base of cases—database of spam templates, and the most similar gets out. Combining the chosen case with a new case we get a suggested solution. The combined case is called as a solved case.

Figure 3. Architecture of multilayered hierarchical system of spam filtration.
Having reconsidered this solution, it is checked on success and applicability to the real world. The solution got at this step is confirmed solution and this case we call as a tested case.
In case of a failure the new more suitable case gets out. In a preservation stage the successful case with the corresponding solution registers in base for use in future and is called as a learned case.
There should be developed the mathematical methods for solving the tasks belonged to each step. For comparison and extraction of cases one can use the different methods described in works [26,27]. In order to compare new coming message with spam messages collected in database we define the following case parameters—set of characteristics of message:
1) Sender’s e-mail address 2) Sender’s IP address 3) Subject of message 4) Key words in message body 5) Key phrases in message body 6) Message body Let’s introduce some notations.
is a number of layers of the offered multilayered system;
is a number of server nodes on
th level,
;
is a number of nodes on
th level connected to the server node
,
.
Since the proposed system is assumed dynamic and trainable, and the database of spam templates gradually be updated with new templates, we introduce the parameter of time
.
Assume we have
number case parameters, as
. In this work
.
is zth message coming to the node
as spam at a time
, with case parameters
, where
,
,
. During filtering process each new message, coming to the user
is compared with the spam messages, previously delegated by the same user.
is a set of spam messages delegated by user
to the server node
at a time
until delegation of
th spam message:

where
,
,
,
.
Spam filtration at each level is realized based on the anti-spam policy of that level. Anti-spam policy contains each user’s files formed by user’s official reports about spam in the received correspondence. On the basis of these official reports-cases spam filtration is realized [28].
The set of legal mails coming to the node
is defined by anti-spam policy
of the same node:

where
,
,
.
Depending on anti-spam policy of each node, comparison can be made by one criterion or by combination of different parameters.
The number of comparisons of two spam messages is

The number of comparisons of
spam messages is

In the proposed system it is allowed possibility to withdraw back (restore) the message, previously marked as spam. In this case, the message
delegated by the user
as spam at a time
is removed from the set of spam templates
. Accordingly, the set of spam templates
and the antispam policy
for that level
are also changed. The dynamical algorithm of the system will restore the state of a dynamical system in a real time (during the process), using the input information about the system in current discrete time.
In the absence of spam templates no decision is taken for that user. This means that either the user has recently connected to the spam filtration system, or the user is tolerant of spam messages.
3.2. CRM
The expression CRM has a variety of meanings. One of them is that CRM is an information industry term for methodologies, software and usually Internet capabilities that help an enterprise manage customer relationships in an organized way [29].
In some papers there have been identified three types of CRM: operational, analytical and collaborative. There are different approaches to these three steps. According to one of them [30]:
• Analytical CRM is responsible for analyzing customers’ behavior in terms of sales, marketing or any other service provided. It utilizes data warehouse to extract appropriate data regarding different customers;
• Operational CRM is responsible for automating business processes that are related to customers like marketing and sales etc.;
• Communication/Collaborative CRM as the name implies, is responsible for efficient collaboration/association with the customers through e-mails, fax, phone, SMS or face to face communication.
The graphical interpretation of above steps according to Liu & Zhu [31] takes place in Figure 4.
Xu & Walton [32] name these steps as main principles of CRM and define them as following:
• Collect information;
• Efficiently usage of collected data;
• Automation of process.
In this paper we consider CRM theory as a management of relation between customers and their choices. By learning relevant information about the customers such as; names, habits, preferences and expectations one-on-one relation can be formed [33]. Learning this information can help to make right decision. Some times during spam filtration process the legal messages indicates as spam and user lost the important mail. Almost in best antispam solutions there takes place some percent of false positives. The advantage of using CRM approach is to decrease the number of false positives.
In case of spam filtering problem we consider customer as e-mail user
and choices as messages that indicated by user
as spam
. Our approach is to use the main idea of CRM theory, that using more information about customer—user, one can increase efficiency of spam filtering. The CRM database containing data, user-profile as preferences, interests, scientific direction, and etc is in the input of our filtration system (Figure 5). Processing this profile can automatically manage filtration. Depending on time this profile can be changed by user himself manually or can be organized through automatic analyses of information derived from mails and/or visited Web recourses.
According to the above presented main steps of any CRM system, we can define the following consequence of tasks describing the technology framework of our CRM based spam filtering system (Figure 6):
• First one is the construction of analytical CRM system which focuses on data mining tools to gather, analyzes and interprets huge amount of data belonged to users. This data can be derived from e-mail and visited web resources All information belonged to user as his preferences regarding e-mail (which content he like, and which one dislike) and his profile are key points in filtration of his e-mail.

Figure 4. Technology framework of CRM [31].

Figure 6. Technology Framework of CRM based Spam Filtering System.
• Second step is the construction of operational CRM system. After data collection it should be placed in right place—in CRM based spam filtering system database at the input of the system, also can be assessable to user himself in order to manage this data time by time.
• Third step should be the automated process of filtration. During this process the filtering system can recognize the new coming spam messages, comparing spamness signs of message with corresponding data from spam templates reported by user
and stored in database and also with information from profile.
The efficiency of spam filtration depends on used comparison method and the volume of collected data. So well-trained CRM based spam filtering system will show high efficiency with the less number of false positives.
4. Conclusion
In this work it is suggested conception of application of two well-known mathematical apparatus to spam filtering. One of them is CBR technology which is began to apply to spam filtering recently. Another one is CRM technology which is not applied to spam filtering problem yet. These are two machine learning concepts and could be effectively used in spam filtering. As spammers constantly change external signs of spam messages to skip spam filtering systems, there arises a need for adaptive, trainable filtering system. So development of server side personalized e-mail filtering systems that use the learning-based classification algorithms based on CBR and/or CRM technology is a very perspective direction.
5. Future Work
Future work will focused on providing methods and experiments to prove the effectiveness of implementation of CBR & CRM technologies onto spam filtration problem.