The Application of Book Intelligent Recommendation Based on the Association Rule Mining of Clementine

doi:10.4236/jsea.2013.67B006

Paper Menu >>

Journal Menu >>

Journal of Software Engineering and Applications, 2013, 6, 30-33

doi:10.4236/jsea.2013.67B006 Published Online July 2013 (http://www.scirp.org/journal/jsea)

The Application of Book Intelligent Recommendation

Based on the Association Rule Mining of Clementine

Jia Lina, Mao Zhiyong

Graduated Scho o l , L i ao ning Technica l University, Hu Ludao, China.

Email: jelena1988@sina.cn

Received May, 2013

ABSTRACT

The traditional library can’t provide the service of personalized recommendation for users. This paper used Clementine

to solve this problem. Firstly, model of K-means clustering analyze the initial data to delete the redundant data. It can

avoid scanning the database repeatedly and producing a large number of false rules. Secondly, the paper used clustering

results to perform association rule mining. It can obtain valuable information and achieve the service of intelligent

recommendation.

Keywords: Data Mining; Association Rules; Clustering; Intelligent Recommenda tion; Clementine

1. Introduction

The recommended service plays an important role in the

process of the digital library gradually toward personal-

ization and intelligent. Th e syste m can recommend book s

to the readers by the relev ant information which is found

from the readers’ lending behavior and preferences from

data mining. Relevance information mining is association

rules mining[1]. This question has been paid attention

and studied by many international researchers after it has

been put forward by Rakesh Agrawal and researchers

also raise many kinds of algorithms.

Association rules are put fo rward to break the transac-

tion limit. To find the relationship between different

transactions so that to predict events that users interest

reasonably. It will be a long time to do the data mining

and the rules will be a lot with false rules when transac-

tion analysis is carried out on the larg e database. And the

mining efficiency is reduced. Based on it, this paper uses

the data mining software Clementine to clustering analy-

sis on the reader firstly, and cluster the behavior of bor-

rowing book s for high frequency, mediu m frequency and

low frequency[2]. To do the association rule mining to

the books which is borrowed by readers who borrow by

high frequency and medium frequency? Finally, transfer

the mining result to the client user by Web service.

Choose the books borrowed by users which are the high

frequency and medium frequency to have the association

rule mining is because the amount of borrowed books is

huge and the association rule is strong. So it narrows the

amount of data involved in association rule, save scan-

ning time, and then to improve the quality of mining.

2. Clementine Software Introduction

Clementine is data mining software developed by SPSS

company. It puts clustering, association rules, decision

trees, neural network and many kinds of data mining

technology to integrate in the intuitive visual graphic

interface. Clementine combine with business technology

to build the data model quickly to apply it to business

activity and help people to improve the decision making

process. The paper applies clustering and association

rules mining in Clementine 12.0 to book intelligent rec-

ommendation service[3].

2.1. Characteristics of Clementine

1) It provides that visual, strong and easy-to-use data

mining platform. The process of user modeling is to

connect each no de. It can be built the data min ing model

without programming so that user can be more focused

on the solving specific business problems by using data

mining rather than the use of tools.

2) Fully follow the CRISP-DM standards to establish.

Clementine provides good project management function.

And it can manage overall process effectively from

business understanding to result release.

3) It provides steady and strong release function.

Clementine can release data mining model or the whole

flow of data mining to improve efficiency of operations.

4) High flexibility and extensibility. Clementine has

open database interface. It provides almost all the rela-

The Application of Book Intelligent Recommendation Based on the Association Rule Mining of Clementine 31

tionship database. Meanwhile, it owns extended function.

2.2. Six Stages of CRISP-DM Process Model

1) Business understanding. It is the most important

stage in data mining. It includes that confirm business

object, estimate situation, confirm target of data mining

and set out engineering plan.

2) Data understanding. It provides materials of data

mining to realize data characteristics of data source. It

includes that collect initial data, describe data, clean data,

and check the quality of data.

3) Data preparation. Classify the data source from data

mining. It includes that data selection, cleaning, structure,

integration and formattin g.

4) Modeling. It is the core part of data mining. It in-

cludes that choose modeling technology, generate test

design and structure and evaluation model.

5) Model evaluation. It can evaluate result of data

mining that can help to realize business target after

choosing the model. It includes that result, view the pro-

cess of data mining and confirm the next step.

6) Result deploys. It can combine the new knowledge

with daily business flow to solve initial business prob-

lems. It includes that plan deploy, monitoring, maintain,

produce final report and review the project[4].

3. Library Data Mining Based Clementine

The information requests and forms of users in library

are diversified. It provides personalized recommendation

service based on the requests and interests of readers.

The paper clusters analysis to the times of readers. It can

be divided into three types: high frequency, medium fre-

quency and low frequency. And then association rules

analysis to the books which are borrowed by high and

medium frequency readers to realize personalized rec-

ommendation service[10].

3.1. Data Acquisition

The data in this paper is from lib rary in Liao Ning Tech-

nical University. The total amount of reader borrowing

books is 62261 from Nov 7th, 2011 to Mar 7th, 2012.

And extract 3108 from it to serve as the experimental

subject.

3.2. Data Pre-Processing

The paper gets to the Excel table to import SQL Server

2000 database to do the data pre-processing. The data

pre-processing mainly reprocess data in previous stage to

check the integrity of data and consistency of data. It

includes noise immunization, deduce to calculate missing

data, remove duplicate record and complete data type

transfer. In preprocessing stage, delete “dirty data” which

is redundancy vacancy data, not completing, noise in-

formation. It establishes the foundation for data mining

in next step and improves the digging efficiency and dig-

ging quality[7].

3.3. Modeling Based on Clementine

3.3.1. Clus ter Mo del i n g

Input the data which is collected after preprocessing into

cluster modeling in SPSS Clementine to cluster modeling

analysis. The paper uses K-means algorithm to cluster

modeling for the reader’s borrowing behavior.

K-means[15] algorithm is a process of iterating to calcu-

late “centroid” and being based on the distance between

sample and centroid to appoint every sample to cluster.

The following is the process[5].

1) Make sure initial centroid. Select the first sample as

the first centroid. And calculate th e distance and Squared

Euclidean distance between it and centroid for every

sample. Define centroid vector and a

sample vector



Ccc c





xx x

q, Q is the amount of prop-

erties in data set.

is the first q attribute values,

1, 2,,qQ



. So the following is computational formula

of Euclidean distance between sample and centroid:



dxc





After the initial K centroids are gen-

erated, the algorithm begins to iterate and appoint[14].

Select the biggest sample of Euclidean distance to be as

another centroid. And repeat it till K centroids are all

identified.

2) Appoint sample. During every iteration, each of the

samples is appointed to the cluster which is nearest to

itself. The distance is defined by the square of the Eu-

clidean distance so the distance between sample I and

centroid j:



ijijqi qj

dXC xc



 





is vec-

tor which is constituent by attribute values of sample i,

C is centroid vector of cluster j,Q is the amount of

property,

is the number q property value of number

i sample, c is the number of q property value of the

centroid in cluster j. Begin to update every centroid of

cluster after all the records are all appointed.

3) Update centroid. Some samples in one cluster may

be transferred into other clusters in the process of ap-

pointing samples. So it needs to recount centroid of every

cluster. Establish mis the sample amount of number j

cluster after appointing sample. So the vector of recount

the centroid of cluster is:



, ,...,,

jjjQj

Xxx x num-

ber





1,2,...,qq Q in vector and component qi

is:



qj j



,





j is the number q property value

The Application of Book Intelligent Recommendation Based on the Association Rule Mining of Clementine

in sample i of cluster j[11].

4) Stopping criterion. Firstly, “the max iterations”

controls that the algorighm search stable cluster. The

algorighm will repeat “appoint sample-update centroid”

until “the max iterations”[13]. It will generate final mod-

el after it reaches the limitation and the algorighm will

stop to update cluster. And “Tolerance of differences”

provides another way to control algorighm to be stopped.

Calculate distance in centroid space after every iteration

finish. Such as, iteration after t times finish, the distance

in centroid space in number j cluster is:

 

Ct Ct, is centroid vector of number

j cluster of iteration in t times, is the centroid

vector of number j cluster when the last iteration. So

there are k results that produced by k clusters. Select the

max in it:



 



Ct



max 1Ct t

J, if the max is less than C

Tolerance of differences which is predefined. So the al-

gorighm will stop. If not, it will go on.

Through these steps, the following Figure 1 is view of

cluster model.

The result shows that it divides it into three classifies:

high frequency (cluster 2), medium frequency (cluster 3),

and low frequency (cluster 1). Extract the high and me-

dium users because their borrowing amount is huge and

the association rules in the books are strong. The cluster1

is regarded as noisy data to delete so that the association

rules are more typical.

3.3.2. Assoc i a tion Rules Min i ng

Regard the clustering analysis as the pretreatment part of

association rules mining. It can find association rules

efficiently and avoid generating the false rules[6]. It can

Figure 1. Model view.

make data more illustrative, pertinency, veracity. Extract

reader data in Cluster 3 and Cluster 2 are totally 764.

Query the 764 students’ borrowing information from

database to save as data sheet. Use Apriori note in

Clementine to do association rules mining. The process

is:

1) Generate frequent item set. Based on







frequent item sets to make up gather , and generate

all candidate k-item-set , and prune , and calcu-

LC

lated support in every item-set w : support =

i is the amount of transaction of including item-set w.

N is amount of all the transaction. Put item set of support

into item-set k in frequently k-. Find the

frequently k- item-set and k is less than max which is

predefined by user. Repeat above steps and search the

frequently item-set

min_ supL,k







.

2) After getting all the frequently item-set L, the al-

gorighm will generate association rules based on fre-

quently item-set. Firstly, generate l’s all nonvoid subset

based on frequently item-set l of L. Secondly, for very

nonvoid subset A, if it content valuation criterion

(





  

sup min_,sup and sup are

sup

port lconfport lportA

port A 

item-set l and A ‘s support), and then the output role is

“

A”,and -

lA[12].

So the association rules is Figure 2, Figure 3

The call number of library is Chinese Library Classi-

fication. From picture Figure 2, the reader who borrows

B83-09/13(historical pedigree and theoretic finality) also

want to borrow B83/20 = 3(aesthetics introduction. re-

vised edition), it can be the reason for reader recommend.

From Figure 3, it can be clearly shown the association

rules among books. And the association rules with thick

line are stronger th an fine line.

Figure 2. Model view.

The Application of Book Intelligent Recommendation Based on the Association Rule Mining of Clementine

4. Realize Intelligent Recommendation recommending service by digital library development in

the direction of intelligence[9]. The paper views the

cluster as the data pre-processing of association rules

mining to make the rules more accurate. The paper

shows that the subject is effective and viable.

By the data mining process, transfer the association rules

to readers through agent. When there is a request from

client to Web server, transfer the request to the reader

recommended agent to match. And transfer the matching

recommended rules to Web server. Finally, transfer it to

the user in client[8]. This can give readers more selec-

tions, and improve the use ratio of books. Figure 4 is

mode pattern of books intelligent recommendation.

REFERENCES

[1] C. G. Yuan, “Data Mining Theory and SPSS Clementine

Application,” Beijing: Electronic industry publishing,

2009, pp.547-578

5. Conclusions [2] F. Y. You,“Data Mining and digital library application

“Office automation magazine,2007, pp.51－52

It is important to provide flexible and targeted books [3] C. H. Bao, “Data warehouse and Data Mining,” Beijing:

Tsinghua University Press, 2006.

[4] J. Han, M. Kamber, “Data Mining and Technology” M.

Fan, Translator, Beijing: China Machine Press, 2001,

pp.10-33.

[5] H. Y. Chen, “Based on Weighting Association Rules and

Browse Behavioral Personality,” Chongqing University,

2005.

[6] W. Wang, “Reader Behavior Analysis Based on Data

Mining,” Modern Library and information technology,

2006, pp.51-54.

[7] H. Y. Cai, “The Application in University Library System

for Data Mining about Association Rules,” NUT College

Journal, 2005, pp.85-88.

[8] W. H. Li, “Personality Information Recommend System

in Digital Library,” 2007, pp.109-110.

[9] W. W. Chen, “Data Mining Research about Reader Be-

havior,” Chongqing Southwest University, 2007.

Figure 3. Association rules webs.

[10] B. C. Xie, “Data Mining Clementine Application,” Bei-

jing: THU press, 2008, pp.213-215.

History acc ess

data

Data

Preparation Ass o ciation ru le

mining

Readers r ecom m en d Agent

User current ly access dat a

WebServer Client

Recommended

rules after

matching

[11] J. Bao, S. W. Fan, “The Data Pre-processing for Data

Mining,” Library and Information Science, Vol. 26, No. 2,

2008, pp. 31-33.

[12] Z. G. Li, G. Ma, “DW and DM Application,” Beijing:

Higher Education Press, 2008, pp.150-170.

[13] Q. H. Xiao, “Data Mining Apply in Information Server,”

Library forum, Vol. 24, No. 1, 2004, pp.140-142.

[14] B. H. Wang, “Data Mining and Application,” Statistics

and decision, 2006, pp.122- 123.

[15] X. Li, C. H. Yang, “K-means Cluster Application,” Li-

brary and information Science, Vol. 25, No. 2, 2009,

pp.15-17

Figure 4. Mode pattren of intelligent recommendation.