An Improved Name Disambiguation Method Based on Atom Cluster

An improved name disambiguation method based on atom cluster. Aiming at the method of character-related properties of similarity based on information extraction depends on the character information, a new name disambiguation method is proposed, and improved k-means algorism for name disambiguation is proposed in this paper. The cluster analysis cluster is introduced to the name disambiguation process. Experiment results show that the proposed method having the high implementation efficiency and can distinguish the different people with the same name.


Introduction
Recently, with the development of the endowment insurance and medical treatment, a variety of core business processing systems came into being.But in the core business processing systems database, due to the limitations of information recording, it is difficult to find the sole primary key for the real customers.The first problem faced is the customer name collisions problem in insurance business dealing process.Data Mining [1][2][3] is searching large amounts of data, reveals the hidden laws in it, and further models it to the advanced and effective method according to the established business objectives.
The literature [4,5] represent the content of name disambiguation to the vector space model to realize the name disambiguation.The literature [6] realized the name disambiguation by further extracting some information, such as the race, gender, education, work, family relationships, and then calculated the figure similarity.The above methods all considered a lot of useless words, and the digestion process has the strong reliance with the information extraction.
A new name disambiguation method is proposed in this paper and the atom cluster is used to improve the traditional clustering algorism.The experiment shows, the method is this paper can optimize the match processing of Chinese characters, having the high efficiency, it is suitable to be applied in the insurance field having amount of data.

System Frame and Algorism Principle
The same name disambiguation is a regular problem in insurance field.The clustering algorism is used to disambiguate the same name.
Definition 1. Same name disambiguation problem of insurance can be defined as: for the given customer identifier set r , the cooperation relation set , for the same or similar customer identifier of r such as i and j r , to calculate the set c j , in which e is a entity set, and i represent the entity corresponding with the customer identifier.

Name Disambiguation Based on Atom Cluster
Atom clusters refer to the entity with strong ties in the clustering process will not be dismantled.The same name disambiguation flow of Clusters is showed as Figure 1.
The same name disambiguation process based on atom clusters mainly contains two steps: identification of clusters and the same name disambiguation.In the first step, the entity with a strong relationship is identified and it is as the input of the name disambiguation, using the classifier based on AdaBoost to calculate the connection level between entities to justify it satisfy the standard of atom cluster or not; and then the k-means cluster is used to name disambiguation using the output of the first step.
The principle of AdaBoos algorism is given a train set such as     , , , , x y x y  i , in which x belongs to a domain or instance space X, i .In the initial time, the distribution of the given train set of AdaBoost is 1/m, and according to the distribution using the weak earner to train the train set, after train, according the train result to renew the distribution of train set, and according to the new sample distribution to train, after the iterative rounds, finally a sequence of estimate can be concluded as 1 r , every estimate has some weight, the final estimate H is obtained by weight voting, and the probability of every sample appeared in the new train set is obtained by AdaBoost algorism, the train error of the final prediction function H is satisfied with the Equation (1): In Equation (1),  is the train error of prediction , and from Equation (1) we can conclude the train error is deduced with t.
After the classification of the entities with the different connection intensity using AdaBoos algorism, the kmeans cluster is used to recognize the name disambiguation as follows.
The core idea of K-Means cluster is to classify to k cluster of n data, and make the sum of square for every data of cluster to the cluster, and the algorism is managed as follows.
The Definition 1. k-means cluster for name disambiguation algorism.
Input: Cluster counts k, the data set contains n data object.
Step 1. Choose any k objects from n data objects as the initial cluster center.
Step 2. Calculate the distance of every object to the every cluster center, and the object is assigned to the nearest cluster center.
Step 3.After the assign of all the objects is finished, the k cluster center is re-calculated.
Step 4. Compare with the previous calculated k cluster center, if the center of cluster is changed, then goes the step 2 else goes the step 5.
The details of the algorism can be described as follows.
Firstly, choose k object as the initial cluster center from n data object, for the other objects, according to their similarity with the cluster center, assign to the similar cluster to them.
Then calculate the new cluster center of every new cluster, repeat the process till all the standard measure function is in convergence, using the standard deviation as the standard measure function, it is defined as follows: In equation, E is the sum of square for standard deviation of all the objects of the database, p represents a spot in object space, is the average value of cluster .

Problem Description
The data set used in the experiment is from the core business system of some big insurance company, mainly contains integrated business manage system, universal system and pension system etc.The insured information contains customer number, social insurance number, certificate type, certificate number, occupation categories, subcategories, small categories, health state, smoke years, the smoked number, marriage status, the relationship with the insured; the information of the insured contains customer number, certificate type, certificate number and name etc. Insurance information contains insured number, sign date, effective date, paid premium and policy duration.Insurance category contains insurance code, insurance name, insurance type, duration and the agent-related information.

Atom Cluster Simulation
In the life insurance field, usually, a policyholder for the insurance and the insured designated beneficiary, so that the insured, the insured, the beneficiaries are existed the relationship.Any two of the three constitutes two connected network.In the social network analysis, the small network is called atom cluster.Through the cluster analysis for the atom cluster, the target of the same name is realized.Figures 2 and 3 showed the insurance of the networks before and after the name disambiguation.The clusters found by comparing the simulation agents can effectively distinguish the same name.

Conclusions
A novel improved method based on cluster analysis is introduced.In order to improve the traditional cluster algorism by using the atom cluster, and from the compared experiment, the method showed in this paper can solve the name disambiguation, and the executing efficiency can satisfy the practical demands.The method proposed in this paper has successfully applied in some insurance company, the next work is to consider improv-  ing the algorism efficiency, and considering extracting the other named entity and combining with the idea of text cluster to realize the name disambiguation of customers.
The next work is to evaluate performance difference among the method proposed in this paper with the other name disambiguation methods.

Figure 1 .
Figure 1.Name disambiguation flow based on atom cluster.