Identification of Categorical Registration Data of Domain Names in Data Warehouse Construction Task

This work is dedicated to formation of data warehouse for processing of a large volume of registration data of domain names. Data cleaning is applied in order to increase the effectiveness of decision making support. Data cleaning is ap- plied in warehouses for detection and deletion of errors, discrepancy in data in order to improve their quality. For this purpose, fuzzy record comparison algorithms are for clearing of registration data of domain names reviewed in this work. Also, identification method of domain names registration data for data warehouse formation is proposed. Deci- sion making algorithms for identification of registration data are implemented in DRRacket and Python.


Introduction
Decision support systems (DSS) have been used for formation of administration decisions, i.e. discovery of the best alternative for a lengthy period of time. Majority of commercial, social and governmental organizations no longer make serious decisions without using elements of computer analysis [1]. At all stages of complex administrative decision making-from identification of the problem to control of realization of made decision-the more accurate, timely and complete is the appropriate information, the more effective will be the decision. Information support of the decision making process determines the effectiveness of decision at all stages [2].
Modern approach to automation of decision making support is based on use of data warehouse (DW) concept. Rampant development of information technologies, and data collection, storage and procession means in particular, allows to collect vast volumes of data which require analyzing. DW provides analysts, executives and top managers with capability to study large volumes of interdependent data using fast interactive information reflection on different levels of detailing from different points of view in accordance with notion of the user on subject field [3,4].
Bill Inmon, author of DW concept, defined them as "subject oriented, integrated, unchanged, supporting data collection, organized to support management", designed to act as "unique and only source of truth", providing managers and analysts with trustworthy information necessary for operative analysis and decision making [5,6]. Richard Hackathorn, another founder of this concept, wrote that the objective of DW-is to provide "unique image of existing reality" for organizations [7]. Ralph Kimball, one of DW authors, described the warehouse as "a place, where people can access their data". He also formulated in [8,9] main requirements to data warehouses.
For correct comprehension of data concept, it is necessary to understand following principal moments: DW concept-is not a data analyzing concept; rather it is a concept of preparation for data analyzing; DW concept does not predetermine architecture of purposive analyticcal system. It describes the processes to be implemented in the system, but not the where precisely and how these processes must be implemented. The main objective of warehouses-is creation of a single logical presentation of data, contained in multitype data bases (DB) or in other words, a single corporate data model. another does not cause difficulties. However if we simply relocate the data from all sources to a single DB, then we will obtain a "dump", an uncoordinated set of data. In order to create something more accessible for analysis by the final user, it is necessary to coordinate the data entering the DB from sources of data warehouse. In other words, to solve the main task of DW construction: to create the most coordinated, subject-oriented, integrated, time dependant data set.
Filling of warehouses, as a rule, is carried out by information from several data sources. Development or consequential support of qualitative information storage and processing systems is a complex problem. Human factor and partial absence of control at submittal lead to occurrence of distortions in data. Misprints and omissions are present almost in all details of saved objects, as well as in identification sets. During entrance stage of information into the data base, human factor is the main reason for occurrence of distortions. Damerau demonstrated that 95% of errors during typing by a person are transpositions [10].
Upon development of DW, very little attention is paid to cleaning of incoming information. Apparently, larger the volume of warehouse, the better it is considered. This is an erroneous practice and the best way to turn the data warehouse into a disposal dump. It is necessary to clean data. Information is heterogeneous and is collected from different sources.
Exactly presence of point sets of information collection makes the clearing process especially relevant. Generally speaking, errors are always committed, and it is impossible to completely dispose of them. Possibly, sometimes it is more reasonable to accept them, rather than spend money and time to get rid of them. But, in general cases, it is necessary to aim to reduce the amount of errors to an acceptable level so ever.
There are different kinds of errors. There are also errors characteristics to a certain subject field or task. Errors, which do not depend of task: contrariety of information; data omissions, anomalous values; noise; data entry errors etc. There are different kinds of solutions for each of these problems. Data omissions are a very serious problem for majority of DW.
Due to lack of information, as a matter of actual practice, application of DW is realized poorly or with significant limitations. Quantity of errors during data input is excessive, for example misprints, deliberate data distortions, inconsistence of formats, excluding typical errors related to specifics of data input application operation.

Objective
Domain is a domain namespace field and is characterized with independence of data allocation, inclusion of information system in domain contents, presence of special information systems (DNS servers) containing data on domain names allocated in domain and carries out the function of domain name space organization [11].
Registration data of domain include: domain name (domain), registry identifier (registrar), full name of the physical person (person), contact address of the physical person (address), domain administrator identifier (admino), organization identifier for administrative communication (admin-c), title of organization (organization), domain registration time (created), domain free date (freedate), telephone numbers with international codes (phone), e-mail address (e-mail), list of DNS servers supporting domain (nserver), domain type (type), information source (source), domain registration payment time (paid-till) [12].
Clearing of domain name registration data is carried out in the works and domain names registration data identification method is developed based on decision tree apparatus application. Decision trees are selected as the main algorithmic approach for construction of effective data integration system.
During construction of data warehouse, problems occur related to misprints and omission of data. As registration data of domain mainly consists of categorical data, it was decided to apply fuzzy search algorithms for cleaning of these data.
Following tasks are formulated for research purposes: 1) Processing of domain name registration data, data clearing using Damerau-Levenstain algorithm; 2) Identification of domain name registration data using decision tree construction.

Algorithm Choice
Generalized string matching task which includes detection of substrings of text strings is also called fuzzy string matching task. This task is important for cases where errors are taken into account. Review of fuzzy comparison algorithms is provided in work [13][14][15][16][17].
Fuzzy search algorithms (also known as fuzzy string search) are the foundation of spell check and complete search systems such as Google or Yandex. There are many methods of analysis [18][19][20][21][22][23][24][25] and fuzzy (inexact) string matching [26][27][28][29]. The most popular of these fuzzy string matching methods are the methods of calculation of editing spacing [30][31][32]. Generally, metrics numerically calculating the value of transformation of one line to another is considered as spacing editing. There are different several operations, each of which can have a value of its own: character stuffing, deleting, replacement and transposition of proximate symbols. There are dif-ferent fuzzy string matching algorithms, which are based on different editing distances. Hamming distance-is a number of positions, in which corresponding symbols of two words of the equal length are different [33]. In more general cases, Hamming distance is applied to lines of same length of any q alphabets and serves as the difference metrics of objects of equal dimensions. Hammings distance is usually used in bio-informatics and genomics.
If matching of two strings of different lengths is allowed, then as a rule, insertion and deletion are also required. If they are given the same weight as replacement, minimal general value of transformation will be equal to one of the metrics proposed by Levenstein [34]. As a result, a more general task for a voluntary alphabet was associated with his name. Gasfield made a significant contribution in study of this issue [35].
Levenstein distance and its generalization is actively applied for correction of errors in words (in search systems, data bases, during text input, during automatic detection of scanned text or speech), for comparison of text files, in bio-informatics for comparison of genes, chromosomes and proteins. From application point of view, determination of distance between words or text fields according to Levenstein has following disadvantages:  Comparatively large distances are made upon rearrangement of words or parts of words;  Distances among completely different short words become small, while distance between very similar long words become significant. As a result of analysis of different DNS data bases, it is known that, contents of records of same fields of registration data of domain names can be expressed in forms different (not identical) in content. The reason for occurrence of difference of field value might be lack of information, misprints, use of abbreviations, duplication of records etc.
Upon entry of domain name registration data in DW, abbreviations, misprints, omissions, double recordings and other distortions are encountered. In order to increase the quality of input registration data such as "registrar", person", "address", "organization", "admin-о" etc, prevention of errors and inconsistencies of duplications in records is required. For example, in names of countries (cities) misprints can be as Kanata (Canada), Russian (Russia), Frankfrut Am Main (Frankfurt Am Main) etc; organization identifier registration data (registrar) "MONIKER ONLINE SERVICES, INC." can be identified as "MONIKER, INC.", "MONIKER", "MONIKERS ONLINE SERVICES". In this case, words included in phrases must be processed separately. Despite difference of these strings, it is clear that, all of these titles stand for the same registrar. But let's also note that, upon comparing strings "MONIKER ONLINE SERVICES, INC." and "MONIKER, INC." using Levenstein metrics, we receive a larger value for editing distance. In cases where it is necessary to process the words contained in phrases separately from each other and/or apply algorithms that can compare similar lines for equivalence (for example, full entrance of one line into another as subsequence) and detect them as "similar".
In reviewed paper, let's use Damerau-Levenstein distance-difference measure of two strings of symbols, defined as minimal quantity of insertion, deletion, replacement and transposition (rearrangement of two proximate symbols) operations necessary for transfer of one string to another. It is the modification of Levenstein distance, and differs from it by addition of transposition operation.
Editing distance determined in such way can be calculated using dynamic programming method [36]. Also, there is an algorithms for this, requiring O(MN) operations, where M and N-are lengths of compared lines, and it is required to calculate MN of elements of so called dynamic programming matrix in order to find the distance value.

Symbol Fields for Registration Data Recording
As a rule, symbol fields consist of a string which contains one or several words, divided in gaps and punctuation marks. Nonstandard phrases are used in fields. Data is entered manually by an operator, often in distorted condition. In this regard, punctuation marks such as ".", ",", ":", " " ", "-" etc. that do not carry a functional significance, are replaced by "blank" sign. Name of physical person (person), contact address of physical person (address), domain administrator identifier (admin-o), organization identifier for administrative communication (admin-c), organization title (organization) are the key fields upon identification of registration data.
Components of these fields can be present in random order, which is a difficult task for automatic processing of such information. In order to implement comparison of two lines containing such information, it is necessary to dismember each field to its contents, then compare only those which have identical meaning, for example in the address, compare the name of the city to the name of the city, name of the street to the name of the street. For content analysis, we enter so-called, samples. For example, sample address template is a combination of address components: [

Decision Tree
Decision trees is the most comfortable decision making method for record (object) identification, for their demonstrativeness while use, minimal calculation resources and simplicity of realization. Decision tree-is one the methods of automatic analysis of vast amounts of data. Decision trees-is a method of rule presentation in hierarchic, consecutive structure, where each object corresponds to single knot that gives decision. Under rule, we understand a logical construction, presented in "if…then" form. Main advantages of decision trees are generation of rules in fields where experts formalize their knowledge with difficulty; extraction of rules in natural language; intuitively understandable classification model; high forecast precision, comparable to other methods (statistics, neural networks); construction of non-parametric models [37].
Suppose that we are given any educating set of T, containing objects, each of which is characterized by m attributes, while one of them points at affiliation of the object o a certain class. Construction idea of decision tree from T set, first expresses by Hunt, is demonstrated in accordance with R. Queenlan [38].
Let's label classes (values of class marks) through {C 1 , С 2 ,···, C k }, then there are 3 situations:  Set of T contains one or more examples related to one class C k . Then decision tress for T-is a list, determining the С к class.  Set of T does not contain any examples, i.e. it is an empty set. Then again, the list and the class associated with the list are selected from another set different from T, let's say from sets associated with the parent;  The Set of T contains examples affiliated with different classes. In this case it is necessary to divide Set of T to some subsets. For this purpose, we choose one of the attributes that has two or more different values O 1 , О 2 ,···, O n . T is divided into subsets of T 1 , Т 2 ,···Т n , where each subset Т i ; contains all examples that have a value of O i for selected attribute. This procedure will be recursively continued until the final set will consist of examples related to the same class. Above described procedure is the basis of many modern construction algorithms of decision trees. Obviously, upon using given methodology, construction of decision tree will be implemented from top downward. Currently, there is a significant number of algorithms, realizing decision trees CART, C4.5, Newld, ITrule, CHAID, CN2 etc [39,40]. But following two are the most widespread and popular: CART (Classification and Regression Tree) and С4.5.
Decision tree will be generated following way: for exact identification of the objects, we will use fields univocally identifying the object. Domain registration data will be identified by following attributes: registrar, person, address, admin-o, admin-c, organization, created, updated, free-date, phone, e-mail, nserver, type, source. Intermediate nodes of the tree correspond to these attributes, and arches-to possible alternative comparison values of these attributes "+" (same), "±" (similar), and "-" (different). Tree leaves are indicated as one of three classes as "=" (compared objects are identical), "≠" (objects are different), "?" (unknown). The example of the tree for identification of registration data domain names is provided on Figure 1.
Following are the most significant among attributes: registration identifier (registrar), full name of the physical person (person) contact address of the physical person (address), domain administrator identifier (admin-o), organization identifier for administrative communication (admin-c), title of organization (organization), list of DNS servers supporting domain (nserver).
Less significant, but as informative attributes are: domain registration time (created), domain update time (updated), domain free date (free-date), telephone numbers with international codes (phone), e-mail addresses (e-mail), domain type (type), domain registration payment time (paid-till), information source (source) etc.
Decision tree is formed based on the knowledge of experts of the subject field. Depth of the tree is selected based upon objects necessary for precise identification.

Software Implementation
Software implementation of identification algorihms of registration data of domain names is implemented on Copyright © 2013 SciRes. ICA implemented in Python language which is used for solution of wide range of tasks of applied programming [44][45][46][47]. Same algorithm is presented in Python program on Figure 3.
DRRacket language. Racket (former PLT Scheme) is the multi-paradigm programming language in Lisp/Scheme family, which also serves as a platform for creation of languages, design and realization. Programming language is known due to its extensive macro system, which allows creating built-in and subject-oriented languages, language constructions, such as classes or modules, as well as separate Racket dialects with different semantics. Distribution platforms of free and open software is spread under LGPL license [41][42][43].
Conducted experiments with above described decision tree gave the identical positive results in DRRacket and Pythin programs.

Conclusions
Modern approaches to solution of this task are related with construction of DW allowing "liberating" of information from strict frames of operating systems and better comprehend the problems of real activity.
We use Racket, in order to demonstrate that without crochets and other unused symbols, it is easy to understand complex program codes in programming language. Despite the fact that Racket language is simple and easy for perception, it is quite difficult to find errors committed in the program. a fraction of decision algorithm for identification of registration data in DRRacket language is provided in Figure 2.
DW provides high speed of data reception, ability to receive and compare, as well as consistency, completeness and authenticity of data. After withdrawal from the source, data are loaded into the warehouse, in order to implement cleaning and simultaneously provide comprehensive support of data cleaning [48]. For comparison of results, the identical algorithm is Domain names registration data category identification was reviewed in this work. Abbreviations, misprints, omissions, conscious data corruption, record duplicates etc which were allowed on information collection stage are considered as errors. Currently, variety of alternative proximity functions were proposed, but from our point of view, in order to conduct clearing and consistency checking, Damerau-Levenstein distance most accurately corresponds to intuitive similarity concept. Also, domain name registration data records identity method based on decision tree was proposed.
Developed identification method was implemented for creation of a DW on information from several DNS Servers based on the example of domain name registration data serving the interests of the Republic of Azerbaijan.