Construction and Application of the Multidimensional Table for Knowledge Discovery in Ancient Chinese Books on Materia Medica

Knowledge discovery, as an increasingly adopted information technology in biomedical science, has shown great promise in the field of Traditional Chinese Medicine (TCM). In this paper, we provided a kind of multidimensional table which was well suited for organizing and analyzing the data in ancient Chinese books on Materia Medica. Moreover, we demonstrated its capability of facilitating further mining works in TCM through two illustrative studies of discovering meaningful patterns in the three-dimensional table of Shennong’s Classic of Materia Medica. This work might provide an appropriate data model for the development of knowledge discovery in TCM.


Introduction
Data mining and knowledge discovery, as incremental adopted information technologies in biomedical science, have shown great promise in the field of Traditional Chinese Medicine (TCM) for years.Based on a different view toward human life and disease, TCM has developed a distinct medical system for diagnosis and treatment during thousands of years, which has accumulated a large number of medical and pharmaceutical data [1].In the past years, it has been increasingly adopted as an important complementary healing therapy around the world [2] and has attracted researchers among different areas to mine the "knowledge gold" buried in TCM data mountains [3][4][5].Thus, data mining techniques are believed to be able to bridge the gap between the availability of large amounts of data and the difficulty of obtaining novel knowledge about TCM, especially the medical theory such as yin-yang and five elements.
Learning rich dialectical thoughts from the ancient Chinese philosophies, TCM views the world and human body as a whole and analyzes their relationship with yinyang and five elements theory.These theories build a universal foundation for the specific theories related to the diagnosis and treatment, such as syndrome differentiation theory, Zang Fu theory, and Chinese herbal medicine theory [1].Among them, the Chinese herbal medi-cine theory (herbal property, compatibility, the multiple effectiveness of herbal medicine, etc.) is believed to be a breakthrough in TCM modernization, which is worthy of further investigation.Thus, Chinese Herbal Medicine Informatics (CHMI) has arisen gradually [6][7][8] and ancient Chinese books on Materia Medica, the conventional media storing the information of medicinal herbs, are always the preferred materials for study.
Shennong's Classic of Materia Medica (SCMM), also known as Shennong Bencao Jing, is among the great classics of herbal pharmacology and the earliest extant one.The book collects 365 kinds of Chinese medicines and involves many aspects of medicines such as alias, qi and flavor, efficacy and their origins.More than 170 kinds of diseases are discussed, including diseases of internal medicine, surgery, gynecology, pediatrics, etc. [9].Since many of the recorded herbs are still used in TCM therapies currently, SCMM has received sufficient attention in modern research.However, due to the ancient Chinese vocabularies, expert data cleansing and integration are needed for accessibility to modern researchers.
Moreover, to be more effective and valuable, the credibility of the data source and the contribution to new knowledge acquisition are required in the process of data mining.For knowledge discovery in TCM, three aspects of data quality should be highlighted to improve data credibility including representation granularity, representation consistency and completeness [10].Another key issue is the transformation from data mining results generated by the computer into novel TCM knowledge.As a solution, the two-cycle model was provided by Wang in 2008 [11] who has attached importance to the collaboration of medical researchers and data mining researchers.
In this paper, we intended to establish a kind of multidimensional table to manage herbal information contained in Materia Medica books, as well as to permit data to be easily accessed and analyzed.Taking SCMM for example, we constructed the three-dimensional table that presented the major aspects of herbs including herbal qi, herbal flavor and herbal efficacy.Furthermore, we applied the three-dimensional table of SCMM to mining novel knowledge related to Chinese herbal theory.This framework might provide as a helpful tool for information management and understanding in TCM.
The rest of the paper is organized as follows.Section 2 described the process of constructing the multidimensional table for appropriate organization of information contained in SCMM.Section 3 presented two application examples involving association rules mining and clustering analysis.Finally we provided the conclusions in Section 4.

The Construction of Multidimensional Table
Ancient Chinese materia medica books are among the most important resources of TCM for data mining, which constitute the foundations of CHMI.As a practical manual of TCM drug therapy, the information about herbal name and botanical origins recorded in the book guarantees the fit medicinal herbs, while the information about herbal property and efficacy reflects the direct experiences of TCM practitioners on clinical drug use [12,13].
Actually, in the view of data management, the text in these books shares common features of semi-structured data, which contains tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data [14].For example, SCMM is composed of 365 medicine records prepared by classical Chinese words.Each record is written in accordance with the record format which can be divided into six parts including herbal name, herbal qi, herbal flavor, herbal efficacy, alias and source land (Figure 1).The first four parts which formed the main body of Chinese herbal theory were selected in this work to construct the multidimensional table.

Table Structure
A multidimensional table is a multidimensional array consisting of records (rows) and fields (columns), which is suited for organizing and analyzing the data in ancient Chinese books on Materia Medica.In the data table of SCMM, each row represented a single herbal medicine.Herbal qi, herbal flavor and herbal efficacy, which are among the most significant parameters to define the clinical performances of medicinal herbs, were employed as fields.In addition, each of the first two fields could be split into five categories due to its structured data model.However, the field of efficacy presented as semi-structured text, would be split into a determinate number of categories after appropriate data integration.Therefore, the resulting table would have three dimensions, since each categorized variable represented one dimension.The ultimate data model was shown in Table 1, which also contained a unique identifier (Herb ID) and herbal name.The concrete information of each dimension is as follows: 1) Herbal qi dimension: It is the structured data which has five attributes (equivalent to categories in the field in this paper) including cold, cool, neutral, warm and hot.Only one attribute can serve as the marker for each herb in this dimension.2) Herbal flavor dimension: It is the structured data which has five attributes including pungent, sweet, sour, bitter and salty.Only one attribute can serve as the marker for each herb in this dimension.
3) Herbal efficacy dimension: It is the semi-structured data which can be divided into a finite number of attributes after data integration.Several attributes can serve as the markers for each herb in this dimension.

Data Preprocess
Since most of the ancient Chinese Materia Medica books are prepared by classical Chinese and provided with different versions, data preprocess (e.g.data cleaning, data integration and annotation) is indispensable for ensuring data quality.In this work, regarding to synonyms of efficacy terms in Classical Chinese, some ancient and contemporary references including Zhu Bing Yuan Hou Lun [15], Internal Medicine of TCM [16], Surgery of TCM [17], Obstetrics and Gynecology of TCM [18] and two proofreading and annotation books for SCMM [19,20] were employed to achieve representation consistency.
Finally, 196 items were acquired for attributes in efficacy dimension.Thus, semi-structured data records presented in Figure 1 can be converted into a data table shown in Table 2.
After the selection of defined attributes in three dimensions separately, a kind of three-dimensional table was constructed in an Excel file format.The row of the table represented the information of a single herbal medicine.The medicine was located in the table using Boolean values whose expression was evaluated to 0 if the medicine did not have the corresponding attribute, 1 if it have (Table 3).Taking ginseng for example, the value of the cell identified by the row of ginseng and the column (attribute) of cool was 1 while other values in this dimension were 0 because the herbal qi of ginseng was cool.

The Application of Multidimensional Table
Above all, the digitization of information in ancient Chinese materia medica books was achieved appropriately

Clearing heat
Curing war wounds a Five attributes in herbal efficacy dimension were chosen for display.by the multidimensional table, which could facilitate further data mining works.The complete three-dimensional table of SCMM consisted of 365 herb records, including 5 attributes in herbal qi dimension, 5 attributes in herbal flavor dimension and 196 attributes in herbal efficacy dimension.Then, two data mining researches, an association rules mining [21] and a cluster analysis [22], were implemented to search for correlations between attributes and between records respectively (Figure 2).They would contribute to the acquisition of novel knowledge about Chinese herbal theory.

Association Rules Mining
In this section, frequent patterns and valued association rules between attributes in the dimension of herbal qi/flavor and herbal efficacy were mined.These kinds of association rules demonstrated the strong relations be-tween herbal property and herbal efficacy, promoting the understanding of Chinese herbal theory.Setting the proper parameters, we acquired 115 strong association rules by the Apriori algorithm (Table 4), which presented the evidence to discriminate the qi/flavor of the medicinal herb with specific efficacy.As we can see, some efficacy attributes in Table 2 were among them such as promoting longevity, clearing heat, warming the middle qi, etc.

Cluster Analysis
In this section, a classification study was implemented by using semi-supervised incremental clustering algorithm.
Calculating the jaccard's index of similarity between every two herb records, we first selected the micro-clusters whose members had exceptionally close correlations.Then a k-nearest neighbor algorithm (k = 3) was used to  Warming the middle qi ⇒ Pungent; Relieving cough with dyspnea ⇒ Pungent; Nourishing essence-spirit ⇒ Sweet; Removing water retention ⇒ Bitter Efficacy ⇒ Qi∧Flavor 9 Warming the middle qi ⇒ Pungent∧Hot; Resolving hard mass in stomach and intestine ⇒ Bitter∧Cold classify the rest of the herbs.The results showed that 253 herbal medicines were reasonably classified as 14 types such as sort of invigoration, clearing heat, diuresis, treating impediment disease and treating gynecological disease, while the other 112 medicines were classified into 112 individual types.The same high similarity to different known types might be the main reason for those individual herbs.Table 5 showed the major clusters involving more than 10 herbs.

Conclusion
Data mining is a promising technology which can be applied in analyzing vast amounts of TCM data for investigating novel knowledge.In this paper, we provided a kind of multidimensional table that was suited for the data in ancient Chinese materia medica books, in order to assist researchers to manage the data in an efficient way.Moreover, we also introduced two illustrative studies of mining meaningful patterns in the three-dimensional table of SCMM.The results provided evidence that the multidimensional table could facilitate data mining works in TCM.

Figure 2 .
Figure 2. Two data mining studies based on the three-dimensional table of SCMM.