Data mining of hospital characteristics in online publication of medical quality information *

Information disclosure can reduce information asymmetry between health care providers and patients, thus improving both patient safety and medical quality. The National Bureau of Health Insurance (NBHI) in Taiwan currently publishes health-related information online in order to enhance service efficiency and enable the public to monitor the country’s medical system. A data mining technique, classification and regression tree (CART), is used in this work to investigate online public quality information to compare the characteristics of hospital. The hospital quality indicators and characteristics data are available on the websites of the NBHI (http://www.nhi.gov.tw/AmountInfoWeb/Index.as px) and the Department of Health (http://www.doh.gov.tw/). The full classification and regression tree presented in this work, grown using the hospitals’ quality medical indicators and characteristic values, classifies all hospitals into seven groups. The rate of stays longer than 30 days, which is the dependent variable in this study, is most influenced by the number of medical staff. This reflects the fact that the fewer medical staffs that are employed, the smaller the hospital is, and patients who are likely to have longer stays tend to go to the medium or large hospitals. Policy makers should work to decrease or eliminate persistent healthcare disparities among different socioeconomic groups and offer more online healthrelated services to reduce information asymmetry between health care providers and patients.


INTRODUCTION
Advances in information, telecommunication, and network technologies have led to the emergence of a revolutionary new paradigm for health care that some refer to as e-health.By reducing information asymmetry between health care providers and patients, information disclosure can improve both patient safety and medical quality.With more disclosure, patients are capable of searching for more appropriate health-related knowledge, while health care providers are encouraged to provide higher quality services in order to attract patients [1][2][3].
To balance information asymmetry is pressured to health care providers in hope that public awareness and an informed consumer/patient will indirectly lead to improvements in quality of care [4,5].For example, if information asymmetry exists with regard to prices, then this may cause increased prices for health services, because information holders, i.e. the health care providers, can charge monopoly prices [6].In order to reduce information asymmetry, the information disclosed should also include unobservable quality measures in health care services, such as hospital mortality rates or the extent to which appropriate care is provided.
In the Health Maintenance Organization (HMO) market in the US, the quality scores of health care service providers have been disclosed since 1996 as part of the Health Plan Employer Data and Information Set (HEDIS).The demand for quality information came from HMOs themselves, who wished to demonstrate their quality improvement efforts [7].Since HMOs with low quality scores are less likely to attract customers, such companies choose not to disclosure such information, especially if nondisclosure carries little stigma.This implies that voluntary disclosure of quality data, the national mechanism for HMO quality oversight in the US is failing to meet its stated goals to improve consumer decision making, provided incentives to raise quality, and increase public accountability [8].However, Jung (2010) finds positive effects of disclosure on HMO quality, supporting the view that the public release of quality information may lead to improvements in quality [9].High-quality care may be offered in markets with consumers who have greater willingness to pay for quality, and high-quality plans, which benefit from data disclosure, tend voluntarily to release quality information to the public.
In 1995, the National Health Insurance (NHI) program was introduced in Taiwan, offering a comprehensive, unified, and universal health care system.The NHI program is a single-payer one that is managed by the Bureau of National Health Insurance (BNHI), a government agency, to offer comprehensive benefit coverage.To raise the quality of medical care services, mandatory publication of service quality information was adopted in 2007.The Department of Health (DOH) is representatives for assisting in the medical care quality indicators and choices of relative information, as well as establishing cooperative planning information with BNHI.All the relevant medical care quality indicators were selected by the NHI Committee on Quality of Medical Care Services (CMCQ), composed of clinical medicine, medical management, information, law, and health education experts, as well as representatives from consumers, patients, the media, and various associations.
The quality assurance programs agreed with the medical sectors establishing health care quality indicators and posting quality information on the BNHI website as a reference for medical institutions to help them continue improving the quality of their care.The BNHI is committed to making health-related information more open and transparent to improve service efficiency and enable the public to monitor the country's medical system, with 73 quality indicators being posted on the agency's website (http://www.nhi.gov.tw/AmountInfoWeb/Index.aspx) having received 3,497,611 hits as of the end of March 2012.
In practical terms, the timely availability of information is required for effective public health policy making and decision making [10].However, online publication of quality information only reports the reference statistics, and not the meanings that they contain.Data mining is the analysis of data sets to find hidden relationships and extract useful patterns and rules.Data mining in medicine is most often used for building classification models for diagnosis, prognosis or treatment planning [11][12][13][14].Healthcare process are typically data rich, and thus many patterns can be discovered by the use of different algorithms [15][16][17][18], following techniques derived from machine learning, artificial intelligence, and statistics.The effectiveness of data mining has been proven in improving marketing campaigns, detecting fraud, and predicting diseases based on medical records [17,19,20].
As noted above, technological advances in information and communication technologies, the widespread use of the Internet has a number of implications for medical practice [21][22][23][24].The use of patient population characteristics as surrogates for the characteristics of a particular patient have been widely reported in the medical literature [20,[25][26][27].However, there are few studies exploring the providers' characteristics when health care information is disclosed [28,29].This study thus uses data mining to investigate the differences in quality information in order to compare hospitals.The aim of this study is thus to identify the conditions that facilitate voluntary disclosure of quality information, and to further develop policies to encourage this, so that the characteristics of a hospital, such as its human resources, can be considered by patients.

Data Mining
Over the last decade, there has been widespread use of medical information systems and an explosive growth of medical databases.Data mining is defined as the nontrivial extraction of implicit, previously unknown and potentially useful information from data [30].To aid healthcare management, data mining applications can be developed to better identify and track chronic disease states and high-risk patients, design appropriate interventions, and reduce the number of hospital admissions and claims.
In general, data mining in medicine is most often used for building classification models, these being used for diagnosis, prognosis or treatment planning.In various types of data mining methods, the classification and regression tree (CART) technique, which uses a top-down greedy approach to tree construction, is used to produce accurate predictions or classifications based on a few logical if-then conditions.The main difference between decision trees and regression trees is that decision tree construction involves classification into a finite set of discrete classes, whereas in regression tree learning the decision variable is continuous and the leaves of the tree either consist of a prediction with a numeric value or a linear combination of variables.

CART
CART is a nonparametric statistical procedure that identifies mutually exclusive and exhaustive subgroups of a population whose members share common characteristics that influence the dependent variable of interest [11].Either continuous or categorical variables can be taken as input in CART, and no distributional hypothesis is required for these variables.CART has been applied to the problem of mining a diabetic data warehouse composed of a complex relational database with time series and sequencing information [17,[31][32][33].
Based on the fact that tree methods are nonparametric and nonlinear, the simplicity of the results that CART provides is useful not only for the rapid classification of new observations, but can also be used to explain why observations are classified or predicted in a particular manner.The final results of using a binary tree structure for classification can be summarized in a series of logical if-then conditions.Therefore, tree methods can often reveal simple relationships between just a few variables that could have easily gone unnoticed using other analytic techniques.
In general, CART analysis consists of four basic steps, described as follows: 1) Tree building.A tree is built using recursive splitting of nodes.Each resulting node is assigned a predicted class, based on the distribution of classes in the learning dataset which would occur in that node and the decision cost matrix.The assignment of a predicted class to each node occurs whether or not that node is subsequently split into child nodes.
2) Stopping the tree building.At this point a "maximal" tree has been produced, which probably greatly overfits the information contained within the learning dataset.
3) Pruning a tree.To create the sequence of simpler and simpler trees, through the cutting off of increasingly important nodes.
4) To select an optimal tree.The tree, which is selected from among the sequence of pruned trees, fits the information in the learning dataset, but does not overfit the information.

Data
Under health care system in Taiwan, hospitals must register details of their beds, medical staff and related operating item with the Department of Health (DOH).The data set, during the first quarter of 2011, in this study consists of hospital characteristics data and quality indicators of licensed health care facilities in the NBHI website (http://www.nhi.gov.tw/AmountInfoWeb/Index.aspx).Data related to the number of beds, physicians, manpower data and hospital characteristics is collected from DOH (http://www.doh.gov.tw/).The database was merged into a data set using hospital registered identification codes, and the descriptive statistics are listed in detail in Table 1.
A quality indicator, the rate of hospital stays longer than 30 days, is used as the dependent variable in the regression tree.There are seven independent variables: acute beds, chronic beds, physicians, nursing staff, pharmacists, medical staff and hospital type.The detail definitions are as follows: Acute beds: the number of acute beds served by a hospital.
Chronic beds: the number of chronic beds served by a hospital.
Physician: the number of full-time physicians employed by a hospital.
Nursing staff: the number of full-time nursing staffs employed by a hospital.
Pharmacist: the number of full-time pharmacists employed by a hospital.
Medical staff: the number of full-time medical staff employed by a hospital.
Hospital type: different sizes of hospitals in Taiwan are classified into hospitals, specialist hospitals, chronic hospitals and general hospitals.
According to the related attributes and tasks, hospitals in Taiwan are classified into hospitals, specialist hospitals, chronic hospitals and general hospitals.The primary difference of general hospitals and hospitals is that the beds and the departments they offered.General hospitals must have over 100 beds and serve health care services including six departments at least, such as the departments of Medicine, Surgery, Obstetrics and Gynecology, Pediatric, Anesthesiology, and Radiology.The hospitals usually have less than 100 beds and serve one department or several specialist departments.Only hospital type is categorical, while the other variables are continuous.

RESULTS
The disclosure of quality information related to individual hospitals in the Taiwan NHI system is intended to identify the quality differences among hospitals and thus enable people to make informed choices to best meet their needs for health care services.Disclosure should not simply be focused on the perspective that patients should only receive disclo-sures and providers should only give them in an effort to promote patient safety.
To focus on the relationship between hospital characteristics and medical quality in the Taiwan NHI system, CART is used to classify different clusters with minimum internal variance, but with maximum variance between clusters, using a tree structure.The full tree, grown using the hospitals' quality indicators and characteristic values, contains seven the predictor variables and seven terminal nodes.The full tree is shown in Figure 1.This tree successfully classified 387 cases into seven groups, with significant differences among these (as shown in Table 2).Simple resubstitution classification rates can suffer considerable bias, and a more realistic assessment of the performance of this tree is to apply it to data other than that used in its construction.

DISCUSSION
The aim of public disclosure of medial quality data is not only to reveal information to patients, but also to stimulate quality improvement efforts in hospitals [3].For example, the public release of performance data has been proposed as a mechanism to improve quality of care [34] by providing more transparency and greater accountability of health care providers [35].In the US, voluntary national efforts to publicly report on hospital quality include pilot projects that have tested the use of a standardized instrument, the Hospital CAHPS Survey, to measure patient perspectives on hospital care [36].
Different from most previous studies concerning data mining application in health care services, which classifies patients' characteristics, this study focuses on the relationship between hospital characteristics and medical quality.In this study, CART is adopted to discover the hidden connections between hospital The rate of hospital stays longer than 30 days 0.0039 0.0182 0.0162 0.0210 0.6947 0.1317 0.1017 0.1907 0.1500 0.3000 0.1158 0.2670 0.0150 0.0484 characteristics and medical quality indicators.The most important factor that influences the length of a hospital stay is the number of medical staff, such as medical laboratory scientists, physical therapists and occupational therapists.With regard to the services offered by a hospital, the physicians provide the direct health care services needed to improve patients' health, with medical staff providing supportive health care services.The number of medical staff employed is based on the organizational structure of a specific hospital, and reflects both the operating scale and services provided.For example, a small hospital with less than five physicians may only provide primary acute health care services, without a laboratory scientist or physical therapist.Therefore, patients with an expected length of stay over 30 days will prefer to seek health care services at medium or large hospitals, as these have more medical staff and better equipment to serve more health care services.
The hospital characteristics are classified as shown in Table 3, based on the association rules derived from CART.The results show that the number of medical staff is the most important factor to classify groups.Almost 22% of the hospitals, with less than or equal to four medical staff, are classified into group I.Such hospitals offer very limited services, and are common local district hospitals, offering primary health care services in rural areas.Middle and largesized hospitals with more than 12 physicians, like Na- Health policy makers have given considerable attention to the effect of information disclosure on medical quality, such as improving the performance of hospitals and physicians.However, from the results of this study, the difference of medical quality indicator results from the different scale of hospitals, including equipment and medical human source.The small hospitals have better performance in medical quality indicator than great ones, because the patients with serious illness usually go to big hospitals for adequate health care services.According to the online publication of medical quality information, the patients might misunderstand that small hospitals have better performance on quality indicator than big ones and ignore that the great hospitals have the ability to care for the patients with serious, chronic, and terminal illness.Hence, the selection of quality indicators for disclosure should reveal different clinical information, which is easy to read and understand for people.
The online publication of information on the quality of medical care services offers patients a chance to reduce information asymmetries existed between patient and physician and to seek services.However, access to outside medical information is linked to a patient's socioeconomic status [25][26][27]37,38], in which can create a digital divide.Policy makers should thus target educational efforts to decrease or eliminate the persistent healthcare disparities among different socioeconomic groups.Educating and empowering ehealth care consumers through online information enables them to become active participants in their own health care, thus potentially resulting in higher satisfaction.

CONCLUSIONS
Unlike traditional studies, hospital characteristics, instead of patients, have been explored in this study to reveal the hidden issues behind quality care of health service providers.Similar hospital characteristics with nearly quality indicator have been classified into a group with a tree structure through data mining.Medical staff plays an important role to reflect different size of hospital, and reveal the department or services that hospital provided.Hence, the fewer medical staffs are employed, the smaller size hospital is, which presents that patients seek for health care service towards medium or large hospital with higher rate of stay length over 30 days.
Nowadays, online publication of medical quality information in NHI system of Taiwan help people identify the quality differences among individual hospitals and to make an informed choice to seek for health care service, which is helpful to reduce the information asymmetry existed in health care delivery and to improve health care quality.Health policy maker should enhance people among different socioeconomic groups to access medical information through internet.

Figure 1 .
Figure 1.Classification and regression tree.Dependent variable: the rate of hospital stays longer than 30 days.

Table 1 .
Descriptive statistics of different type hospitals.

Table 2 .
Descriptive statistics of end nodes classified by CART.

Table 3 .
Association rules of hospital characteristics classified by CART.Hwa Hospital, reflect the fact that more chronic beds offer patients more opportunities to increase the length of their hospital stays.Besides, some small or chronic hospitals, such as group IV, have a shortage of pharmacists.Most hospitals in group VII are small and middle-sized ones, only Chiu Hospital, You-Chang United Hospital, Yang-Ming Hospital (Taoyuan County) and En-Hua Hospital have more than 70 beds and the others don't.