Medical Data Visualization Analysis and Processing Based on Machine Learning

Trying to provide a medical data visualization analysis tool, the machine learning methods are introduced to classify the malignant neoplasm of lung within the medical database MIMIC-III (Medical Information Mart for Intensive Care III, USA). The K-Nearest Neighbor (KNN), Support Vector Machine (SVM) and Random Forest (RF) are selected as the predictive tool. Based on the experimental result, the machine learning predictive tools are integrated into the medical data visualization analysis platform. The platform software can provide a flexible medical data visualization analysis tool for the doctors. The related practice indicates that visualization analysis result can be generated based on simple steps for the doctors to do some research work on the data accumulated in hospital, even they have not taken special data analysis training.


Introduction
Medical data mainly include clinical trial data, biomedical data, electronic medical records and diagnosis books, and individual health information [1].The data type varies from image, text to numbers.The huge volume makes the doctors to be drowning in medical data accumulated in hospital but starved of information.Sometimes the doctors maybe want to reveal the rule behind the data; for instance, if a special disease is related to sex, age, residence region, or other things, and why.The medical data visualization analysis and processing can provide an intuitional graphical tool, and more and more methods are developed in past decades.For instance, in 2014, Akilah L. [2] organized hierarchical data structures by using treemaps to examine large amounts of data in one overall view, which served as a proof that treemaps could be beneficial in assessing surgical data retrospectively by allowing surgeons and healthcare administrators to make quick visual judgments.In 2015, Gilbert Chien Liu [3] provided health services researchers a visualization tool to construct logic models for clinical decision support within an electronic health record.The mapping relationships could be acquired based on software for social network analysis: NodeXL and CMAP.Seonah Lee [4] developed time-oriented visualization for problems and outcomes and Matrix visualization for problems and interventions by using PHN-generated Omaha System data to help PHNs consume data and plan care at the point of care.In 2016, Shahid Mahmud [5] presented a data analytics and visualization framework for health-shocks prediction based on large-scale health informatics dataset based on fuzzy rule summarization, which can provide interpretable linguistic rules to explain the causal factors affecting health-shocks.Usman Iqbal [6] put forward an animated visualization tool called as Cancer Associations Map Animation (CAMA), which can depict the association of 9 major cancers with other disease over time based on 782 million outpatient data in health insurance database.Dror G. Feitelson [7] introduced multilevel spie chart to create a visualized combination of cancer incidence and mortality statistics.In 2017, Fleur Mougin [8] reviewed the current methods and techniques dedicated to information visualization and their current use in software development related to omics or/and clinical data.It can be seen from the past research on medical visualization that related research progress has been made on the processing of medical big data, visualization of electronic health records and correlation analysis of disease characteristics.But the research on medical data visualization analyzed by fusion algorithm is still to be explored.Under the background of this study, this paper put forward general-purpose medical data visualization analysis tool within R and the machine learning methods, which are taken as predict tool.

Machine Learning Classification Algorithms
Sometimes the special type medical data need to be classified into clusters, then we can try to find the relationship between the cluster and disease.The cluster analysis is an important method as the data visualization analysis.So, the typical machine learning methods KNN, Support Vector Machine and Random Forest are selected as predict tool for the data classification.

K-Nearest Neighbor
K-Nearest Neighbor (KNN) [9] [10] is a typical supervised machine learning method.KNN is a non-parametric method used for classification, where the output is a class membership.The objects are classified by a majority vote of its neighbors, with the objects being allocated to the class most common among the k nearest neighbors.For the medical data object, if most of the k nearest samples in the feature space belongs to certain category, which means the samples belong , , KNN makes decisions based on the dominant categories of k objects, rather than a single object category.The KNN algorithm could be describes as: Step 1: Calculate the distance between the test data and each training data; Step 2: Sort the distance according to the increasing relation; Step 3: Select K points with the nearest distance; Step 4: Determine the occurrence frequency of the category of the first K points; Step 5: Return the category with the highest frequency in the K points as the prediction classification of test data.

Support Vector Machines
Support Vector Machines (SVM) [11] , based on the training set D, a hyperplane founded in the sample space could be taken as the mark of the sample belonging to one or the other of two categories.For the medical data, there exists two possible situations: linear separable data condition and non-linear separable data condition.If the data is linearly separable, this Equation ( 3) is used in the n-dimensional space to find a set of weights (4) that specify two hyperplanes.
The distance between two planes is 2 w  , where w  stands for Euclidean norm.Such task situations are expressed as a set of constraints (5).When the data is non-linear and separable, the constraint condition of the task case is (6).
( ) ( ) When dealing with vector i x , it can map to high-dimensional space through the kernel function.Kernel function which used commonly has linear kernel (7), polynomial kernel (8), sigmoid kernel (9), Gaussian RBF kernel (10), etc.

Random Forest
Random Forest (RF) [15] is a combinatorial classifier algorithm, which is a classifier composed of multiple Decision Tree [16] ( ) where { } i θ is an independent and identically distributed random vector, and the final class label of input vector x is determined by all decision trees.The growth of each decision tree depends on an independent identically distributed random vector.The overall generalization error depends on the classification ability of a single decision tree in the forest and the correlation degree between each tree.The algorithm consists of two parts: the growing process of the decision tree and the voting process.The random forest generates multiple decision tree classifiers by bagging and bootstrap.

Data Processing
Considering that the medical data varies from values to image, the original data may need to be pre-processed before the visualization analysis.

Data Filling
In medical database, some values may be not available.Therefore, in most cases, the database is incomplete.The methods to deal with incomplete data sets mainly include [17] ; ; ; ; ; ; m y y y = y  , ( ) ( ) y Xw y Xw , and take the derivative with respect to ŵ and we can get (13).
( ) If T X X is positive definite matrix, the derivative is zero, and ( ) T X X is not positive definite matrix, the regularization term is introduced.At first, take KNN as extractor, and calculate the distance of features, which needs to convert the nominal features involved in the data set into a numerical format.Take dummy variable encoding method (14).

Data Visualization Analysis and Processing Platform
Try to reduce the dependence of KNN on the measurement scale of the input features, the min-max standardized data is adopted, the classification results are shown in Table 1, where amount is the total number of the classified data, category A represents malignant neoplasm of lung, category B represents non-tumors, and the classification rate means the accuracy of classifier, referring to the proportion of items correctly classified by classifier in all classified items.
Then take SVM as extractor, the classification results are shown as in Table 2.
Turn to RF, the choosing of appropriate mtry's value by testing could improve the accuracy.As shown in Figure 2, the horizontal axis represents 26 different measurement indexes, and the vertical axis represents the mean error rate of each measurement index.Here, the limit which we selected based on Figure 2 is 0.14, and the number of measurements which less than 0.14 is selected as mtry.
Finally optimizing mtry's value could be set as 9. Then the relationship between the model error and the number of decision trees could be detected by experiments, as shown in Figure 3.After the decision tree's number is 100, the curve flattens.We can set ntree's value as 100.The final classification results are shown as in Table 3.
Sensitivity means the rate that the suffering samples detected take up all the suffering samples.Specificity is the rate of the non-suffering samples account for all the non-suffering samples.Comparing classification accuracy from Table 1, Table 2 and Table 3, the KNN's sensitivity is slightly higher than the SVM, the specificity of the SVM is far higher than the KNN, and RF is significantly higher than the former two kinds of algorithm.As shown in Table 4, taking the 750 records of test set, the correct rate can up to 99%.
As shown in Figure 4, RF has higher performance than KNN  We introduced the machine learning methods is aimed to help medical personnel with diagnosis and treatment of diseases (the disease which selected in our experiment is the malignant neoplasm of lung), and obtain the influence of different characteristics on diseases in the analysis process, as shown in Figure 6.
At the same time, on this basis, we can provide prediction tools for the doctors.
Try to provide a general-purpose medical data visualization analysis tool, within KNN, SVM and RF, a platform software including data processing, data extraction, data analysis is developed based on R language, image software ImageJ, and database PostgreSQL.
As shown in Figures 7-9, we have provided a visual platform for doctors to implement the algorithms to obtain the results of disease classification.Doctors can also realize the statistical analysis of the data through the platform, and can manually control the visualization operation of the data.This can assist providing intuitionistic analysis within human-machine coupling to find the relationship between potential influent factor(s) and disease or recovery.

Conclusion
For the medical data visualization analysis, the machine learning methods can provide both predict and classification tool.We select three typical machine get a linear model to predict real value output tags as accurately as possible.We can construct a model as(11).
linear regression.The least square method is used to estimate w and b, and w and b are absorbed into the vector form ( ) ˆ; = w w b .The data set D is represented as a matrix X of

Figure 1 .
Figure 1.Missing data analysis based on box plot.

Figure 3 .
Figure 3. Model error and number of decision trees chart.

Figure 4 .
Figure 4. Performance evaluation indexes of the KNN, SVM and RF.

Figure 5 .
Figure 5. Histogram of the measure index of the importance of eigenvalues.

Figure 6 .
Figure 6.The structural framework of data visualization analysis platform.

Figure 8 .
Figure 8. Manually control the visualization operation of the data.
learning methods: K-Nearest Neighbor (KNN), Support Vector Machine (SVM) and Random Forest (RF) as the classifier to predict whether patients suffer from the malignant neoplasm of lung.Considering the sensitivity, specific and detection accuracy, RF has better performance.By the medical data visualization analysis platform based on machine learning tools, we can make further efforts to classify the most influence factors are pH, Platelet Count and Creatinine for the results of classification.The platform can also provide various graphics generators according the doctor's query operation, which can provide doctors with intuitive analysis, find the relationship between potential influent factor(s) and disease or recovery.The experiment and practice within the medical database Calculated Total CO 2 , Chloride, Creatinine, Glucose, Hematocrit, Hemoglobin, Magnesium, MCH, MCHC, MCV, pCO 2 , pH, Phosphate, Platelet Count, pO 2 , Potassium, RDW, Red Blood Cells, Sodium, Urea Nitrogen, White Blood Cells.
MIMIC-III (Medical Information Mart for Intensive Care III) is a large, freely-available database comprising identified health-related data associated with over forty thousand patients who stayed in critical care units of the Beth Israel ing out the experiment, this data is which included the data of patients with pulmonary malignant tumor and healthy people.The main detection projects included Anion Gap, Base Excess, Bicarbonate, Calcium.Total, : 1) delete tuples, which is used to delete objects (tuples, records) with missing information attribute values; 2) data complement, which is used to fill the null value with a certain value to complete the data table.In general, a null value is filled according to the distribution of values of other objects in the data table based on the principle of statistics.Common methods include k-means clustering, Regression, etc.; 3) without processing.In some cases, null values have little impact on the study, or the data analysis method adopted can automatically process the null values, in this case no additional operations are required.
For the missing value in the medical database, we can use box chart as a missing data analysis tool.For instance, by calling the summary function in R, as shown in Figure1the box plot can give a missing data summary of the properties of Sodium and Potassium.Considering various factors, filled the data in MIMIC-III by means of regression assignment.Given dataset

Table 1 .
Classification results of KNN.

Table 2 .
Classification results of SVM.

Table 3 .
Classification results of RF.

Table 4 .
Comparison of classification performance of three algorithms.