Support Vector Machine-Based Fault Diagnosis of Power Transformer Using k Nearest-Neighbor Imputed DGA Dataset

Missing values are prevalent in real-world datasets and they may reduce predictive performance of a learning algorithm. Dissolved Gas Analysis (DGA), one of the most deployable methods for detecting and predicting incipient faults in power transformers is one of the casualties. Thus, this paper proposes filling-in the missing values found in a DGA dataset using the k-nearest neighbor imputation method with two different distance metrics: Euclidean and Cityblock. Thereafter, using these imputed datasets as inputs, this study applies Support Vector Machine (SVM) to built models which are used to classify transformer faults. Experimental results are provided to show the effectiveness of the proposed approach.


Introduction
Power transformers are used for transmitting and distributing electricity from plant to customer in utility companies worldwide.Therefore, it is important to always ensure good operating condition of power transformers to provide reliable and continuous supply of electrical power, a necessity in this modern world.Acting on this fact, utility companies have implemented various condition assessment and maintenance measures, and Dissolved Gas Analysis (DGA) is one of them.DGA is a method that detects and predict faults found in oil-filled transformers by: a) analyzing the concentration of certain gases dissolved in the insulating oil, and their gassing rates, and gas ratios, b) identification of fault using diagnostic tools such as Key Gas [1], IEC ratios, Rogers ratios [1], Dornenburg ratios [1] and Duval Triangle [2].However, these tools face some shortcomings.In some cases, the calculated gas ratios fall outside the established ratios codes of the aforementioned tools.As a result, faults that occur inside transformers may not be identifiable [3].In addition, these tools can give different analysis results for the same dissolved gas record, and it is difficult for engineers to conclude a final assessment when faced with so much diverse information [4].
Those drawbacks have motivated many researchers to develop faults diagnostics tools embedded with machine learning techniques that learn from historic DGA data to predict new or unknown faults.In recent years, Support Vector Machine (SVM) has been widely applied for classification of power transformer faults.Such interest is justified by: 1) SVM excellent generalization ability to new knowledge; 2) SVM requires limited effort for architecture design (i.e., it involves few control parameters); and 3) SVM ability to classify non-linear problems.Capitalizing on these strengths, this paper adopts SVM to learn from historic DGA data to predict faults of power transformer.
Alas, those strengths of SVM are not the only deciding factor for achieving higher predictive accuracy of its learning task.The representation and quality of the training data is first and foremost [5].One factor that affects data quality is the presence of missing values in a dataset.Unfortunately, it is a common fact that missing values persistently appear in most real-world data sources and DGA datasets are no exception as documented by [6].It has been proven [7] [8] that as missing values increase in a dataset, the predictive accuracy of an algorithm that learns from this dataset decreases in tandem.As many statistical and learning methods cannot deal with missing values directly, examples with missing values are often deleted.However, deleting cases can result in the loss of a large amount of valuable data.Thus, much previous research has focused on filling-in the missing values with estimated values ("imputing") before learning and testing is applied to.
In this paper, we propose imputing the missing values in a DGA dataset before the learning process of SVM takes place with the main objective of increasing SVM performance in classifying power transformers' faults.At present, there are many established imputations methods such as mean/mode, regression, expectation maximization and multiple imputation to choose from.These techniques, however, require a priori knowledge of data distribution in order to produce as accurate estimation values as possible.In view of this hard-to-realize prerequisite, this paper adopts a simpler well-known method that is the k-nearest neighbour (kNN) to estimate plausible values to fill-in the missing values in DGA dataset.The successful attempts by [9] [10] in using kNN for imputing missing values have also motivated us to do the same.
In Section 2 we briefly described imputation methods and application of SVM.The proposed combination of kNN imputation method and SVM is elaborated in Section 3. Section 4 presents and analysed experimental results.Section 5 concludes our findings and future work.

Imputation Methods
Various methods have emerged over the years to determine and assign replacement values for missing data items.These methods can be categorised into statistical [11] [12] and machine learning methods [13].Linear regression, multiple imputation, parametric imputation, and non-parametric imputation fall into the first category; whilst neural networks, decision tree imputation, and kNN fall into the latter.
The kNN algorithm fills in missing data by taking values from other observations in the same data set.This method searches the k-nearest neighbours of the case with missing value(s) and replaces the missing value(s) by the mean or mode value of the corresponding feature values of the k-nearest neighbours.The advantages of the kNN imputation are: 1) It does not require creating a predictive model for each feature with missing data.
2) It can treat both continuous and categorical values.
3) It can easily deal with cases with multiple missing values.
4) It takes into account the correlation structure of the data.Most notably, kNN is also a non-parametric imputation which makes it practically

Application of Support Vector Machine
In the area of pattern recognition, SVM has been applied to recognize offline and online handwritten characters.
After extracting important features such as chain code, density, and number of lines, SVM was trained on these features to build a model to recognize offline handwritten numbers in [14].The authors reported 98.99% recognition rate.Meanwhile, researchers in [15] combined two classifiers which were Convolutional Neural Network (CNN) and SVM to solve the handwritten digit recognition problem.Their hybrid model was evaluated in two aspects: the recognition accuracy and the reliability performance.Experimental results on the MNIST digit da-tabase showed the proposed hybrid model surpassed CNN and human subjects.SVM has also gained attention in the image recognition field.The authors in [16] automated the process of bacterial recognition and counting during the process of detection of food contamination.Compared with the results recognized by human eye, SVM can effectively distinguish the bacteria from non-bacteria in the image, and greatly reduce the detection time of each sample.SVM was applied in [17] to recognize facial expression by Gabor features which separate facial region from images and reported satisfied results.
In the context of faults prediction of power transformers, some experimental investigations pointed out the effectiveness of SVM in classification of fault despite the small size of DGA training and testing samples.Researchers in [18]- [20] trained and tested their SVM using less than 100 samples for each investigated fault and reported accuracy between 80%-100%.Their SVM classifier also performed better when compared to ratiobased DGA diagnostic tools and NN and fuzzy logic

The Proposed Method
Figure 1 depicts the system architecture for classifying transformer faults.As an input is the incomplete (dataset that contains missing values) DGA dataset, then, all missing values are filled-in with estimated values obtained from the kNN algorithm which results in a complete DGA dataset (dataset with no missing values).Using this complete dataset, SVM learns and predicts the transformer faults.

Impute Missing Values Using kNN
As an imputation method, the kNN algorithm is very efficient and simple to execute.In this method, the missing values of a sample are imputed considering k samples that are most similar to the sample of interest.The similarity of two samples is determined using a distance metric.This study uses three well-known distance metrics to gauge similarities among samples.
1) City Block Distance (CB): It is based on Taxicab geometry, first considered by Hermann Minkowski in the 19th century, is a form of geometry in which in which the distance between two points is the sum the absolute differences of their coordinates defined using the following equation: 2) Euclidan Distance (EU): This is the most usual way of computing a distance between two objects.It examines the root of square differences between coordinates of a pair of objects and is defined using the following equation: ( ) Generally, the steps of k-NN are as follows: a) choose k, the number of nearest neighbours to be selected.b) calculate the distance between the sample with the to-be-imputed missing value with an another sample using a distance metric.Let , , ,  denotes the instance that contains the to-be-imputed missing values and , , ,  be the other sample.If the metric is CB then the distance between i X and q X is: where m is the number of features in i X and q X , and ij x is the j th feature of sample i X and qj x is the j th feature of sample c) Repeat step 2 to compute the distance between i X with each remaining sample in the dataset.d) sort in ascending order (based on the calculated distance values) all q X excluding i X .e) select the top k samples from the sorted list as the k-nearest neighbours to i X .These k-nearest neigh- bours are Let ij x be the to-be-imputed missing value in i X .Then the estimated value is obtained from where k is the number of nearest neighbours, lj x is the j th feature of sample lj x , and l kNN X X ∈ .

Classify Using SM
Figure 2 depicts the implementation of SVM for classifying transformer faults using DGA datasets.In this phase, the dataset used as input is the imputed and complete dataset from the imputation phase executed earlier.a) Normalization: A common characteristics of DGA datasets is the wide range of values contained in some of the attributes.To avoid the possible domination of attributes with greater numeric ranges, this study adopts a preprocessing technique called normalization, specifically the min-max normalization.All attributes are normalized to [0,1] as follows: where v is the actual value, min is the minimum value of an attribute A, max is the maximum value of attribute A, v′ is the normalized value.b) Training and Testing Data: Since SVM is a supervised learning algorithm, the original DGA dataset is randomly split into two parts: a training dataset for deciding the hyperplane that can separate the samples into different classes (i.e.different fault types) and a testing dataset for verifying the classification accuracy of the algorithms.Note that the samples distribution among different classes in both training and testing dataset are kept as the same as that in the original dataset.c) Model Selection: In view of the possibility that classifying fault using DGA dataset is a non-linear problem, the study chooses three different kernels that helps SVM to solve non-linear classification.They are Radial Basis Function (RBF), Polynomial and Sigmoid kernels which equations are as follow: vs-1 SVM: SVM is originally designed for binary classification.However, some datasets contain multi-class labels to learn from and DGA is an example of such datasets.Fortunately, a few extensions to SVM have been developed such as multi-level, one-against-all, one-against-one, and DAGSVM to solve this multi-category issue.Following the recommendations made by a few researchers [21] [22] on the benefits of one-against-one method, this study adopts it to diagnose different fault types in DGA datasets.Based on the idea of "divide and conquer", one-against-one method decomposes the multi-class problem into c(c-1)/2 binary SVMs where c is the number of classes in the experimented dataset.Upon learning from the training set, a classifier is built which is used to classify faults in the testing set.

Experimental Setup
One of the advantages of the kNN method in imputing missing values is that it requires only few parameters to tune: k and the distance metric, for achieving high estimation accuracy.Because both datasets in Table 1 are quite small in size, this study chooses k = {1, 3, 5, 7, 9}.Two different distance metrics mentioned in Section 3.1 are compared.Therefore, each incomplete dataset will be filled-in using kNN for different values of k and for three different distance metrics.
After filling in the missing values in DGA datasets, using the imputed datasets as input, SVM was trained and its model predicted the fault types.The effectiveness of SVM is measured using accuracy defined as follows: Accuracy 100% where c n is the number of instances whose class labels are predicted correctly and n is the total number of in- stances in a test set.SVM with three different kernels were used as mentioned in Section 3.2.
As we want to estimate how accurately a predictive model will perform in practice, this study performed a 5-fold cross-validation where a dataset is divided into 5 disjoint sets (folds), 4 folds are used for training and the last fold is used for evaluation.This process is repeated 5 times, leaving one different fold for evaluation each time.Further, to reduce variability, 100 runs of this 5-fold cross-validation were carried out and the accuracy of  (1) equals to the mean of the accuracies of all run.This paper used a commercial software package MATLAB [23] to impute the missing values as well as classify the faults.The three kernels of SVM require different parameters which values affect the performance of SVM.However, optimizing of these parameters is not the purpose of this study, therefore the default parameter values provided by MATLAB were adopted.The effectiveness of our proposed method to diagnose power transformer faults was validated using the "before-and-after" experiment where the accuracies of each kernel learned on the original incomplete datasets and that learned on the imputed datasets were compared.Because MATLAB is one of the software that cannot deal with datasets with missing values unless they are deleted, we filled-in the missing values with zero instead, to enable MATLAB to perform classification task.However, we would like to remind that zero is not a missing value.As such, the before-and-after experiment is reduced to comparison between zero-filled datasets with imputed dataset using kNN imputation method.

Case Study 1: IEC10DB Dataset
Figure 3 and 4 shows the comparative performances of SVM_RBF, SVM_POLY, and SVM_SIG that predict fault using imputed IEC10DB dataset that was imputed using different values of k for two distance metrics.Using CB as the distance metric, the individual performance of the three kernels over different values of k were pretty much the same as evident in Figure 3.Among the three kernels, SVM_RBF registered the highest accuracy at k = 5. whilst SVM_SIG performed the worst.In the case of EU (Figure 4), similar observations as CB were recorded in terms of the influence of the different values of k to the individual performance of each kernel.Again, SVM_RBF outperformed the other methods and SVM_SIG performed the worst.
Table 3 shows the values of k that helped achieved highest accuracy for each SVM kernel using two distant  metrics.When k = 1, SVM_POLY performed the best, whilst the other two kernels predicted better over higher k.It can be said, the choice of distance metric and the kernel determine the best k.The results of the beforeand-after experiment for this dataset are shown in Figure 5.For this comparison, only the highest accuracy for each kernel over each distance metric was taken for comparison.It is noted that two kernels, namely SVM_RBF and SVM_SIG had better performance using imputed datasets than learning from zero-filled dataset.All imputed datasets obtained using all of the three kernels improved, albeit slightly, these two kernels.The opposite was reported by SVM_POLY.

Case Study 2: MAL
Figure 6 and 7 show the comparative performances of SVM_RBF, SVM_POLY, and SVM_SIG on each imputed dataset using the two distant metrics.It can be seen that, in the case of CB (see Figure 6), higher values of k increased the performances of all of the kernels.However, the performance of each kernel varies greatly among each other.SVM_RBF outperformed the other two kernels with big differences, and reached the highest accuracy at k = 9.SVM_SIG performed second after SVM_RBF, and the highest accuracy was obtained when k = 9.The worst performer was SVM_POLY which was at its most accurate when k = 7.Similar observations as CB were seen when EU was the distance metric as shown in Figure 7.However, the effect of higher k to the individual performance of each kernel was more pronounced and better using EU as the distance metric.Again, SVM_RBF was the most effective of all and achieved the highest accuracy when k = 9.Next was SVM_SIG, which recorded the highest accuracy when k = 5.SVM_POLY was the least effective and it performed the best when k = 5.It can be seen that, EU improved the performance of each kernel better than CB for the MAL dataset.
Table 4 shows the values of k that help achieved highest prediction rate by the kernels.For this dataset, it is clear that all kernels predicted better over higher values of k.In fact SVM_RBF worked best when k = 9 for the two distance metrics.The results of the before-and-after experiment are shown in Figure 8.For this comparison, only the highest accuracy for each kernel over each distance metric was taken for comparison.For this dataset, all of the kernels had better performance using imputed datasets than learning from zero-filled dataset.Although SVM_POLY was the least effective, it benefited the most when missing values were imputed before learning took place.

Analysis
a) The value of the best k, the number of nearest neighbor is determined by the individual dataset and the choice of distance metric.However, larger values of k increase the kernels performance when high amount of missing values are found in a dataset as in the case of the MAL dataset.b) For both of the datasets, imputed datasets using EU provide better performance for two kernels (SVM_ RBF and SVM_SIG) than CB.For SVM_POLY, it works better with CB than EU on the small dataset IEC10DB.The opposite is true for the large dataset MAL.f) The best kernel is SVM_RBF which consistently outperforms the other two kernels and the best distant metric is EU.The combination of these two also produces the highest accuracy for both DGA dataset of different sizes and different percentages of missing values.

Conclusions
This paper proposes imputing the missing values found in a DGA dataset using the well-known kNN imputation method before letting SVM-a widely applied classification algorithm-learns and builds classifier to predict transformer faults.The experiments conducted using this proposed combination show the significant improvements in classifying the faults of power transformer especially when the percentage of missing values in DGA dataset is high.Moreover, by imputing missing values in a dataset enable some software to perform statistical analysis or machine learning task to be carried out without having to omit the samples that contain the missing values.
For future research, this study intents to conduct experiments using other combination of classification algorithms and/or imputation methods.

Figure 1 .
Figure 1.The proposed method for classifying transformer faults.

Figure 5 .
Figure 5.The before-and-after comparative performances on the IEC10DB dataset.

Figure 7 .
Figure 7. SVMs performances on the MAL imputed datasets using Euclidean.

Figure 8 .
Figure 8.The before-and-after comparative performances on the MAL datasets.

Table 1 .
A sample of DGA data consists of a number of dissolved gases in oil and the corresponding fault type as shown in Table2.Dashes in Table2represents missing values (missing gases).

Table 1 .
The characteristics DGA datasets used in this study.

Table 2 .
A DGA dataset with missing values.

Table 3 .
The best value of k for each kernel for IEC10DB dataset.

Table 4 .
The best value of k for each kernel for the MAL dataset.