^{1}

^{2}

The paper is written to analyze the behavior of a selected set of vehicles with different makes, on how they survive after each owner change. The data come from Github. The four cars are Honda Accord, Mini Cooper, Chevy Cavalier, and Toyota Avalon. The two faults are Engine System and Transmission System. The data are from 1996 to 2012. The paper used the Kaplan-Meier curve to survival analysis; the paper also calculates and discusses the self-comparison of each car’s four time periods, the four-stage failure rate through median comparison, and the median comparison of fault conditions in all years. We find that all the vehicle types have gotten better with the years and Toyota vehicles are more reliable than Honda.

Vehicle survival is a concept concerning total time a vehicle works after it is sold to a customer and the malfunctions of the vehicle. Vehicle survival analysis is utilized in numerous areas like vehicle quality assessment. For example, people are able to anticipate latent problems that might occur on vehicles to ensure the driver and passengers’ safety; survival analysis is also employed in large-scale vehicle scrappage programs to maximize the usage vehicles’ abilities and maintain the price of vehicles.

Furthermore, the vehicle survival analysis can also be used to estimate the car stability even before customers purchase it. In this way, money is used more efficiently. This analysis related to specific vehicles provides car manufacturers with a great opportunity to make an improvement in their products, attracting new consumers and ensure old consumers’ loyalty.

Former researchers have done a similar analysis to estimate vehicle performances. Data mining and neural network methods are utilized to estimate the reliability of a vehicle [

In this paper, we adopt data from Github (https://github.com/tcrug/car-reliability). Data is in the form of charts. In the data set, there are totally six columns in the data chart. The first column represents the date the vehicle was bought; the second column shows the vehicle manufacturer; the third column is the specific type of vehicle produced by the manufacturer; the fourth column lists out the total distance traveled by the vehicle before it was examined at it first malfunction; the fifth and sixth columns represents the state of engine and transmission system respectively. Four different types of vehicles from different car brands are analyzed, like Honda, Toyota, MINI, and Chevrolet. To make the data represent the complicated vehicle market more generally, we deliberately chose car manufacturers from three different countries which stand for different manufacture criteria and styles. The malfunctions are separated into two major categories: engine problem and transmission problem. The vehicle is examined at the time the state of these two parts are represented by 0 and 1. 0 means no problem found after the vehicle examination, whereas 1 represents that problem exists in corresponding part.

In this paper, we employ mainly three methods: non-parameter method, semiparametric method and parametric method.

Kaplan-Meier estimation graph and log-rank test are utilized to analyze the state of each type of vehicles.

The graph of Kaplan-Meier estimator declines like stairs. It is composed of multiple horizontal lines and vertical lines to reveal the chance of individual to survive within a given time. It is described as survival function. It is mainly used in medical treatment to estimate the probability for patients’ to survive under certain circumstances, but in this paper it serves as the main method to evaluate vehicle survival possibilities. The utility of this method on vehicles is essential for the promotion of vehicle production with higher qualities [

The estimator has the basic function of

s ^ ( t ) = ∏ i : t i ≤ t ( 1 − d i n i )

t_{i} is a time when at least one event happened, d_{i} is the number of events that happened at t_{i}. n_{i} is the individuals known to survive (have not yet had an event or been censored) at time t_{i}. There is no unknown parameter, so Kaplan-Meier can be include in non-parametric methods [_{i}/n_{i} can be regarded as a parameter. We can use the method of maximum likelihood to estimate its value.

We hypothesize the new function to be

s ^ ( t ) = ∏ i : t i ≤ t ( 1 − h i )

The likelihood function is

L ( h j : j ≤ i ) = ∏ j = 1 i h j d j ( 1 − h j ) n j − d j

To maximize the likelihood function, just simplify the function using natural logarithm.

ln ( L ) = ∑ j = 1 i d j ln ( h j ) + ( n j − d j ) ln ( 1 − h j )

∂ ln ( L ) ∂ h i = d i h ^ i − n i − d i 1 − h ^ i = 0 ⇒ h ^ j

The Kaplan-Meier estimator is one of the most frequently used method for survival analysis. It has a comparably advantage in estimating the death rate which is the rate of malfunction in vehicles in each part. Also, the result is clearer since it is visualized.

The logrank test statistic compares estimates of the hazard functions of the two groups at each observed event time. It is constructed by calculating the observed and expected number of events in one of the groups at each observation time and then add these estimates to obtain an overall summary throughout the focused period where there is an event [

Let j = 1, ..., J be the distinct times of observed events in either group. For each time j, let N 1 j and N 2 j be the number of subjects “at risk” (have not yet had an event or been censored) at the start of period j. Let N j = N 1 j + N 2 j . Let O is the observed number of events. The expectation value of the log-rank test is E_{ij}, the variance difference is V_{j}. [

Z = ∑ j = 1 J ( O 1 j − E 1 j ) ∑ j = 1 J V j ~ N ( 0 , 1 )

Calculated outcome should be tested using Z test above and determined whether it is in the acceptable range.

Log-rank test can estimate the difference between two groups with significantly different risks, but it is only a test for significance, so it will not be the primary resolution in this paper.

COX regression model uses h ( t , X ) = h 0 ( t ) exp ( β 1 X 1 + ⋯ + β m X m ) as a variable in the middle instead of directly determine the relationship between the causing factor X and the survival function S ( t , x ) [

R R = h ( t , X i ) h ( t , X j )

Cox regression model takes multiple factors which will affect the studied subject’s survival time.

Exponential distribution and Weibull distribution measure the status of the occurrence of a specific event in a time interval. Exponential distribution and Weibull distribution has a probability density function respectively:

f ( x ) = { λ e − λ x ( x > 0 ) 0 ( x ≤ 0 ) , f ( x ; λ , k ) = { k λ ( k λ ) k − 1 e − ( x / λ ) k ( x ≥ 0 ) 0 ( x < 0 )

Weibull distribution and exponential distribution are very alike [

The parameters have to be determined clearly to draw the precise probability density function. We mainly use two methods for parameter estimation-point estimation and the maximum likelihood estimation [

Parametric estimation can be combined with predefined equations and functions to estimate duration of a project.

To determine whether there is significant cause and effect relationship, we have to do regression test to the outcome from Weibull and Exponential distribution. First, we hypothesize that there is no relationship between the factor and the result, then we propose the formula and plug in the required data presented in the data set. Next, we determine the rejection region and see if the value falls within this range. Finally, we give the result whether accept the hypothesis or not.

Use SQL to sort out the data distribution of the four cars, listed in

As we can see from

We adopt non-parametric method to analyze the faults of automobiles. Based on the data distribution, considering the lack of data, we are ready to analyze and model the data from two angles. The first angle is to compare the survival curves of the four cars, using the Kaplan-Meier estimator. The second angle is to compare the survival curves of different time periods. We divide the time into four sections, 1996-1999, 2000-2003, 2004-2007, 2008-2012. The last stage is one year longer than the first three stages, considering that the data for 2012 is relatively small.

Chevrolet | Honda | MINI | Toyota | |
---|---|---|---|---|

1996 | 48 | 393 | 109 | |

1997 | 98 | 509 | 139 | |

1998 | 116 | 977 | 171 | |

1999 | 146 | 1171 | 152 | |

2000 | 209 | 1382 | 354 | |

2001 | 222 | 1271 | 229 | |

2002 | 335 | 1419 | 129 | 175 |

2003 | 348 | 1650 | 254 | 135 |

2004 | 354 | 1152 | 239 | 96 |

2005 | 195 | 984 | 337 | 123 |

2006 | 574 | 299 | 203 | |

2007 | 570 | 245 | 105 | |

2008 | 481 | 194 | 49 | |

2009 | 231 | 158 | 9 | |

2010 | 206 | 61 | 11 | |

2011 | 101 | 25 | 11 | |

2012 | 47 | 12 | 1 |

The programming language uses python 3.6 [

Comparing four types of vehicles in all years by K-M method, the results of calculating the two kinds of faults are shown in

From

From

Different time periods, K-M overall comparison, two faults, the results are shown in

From

From

In the fifth part of the article, we give two results, which are the survival curves of the four cars, and the survival curves of different time periods. In order to

compare the reliability of the four cars more deeply, we will continue to carry out some analysis and calculation here.

Here we select Honda and Toyota, which compare the faults of each of the four time periods.

From

From

The median here is the average mileage of all cars when 50% of cars fail. We compare the median of the four stages here, and the results are shown in

From

Similar to the method of 6.2, we compare the median of all years here, and the calculation results are listed in

In

the first fault | the second fault | |
---|---|---|

1996-1999 | 270,919 | 329,885 |

2000-2003 | 250,720 | 374,217 |

2004-2007 | inf | Inf |

2008-2012 | inf | inf |

Honda | Toyota | |||
---|---|---|---|---|

the first fault | the second fault | the first fault | the second fault | |

1996 | 296188 | inf | inf | inf |

1997 | 255692 | inf | inf | inf |

1998 | 260828 | 329,885 | 287496 | inf |

1999 | 257376 | 323,454 | inf | inf |

2000 | 237485 | 301,751 | 324,786 | inf |

2001 | 225687 | inf | inf | inf |

2002 | 212010 | inf | inf | inf |

2003 | inf | inf | inf | inf |

2004 | inf | inf | inf | inf |

2005 | inf | inf | inf | inf |

2006 | inf | inf | inf | inf |

2007 | inf | inf | inf | inf |

2008 | inf | inf | inf | inf |

2009 | inf | inf | inf | 114,932 |

2010 | inf | inf | inf | inf |

2011 | inf | inf | inf | inf |

2012 | inf | inf | inf | inf |

The most obvious shortcoming of this study is that the data source is single, and the data of other two cars is incomplete. In 17 years, there are 6 years of missing data.

This paper analyzes the survival of engine and transmission faults and compares the reliability of four vehicles from different manufacturers. The research in this paper shows that by applying the Kaplan-Meier fitter method and the log-rank test, we can not only get the most insight into improving the car brand, but also get the best performance. A comparative analysis of the four time periods suggests that the entire industry may be getting better. Data analysis can provide customers with very useful vehicle reliability information for their reference at the time of purchase. Survival analysis methods can also be applied to specific parts of a vehicle, such as the most common damaged parts on a vehicle―a tire or suspension. This aspect is also one of our follow-up studies.

The authors declare no conflicts of interest regarding the publication of this paper.

Xu, P.Z. and Gao, J.H. (2019) Case Study of Four Vehicle Reliability Comparison Based on Survival Analysis. Journal of Transportation Technologies, 9, 109-119. https://doi.org/10.4236/jtts.2019.91007