^{1}

^{*}

^{1}

^{2}

Recent developments in database technology have seen a wide variety of data being stored in huge collections. The wide variety makes the analysis tasks of a generic database a strenuous task in knowledge discovery. One approach is to summarize large datasets in such a way that the resulting summary dataset is of manageable size. Histogram has received significant attention as summarization/representative object for large database. But, it suffers from computational and space complexity. In this paper, we propose an idea to transform the histogram object into a Piecewise Linear Regression (PLR) line object and suggest that PLR objects can be less computational and storage intensive while compared to th ose of histograms. On the other hand to carry out a cluster analysis, we propose a distance measure for computing the distance between the PLR lines. Case study is presented based on the real data of online education system LMS. This demonstrate s that PLR is a powerful knowledge representative for very large database.

Knowledge Discovery in Databases (KDD) is defined as the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data [

As we live in a digital world, there is a steady increase in accumulation of structured and unstructured data from various sources such as transactions, social media, sensors, digital images, videos, audios and click streams for domains including healthcare, retail, energy and utilities. For instance observation [

To incorporate new concepts in knowledge representation, Diday [

Our objectives in this paper are to propose a new idea of histogram based piecewise linear regression method, to summarize very large datasets, to produce smaller datasets in order to enhance the data mining technique to mine knowledge pattern in big data. Section 2 presents a brief literature review about symbolic data analysis and provides reasons for selection of PLR. Sections 3 describes histograms based piecewise linear regression lines. Also we propose the distance measure between two piecewise linear regression features. Section 4 furnishes the case studies on online education system to evaluate and discover the knowledge. Section 5 provides conclusion.

As in [

Also in the case of histogram there are some disadvantages. It is observed that algorithms for producing histograms are required to have the same number of bins and same bin width for all datasets for the effective characterization of data into histograms [

To overcome the disadvantages, an idea has been proposed [

Analysis on regression model shows that, single linear regression line is not always “best fit” for the histogram model because it suffers from appropriateness by maximizing the sum of the squared residuals. To be precise, a single linear regression model could not provide an adequate description for generic databases and nonlinear model could not be appropriate either. Besides, its corresponding regression based distance measure would produce factual error. To overcome this problem, we propose a model using PLR [

A problem which recurs occasionally is the estimation of regression parameters when the data sample is hypothesized to have been generated by more than a single regression model. This has been referred to as “piecewise linear regression” [

To state our problem, consider Y as the response variable, and X as the explanatory variables. Assume that there is a sample of n observations. These observations are governed by a model of histograms. Our objective is to represent histogram by monotonic increasing piecewise linear regression with k segments separated by a breakpoint BP. The simplest piecewise-regression model joins two straight lines sharply at the breakpoint as follows:

where y_{i} is the value for the i^{th} bin, x_{i} is the corresponding value for the independent variable, m_{1} and m_{2} are the slope of the line segments, c_{1} and c_{2} are the intercept at the y-axis. The present paper recommends the following procedure when fitting piecewise regression line.

The major problem is to determine the number and location of the “break points” of the underlying regression systems. The procedure we employ is the examination of the first order derivative in histogram, similar to proposed by Howard Wainer [_{ }can be found between each successive bins i and i + 1 using (2).

where I is the number of bins. Hence the places where the first order derivatives

The next problem was to determine which of these points are break points and which are simply bad data point which caused a spuriously high first derivative. Let us say that breakpoint occurs at point j: we wish to determine if the parameters of the regression system determined by points 1 − j are significantly different from those of points j − n, where n is the total number of points. To do this we shall construct a confidence interval around m_{1} and b_{1}, where m_{1} and b_{1} are the slope and the y-intercept respectively of the best fitting (in a least-squares sense) straight line for the points 1 − j. We then inspect our estimates of m_{2} and b_{2} (the same parameters, for points j − n) to see if they lie in the confidence interval. The uncertainty about the distribution of regression parameters indicates that the jackknife [

It is clear that if a break point exists at all it is most likely that the first derivative is largest at that place. If, in fact, the regression systems about point j are different, one can continue the same process by examining the second largest first derivative and if it is below point j, say at point k, repeat the above procedure for points 1 − k and k − j. This can be continued until no further significant differences are obtained.

In order to fit PLR, the histogram is converted to normalized cumulative histogram. Since the shape of the histogram is not monotonically increasing where cumulative histogram always has positive slope. The estimated break points are mapped at the respective position. Then, we shall connect each pair of adjacent points by a straight line, whose are represented by set of slopes and intercepts.

After having built the piecewise regression line, now this section proposes an approach to find distance measure between two piecewise regressions lines indeed distance between two symbolic objects. The key idea is to find the area between consecutive breakpoints.

Consider parallel lines from every break point to other end piece regression line, about the x axis. This is done to split the area into subareas.

Now each subarea has one pair of parallel sides about to x axis and a pair of linear regression lines about to y axis, hence each subarea can be considered as trapezium [

Case 1: When pair of lines are not intersected shown in

where c_{1}, m_{1} are the intercept and slope of simple linear regression line 1 and c_{2}, m_{2} are the intercept and slope of simple linear regression line 2.

Case 2: When pair of lines are intersected shown in

where

Distance computed from all sub region is summed up. A distance measure (also called a metric) is a dissimilarity Measure. The distance measure values closer to zero reveals that two symbolic objects have high similarity and vice versa.

In this section, we propose a case study on real dataset. The case study refers to the quiz data pooled by a Learning Management Systems (LMS) called EkLuv-Ya [Ekl0] is used. EkLuv-Ya is the product created by Amphisoft Technologies for revolutionizing engineering education through Automated Evaluation Systems in different branches of engineering.

From the standpoint of the big data, the main objective of this case study is to show how the PLR summarizes the datasets in a meaningful and intelligent fashion, to its important and relevant features. Also how it enhances the data mining technique to mine knowledge pattern in big data. Hence we need to use a clustering technique in integration with SDA. Clustering is an unsupervised learning problem that group objects based upon distance or similarity. Each group is known as a cluster [

A typical web based LMS such as moodle [

The datasets under investigation is the quiz data. It contains the marks scored in different quizzes by the 52 students of sixth semester of Bachelor of Engineering from Aditya Institute of Technology, Coimbatore (Tamil Nadu), India. The subjects of study in the order of appearance are: Computer Graphics, Mobile Computing, Numerical Methods, Object Oriented System Design and Open Source Software. For the sake of experimentation, minimum marks for each quiz is 2 and maximum marks for each quiz is 10. Marks of students who could not take up the quiz in a particular subject have been marked as “1”. The total number of bin taken is ten which is proportional to the number of quizzes. For bin size 10, the number of break point fixed is 1.

Results and DiscussionThe histogram matrix computed by summarizing the distribution of marks of ten quizzes under each of five subjects for every student is illustrated in

Tukey’s “jackknife” is used to test the meaningfulness of that breakpoint. For illustration, the data are shown in

The points 8 - 10 were also jackknifed and yielded the following parameter estimates:

where M and B are the slope and the y-intercept of the regression line indicated. These two pairs of intervals do not even overlap and hence we concluded that point 7 was a break point.

Then histogram is converted to cumulative histogram

Piecewise regression line

The obtained distance matrix is given as input to cluster analysis [

above 70% which indicates good knowledge in subjects. Cluster 2 (red) students below 70% who have poor knowledge in subjects. In order to validate, obtained results when compared with teachers handling the subjects match exactly with the expected results. This experimentation will help the teacher in finding out the students of a particular group and counsel them as the information about serial number of students is retained as knowledge. Also it promises to enhance academic planner’s sense of decision making which specific subject should be improved to achieve student learning effectiveness and progress [

Big data are a new phenomenon. Also a characteristic of modern large-scale data sets is that they often have a nontraditional form. New statistical methodologies with radically new ways of thinking about data are required. In that very important sense, PLR can be considered a method for Big Data which holds more details about the data in a compressed form which requires only two parameters sets―slope and intercept. Also PLR a powerful knowledge representative, instead of a histogram reduces the memory requirement and the computational complexity of this symbolic histogram. We employed Tukey’s “jackknife” to test the meaningfulness of any suggested subdivision of the regression curve into pieces. To support our proposed theory we have done a case study in educational environment. It is expected that this work can be successfully used in different areas including financial, banking.

Mythili, S., Pradeep Kumar, R. and Nagabhushan, P. (2016) Knowledge Discovery in Learning Management System Using Piecewise Linear Regression. Circuits and Systems, 7, 3862-3873. http://dx.doi.org/10.4236/cs.2016.711322