Using Data Mining with Time Series Data in Short-Term Stocks Prediction : A Literature Review

Data Mining (DM) methods are being increasingly used in prediction with time series data, in addition to traditional statistical approaches. This paper presents a literature review of the use of DM with time series data, focusing on shorttime stocks prediction. This is an area that has been attracting a great deal of attention from researchers in the field. The main contribution of this paper is to provide an outline of the use of DM with time series data, using mainly examples related with short-term stocks prediction. This is important to a better understanding of the field. Some of the main trends and open issues will also be introduced.


Introduction
Data Mining (DM) is a challenging field for research and has some practical successful application in several different areas.DM methods are being increasingly used in prediction with time series data, in addition to traditional statistical approaches [1][2][3].
DM can be presented as one of the phases of the Knowledge Discovery in Databases (KDD) process [4][5][6], and is identified as "the means by which the patterns are extracted from data" [7].Nowadays, it can be said that the two terms, DM and KDD, are indistinctly used.
The OECD Glossary of statistical terms [8] presents the following definition: "A time series is a set of regular time-ordered observations of a quantitative characteristic of an individual or collective phenomenon taken at successive, in most cases equidistant, periods/points of time".There are several application domains of DM with time series data, being that one important application domain is short-term stocks prediction.This will be the focus of this paper.Short-term stocks prediction is a difficult issue and can be considered as an open research issue [9,10].Intelligent forecasting models have achieved better results than traditional methods, particularly in shortterm forecasts [11].Although intelligent forecasting methods are better, we can still improve the results in terms of accuracy in addition to other factors.
The main contribution of this paper is to provide an outline of the use of DM with time series data, using mainly examples related with short-term stocks or market indexes predictions.This is important to a better understanding of the field.Some of the main trends and open issues will also be introduced.
The paper is organized as follows: DM with time series data is presented in Section 2, the integration of fundamental data is explored in Section 3, data frequency issues are introduced in Section 4. The paper closes in Section 5, with conclusion and future research directions.

Data Mining with Time Series Data
Since the seminal paper of Fayyad in 1996 [4], the Data Mining (DM) area has attracted a great deal of interest and can nowadays be considered as an established field.DM applications can be found in a diversified range of application domains.One important application domain is that of time series data."A time-series data set consists of sequences of numeric values obtained over repeated measurements of time.The values are typically measured at equal time intervals (e.g., every minute, hour, or day)".[5].The referred measures can be taken over one variable or several variables-univariate or multivariate time series.

Data Mining with Time Series Data Applications
DM with time series data is popular and many applications can be found in the literature, for instance, for earthquake forecasting [12], characterization of ozone behavior [13], or flood prediction [14].Other application example is that of financial decision making.A decision support tool for financial forecasting, named as EDDIE, is presented in [15].In [16], a new architecture that implements a binary neural network, AURA, to produce discrete probability distribution as forecasts, using high frequency data sets, is presented.The use of support vector machines and back propagation neural networks to predict credit ratings is presented in [17].
One important application concerns short-term stocks prediction, which is the main focus of this paper.In [18], an approach to the paradox of obtaining better results with long-horizon forecasts than with short-horizon forecasts is presented, and it is claimed that the paradox is solved, since the proposed model obtains promising results.Nevertheless, there is a great deal of interest from investors in short-horizon forecasts, thus the authors consider that research focusing on this issue is important, namely in using data mining with time series for shortterm stocks prediction.

Data Mining Techniques Used with Time Series Data for Short-Term Stocks Prediction
Several DM techniques are used with time series data in order to obtain short-term stocks prediction.An interesting approach to portfolio management, using the Gaussian temporal factor analysis technique, is introduced in [19].Neural networks are one of the most popular techniques for stocks prediction.[20][21][22][23][24][25] are some examples.
In [22] rough sets and classification trees are used, as well.Rough sets are also used in [26].Support Vector Machines are used in [27].There were not yet been given strong evidences of some technique being better than other, but nonlinear models are more popular.

Specific Challenges
Using DM with time series data presents several specific challenges.In [28,29] the authors focus on the issue of representing time series data in order to effectively and efficiently apply DM.In [28], three types of algorithms are presented and compared, namely, the sliding window algorithm, the top-down algorithm, and the bottom-up algorithm, and a new approach, that is claimed to overcome the inconveniences of these three algorithms, is introduced.In [29], a new concept, named as median strings, is presented as a simple and, at the same time, powerful representation for time series data.
Another interesting issue is to find out if different time series, or parts of a time series, have similar behavior.This issue can be approached through the use of similarity measures and indexing techniques.Interesting reviews can be found in [30,31].
Over fitting is a common problem across DM applications and DM with time series data is not an exception.In [32], an approach that intends to overcome this problem is presented.
Other important issue concerns the way to implement each one of the phases of the KDD process, taking into account the specificities of time series data.An application of DM with time series data for short-term stock prediction is presented in [1], analyzing all the phases of the KDD process.Promising results were achieved, but it is referred that the inclusion of fundamental data could help improving the obtained results.
Table 1 presents a resume of the main techniques and challenges.

Including Fundamental Data
Concerning short-term stocks prediction, a possible approach is to collect the historical financial data, such as open price, higher price, lower price, close price, and volume.These can be used in a daily basis frequency, or other frequencies considered as appropriate.Several indicators can be derived and used for more adequate analysis.This approach is named as technical analysis.Another possible approach is to use statistical data, such as, macroeconomics indexes, and basic financial indicators of the company.This approach is named as fundamental analysis.Table 2 resumes some of the technical and fundamental features found in the literature.Other researches, for instance [37][38][39], present similar indicators.
From the literature review it is clear that one of the main issues in obtaining good predictions is related to the first phase of the KDD process, that is to say, the selec- Another aspect that arises from the literature review is that most researchers use only one of the two types of analysis, technical or fundamental.Thus analyzing combinations of both types of indicators is yet under-explored.In addition, most studies use macroeconomics variables, forgetting the important financial indicators of the companies.Considering the domain application, it is clear that the evolution of stock prices is influenced by both types of variables, so considering it could conduct to good results.low frequency time series obtained from the collection of fundamental data.Forecasts should be done in a daily basis, thus there are some important issues for research.Some research can be found in the literature approaching the issue of integrating time series features with different frequencies.Traditional approaches use regression algorithms such as MIDAS [37,38].Nevertheless, this approach does not use DM.

Quarterly
One of the main issues related to the combination of both types of features is that time series data have different frequencies (Figure 1).Usually technical features have daily frequencies and fundamental features have monthly, quarterly, and lower frequencies, presenting some integration issues.These integration issues are very important and have several implications.
In the literature review, only a few works, use DM with time series data with different frequencies.[22,34] are two examples.These studies present promising results, but the use of neural networks is somehow a limitation.Neural networks, despite usually yielding good results, functions as a "black box".This way it is difficult to understand the mechanism and the generated model.

Integrating Features with Different Frequencies
From the literature review it can be concluded that these issues needs further research, and it can be useful to test other methods, and to explore the selection of some different features.
As stated above, interesting results could be obtained through the integration of time series data with different frequencies.With short-term stocks predictions, there is the need to use mainly time series with data collected daily, yielding high frequency time series, opposed to The application domain is an important issue to con-sider when applying DM, thus it should also be considered in this case.Taking into account the application domain will surely bring good insights and will surely yield good results.

Conclusions and Future Research Directions
This paper presents a literature review of the use of data mining with time series data.This literature review is very useful, since it brings a better understanding of the field of study, and this is an important contribution of this paper.
From the literature review it can be concluded that this subject attracts a great deal of interest by researchers.Nevertheless, several research issues remain unexplored.One of the ones that were identified during this research is related with the combined use of fundamental and technical indicators.The combined use of both types of indicators reveals also the issue of integrating time series with different frequencies.
Feature selection, corresponding to the first phase of the KDD process, is also an issue that requires more research to be done.
Future research directions include the study of ways to select the best features for DM with time series data.The existence of features with different frequencies is a concern, and methods that will help how to envisage this problem will be planned and implemented.

Figure 1 .
Figure 1.Time series with different frequencies.

Table 2 . Features for technical and fundamental analysis.
Weekly Monthly