A Short-Term Stock Exchange Prediction Model Using Box-Jenkins Approach

This paper developed a short-term stock exchange prediction model using the Box-Jenkins approach. In this study, monthly data from Ghana Stock Exchange market report that spans from March 2013 to February 2018 were used to develop the model. ARIMA (0, 2, 1) model was fitted to the data based on the Bayesian Information Criterion (BIC) for model selection. Diagnostic checks showed that the residuals of the fitted model were uncorre-lated. The developed model was used for forecasting for a period of six months. The trend of the forecasted values showed a significant increase in the Ghana Stock Exchange performance for the next six months.


Introduction
A stock exchange market is the center of a network of transactions where buyers and sellers of securities meet to provide a clear indication of the market price for each investment. The exchange also plays a key role in the mobilization of capital from shareholders for companies in exchange for shares in ownership to investors in emerging and developed countries. This leads to growth of industry and commerce of the country; and this is a consequence of liberalized and globalized policies adopted by most emerging and developed governments [1] [2] [3] [4].
Even though the stock exchange markets have been classified as the most volatile in the world and are full of anonymity and escapade performances [5], stock investments are one of the various investment options which has become very attractive to both foreign and local investors due to ease of access to the stock market and the expectation of high rate of returns [6]. In a stock market, financial information is one of the key elements among several factors (e.g. financial policy, monetary policy, foreign trade policy and macroeconomic factors) that influence the stock prices and inform the investors whether to invest their savings in a company's stock or otherwise [6] and [7].
In the stock exchange market, it is known that changes in the stock prices as well as the returns may be attributed to various prevailing risks and events such as economic crisis, natural disasters, movements in international oil prices, inflation effects, foreign exchange rates, changes in government policies, regulations and norms occurring within a country and across the world [8]. Hence, the study of stock market price volatility has been a subject of interest in finance and econometrics. The study of these price changes has become relevant in the context of quantitative analysis, financial time series modelling, volatility assessment and risk analysis [9]. In addition to that, these occurring variations have necessitated the need to investigate the determinants of the stock market performance, analyse the factors causing the variations in the performance indicators, formulate mathematical models that can best fit the performance indicators, explain the underlying behavioural patterns and forecast these indicators using appropriate dataset.
For years, the relationship between financial sector development and real economic activity has been a debatable issue in theoretical and empirical research [10]. Reference [10] argued that well-functioning financial systems en- earity. In the analytical time series, a good forecasting result is achieved on condition that the data being analysed is stationary [11] and [12].
References [13] [14] [15] [16] respectively recounted on this issue and devel-oped Regression Models (RMs) to determine the relationship between the stock market performance and its macroeconomic determinants. However, according to [17], empirical results are still debatable due to the inconsistency of the macroeconomic determinants employed in the model's formulation. To avoid the difficulty of which macroeconomic determinant(s) to be employed into the RMs, [18] argued that the stock price or returns mimics a random walk hypothesis and it is a difficult task to predict or forecast the accurate future returns; but numerous studies in the area of stock returns prediction or forecasting have dedicated on the usage of classical statistical methods (ARIMA) which has dominated the field of financial dataset as a popular choice model that can be used to model the accurate future stock price [19]. In this regard, this study employs the Box-Jenkins approach as an alternative to the RMs in stock market re- searches.
An example of the stock market which requires attention is the Ghana Stock Exchange (GSE). The GSE plays an important role in the economic development of Ghana and its corporate finance. It is a well-known fact that, an organised and well managed stock market stimulates investment opportunities by recognizing and financing productive projects that would lead to real economic activities.
Reference [20] affirmed this assertion and showed from their study that there exists a strong positive relationship between stock market development and economic growth.
Since systemic risk in GSE performance hugely affects stock market investments and the country's economic development, this study seeks to develop a time series model based on Box-Jenkins approach to help capital investors to identify the trend in the GSE and to forecast them appropriately.

Related Works
Reference [21] investigated the dimensionality and expectancy of a naïve investor. The authors used historical dataset of four India midcap companies for training the ARIMA model. The Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) tests were applied to select the best accurate prediction model. The formulated prediction model was tested on individual stocks and Nifty 50 Index. It was observed that the Nifty Index is the way to go for Naïve investors because of low error and volatility. Reference [22] studied the relationship between some macroeconomic variables (exchange rate and oil price) and stock price in the following emerging countries: Brazil, China, India and Russia. The monthly data that spans from March 1999 to June 2006 were analyzed using Box-Jenkins approach. Results showed that there was no significant relationship between the oil prices and exchange rate over the stock market of the emerging countries. As a result, weak form of market efficiency exists in those capital markets.
In Thailand, [23] examined the stock market to find out the relation between the following selected macroeconomic variables: money supply, exchange rate, oil prices, industrial production and share price index by performing time series analysis. It was concluded that money supply positively affected stock prices while exchange rate negatively influenced stock prices.
Reference [24] studied the trends, similarities and patterns in the activities and movements of the Indian Stock Market in comparison to its international counterparts. The time period was divided into various era to test the correlation between the various exchanges to prove that the Indian markets had become more integrated with its global counterparts and its reaction were in tandem with that seen globally.
Reference [25] analyzed extensive process of building ARIMA models. To identify the optimal model, the authors employed the following criteria: standard error of regression, adjusted R-squared and Bayesian Information Criterion (BIC). Based on the mentioned criteria, the best ARIMA model did satisfactory job in predicting the stock prices of Nokia and Zenith Bank. In addition, the authors made strong argument of the forecasting potential of ARIMA models in terms of stock analysis because it could compete reasonably well against the emerging modern forecasting techniques for short term prediction.

Resources
The study used two main resources: 1) Monthly data that spans from March 2013 to February 2018 obtained from Ghana Stock Exchange Market Report (Table 1); and 2) R Statistical software.

Linear Model
In this study, the Ordinary Least Squares (OLS) technique was used to fit a regression equation to the GSE time series data. The essence according to [26] and [27] is to find whether the time series data (i.e. GSE) exhibits linear trends.
Knowledge of the linear trend projection enables the modeller and the user to: 1) Describe historical trend patterns; 2) Permits the projection of past pattern of trends into the future; and 3) Eliminate the trend component from the time series data. Consider the Simple Linear Regression (SLR) given in Equation (1). From OLS method that minimises the sum of squares errors, Equations (2) and (3) are obtained as follows: where n is the sample size.

Hypothesis Testing
The hypothesis for the study is formulated as follows: β is different from zero.

ARIMA Model
According to [28] and [29], Box-Jenkins Autoregressive Integrated Moving Average model consists of the Autoregressive (AR (p)) model and the Moving Average (MA (q)) model. When these two models are put together, the Autoregressive Moving Average (ARMA (p, q)) model is formed.
ARMA processes form the core of time-series analysis. According to [30] and [31], the first order moving average, abbreviated as MA (1), is the simplest non-degenerated time-series process defined in Equation (4).
where 0 φ and 1 φ are unknown model coefficients whose actual values would be determined from data, and t ε is a white noise process. The first order autoregressive abbreviated AR (1) has the following dynamics Journal of Applied Mathematics and Physics (Equation (5)): where 0 θ and 1 θ are the unknown model coefficients whose actual values would be determined from data. t ε is a white noise process. An Autoregressive Moving Average process with orders P and Q; ARMA (P, Q) has the following dynamics (Equation (6)):

Hypothesis Test
The hypothesis for the study is formulated as follows: H 0 : Series is not stationary H 1 : Series is stationary

Results and Discussion
In formulating the OLS model, a statistical description of the data (Table 1) was performed by using R statistical software version 3.6.1 to find the existing relationship among them (see Table 2). Table 2 shows the descriptive statistics summary results. The data size is 60 and the maximum and minimum GSE values are 3337.2 and 2113.58 respectively. The corresponding standard deviation (Equation (7)) value is 312.14. This implies that most of the GSE data points are spread out and they are far from the mean value. The positive value of the skewness (Equation (8)) ( Table 2) implies that the distribution of the data set is skewed to the right (positively skewed). The interpretation here is that the right tail of the GSE data set distribution is longer than the left tail. This means that the GSE data set is heavily concentrated on the left tail of the distribution curve. Hence, providing a measure of the asymmetry of the probability distribution of the GSE data set about its mean value. The kurtosis (Equation (9)), the pointedness of the data distribution, value of 3.75 indicates that the distribution of the data is leptokurtic. Consequently, from the analysis of the GSE using Equations (1), (2) and (3), the linear model was developed (Equation (10)).

2.549
Analysis of variance (ANOVA) test was then performed to find the significance of the developed model (Equation (10)) coefficients (see Table 3). From  Figure 1 shows time series plot for GSE data used to verify the stationarity of the data. It shows sudden changes in trends which attest that it is not stationary.
The Augmented Dickey-Fuller (ADF) stationarity test performed on the data.
The test gave a p-value of 0.99 which is greater than α = 5% level of significance.
Hence, the null hypothesis that the series is not stationary is accepted.
Graphical plots such as Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF) were further carried out to confirm the nonstationarity of the data. This can be seen in Figure 2 and Figure 3. The ACF plot of Figure 2 shows a sine-wave pattern with decaying strong spikes which confirms that the series is not stationary. The PACF of Figure 3 has one significant lag with the rest decaying; which is also an indication of nonstationarity of the data. Since Figure 2 and Figure 3 show that the GSE data is not stationary, it was differenced once (see Figure 4). Figure 4 shows first difference GSE which does not appear to be stationary    due to the presence of an upward movement. As a result, ADF test was performed to confirm the claim.
The ADF test shows that the differenced data was not stationary since the p-value = 0.6235 was still greater than α = 0.05 significance level. Therefore, the first differenced data was differenced again (see Figure 5). Figure 5 shows a time series plot for the second differenced GSE data which appears to be stationary since there are no upward trends as the year progresses and the variations of the amplitudes are equal. Figure 6 and Figure 7 show the ACF and PACF plots for the GSE second differenced data. From the ACF plot, the autocorrelation at lag 1 exceeds the Journal of Applied Mathematics and Physics   significance bounds, but all other autocorrelations are below the significance bounds. The PACF on the other hand, shows that the partial autocorrelations at lags 1, 2 and 5 exceed the significance bounds and are slowly decreasing in magnitude with increasing number of lags. Clearly, from these plots, MA and AR terms are respectively identified. Since the ACF plot ( Figure 6) of the second differenced series cuts off after the first lag, MA (1) was assumed and resulted in IMA (2, 1). The PACF plot (Figure 7) of the second differenced series on the other hand tailed off after lag 2 and cuts off after lag 5. As a result, MA (2) and AR (5) were formed. Consequently, mixed models ARIMA (5, 2, 1) and ARIMA (5,2,2) were formed by combing the AR and MA terms.
The ADF test shows that the second differenced data is stationary since it has a p-value of 0.01 which is less than α = 0.05 significance level and that confirms the claim of a stationary time series. Consequently, an ARIMA (p, 2, q) model is probably appropriate for the GSE data.
After the model identification, Bayesian Information Criterion (BIC) as well as the coefficient of determination, R 2 , were used for the selection of the reliable model. Table 4 shows ARIMA model selection summary results of the BIC and R 2 values. The R 2 is a model goodness of fit measure of prediction accuracy. From Table 4, the ARIMA model with the smallest BIC and R 2 values of 704.5556 and 0.9010 respectively is ARIMA (0, 2, 1); and it was selected as the best model that fits the GSE data well. Thus, the autoregressive order p is the lag value after which the PACF plot crosses the upper confidence interval for the   Figure 7) did not cross the upper confidence interval at any lag value. As a result, the p value was 0 and the integrated value was 2 since the GSE data was differenced twice. On the other hand, the moving average process of order q was obtained by using the ACF plot. Thus, it is the lag value after which the ACF plot crosses the confidence interval for the first time. From Figure 6, it can be seen clearly that after lag 1 the ACF graph crosses the lower confidence interval for the first time. Consequently, the q value was 1.
ARIMA (0, 2, 1) model explained about 90% of the total variation in the composite index data set. Figure 8 shows the checked ACF residuals for GSE second differenced data. From the plot, almost all the lags are below the significance bounds which is an indication that there is no autocorrelation in the residual. This suggests that all the information in the GSE second differenced data used for the modelling has been accounted for by the model.
Equation (8) was used for six-month monthly forecast of the GSE. Table 5 shows the forecasted GSEV for the next six months using the developed ARIMA (0, 2, 1) model. In Table 5, it can be deduced that the forecasted values show a significant increase from March 2018 to August 2018. This assertion can   additionally be confirmed from Figure 9 where a graphical illustration of the forecasted values has been presented. In Figure 9, the six-month forecast is shown in blue line. The dark ash blue shaded area shows 80% to 100% prediction intervals.

Conclusions and Recommendation
In this paper, ARIMA (0, 2, 1) model has been developed from the observed GSE monthly market report data over a period of five consecutive years to predict future stock exchange prices or returns. In developing the ARIMA (0, 2, 1) model, nonstationarity which existed in the GSE sample data and could have caused wrong statistical inferences was resolved by differencing the data twice to ensure that the data is stationary. A confirmatory test to verify the stationarity of the GSE data was also carried out using the widely known Augmented Dickey-Fuller (ADF) test. Diagnostic check was performed by using ACF residuals plot for GSE second differenced data to ensure that there is no autocorrelation in the residuals. This suggests that all the information in the GSE second differenced data was used for the model development.
ACF and PACF plots were used to determine the appropriate ARIMA developed model. After the model identification, Bayesian Information Criterion (BIC) as well as the coefficient of determination, R 2 , was used for the selection of the reliable model. Consequently, the corresponding R 2 of the developed ARIMA model explained about 90% of the total variation in the composite index. The developed ARIMA (0, 2, 1) model was used for forecasting for a period of six months and the trend of the forecasted values showed a significant increase in the GSE. In conclusion, the ARIMA (0, 2, 1) is a good model that can be relied upon by companies and investors to predict accurate future stock prices or returns.