Can Machine Learning Unlock the Continuous Alpha? Empirical Study Based on China A-Share Market


With the development of fintech and artificial intelligence, machine learning algorithms are widely used in quantitative investment. Based on the listed companies in China A-share market from February 2005 to July 2020, quantitative stock selection models with machine learning algorithms are established to obtain continuous alpha returns. The results show that machine learning algorithms can effectively identify the relationship between factors and returns and then improve the performance of the quantitative stock selection model. China A-share market is a weak-form efficient market. By mining the factors that are not fully digested by the market, continuous alpha returns can be obtained. The ensemble algorithms represented by the extremely randomized tree (ET) and light gradient boosting machine (LGBM) perform best in stock market prediction.

Share and Cite:

Lin, Y. and Ye, R. (2021) Can Machine Learning Unlock the Continuous Alpha? Empirical Study Based on China A-Share Market. Open Journal of Business and Management, 9, 2358-2369. doi: 10.4236/ojbm.2021.95127.

1. Introduction

The Efficient Markets Hypothesis (EMH) is a theoretical cornerstone in modern financial economics (Zhang et al., 2016). Malkiel & Fama (1970) systematically elaborated EMH and divided markets into three types by the availability of information: strong-form EMH, semi-strong-form EMH and weak-form EMH. In the strong-form EMH, all assets are effectively priced to reflect their real value. There is no valuable information available in the market, so there is no possibility of getting continuous alpha returns. The effectiveness of China’s capital market has been empirically tested from many angles. For example, Zhang & Zhang (2005) found that China’s futures market could not reject the weak-form EMH based on logarithmic futures price series. Huang et al. (2008) pointed that the market can achieve Pareto optimality in advance when the two information conditions of strong symmetry and strong perfection are both satisfied, but these strict conditions are often not satisfied in practice. Based on empirical analysis of panel data, Wang & Su (2013) showed that the internal capital market of China’s listed companies was efficient. From the perspective of behavioral finance, Zhang (2015) and Ding et al. (2017) believed that the “irrationality” feature of investors was particularly prominent in China, which shook the theoretical premise of EMH. What’s more, the weak-form EMH was confirmed by the long-term performance of capital markets, i.e., most portfolios constructed by professional managers failed to outperform the market index over the long term.

The views on EMH have evolved into two investment philosophies. The first investment philosophy, based on the strong-form EMH, is that the market is always correct. Specifically, the information about the asset value is fully reflected in the prices, because any temporary mispricing will be quickly eliminated by the invisible hand of the market. Thus, no one can obtain continuous alpha returns. The second investment philosophy, based on the weak-form EMH, is that the market can be beaten. The “smart money” can detect mispricing and gain alpha until markets reach equilibrium. At present, China’s capital market is still incomplete. There are opportunities for quantitative investment models to obtain alpha returns due to the deviation in asset pricing and the lack of consistency between markets.

Quantitative investment model is an innovative form of financial technology that combines computer technology and securities price prediction. It can improve asset management efficiency and investment performance. Thanks to the long history of the capital market and the good atmosphere of fintech innovation, quantitative investment has been quite mature in several developed countries. It has also become a hot topic in China. As of February 2021, there are 567 quantitative public funds in China, among which China and Europe Quant Drive Hybrid (001980), Shanghai Investment Morgan Alpha Hybrid (377010), Guangfa Contrarian Strategy Hybrid (000747) and Invesco Great Wall Quantitative Selected Stock (000978) have all performed well. At present, quantitative investment can be divided into two types: technical and fundamental. The former pays more attention to the real-time market, uses computer technology to find arbitrage opportunities, and pursues the qualitative change in computing power to compete with the speed of tick level. The latter focuses more on underlying value, looking for undetected mispricing and profiting from it.

In China, under the macro background of revitalizing the real economy, guiding capital from the virtual to the real and promoting financial services to the real economy, the fundamental quantitative model has attracted more and more attention, among which factor model is the most widely used. The factor model is built based on modern financial frameworks, including the capital asset pricing model (CAPM), arbitrage pricing theory (APT), and so on (Wang, 2016). The factors are widely recognized by market entities to explain and predict stock returns and risks. Finding effective factors is significant to the performance of the factor model (Wang, 2017). Under the premise of Markowitz’s hypothesis, CAPM believed that stock returns only have a linear relationship with the systemic risk (Sharpe, 1964). However, it is hard to achieve complete efficiency in the real market filled with asymmetric information. The stock price can still be explained by factors other than systemic risk. To take advantage of that, many multi-factor models were established, such as Fama-French three-factor model and the five-factor model. On this basis, empirical tests on the effectiveness of factors in China’s market were conducted. For example, Li et al. (2017) empirically tested the effectiveness of the Fama-French five-factor model in China’s stock market, and the results showed that the five-factor model outperformed CAPM, three-factor model and Carhart four-factor model. From the three aspects of safety, cheap and quality, Hu & Gu (2018) selected 8 abnormal factors as the comprehensive indicators and tested the applicability of Buffett’s alpha strategy in China A-share market. However, with the advent of big data era, these traditional models are unable to digest the massive and dynamic information. As a consequence, the machine learning algorithms are introduced into the quantitative investment model.

The machine learning algorithm is a data mining technology based on artificial intelligence, which has been widely used in finance, economics, psychology, biomedicine, and so on. It can not only learn the complex logical relationship behind the data but also improve its performance in the process of repeated training. In the study of foreign markets, researches on quantitative investment and machine learning algorithms are abundant. For example, Nair et al. (2010) used decision tree (DT), neural network (NN) and naive Bayesian to identify the upward and downward trends of stock prices and compared the performance of different investment strategies. Kourentzes et al. (2014) found that the integrated NN model was superior to the single model, and integrated learning could improve the accuracy and robustness of prediction. Choudhry & Garg (2008) developed a set of machine learning algorithms combining genetic algorithm and support vector machine (SVM) to predict stock prices. In the study of China, Chen & Yu (2014) used the heuristic algorithm to extract data features and then constructed a quantitative stock selection model based on SVM, whose annualized return was significantly better than the benchmark of the same period. Yu et al. (2015) established a grey NN model according to the Shanghai securities composite index and introduced the E-GRACH model to predict individual stock returns. Based on the convolutional neural network (CNN) and long-term and short-term memory neural network (LSTM), Sun & Bi (2018) constructed a dual classification model of securities ups and downs, which showed strong profitability and generalization ability. Li et al. (2019) compared the performance of more than 10 machine learning algorithms in stock price prediction, including Lasso regression, gradient lifting tree and integrated NN.

In this paper, quantitative stock selection models with machine learning algorithms are established based on the listed companies in China A-share market. The innovation points of this paper are as follows. Firstly, it enriches the empirical research on alpha returns in China A-share market and provides empirical support for the weak-form EMH. Secondly, it combines the machine learning algorithms and the classical multi-factor model in quantitative stock selection, which improves the utilization efficiency of factor information and the performance of the investment model. Thirdly, the performance of 16 machine learning algorithms in the quantitative stock selection models is compared, which enriches the academic research in the new composite field.

2. Research Design

2.1. Model Design

The framework of the quantitative stock selection model with a machine learning algorithm is shown in Figure 1.

As shown in Figure 1, the task of the machine learning module is to obtain the asset return prediction function with good generalization ability based on the train sets. First, assume that

R i = f ( x i ; θ ) + ε , (1)

where R i represents the stock return of the i-th company, f ( ) represents the asset return prediction function, x i = ( x i 1 , , x i k ) represents the factor vector of the i-th company, θ is the parameter, and ε is the error term.

Then, based on the asset return prediction function f ( ) trained in the machine learning module, the next return of stocks in the current stock pool is predicted. To avoid future factor information, the factor data all lagged behind the

Figure 1. The framework of the quantitative stock selection model with a machine learning algorithm.

stock return data for at least one period. That is, the required factor data is implemented and available for the prediction task. According to the prediction result, buy or hold the stocks in the top 1% of prediction, sell if the ranking deviates from this range, and build an equal-weight portfolio. Due to the threshold and restriction of short asset allocation in China A-share market, only long asset allocation is considered here.

Finally, the alpha return of the model is calculated by the following formula

α = R a ( R f + β ( R M R f ) ) , (2)

where R a represents the monthly return rate, R M represents the monthly return rate of the market (benchmark), R f represents the risk-free return rate (monthly compound interest calculation), and β represents the sensitivity of the model return to the market return fluctuation. The calculation formula is as follows

β = C o v ( R a , R M ) V a r ( R M ) , (3)

where C o v ( ) stands for the covariance and V a r ( ) stands for the variance.

If α > 0 , then the model performance is better than the benchmark performance. If α = 0 , then the model performance is comparable to the benchmark performance. If α < 0 , then the model performance is worse than the benchmark performance.

2.2. Dynamic Time Window

At the beginning of each investment round, the train set is built based on the realized factors and returns. Let the interval of the population sample be [ 1 , T ] , and let the time window of all train sets be w. Then, the interval of the n-th train set is [ n , n + w ] . Let w be 12 months, then the design of the dynamic time window is shown in Figure 2.

2.3. Machine Learning Algorithm

Machine learning is a collection of many forms of prediction functions and algorithms. In this paper, 16 representative algorithms are selected, among which 8 linear algorithms include ordinary least square (OLS) regression, partial least

Figure 2. Design of dynamic time window.

square (PLS) regression, Ridge, Bayesian Ridge, Lasso, LassoLars, Elastic Net, and linear support vector regression (LSVR) machine. And 3 machine learning algorithms are selected, including support vector regression (SVR) machine, decision tree (DT) and gradient boosting decision tree (GBDT). In addition, 5 integrated machine learning algorithms are selected, including random forest (RF), adaptive boosting (AdaBoost), extremely randomized tree (ET), extreme gradient boost (XGBoost), and light gradient boosting machine (LGBM). The algorithm in this paper is based on “sklearn”, “xgboost” and “lightgbm” in Python, and the “GridSearch” method is used to adjust the hyperparameters in the training.

2.4. Data Source and Sample Selection

The sample of this paper is the listed companies in China A-share market from February 2005 to July 2020, including several rounds of economic cycles, which enhance the empirical robustness. The position adjustment round of the model is monthly. The key variable is the monthly stock return considering cash dividend reinvestment. And the benchmark of the model is The Shanghai Securities Composite index (000001). The sample data are obtained from the CSMAR database. To increase the reliability and accuracy of the empirical results, the samples are filtered and treated as follows at the starting point of each round.

1) Exclude ST, *ST and stocks listed for less than one year.

2) Eliminate the stocks whose data are largely missing for continuous trading suspension.

3) If the factor value of a stock is still missing, it will be filled with 0.

4) Z-score standardization of data. Because the differences in dimensions of each factor would increase the complexity of the algorithm and affect the performance of the model, z-score standardization is performed.

Selecting effective factors is fundamental to enhance the model’s information capture ability and improve investment performance. However, many research reports of financial securities companies are based solely on data or on models. In this paper, the construction of the factor pool starts from a prudent literature analysis. On this basis, we consider each company from four aspects and divide the factors into four types: transaction friction, profitability, valuation and growth. The selected factors are shown in Table 1.

3. Empirical Results and Analysis

3.1. Portfolio Performance Analysis

The model is built based on factor pool and dynamic time window design. If the time window w = 12 months, then the number of rounds N = 174 . According to the data cleaning rules in the previous section, all samples of China A-share market companies are processed at the starting point of each round. The sample number entered into the model is shown in Figure 3.

Table 1. Transaction friction, profitability, valuation, and growth factors.

aRefer to Hu & Gu, 2018; Zhang et al., 2014; Amihud, 2002; Goyenko et al., 2009; Jiang et al., 2018; Pastor & Stambaugh, 2003; Roll, 1984; bRefer to Wu & Wu, 2003; Yang & Huang, 2005; Yang et al., 2020; Zhang et al., 2020; Zhao, 1998; Chan & Jegadeesh, 1996; cRefer to Hu & Gu, 2018; Jiang et al., 2018; Loughran & Wellman, 2011; dRefer to Li & Liao, 2007; Zhang et al., 2020.

From Figure 3, the number of effective samples in China A-share market shows an overall upward trend, which is related to the number of listed companies, the proportion of ST and *ST companies, and the data quality of company factors. Considering this dynamic feature, this paper buys or holds the top 1% of predicted returns in each round of portfolio construction. When the all rounds are completed, the overall performance of the quantitative stock selection model is analyzed. The returns and risks of the investment portfolios are shown in Table 2.

Figure 3. Dynamic change of sample number in China A-share market.

Table 2. Performances of quantitative stock selection models based on 16 algorithms.

From Table 2, the quantitative stock selection models based on machine learning algorithms all obtain positive alpha returns during the research period of nearly 16 years. At the same period, the average and median monthly returns of the benchmark are 0.89% and 0.72% respectively. All algorithms perform better than the benchmark.

The portfolio derived by different algorithms has different performances. Firstly, in ascending order of overall performance, they are OLS, other linear algorithms, single machine learning algorithms, and integrated machine learning algorithms. Secondly, ET achieves the highest alpha of all algorithms, ranking in the top five in Sharpe ratio and win rate, showing that ET has a strong ability to

Table 3. Quantitative stock selection model performance under different dynamic time windows.

predict security prices, and can still play the role of selecting high-quality stocks when the market is volatile or downward. Thirdly, LGBM achieves the highest Sharpe ratio of all the algorithms, ranked second in both alpha and win rate, and performs well and steadily in all indicators. Finally, PLS achieves the highest win rate of all the algorithms but performs mediocre in the other two indicators.

3.2. The Influence of Time Window Selection on Model Performance

The selection of the dynamic time window w is one of the factors affecting model performance. In the previous section, 12 months window is selected as the dynamic time window. This section will examine the performance of the model under different dynamic time windows.

From Table 3, the relationship between dynamic time window and alpha presents an inverted U shape. In the dynamic time windows of 6 months and 12 months, the alpha both exceed 0.04. Besides, the out-of-sample generalization effect is the best in the dynamic time windows of 12 months.

4. Conclusion and Enlightenment

The value of an investment is derived from the present value of all the cash flows generated over the life of the investment, so an accurate judgment of the future value of the asset is the key to achieving excess investment returns. Both scholars and market investors have tried to build models with strong prediction and generalization ability. However, for the nonlinearity and high noise characteristics of financial data, the prediction performance of traditional statistical models is improper. In this paper, 16 algorithms including machine learning algorithms are used to predict the prices of A-share listed companies and construct the investment portfolio according to the forecast results. The results show that: 1) The performance of the model based on machine learning algorithms is better than other models. It is mainly manifested in the strong out-of-sample generalization ability, which makes the portfolio return obtained by the quantitative stock selection model based on machine learning algorithm far exceed the market benchmark. 2) China A-share market follows the weak-form EMH. By excavating the factor information that has not been fully digested by the market, there is still a possibility of achieving continuous alpha in China A-share market. 3) The integrated machine learning algorithms represented by ET and LGBM perform well in stock return prediction. By comparing the performance of 16 algorithms in quantitative stock selection, it is found that the integrated machine learning algorithm has significant advantages in analyzing nonlinear and high-noise data, and has strong out-of-sample generalization ability. To further promote the intelligent development of the quantitative model and big data analysis field, and improve the efficiency and accuracy of data mining, this paper puts forward the following enlightenments based on the above conclusions.

Firstly, apply machine learning algorithm to quantitative study in finance, economy and management. Machine learning algorithms can effectively digest and utilize high frequency and high noise data, and have better explanatory power for nonlinear or chaotic data relationships. Through the programmatic implementation, the machine learning algorithms have strong operability and generalization. In addition, the fundamental principle of machine learning, “there is no free lunch”, reminds us that we should apply different algorithms to specific problems. Some machine learning algorithms have problems such as poor interpretability and information black box, so more empirical studies are needed to test and analyze their specific application scenarios.

Secondly, optimize the system design and regulatory mechanism of China A-share market. Although the introduction of securities margin trading and stock index futures ended the lack of short selling mechanism in China A-share market, there are still many restrictions on short-selling operations because of the relatively late start, imperfect mechanism and irrational investors. The high cost of short selling also limits the scope for “smart money” in capital markets, hindering the process of achieving equilibrium and efficiency. Regulatory authorities need to reasonably optimize the institutional design and regulatory mechanism of the capital market in the fintech era based on strengthening risk education for investors and improving information disclosure in the capital market.

Thirdly, for institutional or individual investors, the effective factor pool should be found and constructed to give full play to the advantages of the quantitative stock selection model. The empirical study shows that the quantitative model of fundamental factors still can obtain continuous alpha returns in China A-share market. In addition, stock selection by quantitative models can effectively reduce the degree to which market subjects are affected by factors such as cognitive bias and group behavior, so it is suggested to consider the quantitative model when making investment decisions.


This paper is supported by the Humanities and Social Science Projects of the Ministry of Education (19YJA910006), the NSF of Zhejiang Province (LY20A010019) and the Fundamental Research Funds for the Provincial Universities of Zhejiang (GK199900299012-204).

Conflicts of Interest

The authors declare no conflicts of interest regarding the publication of this paper.


[1] Amihud, Y. (2002). Illiquidity and Stock Returns: Cross-Section and Time-Series Effects. Journal of Financial Markets, 5, 31-56.
[2] Chan, L., & Jegadeesh, N. (1996). Momentum Strategies. Journal of Finance, 51, 1681-1713.
[3] Chen, R. D., & Yu, H. H. (2014). Stock Selection Model Based on Support Vector Machine within Heuristic Algorithm. Systems Engineering, 2, 40-48.
[4] Choudhry, R., & Garg, K. (2008). A Hybrid Machine Learning System for Stock Market Forecasting. World Academy of Science, Engineering and Technology, 39, 315-318.
[5] Ding, Z. G., Jin, B., & Xu, D. C. (2017). Test of Efficient Market: Criticism of Behavioral Finance to EMH. Contemporary Economic Research, 3, 51-59.
[6] Goyenko, R. Y., Holden, C. W., & Trzcinka, C. A. (2009). Do Liquidity Measures Measure Liquidity? Journal of Financial Economics, 92, 153-181.
[7] Hu, Y., & Gu, M. (2018). Buffett’s Alpha: Evidence from China Stock Market. Management World, 8, 41-54, 191.
[8] Huang, Z. X., Zeng, L. H, Jiang, Q., & Duan, Z. D. (2008). Information Revelation and Capital Market Efficiency: Information Efficiency and Allocation Efficiency. China Economic Quarterly, 2, 665-684.
[9] Jiang, F. W., Qi, X. L., & Tang, G. H. (2018). Q-Theory, Mispricing, and Profitability Premium: Evidence from China. Journal of Banking & Finance, 87, 135-149.
[10] Kourentzes, N., Barrow, D. K., & Crone, S. F. (2014). Neural Network Ensemble Operators for Time Series Forecasting. Expert Systems with Applications, 41, 4235-4244.
[11] Li, B., Shao, X. Y., & Li, Y. Y. (2019). Research on Machine Learning Driven Quantitative Investing. China Industrial Economics, 8, 61-79.
[12] Li, J., & Liao, H. (2007). The Investment Tactics of the P/E Ratio: Appraise and Test. Business Management Journal, 6, 73-79.
[13] Li, Z. B., Yang, G. Y., Feng, Y. C., & Jing, L. (2017). Fama-French Five-Factor Model in China Stock Market. Journal of Financial Research, 6, 191-206.
[14] Loughran, T., & Wellman, J. W. (2011). New Evidence on the Relation between the Enterprise Multiple and Average Stock Returns. Social Science Electronic Publishing, 46, 1629-1650.
[15] Malkiel, B. G., & Fama, E. F. (1970). Efficient Capital Markets: A Review of Theory and Empirical Work. The Journal of Finance, 25, 383-417.
[16] Nair, B. B., Mohandas, V. P., & Sakthivel, N. R. (2010). A Decision Tree-Rough Set Hybrid System for Stock Market Trend Prediction. International Journal of Computer Applications, 6, 1-6.
[17] Pastor, L., & Stambaugh, R. F. (2003). Liquidity Risk and Expected Stock Returns. Journal of Political Economy, 111, 642-685.
[18] Roll, R. (1984). A Simple Implicit Measure of the Effective Bid-Ask Spread in an Efficient Market. Journal of Finance, 39, 1127-1139.
[19] Sharpe, W. F. (1964). Capital Asset Prices: A Theory of Market Equilibrium under Conditions of Risk. Journal of Finance, 19, 425-442.
[20] Sun, D. C., & Bi, X. C. (2018). High-Frequency Trading Strategies Based on Deep Learning Algorithms and Their Profitability. Journal of University of Science and Technology of China, 11, 923-932.
[21] Wang, F. J, & Su, L. Z. (2013). Is Internal Capital Market Efficient in Chinese Listed Companies? Empirical Evidences from Multiple Divisions Listed Companies in H-Stock. Accounting Research, 1, 70-75, 96.
[22] Wang, R. (2016). Research on Multiple-Factor Quantitative Stock Selection in A-Share Market. Master’s Thesis, Shanxi University of Finance & Economics.
[23] Wang, Y. (2017). Study on Multi-Factor Model Based on Grey Correlation Analysis Method. Master’s Thesis, Beijing Jiaotong University.
[24] Wu, S. N., & Wu, C. P. (2003). An Empirical Study on Price Inertia Strategy and Earnings Inertia Strategy in China’s Stock Market. Economic Science, 4, 41-50.
[25] Yang, S. E., & Huang, L. (2005). Financial Crisis Warning Model Based on BP Neural Network. Systems Engineering-Theory & Practice, 1, 12-18, 26.
[26] Yang, W., Feng, L., Song, M., & Li, C. T. (2020). Can the Reference Point Ratio Measure Stock Price Overvaluation? Evidence from Stock Crash Risk. Management World, 1, 167-186, 241.
[27] Yu, Z. J., Yang, S. L., Zhang, Z., & Jiao, J. (2015). Stock Returns Prediction Based on Error-Correction Grey Neural Network. Chinese Journal of Management Science, 12, 20-26.
[28] Zhang, L., Deng, L. Y., & Zhou, Y. (2016). Contrarian Effect of Semi-Parametric Alpha Strategy. Chinese Journal of Management Science, 12, 30-38.
[29] Zhang, N., Shi, H. W., Zheng, L., Shan, Z. H., & Wu, H. X. (2020). Pcanet-Based Multi-Factor Stock Selection Model for Value Growth. Computer Science, S2, 64-67.
[30] Zhang, X. Y., & Zhang, Z. C. (2005). Empirical Tests on Efficiency of Commodity Futures Markets in China. Chinese Journal of Management Science, 6, 1-5.
[31] Zhang, Y. P. (2015). Are Investors Really Rational: The Challenge of Behavioral Finance to Fama’s EMH. Academics, 1, 116-125.
[32] Zhang, Z., Li, Y. Z., Zhang, Y. L., & Liu, X. (2014). A Test on Indirect Liquidity Measures in China Stock Market: An Empirical Analysis of the Direct and Indirect Measures of Bid-Ask Spread. China Economic Quarterly, 1, 233-262.
[33] Zhao, Y. L. (1998). Information Content of Accounting Earnings Disclosure: Empirical Evidence from Shanghai Stock Market. Economic Research Journal, 7, 42-50.

Copyright © 2024 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.