AI Implemented Stock Price Prediction and Stock Selection via Patent Forward Citation—A Study on China Stock Market and Patents ()
1. Introduction
Stock market prediction is always an attractive topic due to its important role in economy. Investment forecasting is difficult due to the intrinsic complexity of the financial system and the extrinsic uncertainty of the macroeconomic environment. Regardless of the extrinsic parameters, the differential model of Geometric Brownian motion has been applied for predicting the future stock price for years (Agustini, Affianti, & Putri, 2018; Suganthi & Jayalalitha, 2019; Maiti, 2021). In order to solve the complexity of financial prediction involving the intrinsic and the extrinsic parameters, the application of Artificial Intelligence (AI) has attracted extensive research attention since the 1990s and four categories could be concluded: portfolio optimization, stock market prediction using AI, financial sentiment analysis, and combinations involving more approaches (Ferreira, Gandomi, & Cardoso, 2021). There are many factors could be applied as the input variables for AI to predict the stock market, however, the innovation related factors, e.g. patent indicators, are rarely discussed.
Patent is a typical outcome of Innovation which being an essential driver of economic progress. China has been the largest domestic patent application country in the world for many years. China Intellectual Property Administration (CNIPA) is now the world’s largest patent office. In 2020, there are more than three million patents published and/or granted by CNIPA, including 1517 thousand invention publications, 530 thousand invention grants and 2377 thousand utility model grants. Meanwhile, China is now the world No. 2 economy with the world No. 2 stock market transaction. China-listed companies lead the development of China patents, which the unlisted companies and individuals follow their trend.
Based on patent information, Motohashi (2008) examined China’s development of innovation capabilities from 1985 to 2005 by using more than 679 thousand China invention patents. Motohashi (2009) proposed to review a substantial trend of Chinese firms catching up with Western counterparts via patent statistics in two high-tech sectors: the pharmaceutical industry and mobile communications technology. He found that these two fields show contrasting trends, the rapid catching up can be found in mobile communications technology, while Chinese companies are still lagging behind Western counterparts in the pharmaceutical industry. Hu & Jefferson (2009) used a firm-level dataset that spans the population of China’s large and medium-size industrial enterprises to explore the factors that account for China’s rising patent activity. They found that China’s patent surge is seemingly paradoxical, which indicated the given the vulnerability of protecting intellectual property rights in China.
Lei, Zhao, & Zhang et al. (2011) found that the inventive activities of China have experienced three developmental phases and have been promoted quickly in recent years. The innovation strengths of the three development phases have shifted from government to university and research institute and then industry. Liu & Qiu (2016) used Chinese firm-level patent data from 1998 to 2007 which featured a drastic input tariff exemption in 2002 because of China’s WTO accession. It was found that input tariff exemption results in a consequence for the less innovation undertaken by Chinese firms.
Boeing & Mueller (2019) proposed a patent quality index based on internationally comparable citation data from international search reports (ISR) to compare the foreign, domestic, and self citations. They found that all three citation types may be used as economic indicators if policy distortion is not a concern. They also suggested that the domestic and self citations lead to an upward bias in China and should be employed with caution if they are to be interpreted as a measure of patent quality.
Dang & Motohashi (2015) proposed that China patent statistics are meaningful indicators because China valid patent count is correlated with R & D input and financial output. Chen & Zhang (2019) studied China’s patent surge and its driving forces on patent applications filed by Chinese firms and found that R & D investment, foreign direct investment, and patent subsidy resulted in different effects on different types of patents. They found that R & D investment created a positive and significant impact on patenting activities for all types of patents; the stimulating effect of foreign direct investment on patent applications is only robust for utility model patents and design patents; while the patent subsidy only has a positive impact on design patents.
He, Tong, & Zhang et al. (2016) found that it was difficult in integrating Chinese patent data with company data, so they constructed a China patent database of all China-listed companies and their subsidiaries from 1990 to 2010. Chen, Wei, & Che (2018, 2020) used the patent data and stock data of China-listed companies of RMB common stocks (hereinafter, the A-shares) in Shanghai main board (SH main board) from 2011 to 2017 and found the patent indicators have leading effect on A-share’s stock price.
Chiu, Chen, & Che (2020a, 2020b) focused on the whole China A-shares without distinguishing the stock boards from 2016 Q4 to 2018 Q3. They found that the patent indicators also have leading effect on the financial indicators including the stock price, return-on-asset (ROA), return-on-equity (ROE), book-value-per-share (BPS), earnings-per-share (EPS), price-to-book (PB) and price-to-earnings (PE). The patent prediction equations for quantitatively giving the predictive values of the aforementioned financial indicators are proposed.
The China A-shares are listed on four stock boards including SH main board, Shenzhen main board (SZ main board), Growing-Enterprises board (GE board) and Small-and-Medium Enterprises board (SME board). The majority of A-shares in SH main board, SZ main board are state-owned companies and big companies; most A-shares in GE board and SME board are small and medium companies. Chiu, Chen, & Che (2020c, 2020d, 2020e, 2020f, 2021), Li, Deng, & Che (2020a, 2020b, 2021) further studied the patent leading effect on various stock boards, proposed each stock board’s patent prediction equations on the stock return rate, ROA, ROE, BPS, EPS, PB and PE via time series regression, finally proposed patent based stock selection criteria to prove the stock performance surpassing the market trend. However, the detailed relationships between the China patent indicators comprised in the patent prediction equations and the aforementioned financial indicators have not been thoroughly discussed.
The high-low fluctuation of China stock market is always far beyond any patent trend reflecting to the indicator. Tsai, Che, & Bai (2021a, 2021b, 2021c, 2021d, 2021e, 2021f, 2022a, 2022b, 2022c) systematically discussed the relationship between China patents and China A-share’s stock performance. The A-shares having higher innovation continuity indicated higher stock return rate mean regardless of patent types (Tsai et al., 2021a). The A-shares having higher patent counts indicated higher stock price mean and higher stock return rate mean with regard to the whole China stock market (Tsai et al., 2021b). The A-shares having patents of higher technology variety indicated higher stock return rate mean (Tsai et al., 2021c). The A-shares having the invention grants of the longer examination duration indicated higher stock return rate mean (Tsai et al., 2021d). The A-shares having higher backward citation counts indicated higher stock price mean (Tsai et al., 2021e). The A-shares having higher patent counts also indicated higher stock price mean in any stock board of SH main board, SZ main board, GE board and SME board (Tsai et al., 2021f). The A-shares having the invention grants of longer patent lives indicated significantly higher market capitalization mean (Tsai et al., 2022a). The A-shares having patents but receiving no forward citations indicated the highest stock price mean whereas the A-shares receiving forward citation counts above the average indicated the lowest stock price mean (Tsai et al., 2022b).
Among various patent indicators, patent forward citation is always an important issue for evaluating patent value. When an earlier patent is published or granted, it could be applied by the examiners as the prior art for testing the novelty and non-obviousness of the new patent application which is recognized as the forward citation of the earlier patent. The forward citation count of a patent is the frequency which the patent being cited by the examiners. A patent with high forward citation count is implied to have high influence on the industry and/or the technology involved and is regarded as high value. Companies having more patents of high forward citation count are usually regarded to have better financial achievement (Hall, Jaffe, & Trajtenberg, 2005; Hirshleifer, Hsu, & Li, 2013). However, the features of China patent forward citation are somewhat different from those of US patent forward citation. Tsai et al. (2022b) found that the China A-shares having higher China patent forward citation count do not prove to be with better financial achievement. The China A-shares having China patents without forward citations show significantly higher stock price mean than those A-shares having patent forward citations. Tsai et al. (2022c) further proposed a novel indicator called the price-citation, which defined as the multiplication of the current stock price and the currently receiving forward citation count, and showed its excellence in discriminating the stock return rate. The A-shares of higher price-citation showed significantly higher stock return rate mean while the A-shares of lower price-citation showed significantly lowest stock return rate mean.
With regard to the quantitative valuation models involving the patent forward citation, Chiu et al. (2020a-2020f, 2021), Li et al. (2020a, 2020b, 2021) applied China forward citation count as one of parameters to form linear prediction equations and give the predictive values of China A-share’s financial indicators by time series regression. However, the explanatory ability of the resulting prediction equations is low while the error of prediction is high. Lai & Che (2009a, 2009b, 2009c) applied US patent forward citation count as one of parameters via AI approach to model the damage award of US patent infringement lawsuits.
Since US patent forward citation count had been applied as one of the input parameters of AI approach to model the damage award of infringement lawsuits (Lai & Che, 2009a, 2009b, 2009c), China patent forward citation count might be applied as one of the input parameters of AI approach to model the stock price of China A-shares though the features of China patent forward citation are different from those of US patent forward citation. It is therefore the objectives of this research comprising the followings:
1) To discuss the process and result on applying China patent forward citation count as the input parameter of AI approach to predict China A-share’s stock price, and more particularly, to give the predictive stock price rather than to give the current stock price; and
2) To discuss the investment performance of stock portfolios, wherein the stocks are selected by AI predictive stock return rates resulted from AI predictive stock prices.
The managerial implication of this research would extend the application of China patent forward citations to the China stock market, discuss the combination of patent indicators and AI approaches, show the excellence of AI approach in reducing prediction error when comparing with traditional linear models, see the limitation of AI approach, and propose the stock portfolios based on AI predictive stock return rates which having preferable investment performance than the market trend.
In the following paragraphs, Section 2 presents the data and methodology including the delimitation and limitation, population and effective samples, and the instrumentation which indicated the company integrated patent database used, the calculation of patent forward citation count, the stock price selected, and AI approaches applied; Section 3 presents the result of AI training/testing, and further proposes a stock selection criteria for building stock portfolios based on AI predictive stock return rate for improving the investment performance; Section 4 presents the conclusion and recommendation.
2. Data and Methodology
2.1. Delimitation and Limitation
The objective of this research is to use China patent forward citation count as the input parameter of AI approach to give China A-share’s predictive stock price. It is therefore only the patents filed by companies are discussed, while the patents filed by the government, the R & D institutes, the academic organizations, or the individuals, are all excluded.
There are listed Chinese companies all over the world. In this research, China companies listed with RMB common stocks in Shanghai stock exchange or Shenzhen stock exchange, so called China A-shares, are discussed. Though Hong Kong is a special administrative region of China, the criteria for initial public offering are different from those in Shanghai and Shenzhen stock exchanges. Hence, Chinese companies listed in Hong Kong or any other overseas regions are excluded in this research.
Although China is now the world largest patent application country, however, China patents are less discussed if comparing with US patents. Therefore, only China patents are analyzed in this research. Foreign patents are excluded even though these foreign patents are filed by China A-shares. There are four major patent types in China including the invention publication, the invention grant, the utility model grant and the design grant. The design grant is a design application of a product which granted by overcoming the preliminary examination by having a distinct configuration, distinct surface ornamentation or both. The utility model grant is a utility model application of a product which granted by overcoming the preliminary examination. The invention publication is an invention application of a product or a process which published by overcoming the preliminary examination. The invention grant is an invention application which granted by overcoming not only the preliminary examination but also the substantial examination by having novel and distinct technical features over the prior arts, especially the prior patents. All these four patent types are considered in this research.
When an earlier patent, regardless of patent types, is published or granted, it can be applied by any country’s examiners to test the patentability of later patent applications. For example, if an examiner of USPTO is familiar with Chinese patents, the examiner may apply China patents as the prior arts to test the patentability of later US patent applications. The forward citations of an earlier China patent might therefore may be comprised not only China patents but also foreign patents. However, there are limitations to completely count forward citations of any earlier patent all over the world. In fact, most forward citations are domestic patents due to the patent examiner’s language proficiency. Since foreign patents are excluded in this research, the foreign patent forward citations are also excluded while only China patent forward citations are considered.
2.2. Population and Sample
The population comprises all China A-shares listed in Shanghai exchange and Shenzhen exchange. There are twenty quarters from 2016 Q1 to 2020 Q4 for collecting effective samples. For each of the twenty quarters, an effective sample must meet the following conditions:
1) The A-share was listed to have definite stock closing prices in the last trading days of any specified current quarter and the next quarter so as to have a quarterly future stock return rate; and
2) The A-share had at least one new China patent published or granted for receiving forward citations by the end of any specified quarter according to the patent retrieval intervals of four years, five years and six years.
2.3. Instrumentation
2.3.1. Company Integrated Patent Database
It is a common phenomenon that a listed company has a lot of subsidiaries. When a subsidiary’s revenue is merged to its parent listed company in the formal financial reports, the subsidiary’s patents are therefore inferred to contribute to its parent company’s financial performance in this research. In order to collect the correct patents and count the correct forward citations, a company integrated patent database is built in this research by carefully reviewing all China A-share’s formal financial reports. In the company integrated patent database, all subsidiaries’ patents are integrated together with parent A-share’s patents as a whole to count total forward citations of the parent A-share.
It is also common that a patent is co-owned by plural companies. For avoiding duplicated calculation, if a patent is co-owned by the parent A-share and its subsidiaries, it is regarded as a single one patent of the parent A-share; if a patent is co-owned by several subsidiaries, it is also regarded as a single one patent of the parent A-share. However, if a patent is co-owned by two or more A-shares, it is assumed to contribute equivalently to each parent A-share, so the patent is duplicated and distributed to each of the co-owning A-shares.
2.3.2. Patent Forward Citation and Patent Retrieval Interval
A patent with more forward citations implies to have higher influence to the technology involved. Companies with more patent forward citations usually imply to have higher influence to the industry involved therein.
The forward citation is technology-sensitive. Patents in different technologies receive different numbers of forward citations. Patents in semiconductor or biology technologies usually receive more forward citations than patents in mechanical technologies. The forward citation is also time-sensitive. Older patents usually receive more forward citations than younger patents because the time interval of older patents for receiving forward citations is longer than that of younger patents.
In order to derive the proper forward citation counts of all A-shares, the patent retrieval interval for retrieving the earlier patents and counting the forward citations is an important issue. Thomas (2001) focused on US patents and proposed a “current impact index” which retrieves the earlier patents granted over previous five years, i.e. the patent retrieval interval of five years, and counts patent forward citations in the current year, i.e. the patent retrieval interval of one year. However, China patents do not receive as much forward citations as US patents do. The “current impact index” does not work well on China patents. The patent retrieval interval for counting forward citations should be modified.
Tsai et al. (2022b) focused on China patents and further tested the patent retrieval intervals for both retrieving earlier patents and counting forward citations. For the patent retrieval interval of one year, the earlier patents and the forward citations are both retrieved and counted over previous one year. For the patent retrieval interval of two years, the earlier patents and the forward citations are both retrieved and counted over previous two years. And set forth the patent retrieval intervals of three, four, five and six years. It is found that the forward citation counts based on patent retrieval intervals of four, five and six years showed higher significance on the stock price while the patent retrieval intervals of one, two and three years showed lower significance. Therefore, the patent retrieval intervals of four, five and six years for calculating forward citation count are applied in this research.
The patent forward citation count of an A-share is therefore defined as the summation of every individual patent’s forward citation count of the A-share no matter what patent types. The Kolmogorov-Smirnov test is applied on the forward citation count of various patent retrieval intervals. The test result shows that the original data distributions of forward citation counts are seriously skewed with high kurtosis. Therefore, all forward citation counts in this research are transformed by natural logarithm before any analysis.
2.3.3. Stock Price
The stock price in every trading day is always dynamic. The opening price, the closing price, the highest price, the lowest price, and the mean price, are extensively applied in various analyses according to different purposes. However, it does not matter to use any of the aforementioned stock prices in this research. For simplification and consistency, the closing price of every China A-share in the last trading day of every quarter from 2016 Q1 to 2021 Q1 is applied as the stock price in this research. According to the result of Kolmogorov-Smirnov test, all stock prices in the analysis are also transformed by natural logarithm.
2.3.4. Back Propagation Neural Network
The Back Propagation Neural network (BPN), one of the most popularly known neural networks, is a well developed AI learning technique. BPN is applied in this research to build AI implemented prediction model for the stock price. A neural network is a group of connected I/O units where each connection has a weight associated with its computer programs. It helps to build predictive models from large databases. The back propagation in BPN is the essence of neural network training. It is the method of fine-tuning the weights of a neural network based on the error rate obtained in the previous iteration. Proper tuning of the weights reduces error rates and makes the model reliable by increasing its generalization. Some prominent advantages of BPN includes: fast, simple and easy for coding; no parameters to tune apart from the numbers of input.
In this research, three layers’ BPN including an input layer, an output layer and a hidden layer is applied for predicting the future stock price by the patent forward citation counts under the lag of one quarter. The input nodes of the input layer are the current patent forward citation counts with/without the current stock price under the predetermined lag, the output node of the output layer is the predictive stock price. The number of nodes of the hidden layer is sensitive and affects the convergence efficiency and error of the output. In this research, the number of nodes of the hidden layer is set to be two times the number of input nodes.
In order to test the prediction ability and limitation of BPN, two input modes for specifying the input nodes are compared:
#1) Three input nodes consisting of three patent forward citation counts according to patent retrieval intervals of four years, five years and six years, respectively.
#2) Four input nodes consisting of the stock price and three patent forward citation counts according to patent retrieval intervals of four years, five years and six years.
The input mode #1 is applied for predicting the future stock price by only current patent citation counts. The input mode #2 is applied for predicting the future stock price by not only current patent citation counts but also current stock price in order to see whether the performance is improved or not. In addition, the number of nodes of the hidden layer with regard to the input mode #1 is six, while the number of nodes of the hidden layer with regard to the input mode #1 is eight.
3. Result and Finding
3.1. AI Training and Testing
Table 1 shows the effective sample statistics in every quarter. A total of 59,498 effective samples are collected from 2016 to 2020.
All effective samples are divided into five datasets including dataset 2016, dataset 2017, dataset 2018, dataset 2019 and dataset 2020, for BPN training and testing respectively. Dataset 2016 comprises the A-shares of which the patent forward citation counts according to three patent retrieval intervals are calculated by the end of 2016 Q1, 2016 Q2, 2016 Q3, 2016 Q4, while the stock prices are retrieved by the last trading days of 2016 Q2, 2016 Q3, 2016 Q4 and 2017 Q1. Dataset 2017 comprises the A-shares of which the patent forward citation counts according to three patent retrieval intervals are calculated by the end of 2017 Q1, 2017 Q2, 2017 Q3, 2017 Q4, while the stock prices are retrieved by the last trading day of 2017 Q2, 2017 Q3, 2017 Q4 and 2018 Q1. And set forth the datasets 2018, 2019 and 2020.
With regard to each dataset and each input mode, the lag of one quarter is applied to build BPN. That means, the inputs are the current data and the output is the predictive stock price in the next quarter. Therefore, four prediction periods as shown below are grouped together to build an AI implemented stock price prediction model for each dataset.
Table 1. Effective samples statistics in every quarter from 2016 Q1 to 2020 Q4.
Data source: author’s preparation.
Period 1: patent forward citation counts of Q1 to predict stock prices of Q2.
Period 2: patent forward citation counts of Q2 to predict stock prices of Q3;
Period 3: patent forward citation counts of Q3 to predict stock prices of Q4;
Period 4: patent forward citation counts of Q4 to predict stock prices of next Q1.
The parameters of BPN applied are set as below:
Maximum epochs = 20,000
Number of nodes of the hidden layer = 2 * (Number of input nodes)
Bit fail = 0.035
Error function: FANN
Hidden activation steepness = 0.5
Output activation steepness = 0.5
Quickprop factor = 1.75
EPROP increase factor = 1.2
EPROP decrease factor = 0.5
The results of AI training and testing are shown in Table 2. Four indicators of goodness of fit are tested in Table 2, wherein, GOF represents “Goodness of Fit”, MSE represents “Mean Square Error”, MAE is “Mean Absolute Error”, MAPE represents “Mean Absolute Percentage Error”.
With regard to dataset 2016, MSE of the input mode #1 varies from the best 0.3814 (period 2) to the worst 0.4053 (period 1); MAE of the input mode #1 varies from the best 0.4923 (period 2) to the worst 0.5076 (period 1); MAPE of the input mode #1 varies from the best 18.35% (period 3) to the worst 20.02% (period 1); R2 of the input mode #1 varies from the best 0.0240 (period 1) to the worst 0.0192 (period 4). MSE of the input mode #2 varies from the best 0.0212 (period 3) to the worst 0.0860 (period 1); MAE of the input mode #2 varies from the best 0.0995 (period 4) to the worst 0.1937 (period 1); MAPE of the input mode #2 varies from the best 3.49% (period 3) to the worst 7.14% (period 1); R2 of the input mode #2 varies from the best 0.9475 (period 3) to the worst 0.7930 (period 1).
Table 2. Results of AI training and testing.
Data source: author’s preparation.
With regard to dataset 2017, MSE of the input mode #1 varies from the best 0.3785 (period 4) to the worst 0.4081 (period 3); MAE of the input mode #1 varies from the best 0.4911 (period 2) to the worst 0.5107 (period 3); MAPE of the input mode #1 varies from the best 18.65% (period 2) to the worst 22.20% (period 4); R2 of the input mode #1 varies from the best 0.0333 (period 3) to the worst 0.0121 (period 1). MSE of the input mode #2 varies from the best 0.0240 (period 3) to the worst 0.0538 (period 1); MAE of the input mode #2 varies from the best 0.1160 (period 3) to the worst 0.1576 (period 4); MAPE of the input mode #2 varies from the best 4.47% (period 3) to the worst 6.07% (period 1); R2 of the input mode #2 varies from the best 0.9431 (period 3) to the worst 0.8621 (period 1).
With regard to dataset 2018, MSE of the input mode #1 varies from the best 0.3862 (period 2) to the worst 0.4275 (period 1); MAE of the input mode #1 varies from the best 0.4945 (period 2) to the worst 0.5181 (period 1); MAPE of the input mode #1 varies from the best 21.29% (period 4) to the worst 26.53% (period 3); R2 of the input mode #1 varies from the best 0.0311 (period 1) to the worst 0.0019 (period 4). MSE of the input mode #2 varies from the best 0.0233 (period 3) to the worst 0.0849 (period 4); MAE of the input mode #2 varies from the best 0.1186 (period 3) to the worst 0.2519 (period 4); MAPE of the input mode #2 varies from the best 5.63% (period 2) to the worst 10.26% (period 4); R2 of the input mode #2 varies from the best 0.9397 (period 3) to the worst 0.7934 (period 4).
With regard to dataset 2019, MSE of the input mode #1 varies from the best 0.4357 (period 1) to the worst 0.6063 (period 4); MAE of the input mode #1 varies from the best 0.5197 (period 1) to the worst 0.6105 (period 4); MAPE of the input mode #1 varies from the best 24.59% (period 1) to the worst 29.41% (period 4); R2 of the input mode #1 varies from the best 0.0473 (period 3) to the worst 0.0334 (period 1). MSE of the input mode #2 varies from the best 0.0219 (period 3) to the worst 0.0286 (period 4); MAE of the input mode #2 varies from the best 0.1024 (period 3) to the worst 0.1273 (period 4); MAPE of the input mode #2 varies from the best 4.32% (period 3) to the worst 5.73% (period 4); R2 of the input mode #2 varies from the best 0.9625 (period 3) to the worst 0.9382 (period 1).
With regard to dataset 2020, MSE of the input mode #1 varies from the best 0.6904 (period 4) to the worst 0.7305 (period 3); MAE of the input mode #1 varies from the best 0.6449 (period 4) to the worst 0.6654 (period 1); MAPE of the input mode #1 varies from the best 29.59% (period 4) to the worst 34.33% (period 1); R2 of the input mode #1 varies from the best 0.0338 (period 1) to the worst 0.0277 (period 4). MSE of the input mode #2 varies from the best 0.0277 (period 4) to the worst 0.0411 (period 1); MAE of the input mode #2 varies from the best 0.1279 (period 4) to the worst 0.1432 (period 1); MAPE of the input mode #2 varies from the best 5.39% (period 2) to the worst 6.69% (period 1); R2 of the input mode #2 varies from the best 0.9610 (period 4) to the worst 0.9449 (period 1).
With regard to the input mode #1, MSE varies from 0.3785 (period 4 of dataset 2017) to 0.7305 (period 3 of dataset 2020); MAE varies from 0.4911 (period 2 of dataset 2017) to 0.6654 (period 1 of dataset 2020); MAPE varies from 18.65% (period 2 of dataset 2017) to 34.33% (period 1 of dataset 2020). With regard to the input mode #2, MSE varies from 0.0212 (period 3 of dataset 2016) to 0.0860 (period 1 of dataset 2016); MAE varies from 0.0995 (period 4 of dataset 2016) to 0.2519 (period 4 of dataset 2018); MAPE varies from 3.49% (period 3 of dataset 2016) to 10.26% (period 4 of dataset 2018).
Apparently, the AI prediction models based on the input mode #2 have lower MSE, lower MAE, and lower MAPE than the AI prediction models based on the input mode #1. Any goodness-of-fit of AI prediction models based on the input mode #2 is absolutely better than that of AI prediction models based on the input mode #1.
The goodness of fit R2, usually applied for linear models such like regressions, is seldom used for AI prediction models. However, R2 is more easily understood when comparing the errors. It is because this statistic indicates the percentage of the variance between the model output and the real value, i.e. the variance between the predictive stock price and real stock price in this research. The value range of R2 is from 0.0 to 1.0. When R2 = 1.0, there is no any error, while when the R2 = 0, the error is infinite. In Table 2, the R2 of the input mode #1 varies from 0.0019 (2018 Q4 of dataset 2018) to 0.0473 (2019 Q3 of dataset 2019) while the R2 of the input mode #2 varies from 0.7930 (2016 Q1 of dataset 2016) to 0.9534 (2019 Q3 of dataset 2019). The explanatory ability of AI prediction models based on the input mode #2 is pretty good whereas the explanatory ability of AI prediction models based on the input mode #1 is poor.
Chiu et al. (2020a-2020f, 2021) and Li et al. (2020a, 2020b, 2021) applied time series regression with the forward citation as one of input variables to form linear prediction equations and give China A-share’s predictive stock price. The highest adjusted R2 of all prediction equations is 0.6568 which is lower than any R2 of AI prediction models based on the input mode #2 as shown in Table 2. The AI prediction models based on the input mode #2 show superior goodness-of-fit than linear models.
Via the constructed AI prediction models, Figures 1-5 respectively show the distributions and comparisons of AI predictive stock prices versus real stock prices from datasets 2016 to 2020, wherein, each of the green dots is the real stock price of an A-share, each of the gray dots is the AI predictive stock price by applying the input mode #1 of the corresponding green dotted A-share while each of the orange dots is the AI predictive stock price by applying the input mode #2 of the corresponding green dotted A-share. For clear illustration, the vertical scale shown in Figures 1-5 is transformed in logarithm, and all A-share’s real stock prices distributing from left to right are arranged from low stock price to high stock price.
For dataset 2016 as shown in Figure 1, the orange dots distribute close to the green dots while most orange dots distribute a little higher than the green dots
Figure 1. Real stock prices vs AI predictive stock prices for dataset 2016. Data source: author’s preparation.
Figure 2. Real stock prices vs AI predictive stock prices for dataset 2017. Data source: author’s preparation.
Figure 3. Real stock prices vs AI predictive stock prices for dataset 2018. Data source: author’s preparation.
Figure 4. Real stock prices vs AI predictive stock prices for dataset 2019. Data source: author’s preparation.
Figure 5. Real stock prices vs AI predictive stock prices for dataset 2020. Data source: author’s preparation.
except the right hand side dots; whereas the gray dots distribute between a horizontal zone from left to right. It means that the predictive stock prices via the input mode #2 are very close to the real stock prices and most predictive stock prices thereof are a little higher than the real stock prices; whereas the predictive stock prices via the input mode #1 are between the range from RMB 6.8 to RMB 22.8 and show no relevance to the real stock prices.
For dataset 2017 as shown in Figure 2, the orange dots distribute close to the green dots while the left half orange dots mostly distribute a little higher than the green dots and the right half orange dots mostly distribute lower than the green dots; whereas the gray dots distribute between a horizontal zone from left to right. It means that the predictive stock prices via the input mode #2 are very close to the real stock prices, the predictive stock prices thereof are mostly a little higher when the real stock prices are approximately below RMB 19.1, the predictive stock prices thereof are mostly lower when the real stock prices are approximately above RMB 19.1; whereas the predictive stock prices via the input mode #1 are between the range from 8.5 RMB to 17.5 RMB and show no relevance to the real stock prices.
For dataset 2018 as shown in Figure 3, the orange dots distribute close to the green dots while the left half orange dots mostly distribute a little higher than the green dots and the right half orange dots distribute around the green dots; whereas the gray dots distribute between a horizontal zone from left to right. It means that the predictive stock prices via the input mode #2 are very close to the real stock prices while the predictive stock prices thereof are mostly a little higher when the real stock prices are approximately below RMB 19.1; whereas the predictive stock prices via the input mode #1 are between the range from 8.2 RMB to 11.9 RMB and show no relevance to the real stock prices.
For dataset 2019 as shown in Figure 4, the orange dots distribute close to the green dots; whereas the gray dots distribute between a horizontal zone from left to right. It means that the predictive stock prices via the input mode #2 are very close to the real stock prices; whereas the predictive stock prices via the input mode #1 are between the range from 7.5 RMB to 14.8 RMB and show no relevance to the real stock prices.
For dataset 2020 as shown in Figure 5, the orange dots distribute close to the green dots while the orange dots mostly distribute higher than the green dots; whereas the gray dots distribute between a horizontal zone from left to right. It means that the predictive stock prices via the input mode #2 are very close to and a little higher than the real stock prices; whereas the predictive stock prices via the input mode #1 are mostly between the range from 8.4 RMB to 19.1 RMB and show no relevance to the real stock prices.
Apparently, the AI predictive stock prices via the input mode #1 distribute irrelevantly to the real stock prices. The AI predictive stock prices via the input mode #2 distribute close to the real stock prices, wherein, AI predictive stock prices are higher than the real prices mostly for datasets 2016 and 2020; AI predictive stock prices are half higher and half lower than the real prices for dataset 2017; AI predictive stock prices are half higher than the real prices and half around the real prices for dataset 2018; most AI predictive stock prices are around the real prices for dataset 2019.
There is an important point coming out from Figures 1-5 that though AI technique is powerful but it is not omnipotent to do anything because its goodness-of-fit is seriously limited by the input data. With regard to the input mode #1, the inputs consist of three current patent forward citation counts according to patent retrieval intervals of four years, five years and six years. The linear correlation between these inputs and the desired outputs, i.e. the stock prices in next quarter, is poor for any dataset from dataset 2016 to dataset 2020, so the goodness-of-fits of the trained AI prediction models are also poor. With regard to the input mode #2, an additional input of current stock price is added with three current patent forward citation counts. The linear correlation between the input current stock prices and the output stock prices in next quarter is high, so the AI prediction model is easily trained to provide preferable goodness-of-fit.
3.2. AI Implemented Stock Portfolios
The main objective of predicting the stock price by patent forward citations is to understand how the consideration of patent forward citations can be beneficial while investing stocks. Since the predictive stock price comes out from the AI prediction model based on the input mode #2 (hereinafter, AI implemented stock price prediction model) is proved to show preferable goodness-of-fit, it might be applied for building beneficial stock portfolios.
Chen et al. (2020) and Chiu et al. (2020a) proposed to select stocks for stock portfolios by using the predictive stock return rate for rather than using the predictive stock price, and proved that the investment performance of the stock portfolios based on the predictive stock return rate is preferable than that of the stock portfolios based on the predictive stock price. The AI implemented stock price prediction model is built with the lag of one quarter, the predictive stock price is one quarter ahead of the current stock price. By using the predictive stock price to subtract the current stock price, then divided by the current stock price, the quarterly predictive stock return rate is derived and applied as the stock selection criteria for building stock portfolios in this research. The real quarterly stock return rate of stock portfolios is then set as the performance reference to compare.
The performance comparisons according to various datasets are shown in Table 3. “All” consists of all A-shares for the specified period and the stock return rate mean thereof represents the market trend. “Top 100”, “Top 200” and “Top 300” are the stock portfolios consisting of 100, 200 and 300 A-shares respectively selected by the higher 100, 200 and 300 AI predictive stock return rates.
In all twenty periods of five datasets, there are seven periods in which the market trends show positive quarterly stock return rates, i.e. periods 2 and 3 of
Table 3. Stock performance comparisons of market trends and stock portfolios selected by ai predictive stock return rates.
Data source: author’s preparation.
dataset 2016, period 2 of dataset 2017, period 4 of dataset 2018, period 3 of 2019, periods 1 and 2 of 2020; while the other thirteen periods in which the market trends show negative quarterly stock return rates. The overall market trend from 2016 to 2020 shows declining tendency.
In the aforementioned seven periods in which the market trends show positive quarterly stock return rates, there are six periods, including periods 2 and 3 of dataset 2016, period 2 of dataset 2017, period 4 of dataset 2018, period 3 of 2019, period 2 of 2020, in which any of Top 100, Top 200 and Top 300 has better performance than the market trend. There is only one period, i.e. period 1 of 2020, in which none of Top 100, Top 200 and Top 300 has better performance than the market trend. The stock portfolios based on AI predictive stock return rate work well mostly for these even periods.
Figure 6. Quarterly stock return rate comparisons of stock portfolios selected by AI Predictive stock return rates from 2016 Q2 to 2020 Q4. Data source: author’s preparation.
With regard to the aforementioned thirteen periods in which the market trends show negative quarterly stock return rates, there are nine periods, including periods 1 and 4 of dataset 2016, periods 1, 3 and 4 of dataset 2017, period 2 of dataset 2018, period 1 of 2019, periods 3 and 4 of 2020, in which any of Top 100, Top 200 and Top 300 has better performance than the market trend. There are three periods, including period 3 of 2018, periods 2 and 4 of 2019, in which at least one of Top 100, Top 200 and Top 300 has better performance than the market trend. There is only one period, i.e. period 1 of 2018, in which none of Top 100, Top 200 and Top 300 has better performance than the market trend. The stock portfolios based on AI predictive stock return rates also work well mostly for these thirteen periods.
Since the average quarterly stock return rate represents the market trend, by shifting it to the zero, Figure 6 more clearly shows the performance of stock portfolios based on AI predictive stock return rates when comparing with the market trends in twenty periods, wherein, Top 100, Top 200 and Top 300 are represented respectively by green bars, orange bars and cyan bars. Though Top 100, Top 200 and Top 300 are not always provided with better performance than the market trend, however, Top 100, Top 200 and Top 300 work well mostly while Top 100 usually works the best. It proved that the investment performance could be improved by selecting the stocks using the predictive stock return rate in which the predictive stock price which come out from AI implemented stock price prediction models with patent forward citation counts involved.
4. Conclusion & Recommendation
Based on the company integrated patent database of China A-shares and the stock price data from 2016 Q1 to 2021 Q1, five datasets including twenty prediction periods were formed to discuss the effect of AI approach for predicting future stock price based on patent forward citation counts was discussed.
Three layers’ BPN, as one of the most popular AI approaches, was applied for predicting the future stock price. Five datasets including datasets 2016, 2017, 2018, 2019 and 2020 were provided for training the BPN. Two input modes for specifying BPN’s input nodes were compared, wherein, the input mode #1 consisting of three input nodes of current patent forward citation counts according to patent retrieval intervals of four, five and six years; the input mode #2 consisting of four input nodes of the current stock price and three current patent forward citation counts according to patent retrieval intervals of four, five and six years. The following conclusions were arrived.
1) Different AI prediction models for predicting the future stock price were trained and built via two input modes and five datasets. MSE, MAE and MAPE provided by the input mode #2 were apparently preferable than those provided by the input mode #1. The R2 was further applied for evaluating the explanatory ability of AI prediction models. The R2 of the input mode #2 varied from 0.8919 (dataset 2018) to 0.9534 (dataset 2019) while the R2 of the input mode #1 varied from 0.0230 (dataset 2016) to 0.0427 (dataset 2019). The explanatory ability of AI prediction models based on the input mode #2 were pretty good for every dataset whereas the explanatory ability of AI prediction models based on the input mode #1 were poor. In addition, the goodness-of-fit of AI prediction models based on the input mode #2 were better than the time series regression prediction equations proposed by Chiu et al. (2020a-2020f, 2021) and Li et al. (2020a, 2020b, 2021).
2) The performance of AI prediction models was significantly limited by the input data. The inputs of the input mode #1 consisted of only three patent forward citation counts, the linear correlation between these inputs and the future stock price output was poor, so the trained AI prediction models based on the input mode #1 were also poor. The input of the input mode #2 consisted of not only three patent forward citation counts but also the current stock price, the linear correlation between the input current stock price and the output future stock price was relatively high, the AI prediction models based on the input mode #2, so called the AI implemented stock price prediction models, were therefore efficiently trained and showed preferable goodness-of-fit.
3) By using current stock price and the predictive stock price derived from the AI implemented stock price prediction model, the quarterly predictive stock return rate was formed and applied as the criteria for selecting stocks to build stock portfolios Top 100, Top 200 and Top 300. These stock portfolios were proved to show better performance than the market trend in most periods from 2016 to 2020 while Top 100 usually worked the best.
The patent forward citation had been proved its significance in discriminating China A-share’s current stock price. In this research, the patent forward citation count and current stock price were further successfully incorporated together with BPN to build AI implemented stock price prediction model for predicting the future stock price. The stock portfolios based on the AI predictive stock return rates also showed preferable investment performance than the market trend. The finding would inspire the practitioners and scholars in the patent valuation and patent informatics. It would be worthy for related researchers to compare the results by apply AI approaches to predict the future stock price via other patent indicators which significantly related with stock market, such as the Innovation Continuity (Tsai et al., 2021a), the patent count (Tsai et al., 2021b, 2021f), the technology variety (Tsai et al., 2021c), the patent examination duration (Tsai et al., 2021d), the patent backward citation count (Tsai et al., 2021e), the patent life (Tsai et al., 2022a), the patent forward citation count (Tsai et al., 2022b, 2022c), etc. The finding of this research would also contribute the listed company evaluation and investment evaluation. The Investment institutions could apply the AI implemented stock price prediction model to improve their company evaluation criteria and investment strategy for getting better performance.
Acknowledgements
The authors acknowledge the financial support from Ministry of Science and Technology, Taiwan under Grant No. MOST 109-2410-H-011-021-MY3. The authors are also grateful for the permission of using China A-share’s patent data which collected and processed by Shenzhen TekGlory Intellectual Property Data Technologies, Ltd.