Did You Really Beat the Market? A Practical and Parsimonious Approach to Evaluating Risk-Adjusted Performance

This study contributes to the literature by modifying and recasting the Modigliani and Modigliani M-Squared risk-adjusted performance measure in a practical setting. Specifically, rather than combine the risk-free asset (Treasury Bill) with the portfolio under consideration to match market risk, this study combines the risk-free asset with a levered (or unlevered) market ETF to match portfolio risk. In so doing, this study addresses the question: Could an investor have earned higher returns with the same risk (standard deviation) using a simple combination of the risk-free asset and a readily available levered (or unlevered) market ETF? The study also addresses the impact of the context where one captures return measurements on outperformance conclusions. Although this study focuses its analysis on in-sample descriptive statistics, the new Risk-Equivalent Excess Performance measure and contextualization provide a basis for future out-of-sample inferential analysis.


Introduction
This study presents a simple framework to evaluate claims of superior investment performance. As a finance professor, I often hear claims of superior returns and therefore claims of superior skill from "investment managers" in the guise of students, faculty colleagues, family, friends, and strangers. I have heard such market beating claims regarding stocks, bonds, foreign exchange, real estate, cryptocurrency, and even baseball cards. The frequency of such claims increases during bull markets. This is consistent with Hoffman and Post [1] who find that individual investor belief in skill increases with recent returns and that their be-lief is not impacted by market returns. In other words, bias clouds vision of the possibility that recent high returns are a result of high market returns in general and not skill.
Outperformance claims also appear in national media. For instance, the April 2021 Wall Street Journal article titled "The pandemic year's top stock-fund managers" reports manager Dennis P. Lynch of the Morgan Stanly Inception fund (ticker: MSSGX) earned a 12-month net return of 273% as of March 31, 2021 [2]. Astonishingly, the word "risk" is not mentioned in the Wall Street Journal article at all. This would not surprise Mark Hebner, founder and president of Index Fund Advisors, who in a January 2016 Money Management Executive column [3] states: "But here's the next number that I've never seen in the press: volatility or the deviation. " We live in an era with increased FinTech (Financial Technology) adoption and, unfortunately, a proliferation of misinformation sources. As evidence of FinTech's increased adoption, Robinhood's July 1, 2021, S-1 filing reveals 18 million accounts as of March 31, 2021 [4]. Meanwhile, misinformation regarding returns and risk is on the rise fueled by posts from non-professionals (and professionals) on Facebook, Twitter, Reddit, etc. Together, increased FinTech adoption and misinformation proliferation lead to bubbles like the recent "meme stock" craze (see "$26 Billion Gone! 'Meme Stock' Crash Erases Nearly Half of Gains" [5]).
Undoubtedly, many of Robinhood's millions of account holders are new to financial markets. The majority of those lured into the "meme stock" craze likely lost their investment-or more as many used borrowed funds to purchase stocks. As evidence of borrowing to purchase "meme stocks", see "Robinhood claims it ONLY forced the sale of GameStop shares if they were bought with borrowed funds" [6] that states: "At one point, an estimated half of Robinhood's 13 million users owned some GameStop stock." This study serves as a counterbalance to the misinformation that abounds in social media and the omission of important performance measurement context in mainstream media.
Shifting the focus to the academic literature, the sheer volume devoted to risk measurement indicates risk is critical for performance reporting. For instance, Cogneau and Hubner [7] perform a census of over 100 risk-adjusted performance measures in the academic literature. For now, the introductory quote in Chapter 9 of Altman [8], originally from Schoolman et al. [9] serves as a compass for the current study: "Good answers come from good questions not from esoteric analysis." In the context of assessing market-beating performance claims, the [hopefully] good question I address in this study is: The Moore Performance Question: Could an investor have earned higher returns with the same risk (standard deviation) using a simple combination of the risk-free asset and a readily available levered (or unlevered) market ETF?
The Moore Performance Question takes into account two key factors that are often overlooked in outperformance claims: risk and market (benchmark) returns.
While risk is the primary focus of this study, I must note risk is just one of many critically important factors to consider when assessing performance. Table   1 lists what I term the nine ingredients of valid performance measurement. I discuss these in greater detial in Sections 2.2 and 3.
In sum, the importance and contribution of this study are twofold. First, the nine ingredients of valid performance measurement serve as a reminder for practitioners (veteran and novice alike) and academics to be mindful of the full context of performance reporting. Second, the new Risk-Equivalent Excess Performance measure provides a straightforward measure of value-added performance inclusive of the nine ingredients.
The remainder of the paper is as follows. Section 2 presents a literature review reflective of the starting point of this study and the nine ingredients of valid performance measurement. Section 3 describes the approach this study takes in addressing performance measurement concerns, the sample construction, summary statistics, and construction of the REEP measure. Section 4 presents results and Section 5 concludes.

The Starting Point: Modigliani and Modigliani RAP Measure
This study begins with the Modigliani and Modigliani [10] M-2 (also called Risk-Adjusted Performance or RAP) measure. Modigliani and Modigliani construct RAP by combining the risk-free asset (Treasury Bill) with the portfolio under consideration to match market risk. Higher RAP corresponds to a higher ranking. In contrast, the measure of this study, Risk-Equivalent Excess Performance or REEP, combines the risk-free asset with a levered (or unlevered) market ETF to match portfolio risk.
The study of Cogneau and Hubner [7] notes the RAP measure is a linear function of the Sharpe Ratio and therefore shares its disadvantages. Specifically, Cogneau and Hubner state the Sharpe Ratio (and by extension RAP) 1) does not quantify value-added in that it only ranks funds, 2) produces rankings affected by the choice of the risk-free rate, 3) is suitable for investors who invest in only one fund, 4) is subject to sampling error in the standard deviation calculation, and 5) presumes normality while most fund returns are not normally distributed.
I refer to this set of observations as the Cogneau-Hubner Critique. returns not excess returns and 2) actual borrowing costs are inherent in the construction of the measure.
Regarding one-fund investing, REEP shares the same disadvantage as RAP in that it considers funds in isolation. However, the primary focus of this study is not to rank funds, but rather to evaluate claims of outperformance. The same can be said in the use of standard deviation as a measure of risk. Regarding nonnormal fund returns, this leaves room for future research using alternative nonparametric measures of risk. Regardless, as we shall see, the REEP measure constructed in Section 3 does indeed answer the Moore Performance Question.

Ingredients of Valid Performance Measurement
This section discusses performance measurement ingredients and their importance in the context of extant literature. It is important to bear in mind that although all ingredients are critical, not all are present in performance measurements obtained from acquaintances, the media, or even academic literature. However, by the end of this section readers will be more aware (or reminded) of important pieces of information necessary for valid performance measurement claims. Ingredient 1: Gains and losses. The first step to valid performance measurement is procurement of accurate returns that include both gains and losses. I emphasize both gains and losses because, as Thaler [11] shows in his seminal work on consumer choice, humans suffer from "mental accounting." In the context of reporting returns from their investments, it is much like reporting "winnings" from a casino-many often neglect money they lost and speak only of their winning bets. Ingredient 2: Cash. Related to the "mental accounting" phenomenon pointed out by Thaler, some investment managers report returns "net of cash." In a May 2019 Bloomberg article [12], Warren Buffett criticized the practice of reporting net of cash returns stating "It makes their return look better if you sit there a long time in Treasury Bills … It's not as good as it looks." As a simple illustration, consider hypothetical performance reported by two different investment managers. Presume fund manager A held 50% of assets under management in cash and the other 50% in equities that earned 12% over a year. Now presume manager B held 5% of their assets in cash and the other 95% in equities that earned 8% over the same year. Table 2 presents the illusion of "net of cash" return reporting. Clearly, Manager B earned more money for their clients (7.6%) than Manager A (6.0%) even though Manager A may claim higher net of cash returns (12% vs. 8%). Buffett further states "Firms will include money that's sitting in Treasury Bills waiting to be deployed when charging management fees, but will exclude it when calculating a so-called internal rate of return, the performance measure in which most funds are judged [12]." Thus, returns inclusive of cash positions (and the associated zero return of that portion of the portfolio) is requisite for valid performance measurement. indexing. In particular, active strategies may execute more short-term trades and thereby expose investors to higher short-term gains taxes [14]. Thus, the presence of higher taxation for more frequent trading associated with active management must be considered when comparing results to passively managed instruments with infrequent trading.
As an illustration, presume short-term gains are taxed at 35% and long-term gains at 15%. Presume returns from the active strategy are active R and returns from the passive strategy are passive R . In order to have comparable after-tax returns, the following inequality shows the active manager's returns need to exceed the passive manager's by over 30%.  [7]. Authors Modigliani and Modigliani [10] address risk with a straightforward question: "do returns adequately compensate us for the risk what we bear?" In this study I address risk via a similar yet distinct question: Could an investor have earned higher returns with the same risk (standard deviation) using a simple combination of the riskfree asset and a readily available levered (or unlevered) market ETF? Section 3 describes the approach this study uses to address these questions. While selection of an appropriate benchmark is pervasive in academic literature and finance texts, it sometimes eludes our friends outside the ivory towers and the media (see Hebner's critique [3] of media omitting "risk" when presenting performance numbers).
In a strict sense, the market portfolio is a market capitalization weighted portfolio of all risky assets around the world. Unfortunately such a portfolio is unobservable [16]. However, Doeswijk et al. [17] construct an index of market portfolio returns through extensive data collection. The authors note that the tests of Stambaugh [18] found exclusion of assets such as bonds and residential real estate from the market portfolio had little impact on CAPM inferences. Yet, Doeswijk et al. [17] do note that "certain asset pricing applications" do necessitate a broader market portfolio representation than just the S&P 500. Treynor and Mazuy find no evidence that mutual fund managers outguessed the market.
"The Fundamental Law of Active Management." Grinold [20] introduces "The Fundamental Law of Active Management" which leads to a series of equations in Grinold and Kahn [21] that relate ex-ante information ratios to managerial skill and breadth of investments. However, Goodwin [22] notes that exante information ratios and breadth measures are difficult to estimate making Generalized Binomial Distribution (GBD) simulation. Bhootra et al. [24] employ GBD simulation to identify whether or not observed persistence of mutual funds in the top 25% of returns can occur via chance. Using a sample of 981 mutual funds over the 1995-2009 period, the authors find evidence that more funds achieve persistence in the top 25% than would be predicted by chance.
While the results are promising in that they confirm the presence of skill in the mutual fund industry, the results are still subject to Ingredient 6: Time period. A process that worked in the 1995-2009 time period has no guarantee of working in the 2021-2034 time period. Furthermore, Bhootra et al. document the presence of skill ex-post with no mechanism to identify persistent top 25% performers in advance.
Collectively, the extant literature covers the ingredients for valid performance measurement and developing risk-adjusted measures. This study extends that literature stream by summarizing the relevant factors of valid performance measurement and developing a parsimonious and practical measure that addresses the Cogneau-Hubner Critique of the Modigliani and Modigliani [10] RAP measure.

Addressing Performance Measurement Concerns
The previous section detailed nine distinct ingredients or considerations for valid performance measurement. This section details the approaches used in this study to ensure validity of performance measurements herein.
1) Gains and losses. Returns obtained from Bloomberg L.P. [25] and the local pension fund include both gains and losses.
2) Cash. Returns obtained from Bloomberg and the local pension fund include cash holdings.
3) Fees and costs. Returns from mutual funds obtained from Bloomberg are based on net asset value (NAV) which is net of fees and costs. Return data for the local pension fund are in both gross and net terms as are the Student Investment Fund returns. 4) Taxes. To abstract from taxes, this study presumes assets are held in a non-taxable or tax-deferred account. This is the case for both the local pension fund and the Student Investment Fund and could be the case for the other settings (e.g., IRA, 401k, and 403b accounts). [10] and others using standard deviation as the risk measure. This study also modifies the risk-adjusted performance measure of Modigliani and Modigliani [10]. More on this in Section 3.4. 6) Time period. Section 2.2 suggests the time-varying investment environment could nullify out-of-sample inferences. To illustrate the time-varying investment environment, Figure 1 and Figure 2 present the rolling 10-year (120 month) mean and standard deviation and rolling 10-year (120 month) cumulative return for the S&P 500 index, respectively. Both figures illustrate significant volatility in average monthly returns, monthly standard deviation, and cumulative 10 year returns. The bottom panel of Figure 2 highlights how we have been in a bull run for more than a decade while the top panel reveals bull runs historically precede bear markets.

5) Risk. This study follows Modigliani and Modigliani
However, this study focuses on evaluation of claims during a specific time period and thereby does not make any out-of-sample claims. In the process, this  study raises awareness of the time-varying investment environment in US equity markets and that out-of-sample results could vary substantially from in-sample results.
Another time period consideration is the time-varying manager scenario. To illustrate performance measurement in the context of inherited holdings, I utilize the first and last trade dates for the two most recent managers (Z and M) of the Student Investment Fund used in this study. As such, I conduct analysis of SIF performance in four time periods shown in Table 3. 7) Sample size. From a statistical perspective in a financial return context, the population includes returns we have observed (e.g., Figure 1 and Figure 2) and future returns we have yet to observe (future return graph not available). Having surrendered the focus to in-sample descriptive statistics, I alleviate the pressure to have a large dataset or one representative of the population. However, future research with a larger sample or simulation can contribute to the literature. For example, consider one of the small cap growth funds mentioned in "The pandemic year's top stock-fund managers" [2]. If markets are [reasonably] efficient, and the S&P 500 [reasonably] approximates the market portfolio, then Modigliani and Modigliani [10], this study, and numerous others, use an "appropriate benchmark" to judge performance. 9) Luck vs. skill. Luck vs. skill is both difficult to quantify [22] and project into the future (Ingredients 6, 7, and 9). While Fama and French [23] and Bhootra et al. [24] find some evidence that skill exists in mutual fund management, Bessembinder [26] points out that identifying managers with such skill reliably in advance is still unresolved. As such, I save luck vs. skill analysis for future research that may utilize the Risk-Equivalent Excess Performance measure developed herein. Table 4 provides a brief description of the data used in this article. All data are obtained from Bloomberg L.P. [25] with the exceptions of Funds C and Cn, obtained from a local pension plan. I compute the expense ratio associated with Fund Cn by subtracting the mean of returns net of fees (Fund Cn) from the mean of gross returns (Fund C). This amounts to 0.24%.  One month treasury bill rate n/a n/a n/a 1989-12-01

Sample Construction
Since this study is an in-sample assessment of the Moore Performance Question, missing data issues are mitigated. To extrapolate results out-of-sample, or to make comparisons between financial instruments, one must deal with varying fund return data availability. Such considerations are left for future research. As such, this study does not fill missing data with any values. Rather, it focuses on the data that are available. This focus is evident in the following section that explicitly lists start and end dates for each time series. Table 5 presents summary statistics for monthly returns. The table illustrates the diversity in data availability (start and end dates) as indicated in the last two columns. Figure 3 [27]. de la Hoz [27] states "returns could be, and have been, outstanding." Although returns were +150% in 2020, ARKK was down over 3% through July 2021 [28]. Skepticism in ARKK's continued success is prevalent, to the extent that an anti-ARKK ETF (SARK for short ARKK) is in the works [28].

Summary Statistics
On to Apple Inc., which is substantially above the Practical CML. Apple has been around since 1980 and has its ups and downs. But for the past 12 years, Apple significantly outperformed the market. This relates to Bessembinder [26] who finds the bulk of US stock market gains are concentrated in the top 4% of listed companies while the remainder earn roughly the same as Treasury Bills.
However, Bessembinder points out that the existence of persons able reliably identify such top performing stocks in advance is an open question.
One final observation before moving on to developing the Risk-Equivalent Excess Performance measure (which measures the distance from the Practical CML). Where the theoretical CML presumes borrowing and lending at R f , the Practical CML relies on ETF efficiencies (economies of scale, use of derivatives, etc.) to implement leverage at a much lower cost than many individual investors in a real-world setting.
In all sub-plots, which reflect varied time-frames, we see the cost of leverage increases with leverage. That is, a line between R f and 1X will have a higher slope than a line between R f and 2X which in turn will have a higher slope than a line between R f and 3X. This is not surprising looking at the expense ratios for the market ETFs in Table 4 (0.03% for 1X, 0.90% for 2X, 0.92% for 3X).

Risk-Equivalent Excess Performance (REEP) Measure
Authors Modigliani and Modigliani [10] address performance measurement with a straightforward question: "do returns adequately compensate us for the risk what we bear?" In this study I address performance measurement via a similar yet distinct question: Could an investor have earned higher returns with the same risk (standard deviation) using a simple combination of the risk-free asset and a readily available levered (or unlevered) market ETF? Figure 4 depicts the Risk-Adjusted Performance (RAP) measure of Modigliani and Modigliani [10] and the Risk-Equivalent Excess Performance (REEP) measure of this study.
In their RAP measure, Modigliani and Modigliani [10]   Risk-Equivalent Excess Performance (REEP). Consider the same portfolio under consideration, market portfolio, and risk-free asset of the RAP measure above. The market portfolio M can be levered (or de-levered) using the risk-free asset to construct a new portfolio ** P with the same standard deviation as the the portfolio under consideration. The leverage, mean return, and REEP measure are as follows:

Full Sample
For the context of computing REEP, I define the full sample as the time period from the first available monthly return of the youngest leveraged market ETF (Fund 3X: UPRO, 2009-07-31) to 2021-06-30. Table 6 presents results for the full sample. The results are consistent with the findings in Section 3.3: only three portfolios have positive Risk-Equivalent Excess Performance (Fund W5, Fund A, and AAPL).
As an example of interpreting the results, look to Fund C and Fund Cn of Table 6. First, the risk (standard deviation) of Fund C exceeds that of the S&P500 (or the unlevered S&P500 ETF IVV) given the selection of benchmark 2X. Note, although the portfolio standard deviation sp is less than the market standard deviation (sm), sm refers to the standard deviation of Fund 2X (SSO) not the S&P 500. Second, Fund C does not generate positive REEP before or after fees.
Over the period of analysis, the pension fund would have had higher returns at the same level of risk by purchasing Treasury Bills and Fund 2X (SSO). Thus, the issue of underperformance is more than just fees.

Pre-Pandemic Peak
In 2020 the S&P 500 peaked on 2020-02-19 at 3386 and bottomed on 2020-03-23 at 2237. Therefore I define the pre-pandemic peak period as the monthly returns from 2009-07-31 (first month of returns for youngest market ETF) to 2020-02-29. Table 7 presents REEP calculations and Figure 5 visualizes the summary statistics for the pre-pandemic peak period.
Like the full-sample in Table 6, three portfolios in the pre-pandemic period have generated positive REEP. However, Fund W3 in the pre-pandemic period replaces Fund W5 from the full sample.

Post-Pandemic 12 Month Bull
In 2020 the S&P 500 bottomed on 2020-03-23 at 2237. Therefore I define postpandemic 12 month bull period as the monthly returns from 2020-04-30 to 2021-03-31. Table 8 presents REEP calculations and Figure 6 visualizes the summary statistics for the post-pandemic 12 month bull period.
Consistent the Wall Street Journal (WSJ) article praising pandemic performance, Funds W1-W5 all generated positive REEP during this period. In fact, of the 13 portfolios examined, only three did not generate positive REEP: Fund L, Fund D, and AAPL. Looking to Table 4, we see that Fund L has the second highest expense ratio at 1.78%. Given Fund L, a large-cap value fund, has a much higher cost than its benchmark (Fund 1X with a 0.03% expense ratio), it is not surprising that Fund L generated negative REEP. Fund D represents a concentrated position of 7 stocks that generated higher return than the market (Fund 1X)-but less than a combination of 93% in Fund 3X (Table 8, column d) and 7% in Treasury Bills.  Table 9 presents the results for the varied management time periods. Bear in mind that the column for market standard deviation (sm) is the standard deviation corresponding to the selected benchmark (1X, 2X, or 3X). The Manager Z era is the only era where the portfolio risk (standard deviation) necessitated moving to a levered ETF (2X). However, Manager Z did earn the highest monthly REEP (0.0031 or 0.31%), albeit by a minuscule amount (0.0002, 0.02%, or 2 basis points) and by taking on more risk than stated by the fund's prospectus 1 . 1 The prospectus states that risk should be in-line with the S&P 500. However, as shown in Table 9, the risk (standard deviation) exceeded that of the benchmark (1X) necessitating the use of the 2X levered fund as the benchmark. Journal of Mathematical Finance  Again, Ingredients 6 (Time period) and 7 (Sample size) apply: we have very small sample sizes (13 months for Manager Z and 7 months for Manager M) from a rather unique time-period in history (COVID-19 era) that we hope is not representative of the population that includes future returns. Thus, any conclusions of superior performance should be taken with several grains of salt. This reiterates the need for future research that utilizes a larger sample or simulation. Figure 7 visualizes the summary statistics associated with the Student Investment Fund returns net of fees during different management regimes. Three observations are of note. First, the market ETF (1X:IVV) is above the Theoretical CML in three out of four time frames and on the line in the third. This suggests the managers at iShares are doing a good job keeping costs down and in fact enhancing returns of clients vs. the benchmark index. Second, although the SIF portfolio (Sn) is above the Theoretical CML under both Manager Z and Manager M regimes, Manager M is further way from the CML indicating Manager M's superior performance with respect to the fund's stated benchmark (S&P 500 Index rather than any of the market ETFs). Finally, when looking at the full-sample (fourth chart on the right), the SIF portfolio net of fees lies on the Theoretical CML.

Time-Varying SIF Management
One last comment before moving on to the conclusion. As mentioned in Sec-tion 3, this study does not address Ingredient 9: Luck vs. skill. Given the absence of a luck-vs-skill measure (combined with a small sample that is not representative of the population) one can not determine if the performance results from Managers Z and M are due to luck or skill.

Conclusions
In this study, I introduce a parsimonious and practical Risk-Equivalent Excess Performance (REEP) measure based on the well-known Modigliani and Modigliani [10] Risk-Adjusted Performance (RAP) measure. In addition, I highlight nine key ingredients of valid performance measurement: gains and losses, cash, fees and costs, taxes, risk, time period, sample size, appropriate benchmark, and luck vs. skill. I survey the literature relevant to these nine ingredients and discuss the approach of this study to address those concerns.
The Furthermore, Fund W5 generated negative REEP in the pre-pandemic period. I also provide an example of how to address time-varying fund management when assessing performance. I accomplished this utilizing proprietary data from a university Student Investment Fund. Results affirm the need to be mindful of the nine ingredients of valid performance measurement in that there is insufficient evidence to conclusively determine a "winning" manager. Ultimately, the real winners are all students that participate in Student Investment Funds as they gain knowledge and skills useful in the workforce.
Rather than attempt to develop a method to reliably predict the future, i.e., which financial instruments or managers will outperform the market on a riskequivalent basis, this study examines claims of ex-post (observed) data. No one can reliably predict the future and therefore no one can reliably measure future performance. However, the new Risk-Equivalent Excess Performance measure and contextualization provide a basis for future out-of-sample inferential analysis. But, as always, "past performance is no guarantee of future results."