No. of Recommendations: 74
May, 2006

What is MI? part 2 (of 6): Problems, Lingo

---- Problems
What are some of the limits of backtesting?
What is a crystal ball effect?
What is Survivorship Bias?
What is the Multiple Hypothesis problem?
What is the Efficient Market Hypothesis?
---- Lingo
What is canaricide?
What is a Mound of Toast?
What is Toastiness?
What is RRS?
What is XG?
What is Radiscript?
What is Belize?
Who is Lord Voldemort?


“People should be given respect. Ideas should be beaten within an inch of their lives. That's how you can tell the gold from the junk.”
MarkW in MI #99074, 4/15/2001


What are some of the limits of backtesting?

Just about every site that promotes stock-screening will present some form of backtest results to justify the screening method. This usually comes with some hype, as I groused in the Math section above: “If you had invested using this method, $1 would have turned into the Gross Domestic Product in just 10 minutes!!!!” These results can be difficult to parse.

The #1 risk of backtesting is that you'll fool yourself. You'll find some relationship that seemed to hold true during the backtest period, but that wasn't really well-founded and is unlikely to continue going forward. The classic example of this is the “butter in Bangladesh” story, from a Business Week article in 1997:
possibly the most notorious group of data miners are stock market researchers that seek to predict future stock price movement. Most if not all Stock Market Anomalies have been discovered (or at least documented) via data mining of past prices and related (or sometimes unrelated) variables. When market beating strategies are discovered via data mining, there are a number of potential problems in making the leap from a back-tested strategy to successfully investing in future real world conditions. The first problem is determining the probability that the relationships occurred at random or whether the anomaly may be unique to the specific sample that was tested. Statisticians are fond of pointing out that if you torture the data long enough, it will confess to anything.
In what is becoming an infamous example David Leinweber went searching for random correlations to the S&P 500. Peter Coy described Leinweber's findings in a Business Week article titled “He who mines data may strike fool's gold” ( ). The article discussed data mining, Michael Drosnin's book The Bible Code (much more on this topic later), and the fact that patterns will occur in data by pure chance, particularly if you consider many factors. Many cases of data mining are immune to statistical verification or rebuttal. In describing the pitfalls of data mining, Leinweber "sifted through a United Nations CD-ROM and discovered that historically, the single best predictor of the Standard & Poor's 500-stock index was butter production in Bangladesh." The lesson to learn according to Coy is a "formula that happens to fit the data of the past won't necessarily have any predictive value."

We are particularly susceptible to this, because (A) we want to find something that works, ie that will create a market-beating screen; and (B) the Screen Builders (Jamie's and Keelix') gives lots of us the opportunity to look at lots of combinations of factors. There is no doubt we will find something that had breathtaking performance. But will that have any predictive value? This notion is examined a little more below, under the Multiple Hypothesis problem.

The #2 risk of backtesting is that if your backtest period only includes one type of market, you can find strategies that really do have some kind of predictive value, but that value is limited to the kind of market the backtest period includes. For example, our first really extended set of data on which to perform backtests was the ValueLine stocks data from 1986 to 1999. Those years just happened to be the greatest extended bull market in US stock history. All of our backtests showed momentum was the place to be: anything else degraded returns, even risk-adjusted returns. And that is true: when the market is only going up and up and up, the best-performing stocks are the ones leading the pack. One poor soul bet the ranch at the market top, at age 57:

Millionaires' Club, 3/8/2000
Our plan is that when my own portfolio reaches $2,000,000, I will set aside one million in a more secure environment as our “insurance” in the unlikely event that everything tanks. Then I will continue using RS-IBD on the other million in my portfolio

Old Head Needs Current MI Guidance, 7/16/2003
the leading MI strategies took me from $272,000 (all IRA) in July 99 to just over $1,000,000 in February 2000. … Friends told me to cash out then and there. I [stayed in]. But by May, I couldn't stand it any longer and cashed out at $435,000.

Note that since then, we've managed to extend data in two directions: we have more performance data, going back to 1969 for some VL-based screens; and of course we experienced the 2000-2006 market. It's no longer the case that our backtest period covers only a single type of market.

Other risks of backtesting are that you can identify something that has some predictive value, but that for one reason or another you can't really take advantage of. For example, most backtests don't take friction into account. For a strategy that has you trading in and out of stocks a dozen times a week, the stock price movement that the strategy tries to take advantage of may be real, but your commission costs so high that you actually lose money trading the strategy. Or, for a strategy that involves buying stock of very small companies, you may find that the stocks are so illiquid that they experience a big run-up in price when people try to buy in to them, or they have very large spreads which take a big bite out of your profits.

Another risk of backtesting is that you might take the high CAGR shown in a backtest as being literally predictive of the CAGR you can expect to see going forward. You see an investing method which in backtesting shows a CAGR of 78%, and from that you want to conclude that you can expect to gain 78% on that screen for the next decade, starting right now. Uh, no. For one thing, a screen may just be a high-beta screen, riding a bull-market wave; it may tank when the market turns south; or it may continue to return 3 times what the market returns, but if the market is flat, 3 * flat may not turn out to be very much. Or maybe its outperformance of the market is due stricly to certain kinds of stocks being favored in a certain market: large cap stocks, say, or internet stocks. If the market turns from that type, starts to favor some other kind of stock, then the screen performance going forward is likely to look very different from that 78% CAGR we saw in the backtest.
(That all assumes there was something “real” about the screen's ourperformance to begin with, that it wasn't a data mining mirage.)

A nice point of comparision is with football stats. Do you play fantasy sports? To me, fantasy sports are a great education in the practical aspects of using stats. A fantastic way to hone your intuition about the predictive value of numbers.

• Peyton Manning set a single-season record for passing touchdowns in 2004. So would he be a good bet to do that again? Uh, no. His team made a commitment to running the football more, and his TDs were way, way down in the first half of the 2005 season. He still had a great season; but he threw for 20 fewer TDs and almost a thousand fewer yards, and was not the overwhelmingly dominant force in the fantasy game that he was the prior year.
• Daunte Culpeper had a season in 2004 that was nearly as magnificent as Manning's: one of the great seasons ever by a QB, over 4700 yards passing with and an additional 400 yards rushing. The next year his top wide receiver was traded to Oakland, his offensive coordinator left to take a job wiht Miami, one of his all-pro offensive linemen was lost due to innjury, and he himself struggled with injury. The bottom line on his season was < 1600 yards passing, with 6 TDs to 12 INTs. Ouch.
• Jamal Lewis rushed for over 2,000 yards in 2003, the 2nd-highest single-season total of all time. So was he then clearly the #1 back to get, going into 2004? His rushing numbers dropped by half, as he struggled with injury; his numbers declined further in 2005, dropping below the thousand-yard mark.
• Priest Holmes was the dominant player in fantasy football a couple years running, let's say 2001-2003. In 2004 he struggled with injury and saw his touchdown total cut in half; in 2005 his numbers declined further. He turns 33 during the 2006 season, and is probably close to being out of football.
• In 2004, Muhsin Muhammad led all NFL wide receivers in fantasy production, catching 93 passes for over 1400 yards and 16 TDs. Heading into the 2005 season, would you have picked him to repeat as the most productive reciever? I would not have. Dude was 32, he had a season that was really out of line with his career numbers, and it happened to be a contract year. He signed in the offseason with the Bears, thus going from a team with a good QB (Delhomme) to a team with no QB. In 2005 his yardage totals were cut almost in half, and his TDs cut by 75%.</p?

What do we learn from this? We learn that (A) past performance numbers are not literal predictions of future performance numbers; (B) statistical outliers are unlikely to be repeated; so (C) you need to take backtest results with a grain (or cupful) of salt; and (D) there are often extraneous factors which impact the performance of professional athletes and stock screens.

Don't be less careful in examining and drawing conclusions from investment backtest results than you would be in looking at fantasy football stats.

Those are just the risks with well-designed backtests. There is also the risk that the person who conducted the backtest didn't know what he was doing: the backtest data may be rife with Crystal Ball effects or Survivorship Bias, and therefore useless.

What is a crystal ball effect?

A crystal ball effect is introduced into backtesting, when you include information that would not have been available at the time the stocks were supposed to be purchased. At its most brutally obvious, it would go something like: take the stocks that right now have the highest 1-yr price gain. Do a backtest where you buy these stocks a year ago. You'd have done pretty well, right? Yeah; imagine that. Slightly more subtle: Google just got added to the S&P 500. So now let's say we construct some backtest where we look at the group of stocks in the S&P 500 and buy the ones with greatest revenue growth or the greatest stock momentum. If we're not careful, we might have Google in our list of stocks for 2004 and 2005, so our backtest might show us buying that stock.

Crystal ball effects are tough to eradicate from backtests. If some stock XYZ got added to the list of stocks rated Timeliness=1 by ValueLine on April 20th of 1989, a backtest could have us buying that stock as part of the April '89 basket: but if we were supposed to be trading that basket the 1st or 2nd week of April, then the inclusion of XYZ is a crystal ball effect. You have to be very careful to stick to what would have been known at the time. The best way to do that is to get ahold of copies of publications which were released at the time. We've been able to do this over the years with the ValueLine and the Stock Investor Pro data we've compiled. There is other quality data out there: but some of it is expensive.

What is Survivorship Bias?

I dunno, let's ask Wikipedia!

Suppose you're doing a backtest of a Value strategy where you buy stocks after they experience large drops in price. If you take as your set all the stocks that are trading today, you might find as you look back that stocks which experienced a precipitous drop in price make great buys: in general their stocks gained x% after the drop, and you'd be sitting on some tidy gains. The problem is, all the stocks that tumbled in price and then got suspended because the companies went out of business, they've all been excluded from your study. Where's Enron, where's Worldcom? You've introduced a crystal ball effect, in that the only stocks you're considering are those of companies that survived to the present day.

Survivorship Bias is present in a number of places, not just in backtest data. For example, consider the question of whether Mechanical Investing as discussed on this board works. Let's say in 1998 there were 500 people participating on this board and trading with MI strategies. Let's say half of them made money and half of them lost money, and the half that lost left the board in disgust, so that by 2000 there were 250 people participating on the board and actively trading with MI strategies. Then the tech crash: suppose 3 out of 5 people wiped out, and either left the board in disgust or else just didn't have enough money left to stay on after the Valentine's Day Massacre in 2002, when TMF started charging a measly $30 for participating on the boards. Now it's 2006, and we take a poll and find out there are 100 people on the MI board who have been trading MI strategies since 1998, and they've all made good money. Can we take that as proof that MI works?

What is the Multiple Hypothesis problem?

Suppose you have a source for historical stock data covering a period of about the last 10 years. (If it's the next 10 years, please show it to me.) Let's say that over that period, the “average stock” in your dataset had an annual return of x%, or rather that the set of “all stocks” in the data had that average annual return; with a certain risk/return profile as measured by a Sharpe Ratio of r. Now suppose that you generate some stock screen and backtest it over that 10-yr period. Forget what the exact steps of this screen are: we've got a lot of screens, it could be one of them or some entirely new screen no one has seen before. Question: what are the chances that your screen will show a CAGR higher than x%, with a Sharpe Ratio higher than r?

While you're thinking about that, let's consider some other tests. Suppose you have a hypothesis that some screening criterion, let's say momentum, will create a screen that outperforms “the market” consisting of your dataset. So you generate some stock screen using momentum, let's say RS26 (as in the simple screen mentioned about), and you backtest it – and it does not show any outperformance. Maybe some small amount, (x+.25)%, but something that you're pretty sure will not hold up to transaction fees. What do you do? Well, you could give up on momentum – but let's say you don't. After all, there are other momentum “lookbacks” to check. So you try RS4 (relative strength over the last 4 weeks), and RS13, and RS52 – and you find some outperformance! You get good results with RS13: CAGR approximately doubled, GSD nearly doubled, a Sharpe Ratio noticeably higher than that of the market as a whole. The other RS lookbacks didn't show anything special, but this one does. Eureka! You've found a screen that will reliably beat the market! So you take out a second mortage and sink it all into this screen…

Ok, let's say you don't do that. What if none of the momentum criteria showed any outperformance? Well, if momentum doesn't outperform, maybe value does. You generate a screen that looks for stocks with low PEs. When you backtest it, you find only tepid performance. Maybe a different ratio: you generate a screen that looks for stocks with low Price-to-Sales ratios. Eureka! You've found a screen that will reliably beat the market! So you take out a second mortage and sink it all into this screen…

Do you see where I'm headed here? If you keep trying stuff, you'll eventually find something that seemed to work. If you backtest a large number of possible screens, you'll get a number of different returns. They won't all return x%, the total return of the stocks in your dataset: many of them will be lower, and some of them will be higher, just by the operation of randomness. If you randomly generate a large number of screens, just by chance you are likely to see a distribution of returns around the median return of the whole market. That's if you randomly generate a bunch of screens: in practice you're not going to do that. What you're going to do if you hit on one that shows outperformance over the period is, you're going to generate screens that are slight tweaks of that one, with returns that cluster near the returns of that one; and you're going to quietly drop those ideas that are similar to ones that underperformed.

This is the Multiple Hypothesis problem. Each one of the screens that you tried can be considered a “hypothesis”; and if you generate many of them, well just by chance you're going to find some that “worked” over the period. And the problem is, knowing whether those screens outperformed by chance, or whether they “really” outperformed: that is, whether we can conclude that there really is some predictive value for that screen. This is often referred to pejoratively as “datamining”. (That usage of “datamining” as pejorative occurs only in academic discussions of investing. In business, “datamining” is something that you pay good money to be able to do. The process of going into a large data set and looking for relationships between variables is data mining. When Amazon tells you that “customers who bought this book also bought War and Peace,” that is only possible thru datamining.) By the way, the problem is subtler than you think. You don't have to have actually done the backtests to have generated some hypotheses: if you know something about the market, you will have already drawn some preliminary hunches about what might work and what might not. These hunches can sort of be considered “hypotheses suggested by the data,” since you were influenced by the data in forming these hunches:

And this is largely how we form hypotheses about the market, right? From exposure to the data: that is to say, from learning about the stock market and what methods seem to do well and what methods seem to do poorly. From reading articles; or from reading Graham's book, which itself was written to convey the knowledge Graham acquired from interacting with the market. Just about all of our hypotheses about the market are “suggested by the data”.

We rely to some extent on our ability to make a sensible story out of the results, to help us separate “real” results from results generated just by chance. Momentum is a plausible story. After all, “the trend is your friend.” Value is a plausible story. After all, “buy low and sell high.” Butter in Bangladesh is not a plausible story. However: we're human beings. We have a fantastic ability to generate stories. I was able, almost automatically, barely even thinking about it at all, to generate a story about Butter in Bangladesh that had at least a veneer of plausibility to it:

Peering into the Crystal Ball for 2006! MI # 183306 2/6/2006
Price of butter in Bangladesh, absurd. But if it keeps going on for another five years, post-discovery, then at some point I might start wondering if there is something going on. Does Bangladesh peg its currency to the Euro? Does inflation in Bangladesh imply increased use of outsourcing by US corporations? Rather than being causative, is there something causing price changes in dairy products on the Indian subcontinent, that also causes swings in things I care about? Is there something we haven't seen or figured out yet?

One of the main things our brains do is generate patterns connecting events. So the fact that we can “explain it”: does that even matter at all?

To establish statistical validity, you want to be able to exclude the null hypthesis to a certain statistical level of confidence. The “null hypothesis” is that the results you're seeing are just the product of chance:

When statisticians talk about excluding the null hypothesis, they usually say they want to get a P-value below some standard, typically .05:
a p-value is the probability of obtaining a finding at least as “impressive” as that obtained, assuming the null hypothesis is true, so that the finding was the result of chance alone.

We just cannot do that, with the data we have. It would take literally decades to generate the amount of post-discovery data it would require to exclude the null hypothesis to any standard P-value. Do we just wait around for all those years, sitting in index funds, not putting our money into anything that can't demonstrate a good P-value?

The TMF boards were the flashpoint to one of the most lively “datamining / risks of backtesting” debates on the internet. Read about the Foolish Four and the Dogs of the Dow: it's a good introduction to the type of concerns raised by the Multiple Hypothesis problem. Here are a couple links.

The Unauthorized Dow FAQ and Compendium, FF # 40127 4/3/2001

And here are a few more, to get a taste of how really hot the arguments about this issue got to be:

Bob Brinker & Backtesting, MI # 59741 2/27/2000
I know how to spell BS too. … Both happened to discover the Keystone screen independently. Is one's dicovery more valid than the other's? I presume not. If not, then we must consider the number of possible permutations of strategies in the universe, because there might be a mad scientist in the basement of Merrill Lynch who has tried them all.

The Best Argument I've Seen, MI # 70495 5/31/2000
Are we to conclude from this that … all MI screens are curve fit, or that all statistical inference is curve fit? There's no there there.

Which BSP? Or Is It BS?, FF # 35665 8/15/2000
I am referring to the effect of testing multiple hypotheses (strategies) in the hope of turning up a market-beating strategy. It is a truism that even if you are dealing with pure noise, you will find a signal in it if you look hard enough. The more you look, the more you need to guard against this danger.

Which BSP? Or Is It BS?, FF # 35682 8/15/2000
You need to know (or at least approximate) the distribution of the screen return under the null hypothesis that it doesn't work, otherwise you can't (obviously) make any inference regarding whether it likely worked. The null distribution of a screen return chosen as the best performing screen among a number of alternatives can easily have a mean of 50%. In this case, if the screen truly doesn't work and will likely match the 20% benchmark we expect the in-sample average return to be 50%. Thus when viewed in the proper context 50% is not greater than 20%. In fact, the null distribution could have a mean of 60% whereby the strategy underperformed and will likely (assuming stability) return less than the 20% benchmark in the future. In this case 50% is less than 20% when properly measured. However, it might be that the true null distribution has a mean of 30% whereby your strategy outperforms possibly even when properly tested. In either case you need to know the null distribution. If you don't, you know nothing.

So. What is the Multiple Hypothesis problem? One of our basic premises here is that backtested results will bear some positive correlation to future returns. The Multiple Hypothesis problem is a really really fancy way of reminding us:

Past performance is no guarantee of future results.

So what's the solution?

Dude, did I say there was a solution? If there was a solution then it wouldn't really be a PROBLEM, would it?

I'm going to defer some this to a discussion around a later question, about how one chooses from among all our screens. However, I will say that when theory fails us, like when there is no way to directly calculate an accurate solution to something, then you do what you always do: fall back on practical solutions that are “good enough”, where you can get them; use real-world rule-of-thumb approximations that are “close enough”, where you can generate them; and gather knowledge to fill in around the gap, where you can find it.

Another interesting point is that we don't necessarily need a 95% statistical certainty to proceed with using a stock-picking method. That's an awfully threshold. A gambler could win with a much smaller edge: Vegas thrives with a relatively tiny edge, say 52%. Of course that's not exactly the same thing. It's a statistical certainty that Vegas has its edge, whereas there's no mathematical certainty surrounding the measurement of screen performance. But what I'm trying to say is, how sure do you have to be? If the statistical certainty is “low”, let's say a P-value of up around 40% or so, that any particular stock screen will continue to confer an advantage, what exactly is your risk in using that screen? (Other than the obvious risk associated with volatility.) I don't think there's any claim from the multiple-hypothesis crowd that MI screens are likely to underperform post-discovery: indeed, that might be an impossible position for critics to sustain, since it would imply that there is some predictive value to the MI stock screens. So why not use stock screening? And seriously, what's the alternative? Invest in something that doesn't show outperformance in a backtest? How does that make sense?

Here are a couple of fantastic posts:

Backtesting, MI # 60027 2/29/2000
The underlying equations are a very complicated set of differential equations - the Navier-Stokes (sp?) differential equations. To the best of my knowledge they have yet to be solved for even a specific case. However, as engineers, instead of physicist, our goal was not to solve them, but to estimate the solution in order to produce something that could fly. The joke was that if the Wright Brothers were scientists, we would never have gotten off the ground. Instead, they were engineers. Results are important – exact solutions are not. Through the use of logic and reasoning, we would simplify the problems by estimating the value of certain terms, and those that were close to 0 were ignored. The result – we could design working aircraft and rockets without having complete solutions. To offset the errors that we knew were there we would use a factor of safety – in effect, overdesign it, so it would still work even if there were some effects of the terms that we ignored. On this board, we function more as engineers than as scientist. We are looking for results, not absolute solutions to some arbitrary problem

Datamining vs. The Hunt, MI # 81834 10/3/2000
The philosophical chasm that separates statistics from mathematics is very deep, and completely unrecognized by most people. This “chasm” is a disconnect that profoundly affects every aspect of our discourse on datamining.

On “Data Mining”, MI # 120539 3/18/2002
What we need to be clear on is what it is we are actually talking about.
1) All screens are "data mined."
2) What is curve fitting?
3) Why is curve fitting bad?

What is the Efficient Market Hypothesis?

God I love Wikipedia.
The efficient market hypothesis implies that it is not possible to consistently outperform the market - appropriately adjusted for risk - by using any information that the market already knows, except through luck or obtaining and trading on inside information. … EMH allows that when faced with new information, some investors may overreact and some may underreact. All that is required by the EMH is that investors' reactions be random enough that the net effect on market prices cannot be reliably exploited to make an abnormal profit. Under EMH, the market may, in fact, behave irrationally for a long period of time. Crashes, bubbles and depressions are all consistent with efficient market hypothesis, so long as this irrational behavior is not predictable or exploitable. There are three common forms in which the efficient market hypothesis is commonly stated - weak form efficiency, semi-strong form efficiency and strong form efficiency, each of which have different implications for how markets work.

Weak-form efficiency – No Technical analysis techniques will be able to consistently produce excess returns. Current share prices are the best, unbiased, estimate of the value of the security. Fundamental analysis can be used to identify stocks that are undervalued and overvalued.
Semi-strong form efficiency – Share prices adjust instantaneously and in an unbiased fashion to publicly available new information, so that no excess returns can be earned by trading on that information. Fundamental analysis techniques will not be able to reliably produce excess returns.
Strong-form efficiency – Share prices reflect all information and no one can earn excess returns.
Even though many fund managers have consistently beaten the market, this does not necessarily invalidate strong-form efficiency. We need to find out how many managers in fact do beat the market, how many match it, and how many underperform it. The results imply that performance relative to the market is more or less normally distributed, so that a certain percentage of managers can be expected to beat the market. Given that there are tens of thousand of fund managers worldwide, then having a few dozen star performers is perfectly consistent with statistical expectations.

Some economists, mathematicians and market practitioners cannot believe that man-made markets are strong-form efficient when there are prima facie reasons for inefficiency including the slow diffusion of information, the relatively great power of some market participants (e.g. financial institutions), and the existence of apparently sophisticated professional investors. … It may be that professional and other market participants who have discovered reliable trading rules or stratagems see no reason to divulge them to academic researchers. It might be that there is an information gap between the academics who study the markets and the professionals who work in them. … Regardless of the validity of the EMH, there exists a small number of investors who have outperformed the market over long periods of time, including Peter Lynch, Warren Buffett, and Bill Miller.

As I understand it, the Efficient Market Hypothesis implies, at least in its semi-strong form, that guys like Peter Lynch & Warren Buffett et al are just lucky. Given a very large group of investors, some of them are going to finish ahead of the market and some behind it, just by luck: the success of those lucky ones does not imply a sustainable edge on their part. And in fact, like with coin flips, over the next several years (or trials) the previously-successful investors could all seriously underperform.

The hypothesis also seems to imply that mechanical stock screening will not (can not) be a reliable method of beating the market. At least not on a risk-adjusted basis. (Risk-adjusted usually means the Sharpe Ratio.) If there is any advantage to owning stocks that meet certain screening criteria, then that will become known and it will get “priced in”: those stocks will rise to an efficient price, and there will be no extra gain to be had from buying them.

There are lots of hugely successful traders, you know the names, guys who could buy & sell you with as much financial effort as you spend going to the vending machine in your office, who chuckle at the idea that markets are “efficient”. Are those guys just the lucky ones? Or do they know something?

One small piece of evidence for or against the Efficient Market Hypothesis could be generated by the following little test: suppose you learn MI techniques from this board, and then apply them for the next decade. If you generate market-beating returns, especially on a risk-adjusted basis (which generally means the Sharpe Ratio), then I would consider that very suggestive. ;-)


What is canaricide?

The murder of canaries, of course.

It's a joke about datamining, jointly brought to life on the MI board by Elan and Sparfarkle in October 1999:

More thoughts on monthly seasonal grails, MI # 43262 10/29/1999
This is so deep down the data mine that all the parakeets are dead. Better get out before she blows.

More thoughts on monthly seasonal grails, MI # 43263 10/29/1999
HeeHee. Actually, the miners used canaries not parakeets to test whether the air was breathable.

Over the years this became a standard way to joke about screens that seem too complex or too curve-fit, and a tongue-in-cheek comment about how we work here.

Just kill the canary now, MI # 45890 11/17/1999
CAGR of 111.2%

1969-85: RS in Bear Markets, MI # 51104 12/21/1999
In the next 2-4 weeks I will use the daily data and get much more specific returns. Once I have those, I will feel somewhat comfortable letting the canary killing MI board community at the data.

T-shirt idea:
We are not “DATA MINING”
in red letters under a dead canary.

A lot of us think it's funny, but maybe you had to be there.

What is a Mound of Toast?

Heh. So we have a long-time participant on this board, named MrToast. In 2000 he was working on developing a stock screen, and made this comment:

The Monster Screens, MI # 70437 5/31/2000
I was graphing the returns with different cut-offs. 50 and 60 work well. 55 worked the best. My long-term goal is to produce visualization software to help us see how robust screens are by their topography. (A single spike would be a danger sign. A heartier, smoother mound is what I prefer.)

55 happens to be the peak, but these screens are quite robust in many ways. You can get great results are other nearby cut-offs. They also work well with 39-week returns. This screen is part of a large mound of relative outperformance during the time period. I've been looking at all the relatives of this screens. Its immediate neighbors are very strong. As are many of its more distant relatives. I'll be working on some graphic ways to show its robustness.

MI Thoughts, MI # 73266 6/30/2000
I have been doing a pretty exhaustive survey of the terrain of backtest data. There are some extremely large “mounds” of outperformance in the terrain.

Naturally this usage got named after him. It refers to changing a parameter in a screen. Say you have a step like “keep the top x% by market cap”. It's natural in screen development to use the value of x that gives you the best CAGR. Ok: but we don't want that value of x to be a datamined peak, in the sense that there's a spike there but the values next to it show a dramatic dropoff. We prefer to see a smooth slope, where the value chosen for x gives the best CAGR but the values near it are almost as good, and the values near those are pretty decent too. An alternate way of using the “mound” would be to use the value at the center of that mound, even if it doesn't show the very best performance of the range of parameters: the idea being that future performance will show some variance, but if we're in the center of the backtested “good” range that might help us stay in the “good” range going forward.

The “brittleness” of screening parameters is one criterion to use in evaluating the robustness of a screen.

What is Toastiness?

Named after the same guy, of course. It's an attempt from 2002 to develop a ranking that will proxy for ValueLine's Timeliness ranking. There was a long period of time when we were very disturbed about an overreliance on ValueLine's “black box” system. How do we know it will continue to work? What happens if they change it? What if they go out of business? Etc. MrToast took a shot at an alternative method. Another attempt along similar lines is Sliminess aka Zippiness.

An Alternate Universe, MI # 121078 3/25/2002

What is RRS?

Regression Relative Strength. A measurement of price

Standard relative strength (or total return) looks at a stock's return from one point in time to today: that is, it's dependent on only two stock prices, that at the beginning and that at the end of the time period. If you look at some stock charts, you can see that this method might be limited. A stock can zig-zag around a lot: its total return could vary by several percentage points based on what week or what day of the week (or what time of day!) you check it. You could theoretically check the results of the RS26 screen (ValueLine Timeliness 1 stocks, sort by 26-wk total return descending) every day for a week, and each day you could have a slightly different list.

BarryDTO had an idea. Suppose instead of looking at the two endpoints, you look at each day's return for a stock. One day it'll go up 1%, one day it'll go down 0.5%, one day it'll stay flat – if you take that series of returns, you can plot a regression line and calculate the slope of that line. Now instead of a ranking based on what price a potentially-volatile stock has on a single day, you have a ranking that rewards “smooth” price appreciation over every day of the lookback period. You can see how this is conceptually related to what technical analysts do when they draw trend lines: this is a mathematically rigorous method of “drawing” a tend line.

BarryDTO's theory was that this would give better risk-adjusted returns than the simple RS measurement would. And he carried it a step further, adding a “penalty” for stocks with high volatility. The backtesting he did seemed to bear out his theories. Regress to Win! See BarryDTO's original posts:

Daily Data Analysis: #3, MI # 82733 10/10/2000

Daily Data Analysis #4: Regress to Win, MI # 83661 10/21/2000
The data indicates that using the regression line slope may consistently improve returns compared to using total return.

Screens that use RRS are usually described with notation like this:

RRS42 = rank by the slope of the regression line over the past 42 trading days

RRS189 -2Sigma = rank by the slope of the regression line over the past 189 trading days minus 2 * the standard deviation of the stock's daily returns over the past year

• That “minus 2 Sigma” is the penalty for volatility: in this example, subtract double the standard deviation from the slope, to get the ranking.
• A typical month has 21 trading days in it, so RRS42 takes the slope over about the last 2 months, and RRS189 looks at approximately the last 9 months.

BarryDTO created a most excellent spreadsheet, that takes a list of stocks and downloads daily price data from Yahoo for each of those stocks, crunches the numbers and creates a ranked list of the stocks using whichever RRS lookback and sigma combination you want. A stunning, powerful tool: lots of us use it gratefully.

Note you can apply this RRS ranking to any group of securities with daily price data. BarryDTO originally looked at the ValueLine T1 stocks: that was the data he had. But you can apply this idea to the top-ranked stocks at Zacks, or anywhere else. DreadPotato and I use this method to rank ETFs for trading, although he and I do it slightly differently. It's very versatile stuff.

This work is conceptually very closely related to LorenCobb's work on XG.

What is XG?

Exponential Growth. Very similar in concept to RRS; really it was the precursor to or inspiration for RRS.

Projected Growth of RS Picks, MI 70943 6/5/2000
The perfect stock for an RS screen would have a constant relative strength over time. The graph of the stock price of this wonder-stock would be beautifully exponential, with total predictability. We would all make a forturne investing in this stock, with no risk at all. In the real world, which I occasionally visit, things are more chancy. But suppose we could evaluate the stocks picked by any of our mechanical screens with respect to how closely they come to the ideal?

Exponential Growth Backtest, MI # 79616 9/6/2000
The fundamental concept that risk can be reduced by ranking stocks with risk-adjusted growth rates seems to be confirmed in this table. Risk as measured by GSD generally increases within each column, reading from top to bottom.

Loren's web site has some further discussions of this, and of other matters:

Kaellner has been posting these picks weekly for a long time, bless him:

Exponential Growth Rates, MI # 186943 4/16/2006
Here are the Exponential Growth rankings for Friday, April 14, 2006.

What is Radiscript?

RadiScript is a language for using Excel macros to generate “stock picks” or “rankings” for screens, using screen definitions and downloaded stock data. It's a tool for precisely defining a screen. Radiscript is actually part of Radiscreen, but we tend to refer to the whole shebang as RadiScript. If you have a data provider like ValueLine or Stock Investor Pro, then you can export the latest data update into spreadsheets, and then run the RadisScript stuff to generate your screen picks. This is convenient for two reasons:

1. Screen definitions can sometimes be ambiguous when described in natural language; or steps can create different results when run in a different order. An automated process reduces error, and makes it possible for different people at different locations to replicate the same picks.

2. The screening utilities provided with the commercial data providers can be clunky. Great tools, but they don't have anywhere near the flexibility of something like Excel.

RadiScript is pretty intuitive to read. Here's an example:

Define {RS26WK}
Uses [Timeliness Rank] [Total Return 26-Week]
Deblank [Timeliness Rank] [Total Return 26-Week]
Keep :[Timeliness Rank]=1
Sort Descending [Total Return 26-Week]
; Top :10

This piece of Radiscript:

1. Defines a screen named “RS26WK”;
2. Uses 2 fields delivered with the ValueLine data
3. Discards any stocks for which data is not
available in 2 specific fields;
4. Keeps the Timeliness 1 stocks;
5. Sorts by 26-week return, descending;
6. Keeps the top 10 from the sort.

Some screens are more complicated, and some screens define new variables to be used in the sorting and filtering, so some chunks of Radiscript are longer & more complex than others. But this is basically how it goes. You state which fields are used, you deblank some of them, you KEEP stocks meeting certain criteria, you SORT based on other criteria, and at the end you take the top of the list.

Because it's basically a scripting language, RadiScript has become the standard for building & testing screens using Keelix' SIPro backtester.

RadiScreen Introduction, MI # 120668 3/20/2002

Introduction to RadiScreen, MI # 123421 4/21/2002

What is Belize?

A country in Central America, neighbor to Mexico and Guatemala:

After this post from Sux2BeU:

Microsoft's 10-stock portfolio, MI # 11151 12/7/1998
I hope you one day will join US in our vision of the future and that, maybe on day, we can all get together and sip margaritas on the coast of Belize in retirement

it became emblematic of that fantasy destination we were all going to retire to, after our Mechanical Investing made us rich. “Belize Strategies” are “get so rich you'll never have to work again and can live wherever you want” strategies. Not that “where you want” is necessarily Belize: it could be Chamonix or Jackson Hole or Anchorage. You get the idea. “Hope to see you all in Belize.”

Who is Lord Voldemort?

The fictional arch-villain of the Harry Potter series, of course. An evil wizard bent on securing unmatched power and achieving immortality through the practice of Dark Magic. We started using that (or LV) as a nom du guerre for ValueLine in 2001, when they pressured TMF to stop posting the rankings for the “Foolish Workshop screens.” We got a little paranoid on this board, thinking that ValueLine might even run searches on the board for references to themselves, so that we shouldn't invoke them by actual name. This whole episode shook our faith a little in ValueLine as the beneficent grantor of our economic good-fortune that we'd always sort of assumed:

Workshop Out of Order, MI # 90540 1/18/2001
You've probably noticed that the Workshop area has been disabled. We hope this is a temporary situation. Here's what's happening: We've been working to develop a business partnership with Value Line -- one that will be good for all of us. As you know we've been operating rather informally, but now we are working to put into place an official licensing agreement. In the meantime, we have restricted access to all Fool content that references Value Line's ratings. We hope we'll soon have in place a partnership agreement that will allow us to reopen the Workshop.

Discontinued Strategies Workshop (1/8/01-5/4/01)
The Workshop Portfolio died aborning. Two weeks after its first trades, the portfolio was suspended and the entire Workshop area temporarily blocked at the request of the financial information company Value Line. Value Line objected to the fact some of our mechanical stock selection processes allowed a reader to infer that Value Line had ranked a particular stock number 1 for Timeliness.

It's been fun, MI # 92628 2/8/2001

MI and the Fool, MI # 92966 2/10/2001

MI Board Archival, MI # 93048 2/10/2001

Some Proposals, MI # 93100 2/11/2001

Decentralize = Devolve, MI # 93130 2/11/2001

The Future of the Workshop, MI # 93395 2/13/2001

So that was interesting. A year later we had this:

Failed Attempt To Help Add Value, MI # 122395 4/12/2002

Smoke, Fire, and Voldemort, MI # 122709 4/14/2002

I'm trying to assess the real, long-term impact of that stuff – and frankly, I'm not sure what it was. I know some of us made concerted efforts to expand our backtested capability in SIP, so that we had a viable complete replacement for VL. The big man in that was Keelix. We already had WER, though of course that has always been just off the radar. But we were never really dependent on the workshop for screens: for an influx of new blood, yes; for screens, no. And just about everyone here makes a big effort to be self-sufficient in their screening. In some ways this was just a storm that blew over. A hurricane. The community is still here, still vital; Elan is still here (still vital); we still use VL, largely because of Jamie's backtester. Maybe we don't like VL as much as we used to; we're certainly not as dependent on them. But we still use them. In the end, things remained largely the same.

Print the post  


What was Your Dumbest Investment?
Share it with us -- and learn from others' stories of flubs.
When Life Gives You Lemons
We all have had hardships and made poor decisions. The important thing is how we respond and grow. Read the story of a Fool who started from nothing, and looks to gain everything.
Community Home
Speak Your Mind, Start Your Blog, Rate Your Stocks

Community Team Fools - who are those TMF's?
Contact Us
Contact Customer Service and other Fool departments here.
Work for Fools?
Winner of the Washingtonian great places to work, and Glassdoor #1 Company to Work For 2015! Have access to all of TMF's online and email products for FREE, and be paid for your contributions to TMF! Click the link and start your Fool career.