Before you read this post, read the (Wagner award winning) Know Your System! – Turning Data Mining from Bias to Benefit through System Parameter Permutation by Dave Walton.
The concept is essentially to use all the results from a brute force optimization and pick the median as the best estimate of out of sample performance. The first step is:
Parameter scan ranges for the system concept are determined by the system developer.
And herein lies the main problem. The scan range will determine the median. If the range is too wide, the estimate will be too low and based on data that is essentially irrelevant because the trader would never actually pick that combination of parameters. If the range is too narrow, the entire exercise is pointless. But the author provides no way of picking the optimal range a priori (because no such method exists). And of course, as is mentioned in the paper, repeated applications of SPP with different ranges is problematic.
To illustrate, let’s use my UDIDSRI post from October 2012. The in sample period will be the time before that post and the “out of sample” period will be the time after it; the instrument is QQQ, and the strategy is simply to go long at the close when UDIDSRI is below X (the value on the x-axis below).
As you can see, the relationship between next-day returns and UDIDSRI is quite stable. The out of sample returns are higher across most of the range but that’s just an artifact of the giant bull market in the out of sample period. What would have been the optimal SPP range in October 2012? What is the optimal SPP range in hindsight? Would the result have been useful? Ask yourself these questions for each chart below.
Let’s have a look at SPY:
Whoa. The optimum has moved to < 0.05. Given a very wide range, SPP would have made a correct prediction in this case. But is this a permanent shift or just a result of a small sample size? Let’s see the results for 30 equity ETFs:
Well, that’s that. What about SPP in comparison to other methods?
The use of all available market data enables the best approximation of the long-run so the more market data available, the more accurate the estimate.
This is not the case. The author’s criticism of CV is that it makes “Inefficient use of market data”, but that’s a bad way of looking at things. CV uses all the data (just not in “one go”) and provides us with actual estimates of out of sample performance, whereas SPP just makes an “educated guess”. A guess that is 100% dependent on an arbitrarily chosen parameter range. Imagine, for example, two systems: one has stable optimal parameters over time, while the other one does not. The implications in terms of out of sample performance are obvious. CV will accurately show the difference between the two, while SPP may not. Depending on the range chosen, SPP might severely under-represent the true performance of the stable system. There’s a lot of talk about “regression to the mean”, but what mean is that?
SPP minimizes standard error of the mean (SEM) by using all available market data in the historical simulation.
This is true, but again what mean? The real issue isn’t the error of the estimate, it’s whether you’re estimating the right thing in the first place. CV’s data splitting isn’t an arbitrary mistake done to increase the error! There’s a point, and that is measuring actual out of sample performance given parameters that would actually have been chosen.
tl;dr: for some systems SPP is either pointless or just wrong. For some other classes of systems where out of sample performance can be expected to vary across a range of parameters, SPP will probably produce reasonable results. Even in the latter case, I think you’re better off sticking with CV.
Jaffray Woodriff, who runs QIM, a highly successful systematic fund, has provided enough details about his data mining approach in various interviews (particularly the one in the excellent book Hedge Fund Market Wizards) that I think I can approximate it. Even though QIM has been lagging a bit the last few years, they have an excellent track record, so their approach is certainly worthy of imitation if possible. They trade commodities, currencies, etc. so the approach seems to be highly portable. And while they suffer from significant price impact issues (not to mention being forced into longer holding periods) due to their size, a small trader could probably do far better with the same strategies.
The approach, as much as he has detailed it, goes as follows:
- Generate random data.
- Mine it for trading strategies.
- The best strategies resulting from the random data are now the benchmark that you have to beat using real data.
- Mine the real data, discard anything that isn’t better than the best models from the random data (this ensures that you have found an actual edge despite the excessive mining).
- Use cross validation to more accurately estimate the performance of the models and avoid curve fitting.
- Test the model out of sample, and retain it if it performs reasonably well compared to the in-sample results.
The point is essentially to generate an environment in which we know that we have no edge whatsoever, mine the data for the best possible results, and then use those as a benchmark that we have to clear in order to prove that an edge exists in the real data.
What they do after this is also quite interesting and important: checking the correlation between the newly discovered models and the models they already use. This ensures that any new “edge” they incorporate is a novel one and not simply a copy of something they already have. Supposedly this approach has yielded over 1500 different signals which they then use to trade, on medium-term horizons (if I remember correctly their average holding period is roughly one week). The issue of combining the predictions of 1500 signals into a decision to trade or not trade is beyond the scope of this post, but it’s a very interesting “ensemble model” problem.
It is clear that the approach requires not only rigorous statistical work, but also tons and tons of computing power (the procedure is highly parallelizable however, so you can just throw hardware at it to make it go faster). One potentially interesting way of tempering that requirement would be using genetic algorithms instead of brute force to search for new strategies. There are tricky issues with that approach, though: constructing the genome so that it can describe all possible trading models we want to look at, for example. How does one encode a wide array of chart patterns in a genome? There do not seem to be obvious/intuitive solutions.
Generating random data sets
There are several issues that have to be looked at here. Do we randomly sample the real data or do we use the parameters of that data and plug it into a known statistical distribution to generate completely new numbers? How many times do we repeat this procedure? In either case we are bound to lose some features of real financial time series, but this is probably a good thing since those features may result in true exploitable edges. It is important to generate a healthy number of data series. Some are simply going to be “better” than others for any one particular trading model, so testing over a single randomly generated series is not enough.
In general we want at least the semblance of a “real” data series. As such we can’t simply select random OHLC data; it would just result in a nonsensical time series with giant gaps all over the place. Instead I will use the following procedure:
- Start by selecting a random day’s OHLC points. This forms our first data point.
- Select any random day, and compute the day’s (close to close) percentage return from the previous day.
- Use this value to generate the next fake closing price.
- From that same (real) day, calculate the OHL prices in terms relative to the closing price.
- Use those relative prices to generate the fake OHL prices.
I find this approach gives rather good results, producing series that look realistic and give the appearance of trends, different volatility regimes, etc.
Naturally I can’t test the billions upon billions of models that they test at QIM, and taking the model-agnostic approach is currently beyond my abilities. I can kind-of get around the issue by testing a very narrow range of models: moving average crossovers (another simple and interesting thing to test would be 1/2 day candlestick patterns). This still leaves a significant number of parameters to test:
- The type of moving average to use (simple, exponential, or Hull)
- The length of each moving average.
- The values that the moving averages will be based on (open, high, low, or close).
- The holding period. I’ll be using a technical entry, but a partially time-based exit. This may or may not be a good idea, but I’m running with it.
- Trend-following vs contrarian (i.e. trade in the direction of the “fast” moving average or against it).
Evaluating the results
An important question remains: what metric do we use to evaluate the models? The use of cross validation presents unique problems in performance measurement, and we have to take these into account from this stage, because these results will be used for comparison to the real ones later on.
Drawdown is a problematic measure because drawdown extremes tend to be rare. When dividing a set into N folds for cross validation, a set of parameters may be rejected simply because a certain period generated a high drawdown, despite this drawdown being consistent with long-term expectations.
Another issue arises with the use of annualized returns: they may be rather meaningless if the signal fires very frequently. If what we care about is short-term predictability, it may be more prudent to look at average daily returns after a signal, instead of CAGR. This could also be ameliorated by taking trading costs into account, as weak but frequent signals would be filtered out.
In the end, many of these decisions depend on the trader’s choice of style. Every trader must decide for him or her self what risks they care about, and in what proportion to each other. As an attempt at a balanced performance metric, I will be using my Trading System Consistency, Drawdown, Return Asymmetry, Volatility, and Profit Factor Combination Metric (or TRASYCODRAVOPFACOM for short), which is calculated as follows:
St. Dev. is the annualized standard deviation of daily returns, and the profit factor is calculated based on daily returns.
The TRASYCODRAVOPFACOM still has weaknesses: a set of parameters may pick only a tiny amount of trades over the years. If they’re successful enough, it can lead to a high score but a useless signal. To avoid this I’ll also be setting the minimum number of trades to 100, a reasonable hurdle given the 17 years long sample.
The random return results
Using a brute force approach, I collected approximately 704,000 results from 5 randomly generated series. It took several hours on my overclocked i5-2500K, so it’s definitely not a viable “real-world” approach (I am a terrible programmer, so some of the slowness is of my own making). The results look like you’d expect them to, with a few outliers at the top and to bottom:
Here are the best values achieved:
Note that this isn’t a “universal” hurdle: it’s a hurdle for this specific subset of moving average signals, on the GBPUSD pair. I am certain that a wide array of signals and data would generate higher hurdles.
Brute force takes ages, even for just 5 return series, which is far too low to draw any conclusions. Are there any faster ways than brute force to find the best possible results from our random data? If this were a “normal” dataset, I would say yes, of course! However I was not sure about this case due to the randomly generated data that we are dealing with.
If the data is random, does it follow that the optimal strategy parameters are also randomly distributed? Are they uniformly distributed or are there “clusters” that, due to somehow exploiting the structure of the time series, perform better or worse than the average? The question is: is the performance slope around local maxima smooth, or not? A simple method to make this thing go faster is to throw the problem into a genetic algorithm, but a GA will offer no performance improvement if the performance is uniformly randomly distributed.
Testing this is simple: I just ran a GA search on the same 5 series I brute forced above. If the GA results are similar to the brute force results, we can use the GA and save a lot of time. As long as there are enough populations, and they are large enough (I settled on 4 populations with 40 chromosomes each), the results are “close enough”: roughly 3-20% lower than the brute force (max CAGR was 7.043%, max avg. daily return was 0.176%). It might be a good idea to scale the GA results by, say, an additional 10-20% in order to make up for this deficit.
I then generated 100 series and put the GA to use. Here are the results:
And here are the distributions of maximum values achieved for each individual series:
These results have set the bar rather high. One might imagine that throwing out everything below this hurdle will leave us with very little (nothing?) in the end. But if Woodriff is to be believed, he has found upwards of 1500 signals that perform better than the hurdle (and that’s 1500 signals that were uncorrelated enough with each other that they were added to their models). So there’s got to be a lot of interesting stuff to find!
In part 2 I will take a look at cross validation and what we can do with the real data.