Doing the Jaffray Woodriff Thing (Kinda), Part 1
Jaffray Woodriff, who runs QIM, a highly successful systematic fund, has provided enough details about his data mining approach in various interviews (particularly the one in the excellent book Hedge Fund Market Wizards) that I think I can approximate it. Even though QIM has been lagging a bit the last few years, they have an excellent track record, so their approach is certainly worthy of imitation if possible. They trade commodities, currencies, etc. so the approach seems to be highly portable. And while they suffer from significant price impact issues (not to mention being forced into longer holding periods) due to their size, a small trader could probably do far better with the same strategies.
The approach, as much as he has detailed it, goes as follows:
- Generate random data.
- Mine it for trading strategies.
- The best strategies resulting from the random data are now the benchmark that you have to beat using real data.
- Mine the real data, discard anything that isn’t better than the best models from the random data (this ensures that you have found an actual edge despite the excessive mining).
- Use cross validation to more accurately estimate the performance of the models and avoid curve fitting.
- Test the model out of sample, and retain it if it performs reasonably well compared to the in-sample results.
The point is essentially to generate an environment in which we know that we have no edge whatsoever, mine the data for the best possible results, and then use those as a benchmark that we have to clear in order to prove that an edge exists in the real data.
What they do after this is also quite interesting and important: checking the correlation between the newly discovered models and the models they already use. This ensures that any new “edge” they incorporate is a novel one and not simply a copy of something they already have. Supposedly this approach has yielded over 1500 different signals which they then use to trade, on medium-term horizons (if I remember correctly their average holding period is roughly one week). The issue of combining the predictions of 1500 signals into a decision to trade or not trade is beyond the scope of this post, but it’s a very interesting “ensemble model” problem.
It is clear that the approach requires not only rigorous statistical work, but also tons and tons of computing power (the procedure is highly parallelizable however, so you can just throw hardware at it to make it go faster). One potentially interesting way of tempering that requirement would be using genetic algorithms instead of brute force to search for new strategies. There are tricky issues with that approach, though: constructing the genome so that it can describe all possible trading models we want to look at, for example. How does one encode a wide array of chart patterns in a genome? There do not seem to be obvious/intuitive solutions.
Generating random data sets
There are several issues that have to be looked at here. Do we randomly sample the real data or do we use the parameters of that data and plug it into a known statistical distribution to generate completely new numbers? How many times do we repeat this procedure? In either case we are bound to lose some features of real financial time series, but this is probably a good thing since those features may result in true exploitable edges. It is important to generate a healthy number of data series. Some are simply going to be “better” than others for any one particular trading model, so testing over a single randomly generated series is not enough.
In general we want at least the semblance of a “real” data series. As such we can’t simply select random OHLC data; it would just result in a nonsensical time series with giant gaps all over the place. Instead I will use the following procedure:
- Start by selecting a random day’s OHLC points. This forms our first data point.
- Select any random day, and compute the day’s (close to close) percentage return from the previous day.
- Use this value to generate the next fake closing price.
- From that same (real) day, calculate the OHL prices in terms relative to the closing price.
- Use those relative prices to generate the fake OHL prices.
Naturally I can’t test the billions upon billions of models that they test at QIM, and taking the model-agnostic approach is currently beyond my abilities. I can kind-of get around the issue by testing a very narrow range of models: moving average crossovers (another simple and interesting thing to test would be 1/2 day candlestick patterns). This still leaves a significant number of parameters to test:
- The type of moving average to use (simple, exponential, or Hull)
- The length of each moving average.
- The values that the moving averages will be based on (open, high, low, or close).
- The holding period. I’ll be using a technical entry, but a partially time-based exit. This may or may not be a good idea, but I’m running with it.
- Trend-following vs contrarian (i.e. trade in the direction of the “fast” moving average or against it).
Evaluating the results
An important question remains: what metric do we use to evaluate the models? The use of cross validation presents unique problems in performance measurement, and we have to take these into account from this stage, because these results will be used for comparison to the real ones later on.
Drawdown is a problematic measure because drawdown extremes tend to be rare. When dividing a set into N folds for cross validation, a set of parameters may be rejected simply because a certain period generated a high drawdown, despite this drawdown being consistent with long-term expectations.
Another issue arises with the use of annualized returns: they may be rather meaningless if the signal fires very frequently. If what we care about is short-term predictability, it may be more prudent to look at average daily returns after a signal, instead of CAGR. This could also be ameliorated by taking trading costs into account, as weak but frequent signals would be filtered out.
In the end, many of these decisions depend on the trader’s choice of style. Every trader must decide for him or her self what risks they care about, and in what proportion to each other. As an attempt at a balanced performance metric, I will be using my Trading System Consistency, Drawdown, Return Asymmetry, Volatility, and Profit Factor Combination Metric (or TRASYCODRAVOPFACOM for short), which is calculated as follows:
The TRASYCODRAVOPFACOM still has weaknesses: a set of parameters may pick only a tiny amount of trades over the years. If they’re successful enough, it can lead to a high score but a useless signal. To avoid this I’ll also be setting the minimum number of trades to 100, a reasonable hurdle given the 17 years long sample.
The random return results
Using a brute force approach, I collected approximately 704,000 results from 5 randomly generated series. It took several hours on my overclocked i5-2500K, so it’s definitely not a viable “real-world” approach (I am a terrible programmer, so some of the slowness is of my own making). The results look like you’d expect them to, with a few outliers at the top and to bottom:
Here are the best values achieved:
Note that this isn’t a “universal” hurdle: it’s a hurdle for this specific subset of moving average signals, on the GBPUSD pair. I am certain that a wide array of signals and data would generate higher hurdles.
Brute force takes ages, even for just 5 return series, which is far too low to draw any conclusions. Are there any faster ways than brute force to find the best possible results from our random data? If this were a “normal” dataset, I would say yes, of course! However I was not sure about this case due to the randomly generated data that we are dealing with.
If the data is random, does it follow that the optimal strategy parameters are also randomly distributed? Are they uniformly distributed or are there “clusters” that, due to somehow exploiting the structure of the time series, perform better or worse than the average? The question is: is the performance slope around local maxima smooth, or not? A simple method to make this thing go faster is to throw the problem into a genetic algorithm, but a GA will offer no performance improvement if the performance is uniformly randomly distributed.
Testing this is simple: I just ran a GA search on the same 5 series I brute forced above. If the GA results are similar to the brute force results, we can use the GA and save a lot of time. As long as there are enough populations, and they are large enough (I settled on 4 populations with 40 chromosomes each), the results are “close enough”: roughly 3-20% lower than the brute force (max CAGR was 7.043%, max avg. daily return was 0.176%). It might be a good idea to scale the GA results by, say, an additional 10-20% in order to make up for this deficit.
I then generated 100 series and put the GA to use. Here are the results:
And here are the distributions of maximum values achieved for each individual series:
These results have set the bar rather high. One might imagine that throwing out everything below this hurdle will leave us with very little (nothing?) in the end. But if Woodriff is to be believed, he has found upwards of 1500 signals that perform better than the hurdle (and that’s 1500 signals that were uncorrelated enough with each other that they were added to their models). So there’s got to be a lot of interesting stuff to find!
In part 2 I will take a look at cross validation and what we can do with the real data.