Jaffray Woodriff, who runs QIM, a highly successful systematic fund, has provided enough details about his data mining approach in various interviews (particularly the one in the excellent book Hedge Fund Market Wizards) that I think I can approximate it. Even though QIM has been lagging a bit the last few years, they have an excellent track record, so their approach is certainly worthy of imitation if possible. They trade commodities, currencies, etc. so the approach seems to be highly portable. And while they suffer from significant price impact issues (not to mention being forced into longer holding periods) due to their size, a small trader could probably do far better with the same strategies.
The approach, as much as he has detailed it, goes as follows:
- Generate random data.
- Mine it for trading strategies.
- The best strategies resulting from the random data are now the benchmark that you have to beat using real data.
- Mine the real data, discard anything that isn’t better than the best models from the random data (this ensures that you have found an actual edge despite the excessive mining).
- Use cross validation to more accurately estimate the performance of the models and avoid curve fitting.
- Test the model out of sample, and retain it if it performs reasonably well compared to the in-sample results.
The point is essentially to generate an environment in which we know that we have no edge whatsoever, mine the data for the best possible results, and then use those as a benchmark that we have to clear in order to prove that an edge exists in the real data.
What they do after this is also quite interesting and important: checking the correlation between the newly discovered models and the models they already use. This ensures that any new “edge” they incorporate is a novel one and not simply a copy of something they already have. Supposedly this approach has yielded over 1500 different signals which they then use to trade, on medium-term horizons (if I remember correctly their average holding period is roughly one week). The issue of combining the predictions of 1500 signals into a decision to trade or not trade is beyond the scope of this post, but it’s a very interesting “ensemble model” problem.
It is clear that the approach requires not only rigorous statistical work, but also tons and tons of computing power (the procedure is highly parallelizable however, so you can just throw hardware at it to make it go faster). One potentially interesting way of tempering that requirement would be using genetic algorithms instead of brute force to search for new strategies. There are tricky issues with that approach, though: constructing the genome so that it can describe all possible trading models we want to look at, for example. How does one encode a wide array of chart patterns in a genome? There do not seem to be obvious/intuitive solutions.
Generating random data sets
There are several issues that have to be looked at here. Do we randomly sample the real data or do we use the parameters of that data and plug it into a known statistical distribution to generate completely new numbers? How many times do we repeat this procedure? In either case we are bound to lose some features of real financial time series, but this is probably a good thing since those features may result in true exploitable edges. It is important to generate a healthy number of data series. Some are simply going to be “better” than others for any one particular trading model, so testing over a single randomly generated series is not enough.
In general we want at least the semblance of a “real” data series. As such we can’t simply select random OHLC data; it would just result in a nonsensical time series with giant gaps all over the place. Instead I will use the following procedure:
- Start by selecting a random day’s OHLC points. This forms our first data point.
- Select any random day, and compute the day’s (close to close) percentage return from the previous day.
- Use this value to generate the next fake closing price.
- From that same (real) day, calculate the OHL prices in terms relative to the closing price.
- Use those relative prices to generate the fake OHL prices.
I find this approach gives rather good results, producing series that look realistic and give the appearance of trends, different volatility regimes, etc.
Naturally I can’t test the billions upon billions of models that they test at QIM, and taking the model-agnostic approach is currently beyond my abilities. I can kind-of get around the issue by testing a very narrow range of models: moving average crossovers (another simple and interesting thing to test would be 1/2 day candlestick patterns). This still leaves a significant number of parameters to test:
- The type of moving average to use (simple, exponential, or Hull)
- The length of each moving average.
- The values that the moving averages will be based on (open, high, low, or close).
- The holding period. I’ll be using a technical entry, but a partially time-based exit. This may or may not be a good idea, but I’m running with it.
- Trend-following vs contrarian (i.e. trade in the direction of the “fast” moving average or against it).
Evaluating the results
An important question remains: what metric do we use to evaluate the models? The use of cross validation presents unique problems in performance measurement, and we have to take these into account from this stage, because these results will be used for comparison to the real ones later on.
Drawdown is a problematic measure because drawdown extremes tend to be rare. When dividing a set into N folds for cross validation, a set of parameters may be rejected simply because a certain period generated a high drawdown, despite this drawdown being consistent with long-term expectations.
Another issue arises with the use of annualized returns: they may be rather meaningless if the signal fires very frequently. If what we care about is short-term predictability, it may be more prudent to look at average daily returns after a signal, instead of CAGR. This could also be ameliorated by taking trading costs into account, as weak but frequent signals would be filtered out.
In the end, many of these decisions depend on the trader’s choice of style. Every trader must decide for him or her self what risks they care about, and in what proportion to each other. As an attempt at a balanced performance metric, I will be using my Trading System Consistency, Drawdown, Return Asymmetry, Volatility, and Profit Factor Combination Metric (or TRASYCODRAVOPFACOM for short), which is calculated as follows:
St. Dev. is the annualized standard deviation of daily returns, and the profit factor is calculated based on daily returns.
The TRASYCODRAVOPFACOM still has weaknesses: a set of parameters may pick only a tiny amount of trades over the years. If they’re successful enough, it can lead to a high score but a useless signal. To avoid this I’ll also be setting the minimum number of trades to 100, a reasonable hurdle given the 17 years long sample.
The random return results
Using a brute force approach, I collected approximately 704,000 results from 5 randomly generated series. It took several hours on my overclocked i5-2500K, so it’s definitely not a viable “real-world” approach (I am a terrible programmer, so some of the slowness is of my own making). The results look like you’d expect them to, with a few outliers at the top and to bottom:
Here are the best values achieved:
Note that this isn’t a “universal” hurdle: it’s a hurdle for this specific subset of moving average signals, on the GBPUSD pair. I am certain that a wide array of signals and data would generate higher hurdles.
Brute force takes ages, even for just 5 return series, which is far too low to draw any conclusions. Are there any faster ways than brute force to find the best possible results from our random data? If this were a “normal” dataset, I would say yes, of course! However I was not sure about this case due to the randomly generated data that we are dealing with.
If the data is random, does it follow that the optimal strategy parameters are also randomly distributed? Are they uniformly distributed or are there “clusters” that, due to somehow exploiting the structure of the time series, perform better or worse than the average? The question is: is the performance slope around local maxima smooth, or not? A simple method to make this thing go faster is to throw the problem into a genetic algorithm, but a GA will offer no performance improvement if the performance is uniformly randomly distributed.
Testing this is simple: I just ran a GA search on the same 5 series I brute forced above. If the GA results are similar to the brute force results, we can use the GA and save a lot of time. As long as there are enough populations, and they are large enough (I settled on 4 populations with 40 chromosomes each), the results are “close enough”: roughly 3-20% lower than the brute force (max CAGR was 7.043%, max avg. daily return was 0.176%). It might be a good idea to scale the GA results by, say, an additional 10-20% in order to make up for this deficit.
I then generated 100 series and put the GA to use. Here are the results:
And here are the distributions of maximum values achieved for each individual series:
These results have set the bar rather high. One might imagine that throwing out everything below this hurdle will leave us with very little (nothing?) in the end. But if Woodriff is to be believed, he has found upwards of 1500 signals that perform better than the hurdle (and that’s 1500 signals that were uncorrelated enough with each other that they were added to their models). So there’s got to be a lot of interesting stuff to find!
In part 2 I will take a look at cross validation and what we can do with the real data.
This is a cool subject, I’ve toyed around a bit with an ensemble of neural networks after hearing about QIM’s method. Any optimization will probably curve-fit the data pretty bad, but if you aggregate the output of several models that perform good in-sample (I like to think of it as mountains in a n-dimensional parameter space) you’re taking a much more generalized approach to the idea that there are useful patterns in the data.
Any way, looking forward to the next post!
I think the idea is that the combination of setting a hurdle using random data + proper use of cross validation and true out of sample testing is enough to overcome the extreme curve-fitting that the approach entails. Whether it is enough, I don’t know yet.
The aggregation of the models afterward certainly plays a role too, though…but in the end it’s garbage in, garbage out. I suppose the key there is to maintain a good signal-to-noise ratio and weigh each model’s predictions appropriately. Not a trivial problem…
If you haven’t read it, I recommend Ensemble Methods in Data Mining by Seni and Elder. It’s got a foreword by Woodriff and I believe that the material in the book overlaps with his ideas.
My biggest doubt about the QIM approach is that “properly” mining random data would yield absurdly good results that would be impossible to replicate with real data. If some simplistic moving average model can manage 0.242% per day it spends in market, what happens when you test a billion different (and highly diverse) models, on thousands of random datasets?
Jim the Trader says:
I think those fund managers know that when they describe an edge it is matter of time people will replicate it so they misguide people towards impossible tasks.
Ah, but it’s not an edge, just a process for arriving at edges. In any case, even if it turns out to be be a dud, I’m still having fun and learning things by testing it, so it’s a win-win situation!
Can you give some color regarding the TRASYCODRAVOPFACOM? I can understand each block but why did you choose these prcise functions (eg (1-2*(1-R²)), Min(1,skew) ?
Regarding the scaling between indicators (/10 for the skew, /2 for the profit factor, is there any rationale behind or is purely by experience ?
It’s 100% arbitrary based on what I consider relevant factors to the quality and tradability (in psychological terms) of a strategy. In retrospect I actually regret those weightings (a bit too much on the profit factor and R^2 I think) and should’ve added a measure of daily returns in the mix.
Thank you for the write-up. I came to a slightly different interpretation of his approach though.
Rather than generate random synthetic time series data, then mine it as you describe, the actual empirical data is transformed many times, with many combinations, to generate multiple input data transformation rules (e.g. Close-Low/High-Low) and corresponding model fits (a genetic programming algorithm can be useful here). The models are trained against randomly generated target variables in order to determine if they are just curve fitting well on noise vs. actual real signals. Good candidate models can be selected based upon comparing the edge of the real target data over random target data. These models are then mixed into ensemble models, which can be further trained, weighted, and validated with cross-validation procedures, and then applied to future out of sample data (carefully monitoring expectations on a walk-forward basis as real data streams in).
I’m not sure what you mean by “The models are trained against randomly generated target variables”…In Market Wizards Woodriff says:
>How do you do that?
>Let’s say instead of training with the target variable, which is the price change over the subsequent 24 hours, I generate random numbers that have the same distribution characteristics.
This actually suggests he uses some sort of distribution and not bootstrapping, but he definitely uses random data.
I think we agree on the rest…
A target variable, in ML nomenclature, is the dependent or response variable side of the input training data to a learner. I interpret he is only using a randomized target variable to test his various hypotheses against the true response variable (price change in next 24 hrs). Whereas, if I understood correctly, you are randomizing to generate variations of the price series itself on the input attribute side.
I also interpreted he is only transforming (via randomization) the input processing rules on the raw data; not the raw data itself. e.g.. “I was trying different combinations of secondary variables that I generate from the daily price data… An example would be a volatility measure, which is a data series that is derived from price….(p. 148)” “You are constructing models by selecting combinations of secondary variables..(p. 151)” *see Market Wizards.
I see what you’re saying and I think you may be right in terms of the target variable. Does it make a difference though? If the target variable is randomized, then any predictive information in the data is useless anyway, whether the data that the transformations are taken from is randomized or not.
He clearly trains models on the fake data, and the results of that training is then used for hypothesis testing on the models trained on the real data (“It is only the performance difference between the models using real data and the baseline that is indicative of expected performance, not the full performance of the models in training.”)
Now that I think about it, this whole exercise seems somewhat pointless…if we know the distribution and its characteristics (or if we’re bootstrapping), it’s trivial to calculate the nth percentile of the average of x draws from that distribution. Why go through the trouble of training models?
As for the 2nd paragraph I agree completely, but that’s a different issue…this is essentially what I’m trying to do (in miniature) with the moving averages. I’m not sure whether the choice of secondary variables is discretionary or not…probably not, but the problem of generating them is quite interesting.
Isn’t this pretty much the same as White’s reality check, only that bootstrap has been replaced by monte carlo?
Sorry if I misunderstand, but isn’t the generation of the fake data basically
a simple bootstrap using the distribution of close-close daily returns of the real data?
In other words:
(a) Calculate the distribution of close-close daily returns from the real data (typically non-normal with fat tails).
(b) Create a fake time series of close-close returns by drawing from this return distribution
(c) Assemble the fake return time series (close-close) into a fake price time series of closes
(d) Generate fake OHL data from the fake closing prices
(e) The initial price of the time series is unimportant, it can be scaled to anything, like 1.0
If this is correct, there will no correlation of returns in the fake data, no trends, no volatility clustering etc.
>If this is correct, there will no correlation of returns in the fake data, no trends, no volatility clustering
Absolutely. But that doesn’t preclude the _appearance_ of those things existing, as you can clearly see in the randomly generated series. The fact that the data doesn’t actually contain any of these features, but still seems like it does, is rather important I think. It gives us something to mine, not to mention a cautionary tale when it comes to dealing with real data.
Hi, did you ever follow up with a part 2 of this?
Underfitting, misfitting and understanding alpha’s drivers | Math Trading says:
[…] Schwager’s Market Wizards series presents supporter of both sides, under the names of D.E. Shaw and Jaffray Woodriff. You can read more about their views in William Hua’s post in Adaptive Trader: Ensemble Methods with Jaffray Woodriff, or have a look at this QUSMA’s post for a more in-depth example of Woodriff’s approach: Doing the Jaffray Woodriff Thing (Kinda) […]
2013: Lessons Learned and Revisiting Some Studies says:
[…] Doing the Jaffray Woodriff Thing. I still need to follow up on that… […]
2013: Lessons Learned and Revisiting Some Studies | Supernova Capital says:
[…] Doing the Jaffray Woodriff Thing. I still need to follow up on that… […]
Did/does trading work for you or have you applied your time elsewhere?