The second, and probably final, followup to the Mining for Three Day Candlestick Patterns post. Previously, we improved performance by adding more data to the search. In this post we’ll try to improve the system further by combining multiple predictors. The central question is how to combine the forecasts. I test averaging, weighted averaging, regression, and a voting scheme and compare them against a baseline one-predictor strategy.

### Set-Up

Combining predictors is a standard tactic in machine learning, but the case of k-NN predictors is a bit of an outlier. Typical ensemble methods depend on generating variations in the data set in order to generate different and complementary predictors (as in the cases of boosting and bagging). This doesn’t work very well with nearest neighbor predictors, however, because they tend to be insensitive to variations in the data set. So what can we vary? The choice of *k*, the choice of inputs, the choice of distance measure for the nearest neighbors, and some pre-processing options such as whether to adjust for volatility or not.

I am not going to make any variation in outputs as that’s reserved for a post of its own. The idea is pretty simple: it’s essentially a random forest with k-NN predictors instead of decision trees (here’s an interesting paper on it).

So we’re left with k, sum of absolute or sum of square distances, and volatility adjustment. I picked 10 combinations of these options:

The k values were picked at random and I’m sure it’s possible to do better by optimizing them using cross validation.

The signals obviously overlap significantly, and have similar stats when used one-by-one:

The instrument traded is SPY. Additional data is taken from the following instruments for the pattern search: EWY, EWD, EWC, EWQ, EWU, EWA, EWP, EWH, EWL, EFA, EPP, EWM, EWI, EWG, EWO, IWM, QQQ, EWS, EWT, and EWJ. The thresholds in each case are adjusted to result in a similar length of time spent in the market. Position sizing is done based on the 10-day realized volatility of SPY, as described in this post: leverage is equal to 20% divided by 10-day realized annualized standard deviation, with a maximum leverage of 200%. Finally, an IBS filter is applied that allows long positions only when IBS < 0.5 and short positions only when IBS > 0.5.

The baseline is the PF3 predictor: k = 75, square distance measure, no volatility adjustment. Here’s the equity curve:

### Averaging

The simplest approach is obviously to just average the 10 forecasts and then use the average value to generate trades. A long position is taken when the average forecast is greater than 15 basis points, and a short position when the average is smaller than -12.5 basis points. Here’s what the equity curve looks like:

It’s interesting to note that the dispersion of forecasts is inversely related to the accuracy of the average: the smaller the standard deviation of the forecasts, the more accurate they are. Unfortunately effect is marginal and thus not particularly useful for improving the strategy.

### Weighted Averaging

A simple extension, that generates slightly better stats, is to weigh each forecast before averaging. There’s a wide array of stats one can use here (Sharpe/Sortino/MAR ratios are obvious candidates); I picked the mean square error. The inverse of the MSE becomes the forecast’s weight, so that smaller errors result in greater weights. The same thresholds as above are used to generate signals. The weights provide a slight improvement both in terms of Sharpe and MAR ratios. The equity curve:

### Voting

Using a threshold for each forecast, (>5 basis points for a “long” vote, and <-10 basis points for a “short” vote), each predictor is assigned a long or short vote. The overlap between the votes is significant, between 88% and 97% for different estimators. How many votes should we require for a trade? It quickly becomes obvious that simple majority voting isn’t enough, as only near-unanimous decisions provide worthwhile predictions. The average next-day return when there are between 1 and 8 long votes is 0.4 basis points. The average return after 9 or 10 long votes is 23 basis points.

The resulting equity curve looks like this:

### Ordinary Least Squares

It’s also possible to combine the forecasts using regression, with next-day returns as the dependent variable and the k-NN predictor forecasts as the independent ones.

The distribution of forecasts with OLS is very tightly clustered around 0, and for some reason higher forecasts are not associated with higher next-day returns (as they are for the 3 methods above). I don’t really understand why this is the case. The thresholds for trades are 0.5 basis points for a long trade, and -0.5 basis points for a short trade.

An issue here is, of course, multicollinearity due to the similarity of the independent variables. This can lead to, among other problems, overfitting (which is usually characterized by very large absolute values of the coefficients). Using ridge regression solves that issue by limiting the absolute value of coefficients.

A potentially interesting idea would be to constrain the coefficients to positive values, which might lessen the overfitting effects and also make much more sense on an intuitive level (after all, we know all the forecasts are similarly accurate, so negative coefficients don’t make much sense).

### Ridge Regression

If multicollinearity is a significant problem, we can use ridge regression to solve it. It offer significant improvement over the OLS approach, but it still fares badly compared to the one-predictor case. The same thresholds as in the OLS approach are used. Here’s the equity curve:

### Stats

Here are the stats for the single-predictor base case and all the combination methods:

All of them other than the voting failed horribly. I’m not sure why, but it’s good to know. The improvement provided by the voting system is sizable, however. Not only does the voting-based strategy achieve significantly higher risk-adjusted returns, it does it while spending 15% less time in the market. Those results are also easy to improve on by simply adding more predictors. The marginal gain from each new predictor will be diminishing, but there is definitely more value to wring out of it. And this is just with 3-day patterns: we can easily add 2 and 4 day patterns into the mix as well.

### Other Possibilities

A wide array of machine learning methods can be used to combine predictions. Especially if the number of forecasts grew larger, techniques such as random forests or ANNs would be interesting to investigate. As long as simpler methods work very well I think there is little reason to increase the complexity (not to mention the opaqueness) of the strategy.

7 Comments »This is a followup to the Mining for Three Day Candlestick Patterns post. If you haven’t read the original post, do so now because I’m not going to repeat the basic mechanics of the strategy. While the approach was somewhat fruitful, it also had some obvious problems: it only seems to work in bearish or high volatility market regimes, and it couldn’t produce good short signals. The main idea I had to resolve these issues was simply to get more data.

That is easier said than done. Could we use mutual funds or index values to extend the dataset backwards? No, because the daily high/low values are inaccurate. The only alternative we are left with is using data from other instruments. So I picked a broad selection of equity ETFs to include: EWY, EWD, EWC, EWQ, EWU, EWA, EWP, EWH, EWL, EFA, EPP, EWM, EWI, EWG, EWO, IWM, QQQ, EWS, EWT, and EWJ.

The selection was comprehensive and unoptimized. I think you could do some sort of walk-forward optimization that picks the best combination of securities to include in the data set. I’m not sure how much that would help.

The additional data worked fantastically well, resolving both problems. The number of opportunities to trade increased significantly, long signals work very nicely under all market conditions, and predicting negative returns works far better. There was also an unexpected benefit: far less time is needed before the forecasts become usable. In the original implementation I waited 2000 days before starting to use the forecasts. With the extended data set this can be cut to 500, thus letting the backtest cover a longer period.

Performance-wise there were no problems, as the Accord .NET k-d tree implementation that I use is very quick. Finding the nearest 75 points in a data set of approximately 100,000, in 11 dimensions, takes less than 2 milliseconds on my overclocked 2500K.

The settings used in the search are simple: the length of the patterns is 3 days, the 75 closest ones are used to construct a forecast by averaging their next-day returns, and distance is calculated as the sum of squared distances in every dimension. Trades are taken when the forecast is above/below a certain threshold. They are then passed through a filter which only allows long positions when IBS < 0.5 and short positions only when IBS > 0.5.

It should be noted that using traditional measures of “fit” does not work very well with pattern matching. Adding the above instruments actually increases the RMSE, despite significantly increasing the trading performance of the forecasts.

A look at forecasts vs realized next-day returns:

An important aspect to note is that even marginally positive forecasts work very well. For example, with the extended dataset, forecasts between 5 and 10 basis points resulted in an average 21 bp return the next day. On the other hand, using SPY data only, the return for those forecasts was just 5 basis points. What this means is that there are many more trades to take, which is what allows the strategy to do well in all market environments. Here’s the long-only equity curve:

A couple of charts to analyze the sensitivity of the long-only strategy’s results to changes in inputs (IBS limit and minimum forecast limit):

The additional data also has the benefit of making shorting possible. The equity curve doesn’t look as good, but it’s still a giant improvement over zero predictive ability on the short side:

Finally, the long and short strategies combined, along with the stats:

The concept also seems to work for stocks. For example, I tested a long-only strategy on AAPL, using the same settings as above, both with and without the addition of MSFT data. The Microsoft data improved every aspect of the results, with surprisingly consistent performance over nearly 20 years:

It would be interesting to try to apply this on a more massive scale, by increasing the data set to something like all S&P 500 stocks. Some technical restrictions prevent me from doing that right now, but I’ll come back to the idea in the future.

11 Comments »I have gotten a couple of emails asking me about the topic of analyzing performance, so I decided to detail the tools I use. Measuring performance and attributing success or failure to the right factors is an extremely important part of the trading process. Actually trading a strategy will often reveal aspects that don’t come up in the research stage. Unexpected things happen, revealing previously hidden strengths or weaknesses. Strategies improve or deteriorate through time. Execution issues eat into returns. Patterns emerge that can be exploited to enhance returns or limit risk.

These situations, and performance evaluation in general, are a crucial part of the research/trading/performance loop:

A lack of attention to performance, and the underlying factors that drive it, will have a deleterious effect both on your long-term trading results and the things that you will discover in the research stage.

I’ll demonstrate the tools using two strategies, one of which has been going well, and the other not: 1) a rather generic GTAA momentum/trend-following strategy that has been running for a bit over a year, and 2) an AAPL swing trading strategy that’s been in “trial” mode for the last 6 months or so.

My performance analysis system, the QUSMA Portfolio and Trade Analytics Suite, is primarily based around the concept of a “trade”. A trade is a unit that can contain any number of orders and cash transactions (dividends, taxes, etc.), which are somehow related. A pair trade would include both legs in a single trade, for example. The underlying data is imported using IB’s flex queries which have a very simple and easy to handle XML structure.

Trades are assigned to a “strategy” and can also be assigned any number of tags. Some of the things that I use tags for are: trade direction (long/short/both), trade length, developed/developing country, asset class, etc. Notes with images can also be attached to trades, which is incredibly useful for reviews. Finally, the trades can be filtered on any number of criteria to produce reports, and compared against custom benchmarks.

There are some general principles that summarize my approach to performance measurement:

- Execution and commissions are extremely important.
- Separate timing from sizing.
- Statistics on trades in both dollar terms and % terms.

- Separate capital allocations to strategies from total capital.
- Statistics on returns both on capital allocated to a strategy (ROAC) and on total capital (ROTC).

- Always think probabilistically and in terms of expectations
- The more ways you can find to look at the data, the better.

Simple visual inspection is my starting point, and I think it’s very important. The simple act of staring at charts often leads to new research ideas.

So let’s get started with the graphs and stats. At the top, the standard dollar PnL (daily and close-to-close) and equity curves (both in terms of ROAC and ROTC), which are also plotted against a benchmark:

Next up are the trade statistics. Commissions are right up there, it’s very important to keep in mind how much you are losing in those costs. A few basis points may not seem like much, but they can quickly eat up a significant portion of your profits. Note all the stats are given both in dollar and percentage terms, in order to separate timing effects from sizing effects.

Results by calendar month:

Probably the most important bit, statistics on daily returns, the standard ratios, and so forth. The MAR ratio is probably the most important number for me. The reason is simple: it determines my leverage constraints, and thus my returns. A high Sharpe ratio is meaningless if you can’t lever up. Note how the simple, static benchmark portfolio has destroyed the GTAA approach:

Some simple benchmarking stuff:

Histograms of daily returns, and returns per trade. Again, it’s important to look at both dollar and percentage results:

Also, holding period histogram:

Position sizing vs trade returns. Naive risk parity seems to be doing alright:

Trade length vs returns chart, the relationship here is pretty clear.

The movement capture stats measure how good the strategy is at capturing returns. GU is gross upside, or the gross positive returns during the period. UC% is the percentage of that movement that was captured by being long, UM% is the percentage of the movement that was missed by being flat, while UL% is the percentage of the movement that was lost due to being short. The calculations are repeated for downside movement.

Cumulative percent returns, by instrument. A similar chart with dollar PnL by instrument also exists.

Autocorrelation and partial autocorrelation stats based on daily returns:

Standard value at risk calculations, based on resampled historical data. I’ll be adding the option to use parametric methods in the future.

Monte Carlo simulation. It simply uses historical data, either trades or daily returns (either ROAC or ROTC). Sampling can be done with replacement or without (the latter simply re-orders the existing equity curve). There is also an option to use N consecutive days/trades, which can capture volatility clustering and autocorrelation effects. The analysis returns confidence intervals for the equity curve, as well as the cumulative and point distributions of maximum drawdowns.

Finally, some simple stats and charts on execution. All of my trades are either at the close or the open, so those are the prices I benchmark against. Below are stats from the AAPL strategy’s buy orders around the close.

I think that the biggest weakness in my toolset is the lack of interaction with backtesting results. These can be used in two main ways: 1) comparing theoretical results to real trading results, and 2) as an extended dataset for the risk management functions. Also, I don’t do any stock picking, but if I did that would entail several additions, mainly performance attribution by country, sector, etc. as well as analyzing value/size/momentum factor exposures.

Leave a comment and tell us what you like to use: is the standard stuff enough for you, or do you use any obscure ratios or unique charts?

No Comments »A simple post on position sizing, comparing three similar volatility-based approaches. In order test the different sizing techniques I’ve set up a long-only strategy applied to SPY, with 4 different signals:

- UDIDSRI.
- 2-day candlestick KNN search, going long if the expected return is > 0.125%.
- Cutler’s RSI(3): long if RSI <= 10, exit if > 50.
- Long at every 15 day low close.

On top of that sits an IBS filter, allowing long positions only when IBS is below 50%. A position is taken if any of the signals is triggered. Entries and exits at the close of the day, no stops or targets. Results include commissions of 1 cent per share.

Sizing based on realized volatility uses the 10-day realized volatility, and then adjusts the size of the position such that, if volatility remains unchanged, the portfolio would have an annualized standard deviation of 17%. The fact that the strategy is not always in the market decreases volatility, which is why to get close to the ~11.5% standard deviation of the fixed fraction sizing we need to “overshoot” by a fair bit.

The same idea is used with the GARCH model, which is used to forecast volatility 3 days ahead. That value is then used to adjust size. And again the same concept is used with VIX, but of course option implied volatility tends to be greater than realized volatility, so we need to overshoot by even more, in this case to 23%.

Let’s take a look at the strategy results with the simplest sizing approach (allocating all available capital):

Returns are the highest during volatile periods, and so are drawdowns. This results in an uneven equity curve, and highly uneven risk exposure. There is, of course, no reason to let the market decide these things for us. Let’s compare the fixed fraction approach to the realized volatility- and VIX-based sizing approaches:

These results are obviously unrealistic: nobody in their right mind would use 600% leverage in this type of trade. A Black Monday would very simply wipe you out. These extremes are rather infrequent, however, and leverage can be capped to a lower value without much effect.

With the increased leverage comes an increase in average drawdown, with >5% drawdowns becoming far more frequent. The average time to recovery is also slightly increased. Given the benefits, I don’t see this as a significant drawback. If you’re willing to tolerate a 20% drawdown, the frequency of 5% drawdowns is not that important.

On the other hand, the deepest drawdowns naturally tend to come during volatile periods, and the decrease of leverage also results in a slight decrease of the max drawdown. Returns are also improved, leading to better risk-adjusted returns across the board for the volatility-based sizing approaches.

The VIX approach underperforms, and the main reason is obviously that it’s not a good measure of expected future volatility. There is also the mismatch between the VIX’s 30-day horizon and the much shorter horizon of the trades. GARCH and realized volatility result in very similar sizing, so the realized volatility approach is preferable due to its simplicity.

3 Comments »Posting has been slow lately because I’ve been busy with a bunch of other stuff, including the CFA Level 3 exam last weekend. I’ve also begun work on a very ambitious project: a fully-featured all-in-one backtesting and live trading suite, which is what prompted this post.

Over the last half year or so I’ve been moving toward more complex tools (away from excel, R, and MATLAB), and generally just writing standalone backtesters in C# for every concept I wanted to try out, only using Multicharts for the simplest ideas. This approach is, of course, incredibly inefficient, but the software packages available to “retail” traders are notoriously horrible, and I have nowhere near the capital I’d need to afford “real” tools like QuantFACTORY or Deltix.

The good thing about knowing how to code is that if a tool doesn’t exist you can just write it, and that’s exactly what I’m doing. Proper portfolio-level backtesting and live trading that’ll be able to easily do everything from intraday pairs trading to long term asset allocation and everything in-between, all under the same roof. On the other hand it’s also tailored to my own needs, and as such contains no plans for things like handling fundamental data. Most importantly it’s my dream research platform that’ll let me go from idea, to robust testing & optimization, to implementation very quickly. Here’s what the basic design looks like:

What’s the point of posting about it? I know there are many other people out there facing the same issues I am, so hopefully I can provide some inspiration and ideas on how to solve them. Maybe it’ll prompt some discussion and idea-bouncing, or perhaps even collaboration.

Most of the essential stuff has already been laid down, so basic testing is already possible. A simple example based on my previous post can showcase some essential features. Below you’ll find the code behind the PatternFinder indicator, which uses the Accord.NET library’s k-d tree and k nearest neighbor algorithm implementation to do candlestick pattern searches as discussed here. Many elements are specific to my system, but the core functionality is trivially portable if you want to borrow it.

Note the use of attributes to denote properties as inputs, and set their default values. Options can be serialized/deserialized for easy storage in files or a database. Priority settings allow the user to specify the order of execution, which can be very important in some cases. Indexer access works with [0] being the current bar, [1] being the previous bar, etc. Different methods for historical and real time bars allow for a ton of optimization to speed up processing when time is scarce, though in this case there isn’t much that can be done.

The VariableSeries class is designed to hold time series, synchronize them across the entire parent object, prevent data snooping, etc. The Indicator and Signal classes are all derived from VariableSeries, which is the basis for the system’s modularity. For example, in the PatternFinder indicator, OHLC inputs can be modified by the user through the UI, e.g. to make use of the values of an indicator rather than the instrument data.

The backtesting analysis stuff is still in its early stages, but again the foundations have been laid. Here are some stats using a two-day PatternFinder combined with IBS, applied on SPY:

Here’s the first iteration of the signal analysis interface. I have added 3 more signals to the backtest: going long for 1 day at every 15 day low close, the set-up Rob Hanna posted yesterday over at Quantifiable Edges (staying in for 5 days after the set-up appears), and UDIDSRI. The idea is to be able to easily spot redundant set-ups, find synergies or anti-synergies between signals, and easily get an idea of the marginal value added by any one particular signal.

And here’s some basic Monte Carlo simulation stuff, with confidence intervals for cumulative returns and PDF/CDF of the maximum drawdown distribution:

Here’s the code for the PatternFinder indicator. Obviously it’s written for my platform, but it should be easily portable. The “meat” is all in CalcHistorical() and GetExpectancy().

/// <summary> /// K nearest neighbor search for candlestick patterns /// </summary> public class PatternFinder : Indicator { [Input(3)] public int PatternLength { get; set; } [Input(75)] public int MatchCount { get; set; } [Input(2000)] public int MinimumWindowSize { get; set; } [Input(false)] public bool VolatilityAdjusted { get; set; } [Input(false)] public bool Overnight { get; set; } [Input(false)] public bool WeighExpectancyByDistance { get; set; } [Input(false)] public bool Classification { get; set; } [Input(0.002)] public double ClassificationLimit { get; set; } [Input("Euclidean")] public string DistanceType { get; set; } [SeriesInput("Instrument.Open")] public VariableSeries<decimal> Open { get; set; } [SeriesInput("Instrument.High")] public VariableSeries<decimal> High { get; set; } [SeriesInput("Instrument.Low")] public VariableSeries<decimal> Low { get; set; } [SeriesInput("Instrument.Close")] public VariableSeries<decimal> Close { get; set; } [SeriesInput("Instrument.AdjClose")] public VariableSeries<decimal> AdjClose { get; set; } private VariableSeries<double> returns; private VariableSeries<double> stDev; private KDTree<double> _tree; public PatternFinder(QSwing parent, string name = "PatternFinder", int BarsCount = 1000) : base(parent, name, BarsCount) { Priority = 1; returns = new VariableSeries<double>(parent, BarsCount); stDev = new VariableSeries<double>(parent, BarsCount) { DefaultValue = 1 }; } internal override void Startup() { _tree = new KDTree<double>(PatternLength * 4 - 1); switch (DistanceType) { case "Euclidean": _tree.Distance = Accord.Math.Distance.Euclidean; break; case "Absolute": _tree.Distance = AbsDistance; break; case "Chebyshev": _tree.Distance = Accord.Math.Distance.Chebyshev; break; default: _tree.Distance = Accord.Math.Distance.Euclidean; break; } } public override void CalcHistorical() { if (VolatilityAdjusted && CurrentBar > 0) returns.Value = (double)(AdjClose[0] / AdjClose[1] - 1); if (VolatilityAdjusted && CurrentBar > 11) stDev.Value = returns.StandardDeviation(10); if (CurrentBar < PatternLength + 1) return; if (CurrentBar > MinimumWindowSize) Value = GetExpectancy(GetCoords()); double ret = Overnight ? (double)(Open[0] / Close[1] - 1) : (double)(AdjClose[0] / AdjClose[1] - 1); double adjret = ret / stDev[0]; if (Classification) _tree.Add(GetCoords(1), adjret > ClassificationLimit ? 1 : 0); else _tree.Add(GetCoords(1), adjret); } public override void CalcRealTime() { if (VolatilityAdjusted && CurrentBar > 0) returns.Value = (double)(AdjClose[0] / AdjClose[1] - 1); if (VolatilityAdjusted && CurrentBar > 11) stDev.Value = returns.StandardDeviation(10); if (CurrentBar > MinimumWindowSize) Value = GetExpectancy(GetCoords()); } private double GetExpectancy(double[] coords) { if (!WeighExpectancyByDistance) return _tree.Nearest(coords, MatchCount).Average(x => x.Node.Value) * stDev[0]; else { var nodes = _tree.Nearest(coords, MatchCount); double totweight = nodes.Sum(x => 1 / Math.Pow(x.Distance, 2)); return nodes.Sum(x => x.Node.Value * ((1 / Math.Pow(x.Distance, 2)) / totweight)) * stDev[0]; } } private static double AbsDistance(double[] x, double[] y) { return x.Select((t, i) => Math.Abs(t - y[i])).Sum(); } private double[] GetCoords(int offset = 0) { double[] coords = new double[PatternLength * 4 - 1]; for (int i = 0; i < PatternLength; i++) { coords[4 * i] = (double)(Open[i + offset] / Close[i + offset]); coords[4 * i + 1] = (double)(High[i + offset] / Close[i + offset]); coords[4 * i + 2] = (double)(Low[i + offset] / Close[i + offset]); if (i < PatternLength - 1) coords[4 * i + 3] = (double)(Close[i + offset] / Close[i + 1 + offset]); } return coords; } }

Coming up Soon™: a series of posts on cross validation, an in-depth paper on IBS, and possibly a theory-heavy paper on the low volatility effect.

11 Comments »I’ve been thinking a lot about candlestick patterns lately but grew tired of trying to generate ideas and instead decided to mine for them. I must confess I didn’t expect much from such a simplistic approach, so I was pleasantly surprised to see it working well. Unfortunately I wasn’t able to discover any short set-ups. The general bias of equity markets toward the upside makes it difficult to find enough instances of patterns that are followed by negative returns.

The idea is to mine past data for similar 3 day patterns, and then use that information to make trading decisions. There are several choices we must make:

- The size of the lookback window. I use an expanding window that starts at 2000 days.
- Once we find similar patterns, how do we choose which ones to use?
- How do we measure the similarity between the patterns?

To fully describe a three day candlestick pattern we need 11 numbers. The close-to-close percentage change from day 1 to day 2, and from day 2 to day 3, as well as the positions of the open, high, and low relative to the close for each day.

To measure the degree of similarity between any two 3-day patterns, I tried both the sum of absolute differences and the sum of the squared differences between those 11 numbers; the results were quite similar. It would be interesting to try to optimize individual weights for each number, as I imagine some are more important than others.

The final step is to select a number of the closest patterns we find, and simply average their next-day returns to arrive at an expected return.

How do we choose which patterns are “close enough” to use? Choose too few and the sample will be too small. Choose too many and you risk using irrelevant data. That’s a number that we’ll have to optimize.

When comparing the results we also run into another problem: the smaller the sample, the more spread out the expected return estimates will be, which means more trades will be chosen given a certain minimum limit for entry. My solution was to choose a different limit for trade entry, such that all sample sizes would generate the same number of days in the market (300 in this case). Here are the walk-forward results:

The trade-off between sample size and relevance is clear, and the “sweet spot” appears to be somewhere in the 50-150 range or so, for both the absolute difference and squared difference approaches. Depending on how selective you want to be, you can decrease the limit and trade off more trades for lower expected returns. For me, 30 bp is a reasonable area to aim for.

A nice little addition is to use IBS by filtering out any trades with IBS > 50%. Using squared differences, I select the 50 closest patterns. When their average next-day return is greater than 0.2%, a long position is taken. The results are predictably great:

The IBS filter removes close to 40% of days in the market yet maintains essentially the same CAGR, while also more than halving the maximum drawdown.

Let’s take a look at some of the actual patterns. Using squared differences, the 50 closest patterns, and a 0.2% limit, the last successful trade was on February 26, 2013. The expected return on that day was 0.307%. Here’s what those 3 days looked like, as well as the 5 closest historical patterns:

As you can see below, even the 50th closest pattern seems to be, based on visual inspection, rather close. The “main idea” of the pattern seems to be there:

Here are the stats from a bunch of different equity index ETFs, using square differences, the 50 closest patterns, 0.2% expected return limit and the IBS < 0.5 filter.

The 0.2% limit seems to be too low for some of them, producing too many trades. Perhaps setting an appropriate limit per-instrument would be a good idea.

The obvious path forward is to also produce 2-day, 4-day, 5-day, etc. versions, perhaps with optimized distance weighting and some outlier filtering, and combine them all in a nice little ensemble to get your predictions out of. The implementation is left as an exercise for the reader.

Bitcoin seems to be all the rage these days, and I’m jumping on the bandwagon. Quandl tweeted about their bitcoin data today so I decided I’d have a look at it. I have tested a bunch of popular/”standard” ideas, and the results aren’t really surprising, though they do illuminate the trend-y (bubbl-y) character of the bitcoin market. BTC prices do not revert like equities but show strong momentum, both in the short and medium term. IBS is useless, while trend following works like it does everywhere else. The (daily) data covers the period from 17/7/2010 to today.

## Descriptive Stats

The **mean** simple daily return has been **1.012%**, while the annualized **standard deviation** has been **121.70%.** The distribution of returns is obviously fat-tailed (with a** kurtosis** of **8.62**), though somewhat surprisingly (to me at least), slightly positively **skewed** (**0.76**).

## Up/Down Streaks

Strong up streaks tend to be followed by high returns over the medium term, and there has been a surprisingly large number of these streaks given the small amount of data available.

## IBS

IBS does not appear to have any predictive value when it comes to bitcoin returns.

## RSI(3)

No mean reversion to be found here. Using a 3-period Cutler’s RSI, next-day bitcoin returns are **0.392%** when RSI(3) is **below 20**, and **1.763%** when it is **above 80**. The story is pretty much the same if you go for a medium term length for the RSI: high values beget high returns, with no mean reversion in sight.

## Simple Trend Following

The strong trends that bitcoin has shown would have been very profitable to any trend followers. Going long at a new 50-day high close (with an exit at a new 25-day low close), and vice-versa for short positions, would have yielded these equity curves:

## Day of the Week

Before you jump in, keep in mind that this sort of market can change character very quickly, especially after a big bubble pop. Also consider the fees: Mt. Gox, the most popular exchange, charges an obscene 120 basis points per roundtrip. There are some brokers that will allow you to short bitcoins, and there even appear to be some thinly-traded options and currency futures available…I imagine there are gigantic inefficiencies in the pricing of these instruments (though their legality is probably questionable).

1 Comment »Model risk is the risk that a model is, or will become, unable to perform the tasks it was designed to do. In terms of trading, this can be the risk that a set-up stops working, the risk that a variable loses its predictive power, etc. Ever-changing market conditions mean that model risk is a significant issue for most systematic traders: managing it is an integral part of adapting to new market environments.

Two heuristic rules are commonly used to handle this risk: drawdown-based position sizing, and a maximum drawdown cutoff. The former involves reducing exposure depending on drawdown (e.g. position sizes will be halved below 10% drawdown); the latter technique simply stops the strategy if it ever reaches a specified drawdown cutoff.

To investigate the efficacy of these rules, I’m going to use a simple Monte Carlo approach. The basic strategy has returns drawn from a normal distribution with mean 0.20% and standard deviation 1.5%. Model risk is represented by a small chance (0.05% per cycle) that the returns distribution will permanently change to having a mean of -0.05%.

The rules for dealing with the risk are as follows: equity curve-based position sizing will decrease positions by 25% if the drawdown is below 5%, and by 50% if the drawdown is below 10%. The cutoff simply stops trading if the drawdown ever reaches 25%.

Running 10,000 simulations with 1,000 steps each, the results are shown below:

The first thing to note is the obvious fact that, without model risk, these heuristics have a negative effect on risk-adjusted returns. Yes, maximum drawdowns are decreased on average, but at an unacceptable cost to returns. In the case of equity curve-based position sizing, the average drawdown is deeper and longer as well. The lesson should be obvious a priori but deserves to be stated anyway: if you are confident that a strategy will continue to work well in the future, you should abandon such rules.

Note that these results assume that the returns distribution remains constant; some real-world strategies such as trend following futures exhibit higher than average returns after drawdowns, so decreasing exposure at those times would be even more hurtful. The inverse may be true of other strategies.

Things change when we look at the results after including model risk. Both heuristics improve risk-adjusted returns, with the drawdown cutoff being particularly effective. While equity curve-based sizing improves on the vanilla case, it actually harms returns when combining it with the cutoff. This is presumably because the cutoff already takes care of all the failed strategies (and even more: while 38.5% of strategies failed, 42.6% of them hit the drawdown limit) and the variable sizing only serves to hurt the healthy ones.

Setting the drawdown limit for each particular strategy is a bit trickier. The maximum drawdown of a backtest should serve as a guide. This can be augmented either by assuming normal returns and using the results in *On the Maximum Drawdown of a Brownian Motion, *or through Monte Carlo simulation.

In the real world there are, of course, infinite states between the model working perfectly and it not working at all, so one must leave some room for deterioration and temporary changes by widening the cutoff point a bit. Finally, a more rigorous approach would perhaps use some sort of regime change detection and stop trading when the mean of the returns is determined to be below a hurdle, at a particular level of confidence.

No Comments »Jaffray Woodriff, who runs QIM, a highly successful systematic fund, has provided enough details about his data mining approach in various interviews (particularly the one in the excellent book *Hedge Fund Market Wizards*) that I think I can approximate it. Even though QIM has been lagging a bit the last few years, they have an excellent track record, so their approach is certainly worthy of imitation if possible. They trade commodities, currencies, etc. so the approach seems to be highly portable. And while they suffer from significant price impact issues (not to mention being forced into longer holding periods) due to their size, a small trader could probably do far better with the same strategies.

### Introduction

The approach, as much as he has detailed it, goes as follows:

- Generate random data.
- Mine it for trading strategies.
- The best strategies resulting from the random data are now the benchmark that you have to beat using real data.
- Mine the real data, discard anything that isn’t better than the best models from the random data (this ensures that you have found an actual edge despite the excessive mining).
- Use cross validation to more accurately estimate the performance of the models and avoid curve fitting.
- Test the model out of sample, and retain it if it performs reasonably well compared to the in-sample results.

The point is essentially to generate an environment in which we know that we have no edge whatsoever, mine the data for the best possible results, and then use those as a benchmark that we have to clear in order to prove that an edge exists in the real data.

What they do after this is also quite interesting and important: checking the correlation between the newly discovered models and the models they already use. This ensures that any new “edge” they incorporate is a novel one and not simply a copy of something they already have. Supposedly this approach has yielded over 1500 different signals which they then use to trade, on medium-term horizons (if I remember correctly their average holding period is roughly one week). The issue of combining the predictions of 1500 signals into a decision to trade or not trade is beyond the scope of this post, but it’s a very interesting “ensemble model” problem.

It is clear that the approach requires not only rigorous statistical work, but also tons and tons of computing power (the procedure is highly parallelizable however, so you can just throw hardware at it to make it go faster). One potentially interesting way of tempering that requirement would be using genetic algorithms instead of brute force to search for new strategies. There are tricky issues with that approach, though: constructing the genome so that it can describe all possible trading models we want to look at, for example. How does one encode a wide array of chart patterns in a genome? There do not seem to be obvious/intuitive solutions.

### Generating random data sets

There are several issues that have to be looked at here. Do we randomly sample the real data or do we use the parameters of that data and plug it into a known statistical distribution to generate completely new numbers? How many times do we repeat this procedure? In either case we are bound to lose some features of real financial time series, but this is probably a good thing since those features may result in true exploitable edges. It is important to generate a healthy number of data series. Some are simply going to be “better” than others for any one particular trading model, so testing over a single randomly generated series is not enough.

In general we want at least the semblance of a “real” data series. As such we can’t simply select random OHLC data; it would just result in a nonsensical time series with giant gaps all over the place. Instead I will use the following procedure:

- Start by selecting a random day’s OHLC points. This forms our first data point.
- Select any random day, and compute the day’s (close to close) percentage return from the previous day.
- Use this value to generate the next fake closing price.
- From that same (real) day, calculate the OHL prices in terms relative to the closing price.
- Use those relative prices to generate the fake OHL prices.

I find this approach gives rather good results, producing series that look realistic and give the appearance of trends, different volatility regimes, etc.

### The models

Naturally I can’t test the billions upon billions of models that they test at QIM, and taking the model-agnostic approach is currently beyond my abilities. I can kind-of get around the issue by testing a very narrow range of models: moving average crossovers (another simple and interesting thing to test would be 1/2 day candlestick patterns). This still leaves a significant number of parameters to test:

- The type of moving average to use (simple, exponential, or Hull)
- The length of each moving average.
- The values that the moving averages will be based on (open, high, low, or close).
- The holding period. I’ll be using a technical entry, but a partially time-based exit. This may or may not be a good idea, but I’m running with it.
- Trend-following vs contrarian (i.e. trade in the direction of the “fast” moving average or against it).

### Evaluating the results

An important question remains: what metric do we use to evaluate the models? The use of cross validation presents unique problems in performance measurement, and we have to take these into account from this stage, because these results will be used for comparison to the real ones later on.

Drawdown is a problematic measure because drawdown extremes tend to be rare. When dividing a set into N folds for cross validation, a set of parameters may be rejected simply because a certain period generated a high drawdown, despite this drawdown being consistent with long-term expectations.

Another issue arises with the use of annualized returns: they may be rather meaningless if the signal fires very frequently. If what we care about is short-term predictability, it may be more prudent to look at average daily returns after a signal, instead of CAGR. This could also be ameliorated by taking trading costs into account, as weak but frequent signals would be filtered out.

In the end, many of these decisions depend on the trader’s choice of style. Every trader must decide for him or her self what risks they care about, and in what proportion to each other. As an attempt at a balanced performance metric, I will be using my **Tra**ding **Sy**stem **Co**nsistency, **D**rawdown, **R**eturn **A**symmetry, **Vo**latility, and **P**rofit **Fa**ctor **Co**mbination **M**etric (or TRASYCODRAVOPFACOM for short), which is calculated as follows:

St. Dev. is the annualized standard deviation of daily returns, and the profit factor is calculated based on daily returns.

The TRASYCODRAVOPFACOM still has weaknesses: a set of parameters may pick only a tiny amount of trades over the years. If they’re successful enough, it can lead to a high score but a useless signal. To avoid this I’ll also be setting the minimum number of trades to 100, a reasonable hurdle given the 17 years long sample.

### The random return results

Using a brute force approach, I collected approximately 704,000 results from 5 randomly generated series. It took several hours on my overclocked i5-2500K, so it’s definitely not a viable “real-world” approach (I am a terrible programmer, so some of the slowness is of my own making). The results look like you’d expect them to, with a few outliers at the top and to bottom:

Here are the best values achieved:

Note that this isn’t a “universal” hurdle: it’s a hurdle for this specific subset of moving average signals, on the GBPUSD pair. I am certain that a wide array of signals and data would generate higher hurdles.

### Genetic Algorithm?

Brute force takes ages, even for just 5 return series, which is far too low to draw any conclusions. Are there any faster ways than brute force to find the best possible results from our random data? If this were a “normal” dataset, I would say yes, of course! However I was not sure about this case due to the randomly generated data that we are dealing with.

If the data is random, does it follow that the optimal strategy parameters are also randomly distributed? Are they uniformly distributed or are there “clusters” that, due to somehow exploiting the structure of the time series, perform better or worse than the average? The question is: is the performance slope around local maxima smooth, or not? A simple method to make this thing go faster is to throw the problem into a genetic algorithm, but a GA will offer no performance improvement if the performance is uniformly randomly distributed.

Testing this is simple: I just ran a GA search on the same 5 series I brute forced above. If the GA results are similar to the brute force results, we can use the GA and save a lot of time. As long as there are enough populations, and they are large enough (I settled on 4 populations with 40 chromosomes each), the results are “close enough”: roughly 3-20% lower than the brute force (max CAGR was 7.043%, max avg. daily return was 0.176%). It might be a good idea to scale the GA results by, say, an additional 10-20% in order to make up for this deficit.

I then generated 100 series and put the GA to use. Here are the results:

And here are the distributions of maximum values achieved for each individual series:

These results have set the bar rather high. One might imagine that throwing out everything below this hurdle will leave us with very little (nothing?) in the end. But if Woodriff is to be believed, he has found upwards of 1500 signals that perform better than the hurdle (and that’s 1500 signals that were uncorrelated enough with each other that they were added to their models). So there’s got to be a lot of interesting stuff to find!

In part 2 I will take a look at cross validation and what we can do with the real data.

16 Comments »The VXV is the VIX’s longer-term brother; it measures implied volatility 3 months out instead of 30 days out. The ratio between the VIX and the VXV captures the differential between short-term and medium-term implied volatility. Naturally, the ratio spends most of its time below 1, typically only spiking up during highly volatile times.

It is immediately obvious by visual inspection that, just like the VIX itself, the VIX:VXV ratio exhibits strong mean reverting tendencies on multiple timescales. It turns out that it can be quite useful in forecasting SPY, VIX, and VIX futures changes.

### Short-term extremes

A simplistic method of evaluating short-term extremes is the distance of the VIX:VXV ratio from its 10-day simple moving average. When the ratio is at least 5% above the 10SMA, next-day SPY returns are, on average, **0.303% **(front month VIX futures drop by -0.101%). Days when the ratio is more than 5% below the 10SMA are followed by **-0.162%** returns for SPY. The equity curve shows the returns on the long side:

### Long-term extremes

When the ratio hits a 200-day high, next-day SPY returns have been **0.736%** on average. Implied volatility does not fall as one might expect, however.

More interestingly, the picture is reversed if we look at slightly longer time frames. 200-day VIX:VXV ratio extremes can predict pullbacks in SPY quite well. The average daily SPY return for the 10 days following a 200-day high is **-0.330%**. This is naturally accompanied by increases in the VIX of 1.478% per day (the front month futures show returns of 1.814% per day in the same period). It’s not a fail-proof indicator (it picked the bottom in March 2011), but I like it as a sign that things could get ugly in the near future. We recently saw a new 200-day high on the 19th of December: since then SPY is down approximately 1%.

This is my last post for the year, so I leave you with wishes for a happy new year! May your trading be fun and profitable in 2013.

2 Comments »
## Recent Comments