My Performance Analysis Tools

I have gotten a couple of emails asking me about the topic of analyzing performance, so I decided to detail the tools I use. Measuring performance and attributing success or failure to the right factors is an extremely important part of the trading process. Actually trading a strategy will often reveal aspects that don’t come up in the research stage. Unexpected things happen, revealing previously hidden strengths or weaknesses. Strategies improve or deteriorate through time. Execution issues eat into returns. Patterns emerge that can be exploited to enhance returns or limit risk.

These situations, and performance evaluation in general, are a crucial part of the research/trading/performance loop:


A lack of attention to performance, and the underlying factors that drive it, will have a deleterious effect both on your long-term trading results and the things that you will discover in the research stage.

I’ll demonstrate the tools using two strategies, one of which has been going well, and the other not: 1) a rather generic GTAA momentum/trend-following strategy that has been running for a bit over a year, and 2) an AAPL swing trading strategy that’s been in “trial” mode for the last 6 months or so.

My performance analysis system, the QUSMA Portfolio and Trade Analytics Suite, is primarily based around the concept of a “trade”. A trade is a unit that can contain any number of orders and cash transactions (dividends, taxes, etc.), which are somehow related. A pair trade would include both legs in a single trade, for example. The underlying data is imported using IB’s flex queries which have a very simple and easy to handle XML structure.

Trades are assigned to a “strategy” and can also be assigned any number of tags. Some of the things that I use tags for are: trade direction (long/short/both), trade length, developed/developing country, asset class, etc. Notes with images can also be attached to trades, which is incredibly useful for reviews. Finally, the trades can be filtered on any number of criteria to produce reports, and compared against custom benchmarks.

A trade and its two associated orders.

A trade and its two associated orders.

There are some general principles that summarize my approach to performance measurement:

  • Execution and commissions are extremely important.
  • Separate timing from sizing.
    • Statistics on trades in both dollar terms and % terms.
  • Separate capital allocations to strategies from total capital.
    • Statistics on returns both on capital allocated to a strategy (ROAC) and on total capital (ROTC).
  • Always think probabilistically and in terms of expectations
  • The more ways you can find to look at the data, the better.


Simple visual inspection is my starting point, and I think it’s very important. The simple act of staring at charts often leads to new research ideas.

GTAA strategy: a number of losing trades in TLT.

GTAA strategy: a number of losing trades in TLT.


So let’s get started with the graphs and stats. At the top, the standard dollar PnL (daily and close-to-close) and equity curves (both in terms of ROAC and ROTC), which are also plotted against a benchmark:


GTAA strategy cumulative returns on allocated capital. Chart also comes in ROTC flavor.


AAPL strategy cumulative PnL.


Next up are the trade statistics. Commissions are right up there, it’s very important to keep in mind how much you are losing in those costs. A few basis points may not seem like much, but they can quickly eat up a significant portion of your profits. Note all the stats are given both in dollar and percentage terms, in order to separate timing effects from sizing effects.

AAPL strategy.

AAPL strategy.


Results by calendar month:

AAPL strategy. Also comes in ROTC flavor.

AAPL strategy. Also comes in ROTC flavor.


Probably the most important bit, statistics on daily returns, the standard ratios, and so forth. The MAR ratio is probably the most important number for me. The reason is simple: it determines my leverage constraints, and thus my returns. A high Sharpe ratio is meaningless if you can’t lever up. Note how the simple, static benchmark portfolio has destroyed the GTAA approach:

GTAA strategy.

GTAA strategy. Benchmark is a 20/15/15/20/10/10/10 % mix of SPY/EFA/EEM/IEF/LQD/VNQ/DBC respectively. Stats are also available for ROTC.


Some simple benchmarking stuff:

GTAA strategy vs diversified benchmark.

GTAA strategy vs diversified benchmark.


Histograms of daily returns, and returns per trade. Again, it’s important to look at both dollar and percentage results:

AAPL strategy.

AAPL strategy.

GTAA strategy.

GTAA strategy.


Also, holding period histogram:

AAPL strategy.

AAPL strategy.


Position sizing vs trade returns. Naive risk parity seems to be doing alright:

GTAA strategy.

GTAA strategy.


Trade length vs returns chart, the relationship here is pretty clear.

GTAA strategy.

GTAA strategy.


The movement capture stats measure how good the strategy is at capturing returns. GU is gross upside, or the gross positive returns during the period. UC% is the percentage of that movement that was captured by being long, UM% is the percentage of the movement that was missed by being flat, while UL% is the percentage of the movement that was lost due to being short. The calculations are repeated for downside movement.

GTAA strategy.

GTAA strategy. Being long-only, only upside movement has been captured.


Cumulative percent returns, by instrument. A similar chart with dollar PnL by instrument also exists.

GTAA strategy.

GTAA strategy.


Autocorrelation and partial autocorrelation stats based on daily returns:

GTAA strategy.

GTAA strategy. High autocorrelation values can be exploited both to enhance returns and for risk management.


Standard value at risk calculations, based on resampled historical data. I’ll be adding the option to use parametric methods in the future.


GTAA strategy: 10-day value at risk.


Monte Carlo simulation. It simply uses historical data, either trades or daily returns (either ROAC or ROTC). Sampling can be done with replacement or without (the latter simply re-orders the existing equity curve). There is also an option to use N consecutive days/trades, which can capture volatility clustering and autocorrelation effects. The analysis returns confidence intervals for the equity curve, as well as the cumulative and point distributions of maximum drawdowns.

There is a 10% chance of a drawdown worse than 18% in the next 500 trading days.

GTAA strategy: there is a 10% chance of a drawdown worse than 18% in the next 500 trading days.


Finally, some simple stats and charts on execution. All of my trades are either at the close or the open, so those are the prices I benchmark against. Below are stats from the AAPL strategy’s buy orders around the close.

execution stats

Top: slippage vs time difference in seconds from benchmark. Middle: slippage by order type. Bottom: Slippage histogram.

Top: slippage vs time difference in seconds from benchmark. Middle: slippage by order type. Bottom: Slippage histogram.


I think that the biggest weakness in my toolset is the lack of interaction with backtesting results. These can be used in two main ways: 1) comparing theoretical results to real trading results, and 2) as an extended dataset for the risk management functions. Also, I don’t do any stock picking, but if I did that would entail several additions, mainly performance attribution by country, sector, etc. as well as analyzing value/size/momentum factor exposures.

Leave a comment and tell us what you like to use: is the standard stuff enough for you, or do you use any obscure ratios or unique charts?

Read more My Performance Analysis Tools

Volatility-Based Position Sizing of SPY Swing Trades: Realized vs VIX vs GARCH

A simple post on position sizing, comparing three similar volatility-based approaches. In order test the different sizing techniques I’ve set up a long-only strategy applied to SPY, with 4 different signals:

On top of that sits an IBS filter, allowing long positions only when IBS is below 50%. A position is taken if any of the signals is triggered. Entries and exits at the close of the day, no stops or targets. Results include commissions of 1 cent per share.

Sizing based on realized volatility uses the 10-day realized volatility, and then adjusts the size of the position such that, if volatility remains unchanged, the portfolio would have an annualized standard deviation of 17%. The fact that the strategy is not always in the market decreases volatility, which is why to get close to the ~11.5% standard deviation of the fixed fraction sizing we need to “overshoot” by a fair bit.

The same idea is used with the GARCH model, which is used to forecast volatility 3 days ahead. That value is then used to adjust size. And again the same concept is used with VIX, but of course option implied volatility tends to be greater than realized volatility, so we need to overshoot by even more, in this case to 23%.

Let’s take a look at the strategy results with the simplest sizing approach (allocating all available capital):

fixed fraction

Top panel: equity curve. Middle panel: drawdown. Bottom panel: leverage.

Returns are the highest during volatile periods, and so are drawdowns. This results in an uneven equity curve, and highly uneven risk exposure. There is, of course, no reason to let the market decide these things for us. Let’s compare the fixed fraction approach to the realized volatility- and VIX-based sizing approaches:


These results are obviously unrealistic: nobody in their right mind would use 600% leverage in this type of trade. A Black Monday would very simply wipe you out. These extremes are rather infrequent, however, and leverage can be capped to a lower value without much effect.

With the increased leverage comes an increase in average drawdown, with >5% drawdowns becoming far more frequent. The average time to recovery is also slightly increased. Given the benefits, I don’t see this as a significant drawback. If you’re willing to tolerate  a 20% drawdown, the frequency of 5% drawdowns is not that important.

On the other hand, the deepest drawdowns naturally tend to come during volatile periods, and the decrease of leverage also results in a slight decrease of the max drawdown. Returns are also improved, leading to better risk-adjusted returns across the board for the volatility-based sizing approaches.

The VIX approach underperforms, and the main reason is obviously that it’s not a good measure of expected future volatility. There is also the mismatch between the VIX’s 30-day horizon and the much shorter horizon of the trades. GARCH and realized volatility result in very similar sizing, so the realized volatility approach is preferable due to its simplicity.


Read more Volatility-Based Position Sizing of SPY Swing Trades: Realized vs VIX vs GARCH

Blueprint for a Backtesting and Trading Software Suite

Posting has been slow lately because I’ve been busy with a bunch of other stuff, including the CFA Level 3 exam last weekend. I’ve also begun work on a very ambitious project: a fully-featured all-in-one backtesting and live trading suite, which is what prompted this post.

Over the last half year or so I’ve been moving toward more complex tools (away from excel, R, and MATLAB), and generally just writing standalone backtesters in C# for every concept I wanted to try out, only using Multicharts for the simplest ideas. This approach is, of course, incredibly inefficient, but the software packages available to “retail” traders are notoriously horrible, and I have nowhere near the capital I’d need to afford “real” tools like QuantFACTORY or Deltix.

The good thing about knowing how to code is that if a tool doesn’t exist you can just write it, and that’s exactly what I’m doing. Proper portfolio-level backtesting and live trading that’ll be able to easily do everything from intraday pairs trading to long term asset allocation and everything in-between, all under the same roof. On the other hand it’s also tailored to my own needs, and as such contains no plans for things like handling fundamental data. Most importantly it’s my dream research platform that’ll let me go from idea, to robust testing & optimization, to implementation very quickly. Here’s what the basic design looks like:


What’s the point of posting about it? I know there are many other people out there facing the same issues I am, so hopefully I can provide some inspiration and ideas on how to solve them. Maybe it’ll prompt some discussion and idea-bouncing, or perhaps even collaboration.

Most of the essential stuff has already been laid down, so basic testing is already possible. A simple example based on my previous post can showcase some essential features. Below you’ll find the code behind the PatternFinder indicator, which uses the Accord.NET library’s k-d tree and k nearest neighbor algorithm implementation to do candlestick pattern searches as discussed here. Many elements are specific to my system, but the core functionality is trivially portable if you want to borrow it.

Note the use of attributes to denote properties as inputs, and set their default values. Options can be serialized/deserialized for easy storage in files or a database. Priority settings allow the user to specify the order of execution, which can be very important in some cases. Indexer access works with [0] being the current bar, [1] being the previous bar, etc. Different methods for historical and real time bars allow for a ton of optimization to speed up processing when time is scarce, though in this case there isn’t much that can be done.


The VariableSeries class is designed to hold time series, synchronize them across the entire parent object, prevent data snooping, etc. The Indicator and Signal classes are all derived from VariableSeries, which is the basis for the system’s modularity. For example, in the PatternFinder indicator, OHLC inputs can be modified by the user through the UI, e.g. to make use of the values of an indicator rather than the instrument data.


The backtesting analysis stuff is still in its early stages, but again the foundations have been laid. Here are some stats using a two-day PatternFinder combined with IBS, applied on SPY:


Here’s the first iteration of the signal analysis interface. I have added 3 more signals to the backtest: going long for 1 day at every 15 day low close, the set-up Rob Hanna posted yesterday over at Quantifiable Edges (staying in for 5 days after the set-up appears), and UDIDSRI. The idea is to be able to easily spot redundant set-ups, find synergies or anti-synergies between signals, and easily get an idea of the marginal value added by any one particular signal.


And here’s some basic Monte Carlo simulation stuff, with confidence intervals for cumulative returns and PDF/CDF of the maximum drawdown distribution:


Here’s the code for the PatternFinder indicator. Obviously it’s written for my platform, but it should be easily portable. The “meat” is all in CalcHistorical() and GetExpectancy().

/// <summary>
/// K nearest neighbor search for candlestick patterns
/// </summary>
public class PatternFinder : Indicator
    public int PatternLength { get; set; }

    public int MatchCount { get; set; }

    public int MinimumWindowSize { get; set; }

    public bool VolatilityAdjusted { get; set; }

    public bool Overnight { get; set; }

    public bool WeighExpectancyByDistance { get; set; }

    public bool Classification { get; set; }

    public double ClassificationLimit { get; set; }

    public string DistanceType { get; set; }

    public VariableSeries<decimal> Open { get; set; }

    public VariableSeries<decimal> High { get; set; }

    public VariableSeries<decimal> Low { get; set; }

    public VariableSeries<decimal> Close { get; set; }

    public VariableSeries<decimal> AdjClose { get; set; }

    private VariableSeries<double> returns;
    private VariableSeries<double> stDev;
    private KDTree<double> _tree;

    public PatternFinder(QSwing parent, string name = "PatternFinder", int BarsCount = 1000)
        : base(parent, name, BarsCount)
        Priority = 1;
        returns = new VariableSeries<double>(parent, BarsCount);
        stDev = new VariableSeries<double>(parent, BarsCount) { DefaultValue = 1 };

    internal override void Startup()
        _tree = new KDTree<double>(PatternLength * 4 - 1);
        switch (DistanceType)
            case "Euclidean":
                _tree.Distance = Accord.Math.Distance.Euclidean;
            case "Absolute":
                _tree.Distance = AbsDistance;
            case "Chebyshev":
                _tree.Distance = Accord.Math.Distance.Chebyshev;
                _tree.Distance = Accord.Math.Distance.Euclidean;

    public override void CalcHistorical()
        if (VolatilityAdjusted && CurrentBar > 0)
            returns.Value = (double)(AdjClose[0] / AdjClose[1] - 1);

        if (VolatilityAdjusted && CurrentBar > 11)
            stDev.Value = returns.StandardDeviation(10);

        if (CurrentBar < PatternLength + 1) return;

        if (CurrentBar > MinimumWindowSize)
            Value = GetExpectancy(GetCoords());

        double ret = Overnight ? (double)(Open[0] / Close[1] - 1) : (double)(AdjClose[0] / AdjClose[1] - 1);
        double adjret = ret / stDev[0];

        if (Classification)
            _tree.Add(GetCoords(1), adjret > ClassificationLimit ? 1 : 0);
            _tree.Add(GetCoords(1), adjret);

    public override void CalcRealTime()
        if (VolatilityAdjusted && CurrentBar > 0)
            returns.Value = (double)(AdjClose[0] / AdjClose[1] - 1);

        if (VolatilityAdjusted && CurrentBar > 11)
            stDev.Value = returns.StandardDeviation(10);

        if (CurrentBar > MinimumWindowSize)
            Value = GetExpectancy(GetCoords());

    private double GetExpectancy(double[] coords)
        if (!WeighExpectancyByDistance)
            return _tree.Nearest(coords, MatchCount).Average(x => x.Node.Value) * stDev[0];
            var nodes = _tree.Nearest(coords, MatchCount);
            double totweight = nodes.Sum(x => 1 / Math.Pow(x.Distance, 2));
            return nodes.Sum(x => x.Node.Value * ((1 / Math.Pow(x.Distance, 2)) / totweight)) * stDev[0];

    private static double AbsDistance(double[] x, double[] y)
        return x.Select((t, i) => Math.Abs(t - y[i])).Sum();

    private double[] GetCoords(int offset = 0)
        double[] coords = new double[PatternLength * 4 - 1];
        for (int i = 0; i < PatternLength; i++)
            coords[4 * i] = (double)(Open[i + offset] / Close[i + offset]);
            coords[4 * i + 1] = (double)(High[i + offset] / Close[i + offset]);
            coords[4 * i + 2] = (double)(Low[i + offset] / Close[i + offset]);

            if (i < PatternLength - 1)
                coords[4 * i + 3] = (double)(Close[i + offset] / Close[i + 1 + offset]);
        return coords;

Coming up Soon™: a series of posts on cross validation, an in-depth paper on IBS, and possibly a theory-heavy paper on the low volatility effect.

Read more Blueprint for a Backtesting and Trading Software Suite

Mining for Three Day Candlestick Patterns

I’ve been thinking a lot about candlestick patterns lately but grew tired of trying to generate ideas and instead decided to mine for them. I must confess I didn’t expect much from such a simplistic approach, so I was pleasantly surprised to see it working well. Unfortunately I wasn’t able to discover any short set-ups. The general bias of equity markets toward the upside makes it difficult to find enough instances of patterns that are followed by negative returns.

The idea is to mine past data for similar 3 day patterns, and then use that information to make trading decisions. There are several choices we must make:

  • The size of the lookback window. I use an expanding window that starts at 2000 days.
  • Once we find similar patterns, how do we choose which ones to use?
  • How do we measure the similarity between the patterns?

To fully describe a three day candlestick pattern we need 11 numbers. The close-to-close percentage change from day 1 to day 2, and from day 2 to day 3, as well as the positions of the open, high, and low relative to the close for each day.

To measure the degree of similarity between any two 3-day patterns, I tried both the sum of absolute differences and the sum of the squared differences between those 11 numbers; the results were quite similar. It would be interesting to try to optimize individual weights for each number, as I imagine some are more important than others.

The final step is to select a number of the closest patterns we find, and simply average their next-day returns to arrive at an expected return.

absolute difference 50 closest exp vs realized

Expected vs realized returns for SPY, 50 closest patterns by absolute difference. Numbers above the bars indicate the number of instances in each bucket.

How do we choose which patterns are “close enough” to use? Choose too few and the sample will be too small. Choose too many and you risk using irrelevant data. That’s a number that we’ll have to optimize.

histogram squared

Histogram of expected return estimates for different sample sizes.

When comparing the results we also run into another problem: the smaller the sample, the more spread out the expected return estimates will be, which means more trades will be chosen given a certain minimum limit for entry. My solution was to choose a different limit for trade entry, such that all sample sizes would generate the same number of days in the market (300 in this case). Here are the walk-forward results:

closest count tests

The trade-off between sample size and relevance is clear, and the “sweet spot” appears to be somewhere in the 50-150 range or so, for both the absolute difference and squared difference approaches. Depending on how selective you want to be, you can decrease the limit and trade off more trades for lower expected returns. For me, 30 bp is a reasonable area to aim for.

A nice little addition is to use IBS by filtering out any trades with IBS > 50%. Using squared differences, I select the 50 closest patterns. When their average next-day return is greater than 0.2%, a long position is taken. The results are predictably great:

equity curves with without IBS

squared 50 closest 0.2pct limit ibs filter results

The IBS filter removes close to 40% of days in the market yet maintains essentially the same CAGR, while also more than halving the maximum drawdown.

Let’s take a look at some of the actual patterns. Using squared differences, the 50 closest patterns, and a 0.2% limit, the last successful trade was on February 26, 2013. The expected return on that day was 0.307%. Here’s what those 3 days looked like, as well as the 5 closest historical patterns:


As you can see below, even the 50th closest pattern seems to be, based on visual inspection, rather close. The “main idea” of the pattern seems to be there:

patterns 50th closest

Here are the stats from a bunch of different equity index ETFs, using square differences, the 50 closest patterns, 0.2% expected return limit and the IBS < 0.5 filter.

ETFs square 50 closest 0.2 ibs filter results

The 0.2% limit seems to be too low for some of them, producing too many trades. Perhaps setting an appropriate limit per-instrument would be a good idea.

The obvious path forward is to also produce 2-day, 4-day, 5-day, etc. versions, perhaps with optimized distance weighting and some outlier filtering, and combine them all in a nice little ensemble to get your predictions out of. The implementation is left as an exercise for the reader.

Read more Mining for Three Day Candlestick Patterns

A Quick Look at Bitcoin Returns

Bitcoin seems to be all the rage these days, and I’m jumping on the bandwagon. Quandl tweeted about their bitcoin data today so I decided I’d have a look at it. I have tested a bunch of popular/”standard” ideas, and the results aren’t really surprising, though they do illuminate the trend-y (bubbl-y) character of the bitcoin market. BTC prices do not revert like equities but show strong momentum, both in the short and medium term. IBS is useless, while trend following works like it does everywhere else.  The (daily) data covers the period from 17/7/2010 to today.

Descriptive Stats


The mean simple daily return has been 1.012%, while the annualized standard deviation has been 121.70%. The distribution of returns is obviously fat-tailed (with a kurtosis of 8.62), though somewhat surprisingly (to me at least), slightly positively skewed (0.76).

Up/Down Streaks

Strong up streaks tend to be followed by high returns over the medium term, and there has been a surprisingly large number of these streaks given the small amount of data available.

day streaks returns day streaks updown



IBS does not appear to have any predictive value when it comes to bitcoin returns.

ibs quintiles ibs




No mean reversion to be found here. Using a 3-period Cutler’s RSI, next-day bitcoin returns are 0.392% when RSI(3) is below 20, and 1.763% when it is above 80. The story is pretty much the same if you go for a medium term length for the RSI: high values beget high returns, with no mean reversion in sight.

rsi 3

Simple Trend Following

The strong trends that bitcoin has shown would have been very profitable to any trend followers. Going long at a new 50-day high close (with an exit at a new 25-day low close), and vice-versa for short positions, would have yielded these equity curves:

trend following

Day of the Week



Before you jump in, keep in mind that this sort of market can change character very quickly, especially after a big bubble pop. Also consider the fees: Mt. Gox, the most popular exchange, charges an obscene 120 basis points per roundtrip. There are some brokers that will allow you to short bitcoins, and there even appear to be some thinly-traded options and currency futures available…I imagine there are gigantic inefficiencies in the pricing of these instruments (though their legality is probably questionable).

Read more A Quick Look at Bitcoin Returns

Heuristics for Managing Model Risk

Model risk is the risk that a model is, or will become, unable to perform the tasks it was designed to do. In terms of trading, this can be the risk that a set-up stops working, the risk that a variable loses its predictive power, etc. Ever-changing market conditions mean that model risk is a significant issue for most systematic traders: managing it is an integral part of adapting to new market environments.

Two heuristic rules are commonly used to handle this risk: drawdown-based position sizing, and a maximum drawdown cutoff. The former involves reducing exposure depending on drawdown (e.g. position sizes will be halved below 10% drawdown); the latter technique simply stops the strategy if it ever reaches a specified drawdown cutoff.

To investigate the efficacy of these rules, I’m going to use a simple Monte Carlo approach. The basic strategy has returns drawn from a normal distribution with mean 0.20% and  standard deviation 1.5%. Model risk is represented by a small chance (0.05% per cycle) that the returns distribution will permanently change to having a mean of -0.05%.

ECs without fail without stop

Equity curves for 100 simulations, no model risk.

with fail without stop

Equity curves for 100 simulations, with model risk. Bold equity curves show simulations in which the model switched to the “failed” state.

The rules for dealing with the risk are as follows: equity curve-based position sizing will decrease positions by 25% if the drawdown is below 5%, and by 50% if the drawdown is below 10%. The cutoff simply stops trading if the drawdown ever reaches 25%.

with fail with stop

Equity curves for 100 simulations, with model risk and a drawdown limit of 25%.

Running 10,000 simulations with 1,000 steps each, the results are shown below:


The first thing to note is the obvious fact that, without model risk, these heuristics have a negative effect on risk-adjusted returns. Yes, maximum drawdowns are decreased on average, but at an unacceptable cost to returns. In the case of equity curve-based position sizing, the average drawdown is deeper and longer as well. The lesson should be obvious a priori but deserves to be stated anyway: if you are confident that a strategy will continue to work well in the future, you should abandon such rules.

Note that these results assume that the returns distribution remains constant; some real-world strategies such as trend following futures exhibit higher than average returns after drawdowns, so decreasing exposure at those times would be even more hurtful. The inverse may be true of other strategies.

Things change when we look at the results after including model risk. Both heuristics improve risk-adjusted returns, with the drawdown cutoff being particularly effective. While equity curve-based sizing improves on the vanilla case, it actually harms returns when combining it with the cutoff. This is presumably because the cutoff already takes care of all the failed strategies (and even more: while 38.5% of strategies failed, 42.6% of them hit the drawdown limit) and the variable sizing only serves to hurt the healthy ones.

Setting the drawdown limit for each particular strategy is a bit trickier. The maximum drawdown of a backtest should serve as a guide. This can be augmented either by assuming normal returns and using the results in On the Maximum Drawdown of a Brownian Motionor through Monte Carlo simulation.

In the real world there are, of course, infinite states between the model working perfectly and it not working at all, so one must leave some room for deterioration and temporary changes by widening the cutoff point a bit. Finally, a more rigorous approach would perhaps use some sort of regime change detection and stop trading when the mean of the returns is determined to be below a hurdle, at a particular level of confidence.

Read more Heuristics for Managing Model Risk

Doing the Jaffray Woodriff Thing (Kinda), Part 1

Jaffray Woodriff, who runs QIM, a highly successful systematic fund, has provided enough details about his data mining approach in various interviews (particularly the one in the excellent book Hedge Fund Market Wizards) that I think I can approximate it. Even though QIM has been lagging a bit the last few years, they have an excellent track record, so their approach is certainly worthy of imitation if possible. They trade commodities, currencies, etc. so the approach seems to be highly portable. And while they suffer from significant price impact  issues (not to mention being forced into longer holding periods) due to their size, a small trader could probably do far better with the same strategies.


The approach, as much as he has detailed it, goes as follows:

  • Generate random data.
  • Mine it for trading strategies.
  • The best strategies resulting from the random data are now the benchmark that you have to beat using real data.
  • Mine the real data, discard anything that isn’t better than the best models from the random data (this ensures that you have found an actual edge despite the excessive mining).
  • Use cross validation to more accurately estimate the performance of the models and avoid curve fitting.
  • Test the model out of sample, and retain it if it performs reasonably well compared to the in-sample results.

The point is essentially to generate an environment in which we know that we have no edge whatsoever, mine the data for the best possible results, and then use those as a benchmark that we have to clear in order to prove that an edge exists in the real data.

What they do after this is also quite interesting and important: checking the correlation between the newly discovered models and the models they already use. This ensures that any new “edge” they incorporate is a novel one and not simply a copy of something they already have. Supposedly this approach has yielded over 1500 different signals which they then use to trade, on medium-term horizons (if I remember correctly their average holding period is roughly one week). The issue of combining the predictions of 1500 signals into a decision to trade or not trade is beyond the scope of this post, but it’s a very interesting “ensemble model” problem.

It is clear that the approach requires not only rigorous statistical work, but also tons and tons of computing power (the procedure is highly parallelizable however, so you can just throw hardware at it to make it go faster). One potentially interesting way of tempering that requirement would be using genetic algorithms instead of brute force to search for new strategies. There are tricky issues with that approach, though: constructing the genome so that it can describe all possible trading models we want to look at, for example. How does one encode a wide array of chart patterns in a genome? There do not seem to be obvious/intuitive solutions.

Generating random data sets

There are several issues that have to be looked at here. Do we randomly sample the real data or do we use the parameters of that data and plug it into a known statistical distribution to generate completely new numbers? How many times do we repeat this procedure? In either case we are bound to lose some features of real financial time series, but this is probably a good thing since those features may result in true exploitable edges. It is important to generate a healthy number of data series. Some are simply going to be “better” than others for any one particular trading model, so testing over a single randomly generated series is not enough.

In general we want at least the semblance of a “real” data series. As such we can’t simply select random OHLC data; it would just result in a nonsensical time series with giant gaps all over the place. Instead I will use the following procedure:

  • Start by selecting a random day’s OHLC points. This forms our first data point.
  • Select any random day, and compute the day’s (close to close) percentage return from the previous day.
  • Use this value to generate the next fake closing price.
  • From that same (real) day, calculate the OHL prices in terms relative to the closing price.
  • Use those relative prices to generate the fake OHL prices.

I find this approach gives rather good results, producing series that look realistic and give the appearance of trends, different volatility regimes, etc. fake series

The models

Naturally I can’t test the billions upon billions of models that they test at QIM, and taking the model-agnostic approach is currently beyond my abilities. I can kind-of get around the issue by testing a very narrow range of models: moving average crossovers (another simple and interesting thing to test would be 1/2 day candlestick patterns). This still leaves a significant number of parameters to test:

  • The type of moving average to use (simple, exponential, or Hull)
  • The length of each moving average.
  • The values that the moving averages will be based on (open, high, low, or close).
  • The holding period. I’ll be using a technical entry, but a partially time-based exit. This may or may not be a good idea, but I’m running with it.
  • Trend-following vs contrarian (i.e. trade in the direction of the “fast” moving average or against it).

Evaluating the results

An important question remains: what metric do we use to evaluate the models? The use of cross validation presents unique problems in performance measurement, and we have to take these into account from this stage, because these results will be used for comparison to the real ones later on.

Drawdown is a problematic measure because drawdown extremes tend to be rare. When dividing a set into N folds for cross validation, a set of parameters may be rejected simply because a certain period generated a high drawdown, despite this drawdown being consistent with long-term expectations.

Another issue arises with the use of annualized returns: they may be rather meaningless if the signal fires very frequently. If what we care about is short-term predictability, it may be more prudent to look at average daily returns after a signal, instead of CAGR. This could also be ameliorated by taking trading costs into account, as weak but frequent signals would be filtered out.

In the end, many of these decisions depend on the trader’s choice of style. Every trader must decide for him or her self what risks they care about, and in what proportion to each other. As an attempt at a balanced performance metric, I will be using my Trading System Consistency, Drawdown, Return Asymmetry, Volatility, and Profit Factor Combination Metric (or TRASYCODRAVOPFACOM for short), which is calculated as follows:

 TRASYCODRAVOPFACOMSt. Dev. is the annualized standard deviation of daily returns, and the profit factor is calculated based on daily returns.

The TRASYCODRAVOPFACOM still has weaknesses: a set of parameters may pick only a tiny amount of trades over the years. If they’re successful enough, it can lead to a high score but a useless signal. To avoid this I’ll also be setting the minimum number of trades to 100, a reasonable hurdle given the 17 years long sample.

The random return results

Using a brute force approach, I collected approximately 704,000 results from 5 randomly generated series. It took several hours on my overclocked i5-2500K, so it’s definitely not a viable “real-world” approach (I am a terrible programmer, so some of the slowness is of my own making). The results look like you’d expect them to, with a few outliers at the top and to bottom:

random brute force CAGR random brute force PF random brute force TRASYCODRAVOPFACOM

Here are the best values achieved:

brute force random results maximums

Note that this isn’t a “universal” hurdle: it’s a hurdle for this specific subset of moving average signals, on the GBPUSD pair. I am certain that a wide array of signals and data would generate higher hurdles.

Genetic Algorithm?

Brute force takes ages, even for just 5 return series, which is far too low to draw any conclusions. Are there any faster ways than brute force to find the best possible results from our random data? If this were a “normal” dataset, I would say yes, of course! However I was not sure about this case due to the randomly generated data that we are dealing with.

If the data is random, does it follow that the optimal strategy parameters are also randomly distributed? Are they uniformly distributed or are there “clusters” that, due to somehow exploiting the structure of the time series, perform better or worse than the average? The question is: is the performance slope around local maxima smooth, or not? A simple method to make this thing go faster is to throw the problem into a genetic algorithm, but a GA will offer no performance improvement if the performance is uniformly randomly distributed.

Testing this is simple: I just ran a GA search on the same 5 series I brute forced above. If the GA results are similar to the brute force results, we can use the GA and save a lot of time. As long as there are enough populations, and they are large enough (I settled on 4 populations with 40 chromosomes each), the results are “close enough”: roughly 3-20% lower than the brute force (max CAGR was 7.043%, max avg. daily return was 0.176%). It might be a good idea to scale the GA results by, say, an additional 10-20% in order to make up for this deficit.

I then generated 100 series and put the GA to use. Here are the results:

random GA results maximums

And here are the distributions of maximum values achieved for each individual series:

random GA results

These results have set the bar rather high. One might imagine that throwing out everything below this hurdle will leave us with very little (nothing?) in the end. But if Woodriff is to be believed, he has found upwards of 1500 signals that perform better than the hurdle (and that’s 1500 signals that were uncorrelated enough with each other that they were added to their models). So there’s got to be a lot of interesting stuff to find!

In part 2 I will take a look at cross validation and what we can do with the real data.

Read more Doing the Jaffray Woodriff Thing (Kinda), Part 1

The VIX:VXV Ratio

The VXV is the VIX’s longer-term brother; it measures implied volatility 3 months out instead of 30 days out. The ratio between the VIX and the VXV captures the differential between short-term and medium-term implied volatility. Naturally, the ratio spends most of its time below 1, typically only spiking up during highly volatile times.

VIX VXV Ratio Chart

It is immediately obvious by visual inspection that, just like the VIX itself, the VIX:VXV ratio exhibits strong mean reverting tendencies on multiple timescales. It turns out that it can be quite useful in forecasting SPY, VIX, and VIX futures changes.

Short-term extremes

A simplistic method of evaluating short-term extremes is the distance of the VIX:VXV ratio from its 10-day simple moving average. When the ratio is at least 5% above the 10SMA, next-day SPY returns are, on average, 0.303% (front month VIX futures drop by -0.101%). Days when the ratio is more than 5% below the 10SMA are followed by -0.162% returns for SPY. The equity curve shows the returns on the long side:

short term EC

Long-term extremes

When the ratio hits a 200-day high, next-day SPY returns have been 0.736% on average. Implied volatility does not fall as one might expect, however.

More interestingly, the picture is reversed if we look at slightly longer time frames. 200-day VIX:VXV ratio extremes can predict pullbacks in SPY quite well. The average daily SPY return for the 10 days following a 200-day high is -0.330%. This is naturally accompanied by increases in the VIX of 1.478% per day (the front month futures show returns of 1.814% per day in the same period). It’s not a fail-proof indicator (it picked the bottom in March 2011), but I like it as a sign that things could get ugly in the near future. We recently saw a new 200-day high on the 19th of December: since then SPY is down approximately 1%.

200d high cumulative


This is my last post for the year, so I leave you with wishes for a happy new year! May your trading be fun and profitable in 2013.

Read more The VIX:VXV Ratio

Holiday Effects in the Chinese Stock Market

Various holiday effects are well documented for developed countries’ stock markets, typically showing abnormal returns around thanksgiving, Christmas, New Year, and Easter. Do similar effects exist in the Chinese stock market? In this post I’ll take a look at returns to the Shanghai Composite Index (SSECI) during the days surrounding the following holidays: New Year, Chinese New Year, Ching Ming Festival, Labor Day, Tuen Ng Festival, Mid-Autumn Festival. The index only has 22 years of history, so statistical significance is difficult to establish. Despite this, I believe the results are quite interesting1.

The charts require a bit of explanation: the error bars are 1.65 standard errors wide on each side. As such, if an error bar does not cross the x-axis, the returns on that day are statistically significantly different from zero at the 5% level (by way of a one-tailed t-test). The most interesting holidays are the New Year, Chinese New Year, and Ching Ming Festival, all of which have several days of quite high returns around them.

New Year

new year

Chinese New Year

chinese new year

Ching Ming Festival

The Ching Ming Festival occurs 15 days after the vernal equinox, which is either April 4th or April 5th.

ching ming festival

Labor Day

labor day

Tuen Ng Festival

The Tuen Ng Festival (A.K.A. Dragon Boat Festival) occurs on the 5th day of the 5th lunar month in the Chinese calendar.

tuen ng festival

Mid-Autumn Festival

The Mid-Autumn Festival falls on the 15th day of the 8th lunar month.

mid autumn festival

Bonus: Day of the Month Effects

Since we’re looking at seasonality effects, why not the day of the month effect as well? Using the walk-forward methodology as in my previous day of the month effect posts (U.S., Europe, Asia), here are the results for the Shanghai Composite Index:

dotm china EC

dotm china stats

Finally, the average returns for each day of the month over the last 5000 days:

dotm china days

The standard turn of the month effect seems to be present, but only for the first days of the month instead of the last and first days.

And with that, I’d like to you wish you all happy holidays! In eggnog veritas.

  1. I want to take this opportunity to thank the C# language designers; without the ChineseLunisolarCalendar class this study would’ve been a major chore.[]

Read more Holiday Effects in the Chinese Stock Market

IBS and Relative Value Mean Reversion

I’m writing a paper on the IBS effect, but it’s taking a bit longer than expected so I thought I’d share some of the results in a blog post. The starting point is a paper by Levy & Lieberman: Overreaction of Country ETFs to US Market Returns, in which the authors find that country ETFs over-react to US returns during non-overlapping trading hours, which gives rise to abnormal returns as the country ETFs revert back the next day. In terms of the IBS effect, this suggests that a high SPY IBS would lead to over-reaction in the country ETFs and thus lower returns the next day, and vice versa.

To quickly recap, Internal Bar Strength (or IBS) is an indicator with impressive (mean reversion) predictive ability for equity indices. It is calculated as follows:


Using a selection of 32 equity index ETFs, let’s take a look at next-day returns after IBS extremes (top and bottom 20%), split up by SPY’s IBS (top and bottom half):

 returns by SPY IBS

The results were the exact opposite of what I was expecting. Instead of over-reacting to a high SPY IBS, the ETFs instead under-react to it. A high SPY IBS is followed by higher returns for the ETFs, while a low SPY IBS is followed by lower returns. These results suggest a pair approach using SPY as the first leg of the pair, and ETFs at IBS extremes as the other. For a dollar-neutral strategy, the rules are the following:

  • If SPY IBS <= 50% and ETF IBS > 80%, go long SPY and short the other ETF, in equal dollar amounts.
  • If SPY IBS > 50% and ETF IBS < 20%, go short SPY and long the other ETF, in equal dollar amounts.

The results1:

pair strat returns

The numbers are excellent: high returns and relatively few trades with a high win rate. Let’s take a look at the alphas and betas from a regression of the excess returns to the pair strategy, using the Carhart 4 factor model:

four factor regression

Values in bold are statistically significantly different from zero at the 1% level.

On average, this strategy generates a daily alpha of 0.037%, or 9.28% annually, with essentially zero exposure to any of the factors. Transaction costs would certainly eat into this, but given the reasonable amount of trades (about 23 trades per year per pair on average) there should be a lot left over. The fact that over 90% of days consist of zero excess returns obscures the features of the actual returns to the strategy. Repeating the regression using only the days in which the strategy is in the market yields the following results:

four factor regression trade days only

Values in bold are statistically significantly different from zero at the 1% level.

Unfortunately, these results are pretty much a historical curiosity at this point. Most of the opportunity has been arbitraged away: during the last 4 years the average return per trade has fallen to 0.150%, less than half the average over the entire sample. The parameters haven’t been optimized, so there may be more profitable opportunities still left by filtering only for more extreme values, but it’s clear that there is relatively little juice left in the approach.

In fact if we take a closer look at the differences between the returns before and after 2008, the over-reaction hypothesis seems to be borne out by the data (another factor that may be at play here are the heightened correlations we’ve seen in the last years): low SPY IBS leads to higher next-day returns for the ETFs, and vice versa.

pre and post 2008 results

The lesson to take away from these numbers is that cross-market effects can be very significant, especially when global markets are in a state of high correlation. Accounting for the state of US markets in your models can add significant information (and returns) to your IBS approach.

  1. As with any dollar-neutral approach, calculating returns is a tricky matter; in this case I have calculated the returns as a % of the capital allocated to one of the legs[]

Read more IBS and Relative Value Mean Reversion