2013: Lessons Learned and Revisiting Some Studies

The year is over in a few hours and I thought it would be nice to do a quick review of the year, revisit some studies and the most popular posts of the year, as well as share some thoughts on my performance in 2013 and my goals for 2014.

Revisiting Old Studies

IBS

IBS did pretty badly in 2012, and didn’t manage to reach the amazing performance of 2007-2010 this year either. However, it still worked reasonably well: IBS < 0.5 led to far higher returns than IBS > 0.5, and the highest quarter had negative returns. It still works amazingly well as a filter. Most importantly the magnitude of the effect has diminished. This is partly due to the low volatility we’ve seen this year. After all IBS does best when movements are large, and SPY’s 10-day realized volatility never even broke 20% this year. Here are the stats:

ibs

UDIDSRI

The original post can be found here. Performance in 2013 hasn’t been as good as in the past, but was still reasonably OK. I think the results are, again, at least partially due to the low volatility environment in equities this year.

UDIDSRI performance, close-to-close returns after an zero reading.

UDIDSRI performance, close-to-close returns after a zero reading.

DOTM seasonality

I’ve done 3 posts on day of the month seasonality (US, EU, Asia), and on average the DOTM effect did its job this year. There are some cases where the top quarter does not have the top returns, but a single year is a relatively small sample so I doubt this has any long-term implications. Here are the stats for 9 major indices:

dotm

Day of the month seasonality in 2013

VIX:VXV Ratio

My studies on the implied volatility indices ratio turned out to work pretty badly. Returns when the VIX:VXV ratio was 5% above the 10-day SMA were -0.03%. There were no 200-day highs in the ratio in 2013!

 

Performance

Overall I would say it was a mixed bag for me this year. Returns were reasonably good, but a bit below my long-term expectations. It was a very good year for equities, and my results can’t compete with SPY’s 5.12 MAR ratio, which makes me feel pretty bad. Of course I understand that years like this one don’t represent the long-term, but it’s annoying to get beaten by b&h nonetheless.

Some strategies did really well:

stratgoodOthers did really poorly:
stratbad

The Good

Risk was kept under control and entirely within my target range, both in terms of volatility and maximum drawdown. Even when I was at the year’s maximum drawdown I felt comfortable…there is still “psychological room” for more leverage. Daily returns were positively skewed. My biggest success was diversifying across strategies and asset classes. A year ago I was trading few instruments (almost exclusively US equity ETFs) with a limited number of strategies. Combine that with a pretty heavy equity tilt in the GTAA allocation, and my portfolio returns were moving almost in lockstep with the indices (there were very few shorting opportunities in this year’s environment, so the choice was almost always between being long or in cash). Widening my asset universe combined with research into new strategies made a gigantic difference:

beta

The Bad

I made a series of mistakes that significantly hurt my performance figures this year. Small mistakes pile on top of each other and in the end have a pretty large effect. All in all I lost several hundred bp on these screw-ups. Hopefully you can learn from my errors:

  • Back in March I forgot the US daylight savings time kicks in earlier than it does here in Europe. I had positions to exit at the open and I got there 45 minutes late. Naturally the market had moved against me.
  • A bug in my software led to incorrectly handling dividends, which led to signals being calculated using incorrect prices, which led to a long position when I should have taken a short. Taught me the importance of testing with extreme caution.
  • Problems with reporting trade executions at an exchange led to an error where I sent the same order twice and it took me a few minutes to close out the position I had inadvertently created.
  • I took delivery on some FX futures when I didn’t want to, cost me commissions and spread to unwind the position.
  • Order entry, sent a buy order when I was trying to sell. Caught it immediately so the cost was only commissions + spread.
  • And of course the biggest one: not following my systems to the letter. A combination of fear, cowardice, over-confidence in my discretion, and under-confidence in my modeling skills led to some instances where I didn’t take trades that I should have. This is the most shameful mistake of all because of its banality. I don’t plan on repeating it in 2014.

Goals for 2014

  • Beat my 2013 risk-adjusted returns.
  • Don’t repeat any mistakes.
  • Make new mistakes! But minimize their impact. Every error is a valuable learning experience.
  • Continue on the same path in terms of research.
  • Minimize model implementation risk through better unit testing.

 

Most Popular

Finally, the most popular posts of the year:

  1. The original IBS post. Read the paper instead.
  2. Doing the Jaffray Woodriff Thing. I still need to follow up on that…
  3. Mining for Three Day Candlestick Patterns, which also spawned a short series of posts.

 

I want to wish you all a happy and profitable 2014!

New Paper: The IBS Effect: Mean Reversion in Equity ETFs

I finally finished the first draft of my IBS paper. The results are quite interesting and extremely relevant if you trade equity ETFs. You can read it here.

Abstract:

I investigate mean reversion in equity ETF prices at the daily frequency by employing a simple technical indicator, Internal Bar Strength (IBS). IBS is based on the position of the day’s close in relation to the day’s range. I use it to forecast close-to-close returns with statistically and economically significant results for most instruments. A simple strategy based on IBS generates an average alpha of over 30% p.a. before transaction costs. I show that equity index ETFs have had strong and consistent mean reverting tendencies since the 90s, and that these effects can be exploited as part of a profitable trading strategy. The IBS effect is stronger during times of high volatility, in bear markets, after high-range days, after high-volume days, and early in the week.

Feedback is highly appreciated, either in the comments below or by email to qusmablog at gmail dot com.

Some of the interesting things you’ll find within:

asdf

IBS plotted against average close-to-close returns.

asdf

Cumulative NQ returns at 5 minute intervals after IBS < 0.2 at 15:00 CT.

asdf

Equity curves of a simple RSI(3) strategy on QQQ, with and without IBS filter.

 Update: the comparison chart for the Australian ETF now correctly uses EWA instead of EWO (the Austrian ETF).

k-NN Candlestick Pattern Search Extensions: Combining Forecasts

The second, and probably final, followup to the Mining for Three Day Candlestick Patterns post. Previously, we improved performance by adding more data to the search. In this post we’ll try to improve the system further by combining multiple predictors. The central question is how to combine the forecasts. I test averaging, weighted averaging, regression, and a voting scheme and compare them against a baseline one-predictor strategy.

Set-Up

Combining predictors is a standard tactic in machine learning, but the case of k-NN predictors is a bit of an outlier. Typical ensemble methods depend on generating variations in the data set in order to generate different and complementary predictors (as in the cases of boosting and bagging). This doesn’t work very well with nearest neighbor predictors, however, because they tend to be insensitive to variations in the data set. So what can we vary? The choice of k, the choice of inputs, the choice of distance measure for the nearest neighbors, and some pre-processing options such as whether to adjust for volatility or not.

I am not going to make any variation in outputs as that’s reserved for a post of its own. The idea is pretty simple: it’s essentially a random forest with k-NN predictors instead of decision trees (here’s an interesting paper on it).

So we’re left with k, sum of absolute or sum of square distances, and volatility adjustment. I picked 10 combinations of these options:

choices

The k values were picked at random and I’m sure it’s possible to do better by optimizing them using cross validation.

The signals obviously overlap significantly, and have similar stats when used one-by-one:

Long position threshold: forecast > 5 basis points & IBS < 0.5.

Long signal stats. Long position threshold: forecast > 5 basis points & IBS < 0.5.

asdf

Short signal stats. Short position threshold: forecast < -10 basis points & IBS > 0.5.

The instrument traded is SPY. Additional data is taken from the following instruments for the pattern search: EWY, EWD, EWC, EWQ, EWU, EWA, EWP, EWH, EWL, EFA, EPP, EWM, EWI, EWG, EWO, IWM, QQQ, EWS, EWT, and EWJ. The thresholds in each case are adjusted to result in a similar length of time spent in the market. Position sizing is done based on the 10-day realized volatility of SPY, as described in this post: leverage is equal to 20% divided by 10-day realized annualized standard deviation, with a maximum leverage of 200%. Finally, an IBS filter is applied that allows long positions only when IBS < 0.5 and short positions only when IBS > 0.5.

The baseline is the PF3 predictor: k = 75, square distance measure, no volatility adjustment. Here’s the equity curve:

asdf

PF3 predictor equity curve. $0.005 per share in commissions.

 

Averaging

The simplest approach is obviously to just average the 10 forecasts and then use the average value to generate trades. A long position is taken when the average forecast is greater than 15 basis points, and a short position when the average is smaller than -12.5 basis points. Here’s what the equity curve looks like:

Equity curve using average forecast. $0.005 per share in commissions.

Equity curve using average forecast. $0.005 per share in commissions.

It’s interesting to note that the dispersion of forecasts is inversely related to the accuracy of the average: the smaller the standard deviation of the forecasts, the more accurate they are. Unfortunately effect is marginal and thus not particularly useful for improving the strategy.

 

Weighted Averaging

A simple extension, that generates slightly better stats, is to weigh each forecast before averaging. There’s a wide array of stats one can use here (Sharpe/Sortino/MAR ratios are obvious candidates); I picked the mean square error. The inverse of the MSE becomes the forecast’s weight, so that smaller errors result in greater weights. The same thresholds as above are used to generate signals. The weights provide a slight improvement both in terms of Sharpe and MAR ratios. The equity curve:

Weighted average

Equity curve using weighted average forecast, with weights equal to the inverse of the mean square error. $0.005 per share in commissions.

 

Voting

Using a threshold for each forecast, (>5 basis points for a “long” vote, and <-10 basis points for a “short” vote), each predictor is assigned a long or short vote. The overlap between the votes is significant, between 88% and 97% for different estimators. How many votes should we require for a trade? It quickly becomes obvious that simple majority voting isn’t enough, as only near-unanimous decisions provide worthwhile predictions. The average next-day return when there are between 1 and 8 long votes is 0.4 basis points. The average return after 9 or 10 long votes is 23 basis points.

The resulting equity curve looks like this:

asdf

Equity curve using voting system. 9 or more votes required to take a position. $0.005 per share in commissions.

 

Ordinary Least Squares

It’s also possible to combine the forecasts using regression, with next-day returns as the dependent variable and the k-NN predictor forecasts as the independent ones.

The distribution of forecasts with OLS is very tightly clustered around 0, and for some reason higher forecasts are not associated with higher next-day returns (as they are for the 3 methods above). I don’t really understand why this is the case. The thresholds for trades are 0.5 basis points for a long trade, and -0.5 basis points for a short trade.

An issue here is, of course, multicollinearity due to the similarity of the independent variables. This can lead to, among other problems, overfitting (which is usually characterized by very large absolute values of the coefficients). Using ridge regression solves that issue by limiting the absolute value of coefficients.

A potentially interesting idea would be to constrain the coefficients to positive values, which might lessen the overfitting effects and also make much more sense on an intuitive level (after all, we know all the forecasts are similarly accurate, so negative coefficients don’t make much sense).

asdf

Equity curve using OLS regression. $0.005 per share in commissions.

 

Ridge Regression

If multicollinearity is a significant problem, we can use ridge regression to solve it. It offer significant improvement over the OLS approach, but it still fares badly compared to the one-predictor case. The same thresholds as in the OLS approach are used. Here’s the equity curve:

asdf

Equity curve using ridge regression. $0.005 per share in commissions.

 

Stats

Here are the stats for the single-predictor base case and all the combination methods:

stats

All of them other than the voting failed horribly. I’m not sure why, but it’s good to know. The improvement provided by the voting system is sizable, however. Not only does the voting-based strategy achieve significantly higher risk-adjusted returns, it does it while spending 15% less time in the market. Those results are also easy to improve on by simply adding more predictors. The marginal gain from each new predictor will be diminishing, but there is definitely more value to wring out of it. And this is just with 3-day patterns: we can easily add 2 and 4 day patterns into the mix as well.

Other Possibilities

A wide array of machine learning methods can be used to combine predictions. Especially if the number of forecasts grew larger, techniques such as random forests or ANNs would be interesting to investigate. As long as simpler methods work very well I think there is little reason to increase the complexity (not to mention the opaqueness) of the strategy.

Mining for Three Day Candlestick Patterns

I’ve been thinking a lot about candlestick patterns lately but grew tired of trying to generate ideas and instead decided to mine for them. I must confess I didn’t expect much from such a simplistic approach, so I was pleasantly surprised to see it working well. Unfortunately I wasn’t able to discover any short set-ups. The general bias of equity markets toward the upside makes it difficult to find enough instances of patterns that are followed by negative returns.

The idea is to mine past data for similar 3 day patterns, and then use that information to make trading decisions. There are several choices we must make:

  • The size of the lookback window. I use an expanding window that starts at 2000 days.
  • Once we find similar patterns, how do we choose which ones to use?
  • How do we measure the similarity between the patterns?

To fully describe a three day candlestick pattern we need 11 numbers. The close-to-close percentage change from day 1 to day 2, and from day 2 to day 3, as well as the positions of the open, high, and low relative to the close for each day.

To measure the degree of similarity between any two 3-day patterns, I tried both the sum of absolute differences and the sum of the squared differences between those 11 numbers; the results were quite similar. It would be interesting to try to optimize individual weights for each number, as I imagine some are more important than others.

The final step is to select a number of the closest patterns we find, and simply average their next-day returns to arrive at an expected return.

absolute difference 50 closest exp vs realized

Expected vs realized returns for SPY, 50 closest patterns by absolute difference. Numbers above the bars indicate the number of instances in each bucket.

How do we choose which patterns are “close enough” to use? Choose too few and the sample will be too small. Choose too many and you risk using irrelevant data. That’s a number that we’ll have to optimize.

histogram squared

Histogram of expected return estimates for different sample sizes.

When comparing the results we also run into another problem: the smaller the sample, the more spread out the expected return estimates will be, which means more trades will be chosen given a certain minimum limit for entry. My solution was to choose a different limit for trade entry, such that all sample sizes would generate the same number of days in the market (300 in this case). Here are the walk-forward results:

closest count tests

The trade-off between sample size and relevance is clear, and the “sweet spot” appears to be somewhere in the 50-150 range or so, for both the absolute difference and squared difference approaches. Depending on how selective you want to be, you can decrease the limit and trade off more trades for lower expected returns. For me, 30 bp is a reasonable area to aim for.

A nice little addition is to use IBS by filtering out any trades with IBS > 50%. Using squared differences, I select the 50 closest patterns. When their average next-day return is greater than 0.2%, a long position is taken. The results are predictably great:

equity curves with without IBS

squared 50 closest 0.2pct limit ibs filter results

The IBS filter removes close to 40% of days in the market yet maintains essentially the same CAGR, while also more than halving the maximum drawdown.

Let’s take a look at some of the actual patterns. Using squared differences, the 50 closest patterns, and a 0.2% limit, the last successful trade was on February 26, 2013. The expected return on that day was 0.307%. Here’s what those 3 days looked like, as well as the 5 closest historical patterns:

patterns

As you can see below, even the 50th closest pattern seems to be, based on visual inspection, rather close. The “main idea” of the pattern seems to be there:

patterns 50th closest

Here are the stats from a bunch of different equity index ETFs, using square differences, the 50 closest patterns, 0.2% expected return limit and the IBS < 0.5 filter.

ETFs square 50 closest 0.2 ibs filter results

The 0.2% limit seems to be too low for some of them, producing too many trades. Perhaps setting an appropriate limit per-instrument would be a good idea.

The obvious path forward is to also produce 2-day, 4-day, 5-day, etc. versions, perhaps with optimized distance weighting and some outlier filtering, and combine them all in a nice little ensemble to get your predictions out of. The implementation is left as an exercise for the reader.

IBS and Relative Value Mean Reversion

I’m writing a paper on the IBS effect, but it’s taking a bit longer than expected so I thought I’d share some of the results in a blog post. The starting point is a paper by Levy & Lieberman: Overreaction of Country ETFs to US Market Returns, in which the authors find that country ETFs over-react to US returns during non-overlapping trading hours, which gives rise to abnormal returns as the country ETFs revert back the next day. In terms of the IBS effect, this suggests that a high SPY IBS would lead to over-reaction in the country ETFs and thus lower returns the next day, and vice versa.

To quickly recap, Internal Bar Strength (or IBS) is an indicator with impressive (mean reversion) predictive ability for equity indices. It is calculated as follows:

IBS

Using a selection of 32 equity index ETFs, let’s take a look at next-day returns after IBS extremes (top and bottom 20%), split up by SPY’s IBS (top and bottom half):

 returns by SPY IBS

The results were the exact opposite of what I was expecting. Instead of over-reacting to a high SPY IBS, the ETFs instead under-react to it. A high SPY IBS is followed by higher returns for the ETFs, while a low SPY IBS is followed by lower returns. These results suggest a pair approach using SPY as the first leg of the pair, and ETFs at IBS extremes as the other. For a dollar-neutral strategy, the rules are the following:

  • If SPY IBS <= 50% and ETF IBS > 80%, go long SPY and short the other ETF, in equal dollar amounts.
  • If SPY IBS > 50% and ETF IBS < 20%, go short SPY and long the other ETF, in equal dollar amounts.

The results1:

pair strat returns

The numbers are excellent: high returns and relatively few trades with a high win rate. Let’s take a look at the alphas and betas from a regression of the excess returns to the pair strategy, using the Carhart 4 factor model:

four factor regression

Values in bold are statistically significantly different from zero at the 1% level.

On average, this strategy generates a daily alpha of 0.037%, or 9.28% annually, with essentially zero exposure to any of the factors. Transaction costs would certainly eat into this, but given the reasonable amount of trades (about 23 trades per year per pair on average) there should be a lot left over. The fact that over 90% of days consist of zero excess returns obscures the features of the actual returns to the strategy. Repeating the regression using only the days in which the strategy is in the market yields the following results:

four factor regression trade days only

Values in bold are statistically significantly different from zero at the 1% level.

Unfortunately, these results are pretty much a historical curiosity at this point. Most of the opportunity has been arbitraged away: during the last 4 years the average return per trade has fallen to 0.150%, less than half the average over the entire sample. The parameters haven’t been optimized, so there may be more profitable opportunities still left by filtering only for more extreme values, but it’s clear that there is relatively little juice left in the approach.

In fact if we take a closer look at the differences between the returns before and after 2008, the over-reaction hypothesis seems to be borne out by the data (another factor that may be at play here are the heightened correlations we’ve seen in the last years): low SPY IBS leads to higher next-day returns for the ETFs, and vice versa.

pre and post 2008 results

The lesson to take away from these numbers is that cross-market effects can be very significant, especially when global markets are in a state of high correlation. Accounting for the state of US markets in your models can add significant information (and returns) to your IBS approach.

Footnotes
  1. As with any dollar-neutral approach, calculating returns is a tricky matter; in this case I have calculated the returns as a % of the capital allocated to one of the legs[]

Closing Price in Relation to the Day’s Range, and Equity Index Mean Reversion

UPDATE: read The IBS Effect: Mean Reversion in Equity ETFs instead of this post, it features more recent data and deeper analysis.

The location of the closing price within the day’s range is a surprisingly powerful predictor of next-day returns for equity indices. The closing price in relation to the day’s range (or CRTDR [UPDATE: as reader Jan mentioned in the comments, there is already a name for this: Internal Bar Strength or IBS] if you’re a fan of unpronounceable acronyms) is simply calculated as such:

CRTDR formula

It takes values between 0 and 1 and simply indicates at which point along the day’s range the closing price is located. In this post I will take a look not only at returns forecasting, but also how to use this value in conjunction with other indicators. You may be skeptical about the value of something so extremely simplistic, but I think you’ll be pleasantly surprised.

The basics: QQQ and SPY

First, a quick look at QQQ and SPY next-day returns depending on today’s CRTDR:

SPY CRTDR

QQQ CRTDR

A very promising start. Now the equity curves for each quartile:

spy quartile ECs

QQQ quartile ECs

That’s quite good; consistency through time and across assets is important and we’ve got both in this case. The magnitude of the out-performance of the bottom quartile is very large; I think we can do something useful with it.

There are several potential improvements to this basic approach: using the range of several days instead of only the last one, adjusting for the day’s close-to-close return, and averaging over several days are a few of the more obvious routes to explore. However, for the purposes of this post I will simply continue to use the simplest version.

CRTDR Internationally

A quick look across a larger array of assets, which is always an important test (here I also incorporate a bit of shorting):

All ETF results

Long when CRTDR < 45%, short when CRTDR > 95%. $10k per trade. Including commissions of $0.005 per share, excluding dividends.

One question that comes up when looking at ETFs of foreign indices is about the effect of non-overlapping trading hours. Would we be better off using the ETF trading hours or the local trading hours to determine the range and out predictions? Let’s take a look at the EWU ETF (iShares MSCI United Kingdom Index Fund) vs the FTSE 100 index, with the following strategy:

  • Go long on close if CRTDR < 45%
  • Go short on close if CRTDR > 95%
FTSE vs EWU

FTSE vs EWU CRTDR strategy, 1996-2012. $1m per trade (the number was a technical necessity due to the price of the FTSE 100 index).

Fascinating! This result left me completely stumped. I would love to hear your ideas about this…I have a feeling that there must be some sort of explanation, but I’m afraid I can’t come up with anything realistic.

Trading Signal or Filter?

It should be noted that I don’t actually use the CRTRD as a signal to take trades at all. Given the above results you may find this surprising, but all the positive returns are already captured by other, similar (and better), indicators (especially short-term price-based indicators such as RSI(3)). Instead I use it in reverse: as a filter to exclude potential trades. To demonstrate, let’s have a look at a very simplistic mean reversion system:

  • Buy QQQ at close when RSI(3) < 10
  • Sell QQQ at close when RSI(3) > 50

On average, this will result in a daily return of 0.212%. So we have two approaches in our hands that both have positive expectancy, what happens if we combine them?

  • Go long either on the RSI(3) criteria above OR CRTDR < 50%
QQQ RSI and RSI with CRTDR

RSI(3) and RSI(3) w/ CRTDR strategy applied to QQQ. Commissions not included.

This is a bit surprising: putting together two systems, both of which have positive expectancy, results in significantly lower returns. At this point some may say “there’s no value to be gained here”. But fear not, there are significant returns to be wrung out of the CRTDR! Instead of using it as a signal, what if we use it in reverse as a filter? Let’s investigate further: what happens if we split these days up by CRTDR?

RSI signal returns by CRTDR

Now that’s quite interesting. Combining them has very bad results, but instead we have an excellent method to filter out bad RSI(3) trades. Let’s have a closer look at the interplay between RSI(3) signals and CRTDR:

RSI CRTDR square

Next-day QQQ returns.

And now the equity curves with and without the CRTDR < 50% filter:

QQQ RSI and RSI with CRTDR filter

RSI(3) and RSI(3) w/ CRTDR < 50% filter applied to QQQ. Commissions not included.

That’s pretty good. Consistent performance and out-performance relative to the vanilla RSI(3) strategy. Not only that, but we have filtered out over 35% of trades which not only means far less money spent on commissions, but also frees up capital for other trades.

UPDATE: I neglected to mention that I use Cutler’s RSI and not the “normal” one, the difference being the use of simple moving averages instead of exponential moving averages. I have also uploaded an excel sheet and Multicharts .net signal code that replicate most of the results in the post.