The year is over in a few hours and I thought it would be nice to do a quick review of the year, revisit some studies and the most popular posts of the year, as well as share some thoughts on my performance in 2013 and my goals for 2014.
Revisiting Old Studies
IBS did pretty badly in 2012, and didn’t manage to reach the amazing performance of 2007-2010 this year either. However, it still worked reasonably well: IBS < 0.5 led to far higher returns than IBS > 0.5, and the highest quarter had negative returns. It still works amazingly well as a filter. Most importantly the magnitude of the effect has diminished. This is partly due to the low volatility we’ve seen this year. After all IBS does best when movements are large, and SPY’s 10-day realized volatility never even broke 20% this year. Here are the stats:
The original post can be found here. Performance in 2013 hasn’t been as good as in the past, but was still reasonably OK. I think the results are, again, at least partially due to the low volatility environment in equities this year.
UDIDSRI performance, close-to-close returns after a zero reading.
I’ve done 3 posts on day of the month seasonality (US, EU, Asia), and on average the DOTM effect did its job this year. There are some cases where the top quarter does not have the top returns, but a single year is a relatively small sample so I doubt this has any long-term implications. Here are the stats for 9 major indices:
Day of the month seasonality in 2013
My studies on the implied volatility indices ratio turned out to work pretty badly. Returns when the VIX:VXV ratio was 5% above the 10-day SMA were -0.03%. There were no 200-day highs in the ratio in 2013!
Overall I would say it was a mixed bag for me this year. Returns were reasonably good, but a bit below my long-term expectations. It was a very good year for equities, and my results can’t compete with SPY’s 5.12 MAR ratio, which makes me feel pretty bad. Of course I understand that years like this one don’t represent the long-term, but it’s annoying to get beaten by b&h nonetheless.
Some strategies did really well:
Others did really poorly:
Risk was kept under control and entirely within my target range, both in terms of volatility and maximum drawdown. Even when I was at the year’s maximum drawdown I felt comfortable…there is still “psychological room” for more leverage. Daily returns were positively skewed. My biggest success was diversifying across strategies and asset classes. A year ago I was trading few instruments (almost exclusively US equity ETFs) with a limited number of strategies. Combine that with a pretty heavy equity tilt in the GTAA allocation, and my portfolio returns were moving almost in lockstep with the indices (there were very few shorting opportunities in this year’s environment, so the choice was almost always between being long or in cash). Widening my asset universe combined with research into new strategies made a gigantic difference:
I made a series of mistakes that significantly hurt my performance figures this year. Small mistakes pile on top of each other and in the end have a pretty large effect. All in all I lost several hundred bp on these screw-ups. Hopefully you can learn from my errors:
- Back in March I forgot the US daylight savings time kicks in earlier than it does here in Europe. I had positions to exit at the open and I got there 45 minutes late. Naturally the market had moved against me.
- A bug in my software led to incorrectly handling dividends, which led to signals being calculated using incorrect prices, which led to a long position when I should have taken a short. Taught me the importance of testing with extreme caution.
- Problems with reporting trade executions at an exchange led to an error where I sent the same order twice and it took me a few minutes to close out the position I had inadvertently created.
- I took delivery on some FX futures when I didn’t want to, cost me commissions and spread to unwind the position.
- Order entry, sent a buy order when I was trying to sell. Caught it immediately so the cost was only commissions + spread.
- And of course the biggest one: not following my systems to the letter. A combination of fear, cowardice, over-confidence in my discretion, and under-confidence in my modeling skills led to some instances where I didn’t take trades that I should have. This is the most shameful mistake of all because of its banality. I don’t plan on repeating it in 2014.
Goals for 2014
- Beat my 2013 risk-adjusted returns.
- Don’t repeat any mistakes.
- Make new mistakes! But minimize their impact. Every error is a valuable learning experience.
- Continue on the same path in terms of research.
- Minimize model implementation risk through better unit testing.
Finally, the most popular posts of the year:
- The original IBS post. Read the paper instead.
- Doing the Jaffray Woodriff Thing. I still need to follow up on that…
- Mining for Three Day Candlestick Patterns, which also spawned a short series of posts.
I want to wish you all a happy and profitable 2014!
I finally finished the first draft of my IBS paper. The results are quite interesting and extremely relevant if you trade equity ETFs. You can read it here.
I investigate mean reversion in equity ETF prices at the daily frequency by employing a simple technical indicator, Internal Bar Strength (IBS). IBS is based on the position of the day’s close in relation to the day’s range. I use it to forecast close-to-close returns with statistically and economically significant results for most instruments. A simple strategy based on IBS generates an average alpha of over 30% p.a. before transaction costs. I show that equity index ETFs have had strong and consistent mean reverting tendencies since the 90s, and that these effects can be exploited as part of a profitable trading strategy. The IBS effect is stronger during times of high volatility, in bear markets, after high-range days, after high-volume days, and early in the week.
Feedback is highly appreciated, either in the comments below or by email to qusmablog at gmail dot com.
Some of the interesting things you’ll find within:
IBS plotted against average close-to-close returns.
Cumulative NQ returns at 5 minute intervals after IBS < 0.2 at 15:00 CT.
Equity curves of a simple RSI(3) strategy on QQQ, with and without IBS filter.
Update: the comparison chart for the Australian ETF now correctly uses EWA instead of EWO (the Austrian ETF).
The second, and probably final, followup to the Mining for Three Day Candlestick Patterns post. Previously, we improved performance by adding more data to the search. In this post we’ll try to improve the system further by combining multiple predictors. The central question is how to combine the forecasts. I test averaging, weighted averaging, regression, and a voting scheme and compare them against a baseline one-predictor strategy.
Combining predictors is a standard tactic in machine learning, but the case of k-NN predictors is a bit of an outlier. Typical ensemble methods depend on generating variations in the data set in order to generate different and complementary predictors (as in the cases of boosting and bagging). This doesn’t work very well with nearest neighbor predictors, however, because they tend to be insensitive to variations in the data set. So what can we vary? The choice of k, the choice of inputs, the choice of distance measure for the nearest neighbors, and some pre-processing options such as whether to adjust for volatility or not.
I am not going to make any variation in outputs as that’s reserved for a post of its own. The idea is pretty simple: it’s essentially a random forest with k-NN predictors instead of decision trees (here’s an interesting paper on it).
So we’re left with k, sum of absolute or sum of square distances, and volatility adjustment. I picked 10 combinations of these options:
The k values were picked at random and I’m sure it’s possible to do better by optimizing them using cross validation.
The signals obviously overlap significantly, and have similar stats when used one-by-one:
Long signal stats. Long position threshold: forecast > 5 basis points & IBS < 0.5.
Short signal stats. Short position threshold: forecast < -10 basis points & IBS > 0.5.
The instrument traded is SPY. Additional data is taken from the following instruments for the pattern search: EWY, EWD, EWC, EWQ, EWU, EWA, EWP, EWH, EWL, EFA, EPP, EWM, EWI, EWG, EWO, IWM, QQQ, EWS, EWT, and EWJ. The thresholds in each case are adjusted to result in a similar length of time spent in the market. Position sizing is done based on the 10-day realized volatility of SPY, as described in this post: leverage is equal to 20% divided by 10-day realized annualized standard deviation, with a maximum leverage of 200%. Finally, an IBS filter is applied that allows long positions only when IBS < 0.5 and short positions only when IBS > 0.5.
The baseline is the PF3 predictor: k = 75, square distance measure, no volatility adjustment. Here’s the equity curve:
PF3 predictor equity curve. $0.005 per share in commissions.
The simplest approach is obviously to just average the 10 forecasts and then use the average value to generate trades. A long position is taken when the average forecast is greater than 15 basis points, and a short position when the average is smaller than -12.5 basis points. Here’s what the equity curve looks like:
Equity curve using average forecast. $0.005 per share in commissions.
It’s interesting to note that the dispersion of forecasts is inversely related to the accuracy of the average: the smaller the standard deviation of the forecasts, the more accurate they are. Unfortunately effect is marginal and thus not particularly useful for improving the strategy.
A simple extension, that generates slightly better stats, is to weigh each forecast before averaging. There’s a wide array of stats one can use here (Sharpe/Sortino/MAR ratios are obvious candidates); I picked the mean square error. The inverse of the MSE becomes the forecast’s weight, so that smaller errors result in greater weights. The same thresholds as above are used to generate signals. The weights provide a slight improvement both in terms of Sharpe and MAR ratios. The equity curve:
Equity curve using weighted average forecast, with weights equal to the inverse of the mean square error. $0.005 per share in commissions.
Using a threshold for each forecast, (>5 basis points for a “long” vote, and <-10 basis points for a “short” vote), each predictor is assigned a long or short vote. The overlap between the votes is significant, between 88% and 97% for different estimators. How many votes should we require for a trade? It quickly becomes obvious that simple majority voting isn’t enough, as only near-unanimous decisions provide worthwhile predictions. The average next-day return when there are between 1 and 8 long votes is 0.4 basis points. The average return after 9 or 10 long votes is 23 basis points.
The resulting equity curve looks like this:
Equity curve using voting system. 9 or more votes required to take a position. $0.005 per share in commissions.
Ordinary Least Squares
It’s also possible to combine the forecasts using regression, with next-day returns as the dependent variable and the k-NN predictor forecasts as the independent ones.
The distribution of forecasts with OLS is very tightly clustered around 0, and for some reason higher forecasts are not associated with higher next-day returns (as they are for the 3 methods above). I don’t really understand why this is the case. The thresholds for trades are 0.5 basis points for a long trade, and -0.5 basis points for a short trade.
An issue here is, of course, multicollinearity due to the similarity of the independent variables. This can lead to, among other problems, overfitting (which is usually characterized by very large absolute values of the coefficients). Using ridge regression solves that issue by limiting the absolute value of coefficients.
A potentially interesting idea would be to constrain the coefficients to positive values, which might lessen the overfitting effects and also make much more sense on an intuitive level (after all, we know all the forecasts are similarly accurate, so negative coefficients don’t make much sense).
Equity curve using OLS regression. $0.005 per share in commissions.
If multicollinearity is a significant problem, we can use ridge regression to solve it. It offer significant improvement over the OLS approach, but it still fares badly compared to the one-predictor case. The same thresholds as in the OLS approach are used. Here’s the equity curve:
Equity curve using ridge regression. $0.005 per share in commissions.
Here are the stats for the single-predictor base case and all the combination methods:
All of them other than the voting failed horribly. I’m not sure why, but it’s good to know. The improvement provided by the voting system is sizable, however. Not only does the voting-based strategy achieve significantly higher risk-adjusted returns, it does it while spending 15% less time in the market. Those results are also easy to improve on by simply adding more predictors. The marginal gain from each new predictor will be diminishing, but there is definitely more value to wring out of it. And this is just with 3-day patterns: we can easily add 2 and 4 day patterns into the mix as well.
A wide array of machine learning methods can be used to combine predictions. Especially if the number of forecasts grew larger, techniques such as random forests or ANNs would be interesting to investigate. As long as simpler methods work very well I think there is little reason to increase the complexity (not to mention the opaqueness) of the strategy.
I’ve been thinking a lot about candlestick patterns lately but grew tired of trying to generate ideas and instead decided to mine for them. I must confess I didn’t expect much from such a simplistic approach, so I was pleasantly surprised to see it working well. Unfortunately I wasn’t able to discover any short set-ups. The general bias of equity markets toward the upside makes it difficult to find enough instances of patterns that are followed by negative returns.
The idea is to mine past data for similar 3 day patterns, and then use that information to make trading decisions. There are several choices we must make:
- The size of the lookback window. I use an expanding window that starts at 2000 days.
- Once we find similar patterns, how do we choose which ones to use?
- How do we measure the similarity between the patterns?
To fully describe a three day candlestick pattern we need 11 numbers. The close-to-close percentage change from day 1 to day 2, and from day 2 to day 3, as well as the positions of the open, high, and low relative to the close for each day.
To measure the degree of similarity between any two 3-day patterns, I tried both the sum of absolute differences and the sum of the squared differences between those 11 numbers; the results were quite similar. It would be interesting to try to optimize individual weights for each number, as I imagine some are more important than others.
The final step is to select a number of the closest patterns we find, and simply average their next-day returns to arrive at an expected return.
Expected vs realized returns for SPY, 50 closest patterns by absolute difference. Numbers above the bars indicate the number of instances in each bucket.
How do we choose which patterns are “close enough” to use? Choose too few and the sample will be too small. Choose too many and you risk using irrelevant data. That’s a number that we’ll have to optimize.
Histogram of expected return estimates for different sample sizes.
When comparing the results we also run into another problem: the smaller the sample, the more spread out the expected return estimates will be, which means more trades will be chosen given a certain minimum limit for entry. My solution was to choose a different limit for trade entry, such that all sample sizes would generate the same number of days in the market (300 in this case). Here are the walk-forward results:
The trade-off between sample size and relevance is clear, and the “sweet spot” appears to be somewhere in the 50-150 range or so, for both the absolute difference and squared difference approaches. Depending on how selective you want to be, you can decrease the limit and trade off more trades for lower expected returns. For me, 30 bp is a reasonable area to aim for.
A nice little addition is to use IBS by filtering out any trades with IBS > 50%. Using squared differences, I select the 50 closest patterns. When their average next-day return is greater than 0.2%, a long position is taken. The results are predictably great:
The IBS filter removes close to 40% of days in the market yet maintains essentially the same CAGR, while also more than halving the maximum drawdown.
Let’s take a look at some of the actual patterns. Using squared differences, the 50 closest patterns, and a 0.2% limit, the last successful trade was on February 26, 2013. The expected return on that day was 0.307%. Here’s what those 3 days looked like, as well as the 5 closest historical patterns:
As you can see below, even the 50th closest pattern seems to be, based on visual inspection, rather close. The “main idea” of the pattern seems to be there:
Here are the stats from a bunch of different equity index ETFs, using square differences, the 50 closest patterns, 0.2% expected return limit and the IBS < 0.5 filter.
The 0.2% limit seems to be too low for some of them, producing too many trades. Perhaps setting an appropriate limit per-instrument would be a good idea.
The obvious path forward is to also produce 2-day, 4-day, 5-day, etc. versions, perhaps with optimized distance weighting and some outlier filtering, and combine them all in a nice little ensemble to get your predictions out of. The implementation is left as an exercise for the reader.
I’m writing a paper on the IBS effect, but it’s taking a bit longer than expected so I thought I’d share some of the results in a blog post. The starting point is a paper by Levy & Lieberman: Overreaction of Country ETFs to US Market Returns, in which the authors find that country ETFs over-react to US returns during non-overlapping trading hours, which gives rise to abnormal returns as the country ETFs revert back the next day. In terms of the IBS effect, this suggests that a high SPY IBS would lead to over-reaction in the country ETFs and thus lower returns the next day, and vice versa.
To quickly recap, Internal Bar Strength (or IBS) is an indicator with impressive (mean reversion) predictive ability for equity indices. It is calculated as follows:
Using a selection of 32 equity index ETFs, let’s take a look at next-day returns after IBS extremes (top and bottom 20%), split up by SPY’s IBS (top and bottom half):
The results were the exact opposite of what I was expecting. Instead of over-reacting to a high SPY IBS, the ETFs instead under-react to it. A high SPY IBS is followed by higher returns for the ETFs, while a low SPY IBS is followed by lower returns. These results suggest a pair approach using SPY as the first leg of the pair, and ETFs at IBS extremes as the other. For a dollar-neutral strategy, the rules are the following:
- If SPY IBS <= 50% and ETF IBS > 80%, go long SPY and short the other ETF, in equal dollar amounts.
- If SPY IBS > 50% and ETF IBS < 20%, go short SPY and long the other ETF, in equal dollar amounts.
The numbers are excellent: high returns and relatively few trades with a high win rate. Let’s take a look at the alphas and betas from a regression of the excess returns to the pair strategy, using the Carhart 4 factor model:
Values in bold are statistically significantly different from zero at the 1% level.
On average, this strategy generates a daily alpha of 0.037%, or 9.28% annually, with essentially zero exposure to any of the factors. Transaction costs would certainly eat into this, but given the reasonable amount of trades (about 23 trades per year per pair on average) there should be a lot left over. The fact that over 90% of days consist of zero excess returns obscures the features of the actual returns to the strategy. Repeating the regression using only the days in which the strategy is in the market yields the following results:
Values in bold are statistically significantly different from zero at the 1% level.
Unfortunately, these results are pretty much a historical curiosity at this point. Most of the opportunity has been arbitraged away: during the last 4 years the average return per trade has fallen to 0.150%, less than half the average over the entire sample. The parameters haven’t been optimized, so there may be more profitable opportunities still left by filtering only for more extreme values, but it’s clear that there is relatively little juice left in the approach.
In fact if we take a closer look at the differences between the returns before and after 2008, the over-reaction hypothesis seems to be borne out by the data (another factor that may be at play here are the heightened correlations we’ve seen in the last years): low SPY IBS leads to higher next-day returns for the ETFs, and vice versa.
The lesson to take away from these numbers is that cross-market effects can be very significant, especially when global markets are in a state of high correlation. Accounting for the state of US markets in your models can add significant information (and returns) to your IBS approach.
UPDATE: read The IBS Eﬀect: Mean Reversion in Equity ETFs instead of this post, it features more recent data and deeper analysis.
The location of the closing price within the day’s range is a surprisingly powerful predictor of next-day returns for equity indices. The closing price in relation to the day’s range (or CRTDR [UPDATE: as reader Jan mentioned in the comments, there is already a name for this: Internal Bar Strength or IBS] if you’re a fan of unpronounceable acronyms) is simply calculated as such:
It takes values between 0 and 1 and simply indicates at which point along the day’s range the closing price is located. In this post I will take a look not only at returns forecasting, but also how to use this value in conjunction with other indicators. You may be skeptical about the value of something so extremely simplistic, but I think you’ll be pleasantly surprised.
The basics: QQQ and SPY
First, a quick look at QQQ and SPY next-day returns depending on today’s CRTDR:
A very promising start. Now the equity curves for each quartile:
That’s quite good; consistency through time and across assets is important and we’ve got both in this case. The magnitude of the out-performance of the bottom quartile is very large; I think we can do something useful with it.
There are several potential improvements to this basic approach: using the range of several days instead of only the last one, adjusting for the day’s close-to-close return, and averaging over several days are a few of the more obvious routes to explore. However, for the purposes of this post I will simply continue to use the simplest version.
A quick look across a larger array of assets, which is always an important test (here I also incorporate a bit of shorting):
Long when CRTDR < 45%, short when CRTDR > 95%. $10k per trade. Including commissions of $0.005 per share, excluding dividends.
One question that comes up when looking at ETFs of foreign indices is about the effect of non-overlapping trading hours. Would we be better off using the ETF trading hours or the local trading hours to determine the range and out predictions? Let’s take a look at the EWU ETF (iShares MSCI United Kingdom Index Fund) vs the FTSE 100 index, with the following strategy:
- Go long on close if CRTDR < 45%
- Go short on close if CRTDR > 95%
FTSE vs EWU CRTDR strategy, 1996-2012. $1m per trade (the number was a technical necessity due to the price of the FTSE 100 index).
Fascinating! This result left me completely stumped. I would love to hear your ideas about this…I have a feeling that there must be some sort of explanation, but I’m afraid I can’t come up with anything realistic.
Trading Signal or Filter?
It should be noted that I don’t actually use the CRTRD as a signal to take trades at all. Given the above results you may find this surprising, but all the positive returns are already captured by other, similar (and better), indicators (especially short-term price-based indicators such as RSI(3)). Instead I use it in reverse: as a filter to exclude potential trades. To demonstrate, let’s have a look at a very simplistic mean reversion system:
- Buy QQQ at close when RSI(3) < 10
- Sell QQQ at close when RSI(3) > 50
On average, this will result in a daily return of 0.212%. So we have two approaches in our hands that both have positive expectancy, what happens if we combine them?
- Go long either on the RSI(3) criteria above OR CRTDR < 50%
RSI(3) and RSI(3) w/ CRTDR strategy applied to QQQ. Commissions not included.
This is a bit surprising: putting together two systems, both of which have positive expectancy, results in significantly lower returns. At this point some may say “there’s no value to be gained here”. But fear not, there are significant returns to be wrung out of the CRTDR! Instead of using it as a signal, what if we use it in reverse as a filter? Let’s investigate further: what happens if we split these days up by CRTDR?
Now that’s quite interesting. Combining them has very bad results, but instead we have an excellent method to filter out bad RSI(3) trades. Let’s have a closer look at the interplay between RSI(3) signals and CRTDR:
Next-day QQQ returns.
And now the equity curves with and without the CRTDR < 50% filter:
RSI(3) and RSI(3) w/ CRTDR < 50% filter applied to QQQ. Commissions not included.
That’s pretty good. Consistent performance and out-performance relative to the vanilla RSI(3) strategy. Not only that, but we have filtered out over 35% of trades which not only means far less money spent on commissions, but also frees up capital for other trades.
UPDATE: I neglected to mention that I use Cutler’s RSI and not the “normal” one, the difference being the use of simple moving averages instead of exponential moving averages. I have also uploaded an excel sheet and Multicharts .net signal code that replicate most of the results in the post.