k-NN Candlestick Pattern Search Extensions: More Data
This is a followup to the Mining for Three Day Candlestick Patterns post. If you haven’t read the original post, do so now because I’m not going to repeat the basic mechanics of the strategy. While the approach was somewhat fruitful, it also had some obvious problems: it only seems to work in bearish or high volatility market regimes, and it couldn’t produce good short signals. The main idea I had to resolve these issues was simply to get more data.
That is easier said than done. Could we use mutual funds or index values to extend the dataset backwards? No, because the daily high/low values are inaccurate. The only alternative we are left with is using data from other instruments. So I picked a broad selection of equity ETFs to include: EWY, EWD, EWC, EWQ, EWU, EWA, EWP, EWH, EWL, EFA, EPP, EWM, EWI, EWG, EWO, IWM, QQQ, EWS, EWT, and EWJ.
The selection was comprehensive and unoptimized. I think you could do some sort of walk-forward optimization that picks the best combination of securities to include in the data set. I’m not sure how much that would help.
The additional data worked fantastically well, resolving both problems. The number of opportunities to trade increased significantly, long signals work very nicely under all market conditions, and predicting negative returns works far better. There was also an unexpected benefit: far less time is needed before the forecasts become usable. In the original implementation I waited 2000 days before starting to use the forecasts. With the extended data set this can be cut to 500, thus letting the backtest cover a longer period.
Performance-wise there were no problems, as the Accord .NET k-d tree implementation that I use is very quick. Finding the nearest 75 points in a data set of approximately 100,000, in 11 dimensions, takes less than 2 milliseconds on my overclocked 2500K.
The settings used in the search are simple: the length of the patterns is 3 days, the 75 closest ones are used to construct a forecast by averaging their next-day returns, and distance is calculated as the sum of squared distances in every dimension. Trades are taken when the forecast is above/below a certain threshold. They are then passed through a filter which only allows long positions when IBS < 0.5 and short positions only when IBS > 0.5.
It should be noted that using traditional measures of “fit” does not work very well with pattern matching. Adding the above instruments actually increases the RMSE, despite significantly increasing the trading performance of the forecasts.
A look at forecasts vs realized next-day returns:
An important aspect to note is that even marginally positive forecasts work very well. For example, with the extended dataset, forecasts between 5 and 10 basis points resulted in an average 21 bp return the next day. On the other hand, using SPY data only, the return for those forecasts was just 5 basis points. What this means is that there are many more trades to take, which is what allows the strategy to do well in all market environments. Here’s the long-only equity curve:
A couple of charts to analyze the sensitivity of the long-only strategy’s results to changes in inputs (IBS limit and minimum forecast limit):
The additional data also has the benefit of making shorting possible. The equity curve doesn’t look as good, but it’s still a giant improvement over zero predictive ability on the short side:
Finally, the long and short strategies combined, along with the stats:
The concept also seems to work for stocks. For example, I tested a long-only strategy on AAPL, using the same settings as above, both with and without the addition of MSFT data. The Microsoft data improved every aspect of the results, with surprisingly consistent performance over nearly 20 years:
It would be interesting to try to apply this on a more massive scale, by increasing the data set to something like all S&P 500 stocks. Some technical restrictions prevent me from doing that right now, but I’ll come back to the idea in the future.