Machine learning applied to financial time series has a well-documented failure mode: build 200 features, train a complex model, achieve impressive in-sample performance, and then watch the model produce random noise in live trading. The failure usually traces back to one of three problems — feature look-ahead bias, feature redundancy that creates false confidence, or overfitting to a market regime that no longer exists.
We built a 75-feature ML pipeline for predicting 12-hour forward log returns on crypto perpetual futures. This post is about what we learned from building it properly — with walk-forward validation, IC-based feature selection, and a disciplined approach to evaluating what the model actually learned versus what it memorized.
The Feature Architecture: Three Categories
Our 75 features are organized into three distinct categories, each capturing a different dimension of the prediction problem.
Technical Features (65 features)
The bulk of the feature set is technical indicators computed on each symbol's own price history.
Return lookbacks (12 features): Log returns and simple returns at 1h, 6h, 12h, 24h, 48h, and 72h windows. These capture momentum at different timescales — the 1h features capture very short-term momentum/mean-reversion, the 72h features capture medium-term trend effects.
Volatility (8 features): Rolling realized volatility at 12h, 24h, 48h, and 72h windows, plus the ratio of short-term to long-term volatility (volatility regime indicator).
Momentum indicators (15 features): RSI at 14-period and 24-period, Bollinger Band position (where is current price within the band, using 20 and 50-period bands), MA crossover signals for 12/48 and 24/72 MA pairs, and price position (where is current price within the 24h and 72h high/low range).
Microstructure proxies (30 features): Volume relative to rolling average at multiple windows, volume-weighted price deviations, up/down volume ratios, candle body ratios, and wick-to-body ratios.
Cross-Asset Features (15 features)
These features capture the symbol's relationship to the broader market.
BTC features (8 features): BTC's own return and volatility at 6h, 12h, 24h, and 72h windows. Every crypto asset's price is partially driven by BTC, and including BTC's state as a feature allows the model to learn this relationship explicitly.
BTC distance features (3 features): How far is current BTC price from its 24h, 72h, and 168h moving averages. This captures the BTC trend regime in a continuous rather than binary form.
Rolling BTC correlation (3 features): Each symbol's rolling correlation to BTC over 24h, 72h, and 168h windows. A symbol's current BTC correlation is a meaningful predictor — high-correlation periods tend to produce more directional returns, low-correlation periods more idiosyncratic ones.
Beta features (1 feature): Fast-window beta to BTC. This is distinct from correlation — beta captures the slope of the relationship, not just the tightness. High-beta assets amplify BTC moves; low-beta assets dampen them.
Funding Rate Features (10 features — unique to perpetuals)
These features have no equivalent in spot or traditional markets and represent one of the most distinctive data sources available in crypto perpetual futures.
Funding rate level (4 features): The current funding rate, its 3-settlement and 9-settlement moving averages, and how far the current rate deviates from those averages (z-score of funding).
Funding momentum (3 features): Change in funding rate over 8h, 24h, and 72h — the direction and speed of the funding rate's recent movement.
Funding regime (3 features): Percentage of recent funding settlements that were positive (longs paying shorts), average funding level over the trailing 72h, and funding rate volatility.
Funding rate features are particularly powerful because they are market microstructure data unavailable in spot markets. A consistently positive funding rate signals that leveraged longs are paying a continuous fee to maintain positions — when this fee becomes extreme, the incentive to unwind longs grows, creating predictable price pressure. A funding rate z-score in the top 5% of historical observations is a meaningful signal of overextension that has no analogue in traditional equity technical analysis.
Walk-Forward Validation: Why It Matters More Than Cross-Validation
The standard machine learning approach to backtesting — K-fold cross-validation — is inappropriate for time series data. It randomly assigns periods to training and test folds, which means test data from 2024 may be used to validate a model trained on 2024 data. This leaks future information.
Walk-forward validation enforces strict temporal ordering:
- Train the model on 2 months of historical data
- Validate hyperparameters on the following 1 month (in-sample tuning)
- Test the trained model on the next 1 month (true out-of-sample)
- Advance the window forward by 1 month
- Retrain from scratch with the expanded dataset
- Repeat until the full backtest period is covered
The model is retrained monthly. Each monthly test period is genuinely out-of-sample — the model had no access to any data from that period during training or validation. The backtest equity curve is assembled from the sequence of these out-of-sample test periods.
This means our LightGBM model is effectively a different trained model each month, adapting to structural changes in market behavior rather than memorizing a fixed historical pattern. A model trained on 2021 data is retired when 2022 data becomes available for retraining — its 2021-specific patterns are not assumed to persist.
What the Model Learned: Feature Importance
After running the full walk-forward backtest, we extracted average feature importances across all training epochs. The results are informative about what actually predicts 12-hour forward returns.
Top-performing feature groups:
Return features at 6h and 12h windows consistently rank highest. The model has learned that recent short-to-medium-term momentum is the strongest predictor of near-future returns — consistent with the cross-sectional momentum effect in academic finance. The 1h features are less predictive (too much noise), and the 72h features are less predictive (momentum effect weakens at longer lookbacks in crypto).
Funding rate z-score is the most predictive single funding feature. Extreme funding rates — positive or negative — predict price reversions in the 12h forward window. Perps with funding z-score above 2.0 (heavily positive) tend to underperform; perps with funding z-score below -2.0 (heavily negative) tend to outperform. This reflects the mechanical pressure of high funding costs forcing position liquidations.
BTC volatility regime features (BTC volatility percentile, BTC distance from 72h MA) are among the top cross-asset features. The model has learned that its predictions are more reliable when BTC is in a defined regime (strongly trending or showing specific volatility characteristics) than when BTC is in an intermediate state.
Bollinger Band position outperforms raw RSI. Both are momentum oscillators, but Bollinger Band position is adaptive to current volatility while RSI is not — this makes it more informative in crypto's variable-volatility environment.
What did not make the cut:
Many volume features performed poorly after IC-based selection. Raw volume is noisy and its predictive content is mostly captured by volatility features. We retain a small number of relative volume features but removed most of the 30 microstructure proxies we initially included.
The 24-period RSI performed consistently worse than the 14-period RSI — the shorter period captures more of the relevant near-term momentum signal.
IC-Based Feature Selection
Before the final training runs, we computed the Information Coefficient (IC) for each feature — the correlation between the feature's values and 12-hour forward returns, computed across all symbols and all time periods in the training set.
Features with consistently low IC (absolute value below 0.02 across all time periods) were removed from the final feature set. This reduced our initial 90+ candidate features to the final 75, eliminating the weakest signals and reducing the risk of overfitting to noise.
The IC analysis also revealed a regime dependency: some features have high IC in trending markets and near-zero IC in ranging markets, and vice versa. Rather than removing these features entirely, we added a market regime indicator (BTC volatility percentile bin) that allows the model to weight regime-specific features appropriately.
LightGBM: Why Tree-Based Over Neural Networks
We use LightGBM (gradient boosted decision trees) rather than neural networks for this prediction task. The choice is deliberate.
Financial time series prediction on tabular data does not benefit from the architectural advantages of deep learning (hierarchical feature composition, spatial/temporal convolutions). The signal-to-noise ratio in asset return prediction is extremely low — the model is trying to find weak, non-stationary patterns in data dominated by randomness. In this environment, simpler models with strong regularization outperform complex models in out-of-sample tests.
LightGBM's advantages for this task: - Handles the tabular feature structure natively without transformation - Regularization (L1, L2, minimum child samples) provides robust overfitting protection - Fast training allows monthly retraining without computational overhead - Feature importance outputs support interpretability and IC analysis - Well-calibrated predictions — its ranking of assets by predicted return is more reliable than the absolute magnitude of predictions
The model configuration uses conservative hyperparameters: 31 leaves (limits tree depth), learning rate 0.05, feature fraction 0.8, bagging fraction 0.8, and early stopping on validation loss with a 50-round patience. These settings are biased toward underfitting rather than overfitting — in low-signal environments, it is better to miss weak signals than to fit noise.
The Reality Check: What the IC Numbers Actually Mean
After all the feature engineering and model training, the honest evaluation of predictive quality is sobering. The trained model achieves an average out-of-sample IC of approximately 0.04–0.06 across the full test period.
An IC of 0.05 means the model's predicted return rankings correlate with actual return rankings at approximately 5%. This sounds weak. In asset pricing, it is actually meaningful — a consistent 5% IC across a large universe of assets, implemented with proper position sizing and realistic transaction costs, generates positive risk-adjusted returns.
The insight from Grinold's Fundamental Law of Active Management applies here: a strategy with low per-trade IC but many independent trades (50 symbols, weekly rebalancing, 5 years) accumulates statistical significance that a few high-conviction trades cannot match. The model's edge is not in being dramatically right on any individual prediction — it is in being slightly right, consistently, across many predictions.
Takeaways
- 75 features organized into technical, cross-asset, and funding rate categories — each category captures a distinct dimension of the prediction problem
- Walk-forward validation with monthly retraining is the only defensible methodology for financial ML backtesting — K-fold validation produces optimistic results that do not survive live trading
- Funding rate z-score is the most distinctive and predictive feature in crypto perpetual futures — it has no traditional equity equivalent and captures a mechanical pressure effect
- LightGBM outperforms neural networks on this tabular, low-signal-to-noise prediction problem — architectural complexity is not an advantage when the underlying signal is weak
- IC of 0.04–0.06 is modest but meaningful — Grinold's law explains how small per-trade edges compound into positive returns at scale
- Feature selection by IC eliminates noise and focuses the model on the features that actually predict — removing weak features improves out-of-sample performance