The entire development pipeline is built on a single axiom: a model that looks too good on historical data is almost certainly wrong. In quantitative finance, genuine edges are small. Our methodology is designed to detect and eliminate any artificial inflation of performance metrics, accepting modest but real predictive power over spectacular but illusory results.
Consistency across market regimes, realistic out-of-sample metrics, and robustness under transaction costs. Every design decision prioritises real-world viability over backtest aesthetics.
Any result that exceeds established plausibility thresholds is automatically flagged for review. Features without economic rationale are excluded regardless of their in-sample predictive power.
The system follows an 8-phase sequential pipeline where each phase has strict data-access boundaries. Information flows forward only — no phase may access data generated by a subsequent phase.
All OHLCV data is sourced from institutional-grade feeds across multiple timeframes and subjected to a rigorous cleaning process: timestamp normalisation, duplicate removal, gap detection, and numeric validation. Poor data quality is a common but often overlooked source of false signals in quantitative systems — corrupted prices or misaligned timestamps can create phantom patterns that vanish in live trading.
Traditional train/test splits or k-fold cross-validation randomly shuffle temporal data, allowing the model to “peek” at future information. Walk-forward validation respects the arrow of time: the model is always trained on past data and tested on strictly future, unseen data — exactly as it would operate in live trading.
The training set grows with each window (always starts from the beginning of the dataset), ensuring the model leverages all available historical data without ever seeing the test period. A minimum number of years of training data is required before the first test window begins, ensuring sufficient sample size relative to the feature space.
Within each walk-forward window, the following operations are performed independently, using only that window's training data:
The pipeline operates on thousands of events detected across over a decade of market data. This large sample size ensures that performance metrics are statistically meaningful rather than artefacts of small-sample noise. Each individual test window also contains enough trades for reliable per-window assessment.
The 2015–2025 validation period spans fundamentally different market environments: low-volatility pre-2020 markets, the COVID-19 crash and recovery, the 2022 inflation and interest-rate cycle, and the 2024–2025 gold rally. The model is evaluated across all of these regimes, ensuring it does not rely on a single market condition.
Data leakage occurs when information from the future inadvertently contaminates the training process. Even subtle forms of leakage — a feature computed from an unclosed candle, a cross-timeframe alignment error, or a global statistic computed before splitting — can produce dramatically inflated backtest results that collapse in live trading. Our pipeline implements multiple defensive layers.
Trade entries use prices available at the moment of the signal, never prices that would only be known after the fact. This is critical because using future price data for entry decisions would give the model access to information a real trader would not have at decision time.
Every feature in the system is computed using data with timestamps strictly before the event timestamp. For multi-timeframe features, dedicated offsets ensure that only completed bars from higher timeframes are used. For example, a 4-hour bar that opens at 12:00 does not close until 16:00 — it cannot be used for signals generated at 13:00.
All reference levels are derived from fully completed periods. No intra-period data is used for level calculation, ensuring the model never has access to information from the current, still-forming period.
During development, several features were identified as mechanically correlated with the target variable — not because they captured genuine market dynamics, but because they encoded information about the trade outcome itself. These features were permanently removed from the pipeline after detection.
The pipeline enforces strict deduplication rules to ensure no market event generates multiple correlated training samples, which would artificially inflate apparent model performance.
All models in the ensemble employ multiple regularisation techniques that constrain model complexity at various levels. These constraints prevent the models from memorising noise in the training data and force them to learn generalisable patterns.
From an initial candidate pool of dozens of features, only the top-ranked features are selected per window using a statistical relevance metric computed exclusively on training data. This prevents the model from finding spurious patterns in irrelevant noise variables.
The final prediction combines multiple fundamentally different algorithms — each with different inductive biases. This reduces the risk that any single model's overfitting drives the overall prediction. Ensemble combination smooths out model-specific noise.
Models are configured to learn slowly and incrementally, building predictive power gradually rather than aggressively fitting to the training data. This approach reduces sensitivity to individual training examples and improves generalisation.
The pipeline enforces a minimum amount of training data before any model is evaluated. This ensures a sufficient ratio of training samples to features, reducing the probability of finding spurious correlations that would not generalise.
Every feature must have a plausible economic or market-microstructure explanation for why it would predict the target. Features that show statistical significance without a logical mechanism are treated as suspicious and excluded.
The ensemble incorporates mechanisms to account for imbalanced target distributions. This prevents the models from defaulting to the majority class and forces them to learn genuine discriminative patterns for both outcomes.
The decision threshold is not optimised for raw accuracy — which would favour predicting the majority class. Instead, it is optimised for a risk-adjusted performance metric using only training data, aligning the threshold with real-world trading objectives.
Feature selection is performed inside each walk-forward window, using only the training portion. A statistical method measures the relevance of each candidate feature with respect to the target variable. Only the top-ranking features are retained for that specific window.
This approach has two key benefits: (1) it prevents look-ahead bias that would occur if feature selection used the full dataset, and (2) it allows the feature set to adapt across market regimes, since different features may become relevant in different periods.
The candidate feature pool spans several broad categories of market information across multiple timeframes. Each feature must have a plausible economic rationale and is evaluated independently in every walk-forward window. The specific categories and formulas are proprietary.
All backtest results include a realistic cost model that accounts for:
Many academic and retail backtests ignore transaction costs entirely, producing unrealistically profitable results. Our cost model is applied on every single trade in every walk-forward window.
Rather than using fixed position sizes, the system implements multiple layers of adaptive risk control. These layers reduce exposure during adverse conditions rather than filtering trades entirely, preserving sample size while limiting downside risk.
Position size scales with model conviction. Low-confidence predictions receive smaller allocations, while high-conviction signals receive larger ones — within predefined bounds.
Statistical analysis of historical performance by time period identifies windows with consistently weaker results. Exposure is reduced during these periods rather than eliminated.
When the equity curve experiences a drawdown beyond a predefined threshold, position sizes are automatically reduced to prevent cascading losses during unfavourable regimes.
A trailing window of recent trades is monitored for streaks. During cold streaks, position sizing is reduced; as performance recovers, sizing normalises gradually.
The following table summarises the systematic leakage checks performed on the pipeline. Each check verifies a specific aspect of temporal integrity.
| # | Check | Method | Status |
|---|---|---|---|
| 1 | Walk-forward temporal validation (no random splits) | Structural | ✔ PASS |
| 2 | Entry price uses Open (not Close) of signal bar | Code audit | ✔ PASS |
| 3 | All features use data with timestamp < event time (past only) | Code audit | ✔ PASS |
| 4 | Higher-timeframe features use completed-bar offsets | Code audit | ✔ PASS |
| 5 | Reference levels from completed periods only | Code audit | ✔ PASS |
| 6 | Feature selection computed per-window on training data only | Structural | ✔ PASS |
| 7 | Event deduplication prevents correlated sample inflation | Structural | ✔ PASS |
| 8 | Mechanically predictive features removed from pipeline | Manual review | ✔ PASS |
| 9 | Low-sample event categories removed (no statistical significance) | Statistical | ✔ PASS |
After every evaluation, an automated check compares key metrics against predefined plausibility thresholds. Results that exceed these thresholds are flagged as probable data leakage and trigger a mandatory review. These thresholds are calibrated based on academic literature and industry experience with quantitative trading systems.
| Metric | Assessment | Our Result |
|---|---|---|
| Classification Accuracy | Compared against academic benchmarks for financial prediction | ✔ Within realistic range |
| AUC-ROC | Compared against known bounds for genuine predictive edges | ✔ Within realistic range |
| Sharpe Ratio | Compared against achievable risk-adjusted returns per window | ✔ Within realistic range |
| Win Rate | Compared against industry norms for systematic strategies | ✔ Within realistic range |
All pipeline metrics fall within the “realistic” range. This is itself strong evidence against leakage: a leaked model would show dramatically higher numbers that are impossible for genuine predictive power in financial markets.
The system went through numerous iterations, each addressing specific issues discovered during validation. This iterative process itself demonstrates rigorous self-correction:
| Phase | Focus | Outcome |
|---|---|---|
| Early | Identified and eliminated major data leakage sources | Metrics dropped from “impossibly good” to realistic levels — confirming leakage was present and was successfully removed |
| Mid | Refined feature engineering and target definition | Refined target definition, ensemble approach, and feature engineering. Genuine AUC improvement observed |
| Late | Risk management and production readiness | Added adaptive position sizing, circuit breakers, and rolling performance monitors. Smoother equity curve profile |
| Current | Stability and robustness verification | All walk-forward windows validated. All plausibility checks passed. Production EA deployed |
The fact that removing leakage caused performance to decrease is a strong validation signal. In a clean pipeline, removing features cannot make the model better — only in a leaked pipeline does “fixing” things appear to hurt performance.