BLODSALGO Analytics

Validation & Anti-Overfitting Methodology

Karat Killer — Robustness Verification Report

February 2026 · Confidential

Asset XAUUSD

Data Period 10+ Years

Validation Walk-Forward

ML Models Multi-Model Ensemble

Leakage Tests All Passed

Executive Summary

This document describes the validation methodology employed during the development of the Sneak Peak Expert Advisor. It details the systematic measures taken to prevent data leakage, avoid overfitting, and ensure that the model's predictive edge — however modest — reflects genuine market patterns rather than statistical artifacts. No proprietary parameters, feature formulas, or trading rules are disclosed.

1 Validation Philosophy

Core Principle

The entire development pipeline is built on a single axiom: a model that looks too good on historical data is almost certainly wrong. In quantitative finance, genuine edges are small. Our methodology is designed to detect and eliminate any artificial inflation of performance metrics, accepting modest but real predictive power over spectacular but illusory results.

What We Optimise For

Consistency across market regimes, realistic out-of-sample metrics, and robustness under transaction costs. Every design decision prioritises real-world viability over backtest aesthetics.

What We Reject

Any result that exceeds established plausibility thresholds is automatically flagged for review. Features without economic rationale are excluded regardless of their in-sample predictive power.

2 Pipeline Architecture

The system follows an 8-phase sequential pipeline where each phase has strict data-access boundaries. Information flows forward only — no phase may access data generated by a subsequent phase.

Data Quality Assurance

All OHLCV data is sourced from institutional-grade feeds across multiple timeframes and subjected to a rigorous cleaning process: timestamp normalisation, duplicate removal, gap detection, and numeric validation. Poor data quality is a common but often overlooked source of false signals in quantitative systems — corrupted prices or misaligned timestamps can create phantom patterns that vanish in live trading.

Multi-Timeframe Data Loading — Multiple timeframes ingested, cleaned, deduplicated

Event Detection — Historical trading events identified using only completed bars

Feature Engineering — Multi-timeframe features computed from strictly past data

Data Preparation — NaN handling, zero-variance removal, temporal indexing

Walk-Forward Validation — Multiple expanding windows, per-window feature selection & training

Realistic Backtesting — Transaction costs, position sizing, risk management layers

Model Export — Conversion to optimised format for production deployment

Report Generation — Automated metrics and audit trail

3 Walk-Forward Validation

Why Walk-Forward?

Traditional train/test splits or k-fold cross-validation randomly shuffle temporal data, allowing the model to “peek” at future information. Walk-forward validation respects the arrow of time: the model is always trained on past data and tested on strictly future, unseen data — exactly as it would operate in live trading.

Windows

Multiple

Non-overlapping test sets

Test Period

Months

Per validation window

Min Training

Years

Expanding window start

Expanding Window Design

The training set grows with each window (always starts from the beginning of the dataset), ensuring the model leverages all available historical data without ever seeing the test period. A minimum number of years of training data is required before the first test window begins, ensuring sufficient sample size relative to the feature space.

Window 1: [====== TRAIN ======] [TEST] Window 2: [========= TRAIN =========] [TEST] Window 3: [============ TRAIN ============] [TEST] ... ... Window N: [========================= TRAIN =========================] [TEST] Each [TEST] = a period of strictly unseen future data No random shuffling. No overlap. No future peeking.

Per-Window Isolation

Within each walk-forward window, the following operations are performed independently, using only that window's training data:

✔Feature selection — Statistical relevance scoring computed exclusively on training samples
✔Model training — All ensemble models fitted only on the current window's training set
✔Threshold calibration — Decision threshold optimised on a temporal holdout carved from training data
✔Performance evaluation — Metrics computed on the out-of-sample test set only

Statistical Significance Through Sample Size

The pipeline operates on thousands of events detected across over a decade of market data. This large sample size ensures that performance metrics are statistically meaningful rather than artefacts of small-sample noise. Each individual test window also contains enough trades for reliable per-window assessment.

Cross-Regime Robustness

The 2015–2025 validation period spans fundamentally different market environments: low-volatility pre-2020 markets, the COVID-19 crash and recovery, the 2022 inflation and interest-rate cycle, and the 2024–2025 gold rally. The model is evaluated across all of these regimes, ensuring it does not rely on a single market condition.

4 Data Leakage Prevention

What is Data Leakage?

Data leakage occurs when information from the future inadvertently contaminates the training process. Even subtle forms of leakage — a feature computed from an unclosed candle, a cross-timeframe alignment error, or a global statistic computed before splitting — can produce dramatically inflated backtest results that collapse in live trading. Our pipeline implements multiple defensive layers.

4.1 — Temporal Integrity of Entry Points

Trade entries use prices available at the moment of the signal, never prices that would only be known after the fact. This is critical because using future price data for entry decisions would give the model access to information a real trader would not have at decision time.

4.2 — Past-Only Feature Computation

Every feature in the system is computed using data with timestamps strictly before the event timestamp. For multi-timeframe features, dedicated offsets ensure that only completed bars from higher timeframes are used. For example, a 4-hour bar that opens at 12:00 does not close until 16:00 — it cannot be used for signals generated at 13:00.

4.3 — Level Calculation from Completed Periods

All reference levels are derived from fully completed periods. No intra-period data is used for level calculation, ensuring the model never has access to information from the current, still-forming period.

4.4 — Removal of Mechanically Predictive Features

During development, several features were identified as mechanically correlated with the target variable — not because they captured genuine market dynamics, but because they encoded information about the trade outcome itself. These features were permanently removed from the pipeline after detection.

4.5 — Event Deduplication

The pipeline enforces strict deduplication rules to ensure no market event generates multiple correlated training samples, which would artificially inflate apparent model performance.

5 Anti-Overfitting Measures

Model Regularisation

All models in the ensemble employ multiple regularisation techniques that constrain model complexity at various levels. These constraints prevent the models from memorising noise in the training data and force them to learn generalisable patterns.

Feature Space Control

From an initial candidate pool of dozens of features, only the top-ranked features are selected per window using a statistical relevance metric computed exclusively on training data. This prevents the model from finding spurious patterns in irrelevant noise variables.

Multi-Model Ensemble

The final prediction combines multiple fundamentally different algorithms — each with different inductive biases. This reduces the risk that any single model's overfitting drives the overall prediction. Ensemble combination smooths out model-specific noise.

Gradual Learning Approach

Models are configured to learn slowly and incrementally, building predictive power gradually rather than aggressively fitting to the training data. This approach reduces sensitivity to individual training examples and improves generalisation.

Minimum Training Data Requirements

The pipeline enforces a minimum amount of training data before any model is evaluated. This ensures a sufficient ratio of training samples to features, reducing the probability of finding spurious correlations that would not generalise.

Economic Rationale Filter

Every feature must have a plausible economic or market-microstructure explanation for why it would predict the target. Features that show statistical significance without a logical mechanism are treated as suspicious and excluded.

Class Imbalance Handling

The ensemble incorporates mechanisms to account for imbalanced target distributions. This prevents the models from defaulting to the majority class and forces them to learn genuine discriminative patterns for both outcomes.

Risk-Adjusted Threshold Optimisation

The decision threshold is not optimised for raw accuracy — which would favour predicting the majority class. Instead, it is optimised for a risk-adjusted performance metric using only training data, aligning the threshold with real-world trading objectives.

6 Feature Selection Methodology

Per-Window Statistical Selection

Feature selection is performed inside each walk-forward window, using only the training portion. A statistical method measures the relevance of each candidate feature with respect to the target variable. Only the top-ranking features are retained for that specific window.

This approach has two key benefits: (1) it prevents look-ahead bias that would occur if feature selection used the full dataset, and (2) it allows the feature set to adapt across market regimes, since different features may become relevant in different periods.

Feature Diversity

The candidate feature pool spans several broad categories of market information across multiple timeframes. Each feature must have a plausible economic rationale and is evaluated independently in every walk-forward window. The specific categories and formulas are proprietary.

7 Realistic Cost Model

Transaction Cost Inclusion

All backtest results include a realistic cost model that accounts for:

✔Spread — Bid-ask spread modelled at institutional-level rates for XAUUSD
✔Commission — Per-lot commission scaled proportionally to position size
✔Position sizing — Risk-based sizing with equity curve awareness

Many academic and retail backtests ignore transaction costs entirely, producing unrealistically profitable results. Our cost model is applied on every single trade in every walk-forward window.

8 Adaptive Risk Management

Multi-Layer Position Sizing

Rather than using fixed position sizes, the system implements multiple layers of adaptive risk control. These layers reduce exposure during adverse conditions rather than filtering trades entirely, preserving sample size while limiting downside risk.

Confidence-Based Sizing

Position size scales with model conviction. Low-confidence predictions receive smaller allocations, while high-conviction signals receive larger ones — within predefined bounds.

Temporal Filters

Statistical analysis of historical performance by time period identifies windows with consistently weaker results. Exposure is reduced during these periods rather than eliminated.

Circuit Breaker

When the equity curve experiences a drawdown beyond a predefined threshold, position sizes are automatically reduced to prevent cascading losses during unfavourable regimes.

Rolling Performance Monitor

A trailing window of recent trades is monitored for streaks. During cold streaks, position sizing is reduced; as performance recovers, sizing normalises gradually.

9 Leakage Audit Results

The following table summarises the systematic leakage checks performed on the pipeline. Each check verifies a specific aspect of temporal integrity.

#	Check	Method	Status
1	Walk-forward temporal validation (no random splits)	Structural	✔ PASS
2	Entry price uses Open (not Close) of signal bar	Code audit	✔ PASS
3	All features use data with timestamp < event time (past only)	Code audit	✔ PASS
4	Higher-timeframe features use completed-bar offsets	Code audit	✔ PASS
5	Reference levels from completed periods only	Code audit	✔ PASS
6	Feature selection computed per-window on training data only	Structural	✔ PASS
7	Event deduplication prevents correlated sample inflation	Structural	✔ PASS
8	Mechanically predictive features removed from pipeline	Manual review	✔ PASS
9	Low-sample event categories removed (no statistical significance)	Statistical	✔ PASS

10 Plausibility Thresholds

Automated Anomaly Detection

After every evaluation, an automated check compares key metrics against predefined plausibility thresholds. Results that exceed these thresholds are flagged as probable data leakage and trigger a mandatory review. These thresholds are calibrated based on academic literature and industry experience with quantitative trading systems.

Metric	Assessment	Our Result
Classification Accuracy	Compared against academic benchmarks for financial prediction	✔ Within realistic range
AUC-ROC	Compared against known bounds for genuine predictive edges	✔ Within realistic range
Sharpe Ratio	Compared against achievable risk-adjusted returns per window	✔ Within realistic range
Win Rate	Compared against industry norms for systematic strategies	✔ Within realistic range

All pipeline metrics fall within the “realistic” range. This is itself strong evidence against leakage: a leaked model would show dramatically higher numbers that are impossible for genuine predictive power in financial markets.

11 Iterative Development & Version History

The system went through numerous iterations, each addressing specific issues discovered during validation. This iterative process itself demonstrates rigorous self-correction:

Phase	Focus	Outcome
Early	Identified and eliminated major data leakage sources	Metrics dropped from “impossibly good” to realistic levels — confirming leakage was present and was successfully removed
Mid	Refined feature engineering and target definition	Refined target definition, ensemble approach, and feature engineering. Genuine AUC improvement observed
Late	Risk management and production readiness	Added adaptive position sizing, circuit breakers, and rolling performance monitors. Smoother equity curve profile
Current	Stability and robustness verification	All walk-forward windows validated. All plausibility checks passed. Production EA deployed

The fact that removing leakage caused performance to decrease is a strong validation signal. In a clean pipeline, removing features cannot make the model better — only in a leaked pipeline does “fixing” things appear to hurt performance.

12 Summary of Evidence

Why We Believe This Is Not Overfitted

✔ Multiple non-overlapping out-of-sample windows — Not 2-3 cherry-picked periods, but a comprehensive sweep across a decade of market data including multiple regimes
✔ Expanding window design — The model must generalise to unseen future data at every step, not just a single test set
✔ Per-window feature selection — No global feature selection that could leak test information into training
✔ Multi-model ensemble — Reduces variance from any single model's overfitting tendencies
✔ Modest, realistic metrics — All performance numbers fall within plausibility thresholds for genuine edges in financial markets
✔ Strict temporal data boundaries — All leakage checks passed; features use only completed, past data
✔ Realistic transaction costs — Spread and commission applied on every trade in every backtest window
✔ Heavy regularisation — Multiple layers of complexity constraints prevent models from memorising training noise
✔ Transparent limitations — Known minor issues are documented rather than hidden, demonstrating honest evaluation
✔ Iterative leakage elimination — Performance decreased when leakage was removed, confirming fixes were genuine and not cosmetic

BLODSALGO Analytics

This document describes validation methodology only. No proprietary parameters, feature formulas, model weights, or trading rules are disclosed. Past performance is not indicative of future results. Trading involves substantial risk of loss.

© 2026 BLODSALGO LIMITED
This document is confidential and intended solely for the recipient. Reproduction or distribution without written consent is prohibited.

Generated: 2026-02-02