Validation & Anti-Overfitting Methodology

Karat Killer — Robustness Verification Report
February 2026 · Confidential
Asset XAUUSD
Data Period 10+ Years
Validation Walk-Forward
ML Models Multi-Model Ensemble
Leakage Tests All Passed

Executive Summary

This document describes the validation methodology employed during the development of the Sneak Peak Expert Advisor. It details the systematic measures taken to prevent data leakage, avoid overfitting, and ensure that the model's predictive edge — however modest — reflects genuine market patterns rather than statistical artifacts. No proprietary parameters, feature formulas, or trading rules are disclosed.

1 Validation Philosophy

Core Principle

The entire development pipeline is built on a single axiom: a model that looks too good on historical data is almost certainly wrong. In quantitative finance, genuine edges are small. Our methodology is designed to detect and eliminate any artificial inflation of performance metrics, accepting modest but real predictive power over spectacular but illusory results.

What We Optimise For

Consistency across market regimes, realistic out-of-sample metrics, and robustness under transaction costs. Every design decision prioritises real-world viability over backtest aesthetics.

What We Reject

Any result that exceeds established plausibility thresholds is automatically flagged for review. Features without economic rationale are excluded regardless of their in-sample predictive power.

2 Pipeline Architecture

The system follows an 8-phase sequential pipeline where each phase has strict data-access boundaries. Information flows forward only — no phase may access data generated by a subsequent phase.

Data Quality Assurance

All OHLCV data is sourced from institutional-grade feeds across multiple timeframes and subjected to a rigorous cleaning process: timestamp normalisation, duplicate removal, gap detection, and numeric validation. Poor data quality is a common but often overlooked source of false signals in quantitative systems — corrupted prices or misaligned timestamps can create phantom patterns that vanish in live trading.

1
Multi-Timeframe Data Loading — Multiple timeframes ingested, cleaned, deduplicated
2
Event Detection — Historical trading events identified using only completed bars
3
Feature Engineering — Multi-timeframe features computed from strictly past data
4
Data Preparation — NaN handling, zero-variance removal, temporal indexing
5
Walk-Forward Validation — Multiple expanding windows, per-window feature selection & training
6
Realistic Backtesting — Transaction costs, position sizing, risk management layers
7
Model Export — Conversion to optimised format for production deployment
8
Report Generation — Automated metrics and audit trail

3 Walk-Forward Validation

Why Walk-Forward?

Traditional train/test splits or k-fold cross-validation randomly shuffle temporal data, allowing the model to “peek” at future information. Walk-forward validation respects the arrow of time: the model is always trained on past data and tested on strictly future, unseen data — exactly as it would operate in live trading.

Windows
Multiple
Non-overlapping test sets
Test Period
Months
Per validation window
Min Training
Years
Expanding window start

Expanding Window Design

The training set grows with each window (always starts from the beginning of the dataset), ensuring the model leverages all available historical data without ever seeing the test period. A minimum number of years of training data is required before the first test window begins, ensuring sufficient sample size relative to the feature space.

Window 1: [====== TRAIN ======] [TEST] Window 2: [========= TRAIN =========] [TEST] Window 3: [============ TRAIN ============] [TEST] ... ... Window N: [========================= TRAIN =========================] [TEST] Each [TEST] = a period of strictly unseen future data No random shuffling. No overlap. No future peeking.

Per-Window Isolation

Within each walk-forward window, the following operations are performed independently, using only that window's training data:

Statistical Significance Through Sample Size

The pipeline operates on thousands of events detected across over a decade of market data. This large sample size ensures that performance metrics are statistically meaningful rather than artefacts of small-sample noise. Each individual test window also contains enough trades for reliable per-window assessment.

Cross-Regime Robustness

The 2015–2025 validation period spans fundamentally different market environments: low-volatility pre-2020 markets, the COVID-19 crash and recovery, the 2022 inflation and interest-rate cycle, and the 2024–2025 gold rally. The model is evaluated across all of these regimes, ensuring it does not rely on a single market condition.

4 Data Leakage Prevention

What is Data Leakage?

Data leakage occurs when information from the future inadvertently contaminates the training process. Even subtle forms of leakage — a feature computed from an unclosed candle, a cross-timeframe alignment error, or a global statistic computed before splitting — can produce dramatically inflated backtest results that collapse in live trading. Our pipeline implements multiple defensive layers.

4.1 — Temporal Integrity of Entry Points

Trade entries use prices available at the moment of the signal, never prices that would only be known after the fact. This is critical because using future price data for entry decisions would give the model access to information a real trader would not have at decision time.

4.2 — Past-Only Feature Computation

Every feature in the system is computed using data with timestamps strictly before the event timestamp. For multi-timeframe features, dedicated offsets ensure that only completed bars from higher timeframes are used. For example, a 4-hour bar that opens at 12:00 does not close until 16:00 — it cannot be used for signals generated at 13:00.

4.3 — Level Calculation from Completed Periods

All reference levels are derived from fully completed periods. No intra-period data is used for level calculation, ensuring the model never has access to information from the current, still-forming period.

4.4 — Removal of Mechanically Predictive Features

During development, several features were identified as mechanically correlated with the target variable — not because they captured genuine market dynamics, but because they encoded information about the trade outcome itself. These features were permanently removed from the pipeline after detection.

4.5 — Event Deduplication

The pipeline enforces strict deduplication rules to ensure no market event generates multiple correlated training samples, which would artificially inflate apparent model performance.

5 Anti-Overfitting Measures

Model Regularisation

All models in the ensemble employ multiple regularisation techniques that constrain model complexity at various levels. These constraints prevent the models from memorising noise in the training data and force them to learn generalisable patterns.

Feature Space Control

From an initial candidate pool of dozens of features, only the top-ranked features are selected per window using a statistical relevance metric computed exclusively on training data. This prevents the model from finding spurious patterns in irrelevant noise variables.

Multi-Model Ensemble

The final prediction combines multiple fundamentally different algorithms — each with different inductive biases. This reduces the risk that any single model's overfitting drives the overall prediction. Ensemble combination smooths out model-specific noise.

Gradual Learning Approach

Models are configured to learn slowly and incrementally, building predictive power gradually rather than aggressively fitting to the training data. This approach reduces sensitivity to individual training examples and improves generalisation.

Minimum Training Data Requirements

The pipeline enforces a minimum amount of training data before any model is evaluated. This ensures a sufficient ratio of training samples to features, reducing the probability of finding spurious correlations that would not generalise.

Economic Rationale Filter

Every feature must have a plausible economic or market-microstructure explanation for why it would predict the target. Features that show statistical significance without a logical mechanism are treated as suspicious and excluded.

Class Imbalance Handling

The ensemble incorporates mechanisms to account for imbalanced target distributions. This prevents the models from defaulting to the majority class and forces them to learn genuine discriminative patterns for both outcomes.

Risk-Adjusted Threshold Optimisation

The decision threshold is not optimised for raw accuracy — which would favour predicting the majority class. Instead, it is optimised for a risk-adjusted performance metric using only training data, aligning the threshold with real-world trading objectives.

6 Feature Selection Methodology

Per-Window Statistical Selection

Feature selection is performed inside each walk-forward window, using only the training portion. A statistical method measures the relevance of each candidate feature with respect to the target variable. Only the top-ranking features are retained for that specific window.

This approach has two key benefits: (1) it prevents look-ahead bias that would occur if feature selection used the full dataset, and (2) it allows the feature set to adapt across market regimes, since different features may become relevant in different periods.

Feature Diversity

The candidate feature pool spans several broad categories of market information across multiple timeframes. Each feature must have a plausible economic rationale and is evaluated independently in every walk-forward window. The specific categories and formulas are proprietary.

7 Realistic Cost Model

Transaction Cost Inclusion

All backtest results include a realistic cost model that accounts for:

Many academic and retail backtests ignore transaction costs entirely, producing unrealistically profitable results. Our cost model is applied on every single trade in every walk-forward window.

8 Adaptive Risk Management

Multi-Layer Position Sizing

Rather than using fixed position sizes, the system implements multiple layers of adaptive risk control. These layers reduce exposure during adverse conditions rather than filtering trades entirely, preserving sample size while limiting downside risk.

Confidence-Based Sizing

Position size scales with model conviction. Low-confidence predictions receive smaller allocations, while high-conviction signals receive larger ones — within predefined bounds.

Temporal Filters

Statistical analysis of historical performance by time period identifies windows with consistently weaker results. Exposure is reduced during these periods rather than eliminated.

Circuit Breaker

When the equity curve experiences a drawdown beyond a predefined threshold, position sizes are automatically reduced to prevent cascading losses during unfavourable regimes.

Rolling Performance Monitor

A trailing window of recent trades is monitored for streaks. During cold streaks, position sizing is reduced; as performance recovers, sizing normalises gradually.

9 Leakage Audit Results

The following table summarises the systematic leakage checks performed on the pipeline. Each check verifies a specific aspect of temporal integrity.

# Check Method Status
1 Walk-forward temporal validation (no random splits) Structural ✔ PASS
2 Entry price uses Open (not Close) of signal bar Code audit ✔ PASS
3 All features use data with timestamp < event time (past only) Code audit ✔ PASS
4 Higher-timeframe features use completed-bar offsets Code audit ✔ PASS
5 Reference levels from completed periods only Code audit ✔ PASS
6 Feature selection computed per-window on training data only Structural ✔ PASS
7 Event deduplication prevents correlated sample inflation Structural ✔ PASS
8 Mechanically predictive features removed from pipeline Manual review ✔ PASS
9 Low-sample event categories removed (no statistical significance) Statistical ✔ PASS

10 Plausibility Thresholds

Automated Anomaly Detection

After every evaluation, an automated check compares key metrics against predefined plausibility thresholds. Results that exceed these thresholds are flagged as probable data leakage and trigger a mandatory review. These thresholds are calibrated based on academic literature and industry experience with quantitative trading systems.

Metric Assessment Our Result
Classification Accuracy Compared against academic benchmarks for financial prediction ✔ Within realistic range
AUC-ROC Compared against known bounds for genuine predictive edges ✔ Within realistic range
Sharpe Ratio Compared against achievable risk-adjusted returns per window ✔ Within realistic range
Win Rate Compared against industry norms for systematic strategies ✔ Within realistic range

All pipeline metrics fall within the “realistic” range. This is itself strong evidence against leakage: a leaked model would show dramatically higher numbers that are impossible for genuine predictive power in financial markets.

11 Iterative Development & Version History

The system went through numerous iterations, each addressing specific issues discovered during validation. This iterative process itself demonstrates rigorous self-correction:

Phase Focus Outcome
Early Identified and eliminated major data leakage sources Metrics dropped from “impossibly good” to realistic levels — confirming leakage was present and was successfully removed
Mid Refined feature engineering and target definition Refined target definition, ensemble approach, and feature engineering. Genuine AUC improvement observed
Late Risk management and production readiness Added adaptive position sizing, circuit breakers, and rolling performance monitors. Smoother equity curve profile
Current Stability and robustness verification All walk-forward windows validated. All plausibility checks passed. Production EA deployed

The fact that removing leakage caused performance to decrease is a strong validation signal. In a clean pipeline, removing features cannot make the model better — only in a leaked pipeline does “fixing” things appear to hurt performance.

12 Summary of Evidence

Why We Believe This Is Not Overfitted