I audited my own "validated" backtest and found the Sharpe I'd been quoting was wrong by 7x. Here's the full teardown.
Six years of QQQ opening-range-breakout data, 112 raw trades, a filter waterfall, a loss autopsy, and a stress test aimed at the exact failure mode that gets backtests torn apart here. Posting the whole thing because I'd rather get this checked before real money touches it than after.
Setup: Solo build, systematic ORB on QQQ/NQ, no ML, deterministic rules only (regime gate, day-of-week filter, signal grade, opening range breakout). Going live on a funded futures account shortly, which is why I spent this weekend trying to break my own numbers before someone else did it for me.
The Sharpe was wrong
Original claim: 3.50 Sharpe. Sounded great. Turned out the annualization method was undocumented and effectively assumed daily trading frequency on a system that fires roughly 10 times a year. Recomputed properly:
- Per-trade Sharpe (mean_R / std_R): 0.49
- Correctly annualized for actual trade frequency: 1.54
3.50 was fiction. 1.54 is defensible. Retired the old number everywhere, including my own notes, and documented the methodology so it's reproducible.
The filter waterfall (112 raw trades → 59 filtered)
| Stage | Trades | Win Rate | EV/trade | Sharpe | Max DD |
|---|---|---|---|---|---|
| Raw | 112 | 48.2% | +0.888R | 0.27 | 6.8R |
| + Calendar guard (FOMC/NFP/CPI) | 109 | 48.6% | +0.912R | 0.27 | 6.8R |
| + Friday blocked | 80 | 53.8% | +1.246R | 0.33 | 4.0R |
| + Wed BULL blocked | 70 | 58.6% | +1.479R | 0.37 | 4.0R |
| + Wed BEAR retained only | 61 | 62.3% | +1.539R | 0.38 | 3.0R |
| + Signal grade filter (4-confirmation alignment) | 59 | 57.6% | +0.987R | 0.49 | 3.0R |
Biggest single lever: the Friday filter alone accounts for ~38% of the total edge improvement from raw to final. Friday trades averaged -0.042R across 30 occurrences, essentially free money to remove. Everything else (day-of-week regime interaction, signal grading) matters, but nowhere near as much as just not trading on Fridays.
Loss autopsy—where does the edge actually die
Ran a structural post-mortem on all 59 filtered trades, winners and losers, looking for taxonomy rather than a magic filter (I know curve-fitting a "what-would-have-avoided-this-loss" rule off 25 losses is how people fool themselves, so I explicitly didn't do that, see below).
25 losses broke into three types:
- Target-miss reversals (13, 52%): reached ≥1R in favor, then reversed to a full stop
- Slow bleed (11, 44%): sideways chop, stopped late, no real signal
- Immediate reversal (1, 4%): stopped within 3 bars, the classic fakeout, essentially absent
The 52% figure was the interesting one. Half the losses weren't bad entries, they were good entries the market later took back.
The counterfactual that actually mattered
I'd already built a two-tier exit (bank 50% at +1R, trail the remainder) but never backtested it, it was execution-layer code, not signal logic. Ran it against the loss autopsy as a historical counterfactual:
| Backtest (no engine) | With engine |
|---|---|
| 13 target-miss losses | -13.0R |
| 11 slow-bleed losses | -10.8R |
| 34 winners | +82.0R |
| Total EV/trade | +0.987R |
The mechanism is boring and mechanical, which is exactly why I trust it: locking half a position at +1R structurally can't be curve-fit to 13 specific historical trades, because it's a rule about R-multiples reached, not about any feature of those particular trades. It generalizes by construction.
Stress-testing against the thing that usually kills these posts
Saw enough "smooth equity curve = look-ahead bias" callouts on posts here to specifically check my own backtester for it. The risk: when a bar's high and low both contain the stop and target level, does the backtest assume favorable sequencing (target hit first) when live execution could easily have hit the stop first?
Audited all 93 grade-A trades (pre-final-filter set) for this exact condition:
- 79 trades (84.9%): unambiguous — stop and target far enough apart that same-bar sequencing isn't a question
- 14 trades (15.1%): ambiguous — same-day exit with price between stop and target
Worst-case stress test—force stop-first resolution on all 14 ambiguous trades:
- Original EV: +0.633R (this subset)
- Worst-case EV: +0.449R (-29%)
- After typical live degradation: +0.269R—still positive
It's not zero-impact, and I'm not pretending it is. But the edge survives an assumption that's actively hostile to it, which is a meaningfully different claim than "the backtest looks clean. " I've now wired live trade tracking to flag these same-bar-ambiguous trades going forward and compare real fills against this worst-case floor if, live underperforms +0.449R on this specific cohort, that's the signal something in the backtester's sequencing assumption was actually wrong, not just theoretically risky.
What I did NOT do (the trap I was trying to avoid)
Did not go hunting for a rule that would have "saved" the 25 losses. That's the classic move that always works and always means nothing, with enough features you can always draw a line around your own losses in hindsight. The asymmetry engine passed a higher bar: it existed before the autopsy, has a mechanical justification independent of these specific trades, and its cost side (what it gives up on winners) was measured with equal rigor. Anything that only showed up as "add this filter, get 15 more percentage points" got treated as a red flag, not a discovery.
Where it stands
- 59-trade filtered configuration, 57.6% win rate, +1.266R EV with the exit engine active
- Per-trade Sharpe 0.49, correctly annualized ~1.54
- Max drawdown 3.0R across the full filtered sample
- Live drift monitor now tracks rolling EV against this backtest floor, with explicit drift alerts at 10 and 20 trades, and separately tracks the 14 ambiguous-sequence trades against their own worst-case floor
Going live on a funded account shortly. Wanted this checked here first rather than finding out about a hole from a blown drawdown limit.
Genuinely interested in where this is still wrong. What would you attack first, the calendar guard's negligible impact (only removed 2 trades, is that suspicious in itself?), the grade-filter methodology, or something in the intrabar sequencing check I haven't thought of?