u/FlashAlphaLab

I built the most honest VRP put credit spread backtest I could. 7 years, 5 symbols. Terrible

I built the most honest VRP put credit spread backtest I could. 7 years, 5 symbols. Terrible

I have been trying to make short put credit spreads work as a real strategy. Built a backtest with one config tuned on SPY, then froze every parameter and ran it unchanged on QQQ, IWM, AMZN, NVDA from 2019 to 2026. Then added SPXW as an out-of-sample check using a separate data feed.

The point was to remove the three things that make most VRP backtests lie:

  • Mid-price fills (I posted limits and waited for someone to cross my price)
  • Clean profit-target exits (if my close limit didn't fill, I had to cross the spread)
  • Re-tuning per symbol (the four extra symbols were never touched by the optimizer)

Here is what came out.

The full results table

Symbol Trades Win rate Sharpe CAGR Fill rate
SPY (tuned) 335 71.6% 0.23 +1.05% 13.7%
NVDA 201 72.1% 0.22 +0.84% 6.5%
AMZN 139 71.2% 0.13 +0.42% 3.4%
QQQ 162 69.8% -0.08 -0.23% 7.9%
IWM 58 63.8% -0.35 -0.11% 3.1%
SPXW (OOS) 112 61.6% -0.39 -0.66% 100% *

* SPXW was daily-close resolution from a different feed, so its fill rate is 100% by design. Direction check only.

The win rate held everywhere. 64% to 72% across five very different underlyings. Two of four out-of-sample symbols lost money. SPXW lost money. The best Sharpe was on the symbol I tuned on. Every symbol the strategy had never seen did worse.

Why a 70% win rate makes nothing

The payoff is brutally asymmetric:

Symbol Avg win Avg loss Ratio
SPY $369 -$850 1 : 2.3
QQQ $311 -$751 1 : 2.4
AMZN $407 -$933 1 : 2.3
NVDA $385 -$888 1 : 2.3
IWM $53 -$130 1 : 2.4

0.70 × 1 - 0.30 × 2.3 = roughly zero. By the math, a 70% win rate at a 1:2.3 payoff is a coin flip. You win small often and lose big sometimes, and the two almost exactly cancel.

The win rate is the wrong number to anchor on. Every post that leads with "I have an 80% win rate strategy" and does not show the loss distribution is selling you the setup, not the result.

Only 3-14% of my orders filled, and the ones that did filled at worse prices

Symbol Proposed Filled Fill rate Avg slip vs mid
SPY 2,438 335 13.7% -$0.037
QQQ 2,055 162 7.9% -$0.042
NVDA 3,072 201 6.5% -$0.044
AMZN 4,114 139 3.4% -$0.040
IWM 1,896 58 3.1% -$0.039

A backtest that assumes you fill at mid books all of these orders at a better price than reality gives you. Real life fills 3 to 14 percent of them and the ones that do fill cross at about 4 cents worse than mid, because the orders that actually fill are the ones the market is running through.

On 100-multiplier contracts that is a $4 headwind per fill before the trade even starts. Multiply by the fact that most posts are not even paying attention to fill rate and you can see where the fictional returns come from.

The exact fill logic I used (post limit at ask_edge, stale-quote guard, patient-then-cross exits) is open-source here: flashalpha-fill-simulator. Plug in your own fill model and rerun if you want to test the sensitivity yourself.

The profit target hits are mostly forced spread crosses

pt = my close-limit at the 50% target filled cleanly. pt_x = the target was hit but my limit did not catch, so I had to cross the spread to get out.

Symbol pt (clean) pt_x (crossed) sl sl_x expiry
SPY 103 133 63 32 4
QQQ 29 76 37 12 8
AMZN 44 47 25 14 9
NVDA 68 71 43 13 6
IWM 12 21 7 13 5

On every single symbol, more profit-target exits required crossing the spread than filled cleanly at the limit. The "close at 50% target" that naive backtests book is the minority outcome in practice. This one effect, invisible in any mid-fill backtest, eats most of the gap between the win rate and the real result.

The high-conviction regime lost money

My signal outputs risk_on, neutral, reduce, or risk_off. risk_on means "deploy full size." Here is what each regime actually earned:

Symbol neutral P&L risk_on P&L
SPY +$5,604 +$2,174
QQQ -$3,000 +$1,364
NVDA +$5,918 +$266
AMZN +$4,669 -$1,644
IWM +$145 -$910

On four of five symbols, the regime my signal is most confident about earned less than the middle-of-the-road neutral state. Same pattern showed up independently on SPXW (95 risk_on trades lost $5,266; 17 neutral trades made $625).

That is the mechanism behind every "+5,000% backtest" VRP post that later blows up. The signal sizes you in hardest exactly when it shouldn't.

Where I'm landing

I went into this trying to find the version of put credit spreads that actually works. After the honest backtest, I cannot find one that beats passive SPY on a risk-adjusted basis once you pay realistic execution costs and refuse to refit per symbol.

So I am starting to think the defined-risk structure is the wrong call. With a spread you cap the win (the credit) and cap the loss (the long put you bought for protection), but the long put is so far out of the money in 20-30 delta land that it barely helps in the years that hurt, while it costs you real money in the years that don't.

I am now seriously thinking about going back to short straddles or strangles and managing the position by hand. Undefined risk, yes. But when the underlying moves you actually get to do something about it. Roll the untested side in, adjust the strike, flip to an iron when needed. With a spread you just sit there and watch the short leg get tested while the long leg does nothing useful until you are already at -80% of max.

Managing the position beats hoping a far OTM long put saves you, if you have the discipline to actually do it. The spread feels safer, but eight years of data says the safety was already priced in.

What I'm actually doing next

Strangles is the headline lead, but I'm not committing to it yet. There are bigger levers I haven't pulled, and I want to be honest with myself about the ranking.

  1. Patient execution before anything else. The companion fill-model study showed mid-fills give strongly positive results, honest post-and-wait fills give breakeven. That gap is the biggest single number in the whole project, larger than anything I'd realistically get from a smarter signal. Before signal v3, I want to test posting the limit and letting it work for 2-5 minutes instead of cancelling fast. If fill rate goes from 7% to 20% at similar slip, I've doubled deployable capital on the same edge.
  2. A simpler signal probably beats the classifier. The cleanest single-feature edge in the data is the VIX regime: calm under 12 was -$0.66 realized, crisis 30+ was +$0.54. A dumb floor rule ("don't sell when VIX is under 14, stop at 2x credit") probably captures most of the conditional edge with none of the regime-classifier complexity. The classifier was already wrong about its own confidence — risk_on underperformed neutral — which tells me "more signal" might just be the wrong goal.
  3. Treat this as a sleeve, not a strategy. A 0.48 Sharpe in isolation is not tradeable. As one sleeve in a portfolio where P&L is theta + VRP rather than delta, it diversifies an equity core. The right question stops being "find a profile that beats SPY" and becomes "what is the right sleeve allocation in a multi-strategy book."
  4. 2022 defines sizing. Everything else is leverage. Every other year was easy. 2022 was the survival test. Whatever size does not hurt under 2022 conditions is the deployable size. Anything above that is leverage I will regret in the next bad year.
  5. Drop single-name diversification. The cross-symbol study showed single names took the worst 2022 drawdowns. Drift profile is different, tails fatter. SPY and SPX are the home. AMZN/NVDA do not "diversify", they add risk.
  6. Strangles with management is a project, not a conclusion. Yes, the defined-risk wing barely earns its cost. But uncapping the loss has its own problems, and "manage by hand" is alpha-bearing skill that backtests do not measure well. Build the managed-strangle engine, let the data decide, then commit.

Short-term plan: patient-execution variant first, sleeve-sizing math second, dumb VIX-floor signal third. Managed strangles is the test after that, not the headline pivot.

Caveats

  • This is the unlevered version. Half-Kelly, default 0.05, cap 0.25. Absolute returns and drawdowns are small (1-8%). A leveraged version exists but I am not citing it because it has not been validated out-of-sample, and that is exactly the kind of number that turns into a "+5,000%" post.
  • Out-of-sample in symbol, not in time. The signal thresholds were chosen on SPY 2018-2026. Different window, results could shift.
  • No commissions or fees modeled. Real account would be worse, not better.
  • IWM is statistically thin (58 trades total). Treat its negative Sharpe as directional, not conclusive.
  • SPXW is daily-close resolution so its Sharpe and fill rate are not apples to apples with the 1-minute symbols. It is a direction check.

TL;DR

  • 7 years, 5 symbols, one config tuned on SPY then frozen
  • 70% win rate held everywhere, Sharpe ranged from +0.23 to -0.39
  • Two of four out-of-sample symbols lost money, SPXW lost money
  • The win rate is meaningless without the 1:2.3 win/loss ratio next to it (it nets to zero)
  • Only 3-14% of my orders even filled, and those filled at -$0.04 vs mid
  • Most "profit target" exits were forced spread crosses, not clean limit fills
  • The signal's high-conviction regime earned less than its middle regime
  • Plan: better execution first, then sleeve-sizing math, then a simpler VIX-floor signal, then test managed strangles

The search for money God continues...

u/FlashAlphaLab — 2 days ago

Hey guys, been heads-down for the last couple weeks building a proper SPY put credit spread backtest and I think I have a few findings worth sharing. Posting the highlights here, full writeup with all the tables and numbers is linked at the bottom.

TL;DR - what made the exercise very difficult was adding realistic fill simulation. Every strategy that was making +5,000% suddenly started stopping out on drawdown or going outright bankrupt. Once the simulator stopped letting me fill at the mid, half the "winning" cells flipped negative. Once I added a stop loss, the cells that had been wiping out flipped to category-leading. Almost everything I assumed was either wrong or required one specific guardrail to not blow up.

Setup

  • grid cells across (delta, DTE, profit-take, stop-loss)
  • % drawdown circuit breaker that just halts the whole run when things go bad

The eight findings

1. Mid-fill backtests overstate CAGR by 30-60%. Switching from "fill at mid" to "post a limit and wait" dropped numbers across the entire grid. Several cells flipped from positive to negative. The strategy didn't change - only the simulator's assumption about who fills whom. If your backtest fills at mid you're silently gifting yourself 4-7 cents per contract every trade.

2. The stop loss IS the strategy at short DTE. Same cell, same fills, same period. With a stop at 100% of credit collected: +5,439%. Without one: -100% wipeout on a single tail event.

3. There's a sweet spot in the middle of the grid. Mid-DTE, mid-PT, low delta. Calmar above 2, Sharpe above 1.5, low double-digit drawdown. I'm not posting the exact coordinates - it's the only thing in this whole project worth keeping private and the recipe is useless without the same historical chains anyway. What I will say is what the sweet spot is not: not the highest CAGR cell (those have 30%+ drawdowns), not the highest Sharpe cell (39.7% drawdown), not the shortest or longest DTE, not the highest delta.

4. SL=200% is worse than no stop loss. Yes really. Same 45Δ DTE30 cell at three stop settings:

By the time the loss has grown to 200% of credit, you're deep ITM and gamma is doing the marking. You stop out at a terrible price after letting the position breathe past recovery. Either pick tight or pick none. The middle is the trap.

5. I built a fancy 3-layer signal. A one-line boolean flag beat it. Premium / Danger / Stabilization composites, z-scored macro inputs, continuous Kelly multiplier, the whole nine yards. Then ran t-tests across all 16k trades. The single strongest predictor was a one-liner with t > 8 - the kind of thing you can compute in three lines of pandas. My fancy composite added something on top, but the signal-to-noise was mostly in the simple flag.

The lesson: if your "edge" is a 47-feature gradient-boosted model, check what one obvious flag does. Most signal engineering is just expensive ways to discover one boring flag.

6. Multi-DTE flexibility cost me 9pp of CAGR. Seemed obvious - rank candidates across 30/45/60-DTE chains every entry, pick the best EV/risk, let the term structure tell you which tenor is most attractive. Built it. Ran it. Pooling underperformed focused single-DTE by ~9pp of CAGR. The ranker rationally preferred 30-DTE most of the time. When it picked 45-DTE it was specifically because the 30-DTE chain looked worse than usual that bar - meaning the 45-DTE bucket got adversely selected. More degrees of freedom = the optimizer can also fail in more ways.

7. Higher delta moves equity vol, not alpha. I expected higher delta to mean more edge - more credit per contract, more theta. Instead, raising delta from 10 to 30 to 45 raised CAGR and raised MaxDD in roughly equal proportion. Calmar stayed approximately flat. Delta is just how big you want the swings.

8. The bug log was humbling. Every one of these would have inflated my headline numbers, and most were caught only after a code review:

  • Mid-fill (flipped multiple losing strategies positive)
  • EV-sorted tiebreak when two limits crossed same bar (small but real oracle)

I have an archive_v3_grid/ folder with the same engine on the same data with all the bias bugs intact. Every cell in there returned -1% to -8% CAGR. So when the post-fix numbers turned positive, that wasn't simulator noise - that was what the bias was masking.

If you're building your own backtest and the numbers look great on the first run, you have a bug. Find it.

What I'd actually do differently if you're starting

  • Build the limit-fill model first. Don't bother running anything until your fills aren't free.
  • Don't add multi-DTE flexibility until you've maxed out single-DTE. More flexibility = more ways to be wrong.

Caveats

Standard ones. All numbers are in-sample over 2019-2026. The cell choice and the signal thresholds were both made looking at the same window the engine ran against. Walk-forward train/test split (2019-2022 train, 2023-2026 evaluate untouched) is the next gate before anything goes live. This is a backtest, not a track record.

Full writeup with all the tables, exact numbers Link

Happy to answer questions on methodology in the comments.

u/FlashAlphaLab — 25 days ago