
I built the most honest VRP put credit spread backtest I could. 7 years, 5 symbols. Terrible
I have been trying to make short put credit spreads work as a real strategy. Built a backtest with one config tuned on SPY, then froze every parameter and ran it unchanged on QQQ, IWM, AMZN, NVDA from 2019 to 2026. Then added SPXW as an out-of-sample check using a separate data feed.
The point was to remove the three things that make most VRP backtests lie:
- Mid-price fills (I posted limits and waited for someone to cross my price)
- Clean profit-target exits (if my close limit didn't fill, I had to cross the spread)
- Re-tuning per symbol (the four extra symbols were never touched by the optimizer)
Here is what came out.
The full results table
| Symbol | Trades | Win rate | Sharpe | CAGR | Fill rate |
|---|---|---|---|---|---|
| SPY (tuned) | 335 | 71.6% | 0.23 | +1.05% | 13.7% |
| NVDA | 201 | 72.1% | 0.22 | +0.84% | 6.5% |
| AMZN | 139 | 71.2% | 0.13 | +0.42% | 3.4% |
| QQQ | 162 | 69.8% | -0.08 | -0.23% | 7.9% |
| IWM | 58 | 63.8% | -0.35 | -0.11% | 3.1% |
| SPXW (OOS) | 112 | 61.6% | -0.39 | -0.66% | 100% * |
* SPXW was daily-close resolution from a different feed, so its fill rate is 100% by design. Direction check only.
The win rate held everywhere. 64% to 72% across five very different underlyings. Two of four out-of-sample symbols lost money. SPXW lost money. The best Sharpe was on the symbol I tuned on. Every symbol the strategy had never seen did worse.
Why a 70% win rate makes nothing
The payoff is brutally asymmetric:
| Symbol | Avg win | Avg loss | Ratio |
|---|---|---|---|
| SPY | $369 | -$850 | 1 : 2.3 |
| QQQ | $311 | -$751 | 1 : 2.4 |
| AMZN | $407 | -$933 | 1 : 2.3 |
| NVDA | $385 | -$888 | 1 : 2.3 |
| IWM | $53 | -$130 | 1 : 2.4 |
0.70 × 1 - 0.30 × 2.3 = roughly zero. By the math, a 70% win rate at a 1:2.3 payoff is a coin flip. You win small often and lose big sometimes, and the two almost exactly cancel.
The win rate is the wrong number to anchor on. Every post that leads with "I have an 80% win rate strategy" and does not show the loss distribution is selling you the setup, not the result.
Only 3-14% of my orders filled, and the ones that did filled at worse prices
| Symbol | Proposed | Filled | Fill rate | Avg slip vs mid |
|---|---|---|---|---|
| SPY | 2,438 | 335 | 13.7% | -$0.037 |
| QQQ | 2,055 | 162 | 7.9% | -$0.042 |
| NVDA | 3,072 | 201 | 6.5% | -$0.044 |
| AMZN | 4,114 | 139 | 3.4% | -$0.040 |
| IWM | 1,896 | 58 | 3.1% | -$0.039 |
A backtest that assumes you fill at mid books all of these orders at a better price than reality gives you. Real life fills 3 to 14 percent of them and the ones that do fill cross at about 4 cents worse than mid, because the orders that actually fill are the ones the market is running through.
On 100-multiplier contracts that is a $4 headwind per fill before the trade even starts. Multiply by the fact that most posts are not even paying attention to fill rate and you can see where the fictional returns come from.
The exact fill logic I used (post limit at ask_edge, stale-quote guard, patient-then-cross exits) is open-source here: flashalpha-fill-simulator. Plug in your own fill model and rerun if you want to test the sensitivity yourself.
The profit target hits are mostly forced spread crosses
pt = my close-limit at the 50% target filled cleanly. pt_x = the target was hit but my limit did not catch, so I had to cross the spread to get out.
| Symbol | pt (clean) | pt_x (crossed) | sl | sl_x | expiry |
|---|---|---|---|---|---|
| SPY | 103 | 133 | 63 | 32 | 4 |
| QQQ | 29 | 76 | 37 | 12 | 8 |
| AMZN | 44 | 47 | 25 | 14 | 9 |
| NVDA | 68 | 71 | 43 | 13 | 6 |
| IWM | 12 | 21 | 7 | 13 | 5 |
On every single symbol, more profit-target exits required crossing the spread than filled cleanly at the limit. The "close at 50% target" that naive backtests book is the minority outcome in practice. This one effect, invisible in any mid-fill backtest, eats most of the gap between the win rate and the real result.
The high-conviction regime lost money
My signal outputs risk_on, neutral, reduce, or risk_off. risk_on means "deploy full size." Here is what each regime actually earned:
| Symbol | neutral P&L |
risk_on P&L |
|---|---|---|
| SPY | +$5,604 | +$2,174 |
| QQQ | -$3,000 | +$1,364 |
| NVDA | +$5,918 | +$266 |
| AMZN | +$4,669 | -$1,644 |
| IWM | +$145 | -$910 |
On four of five symbols, the regime my signal is most confident about earned less than the middle-of-the-road neutral state. Same pattern showed up independently on SPXW (95 risk_on trades lost $5,266; 17 neutral trades made $625).
That is the mechanism behind every "+5,000% backtest" VRP post that later blows up. The signal sizes you in hardest exactly when it shouldn't.
Where I'm landing
I went into this trying to find the version of put credit spreads that actually works. After the honest backtest, I cannot find one that beats passive SPY on a risk-adjusted basis once you pay realistic execution costs and refuse to refit per symbol.
So I am starting to think the defined-risk structure is the wrong call. With a spread you cap the win (the credit) and cap the loss (the long put you bought for protection), but the long put is so far out of the money in 20-30 delta land that it barely helps in the years that hurt, while it costs you real money in the years that don't.
I am now seriously thinking about going back to short straddles or strangles and managing the position by hand. Undefined risk, yes. But when the underlying moves you actually get to do something about it. Roll the untested side in, adjust the strike, flip to an iron when needed. With a spread you just sit there and watch the short leg get tested while the long leg does nothing useful until you are already at -80% of max.
Managing the position beats hoping a far OTM long put saves you, if you have the discipline to actually do it. The spread feels safer, but eight years of data says the safety was already priced in.
What I'm actually doing next
Strangles is the headline lead, but I'm not committing to it yet. There are bigger levers I haven't pulled, and I want to be honest with myself about the ranking.
- Patient execution before anything else. The companion fill-model study showed mid-fills give strongly positive results, honest post-and-wait fills give breakeven. That gap is the biggest single number in the whole project, larger than anything I'd realistically get from a smarter signal. Before signal v3, I want to test posting the limit and letting it work for 2-5 minutes instead of cancelling fast. If fill rate goes from 7% to 20% at similar slip, I've doubled deployable capital on the same edge.
- A simpler signal probably beats the classifier. The cleanest single-feature edge in the data is the VIX regime: calm under 12 was -$0.66 realized, crisis 30+ was +$0.54. A dumb floor rule ("don't sell when VIX is under 14, stop at 2x credit") probably captures most of the conditional edge with none of the regime-classifier complexity. The classifier was already wrong about its own confidence — risk_on underperformed neutral — which tells me "more signal" might just be the wrong goal.
- Treat this as a sleeve, not a strategy. A 0.48 Sharpe in isolation is not tradeable. As one sleeve in a portfolio where P&L is theta + VRP rather than delta, it diversifies an equity core. The right question stops being "find a profile that beats SPY" and becomes "what is the right sleeve allocation in a multi-strategy book."
- 2022 defines sizing. Everything else is leverage. Every other year was easy. 2022 was the survival test. Whatever size does not hurt under 2022 conditions is the deployable size. Anything above that is leverage I will regret in the next bad year.
- Drop single-name diversification. The cross-symbol study showed single names took the worst 2022 drawdowns. Drift profile is different, tails fatter. SPY and SPX are the home. AMZN/NVDA do not "diversify", they add risk.
- Strangles with management is a project, not a conclusion. Yes, the defined-risk wing barely earns its cost. But uncapping the loss has its own problems, and "manage by hand" is alpha-bearing skill that backtests do not measure well. Build the managed-strangle engine, let the data decide, then commit.
Short-term plan: patient-execution variant first, sleeve-sizing math second, dumb VIX-floor signal third. Managed strangles is the test after that, not the headline pivot.
Caveats
- This is the unlevered version. Half-Kelly, default 0.05, cap 0.25. Absolute returns and drawdowns are small (1-8%). A leveraged version exists but I am not citing it because it has not been validated out-of-sample, and that is exactly the kind of number that turns into a "+5,000%" post.
- Out-of-sample in symbol, not in time. The signal thresholds were chosen on SPY 2018-2026. Different window, results could shift.
- No commissions or fees modeled. Real account would be worse, not better.
- IWM is statistically thin (58 trades total). Treat its negative Sharpe as directional, not conclusive.
- SPXW is daily-close resolution so its Sharpe and fill rate are not apples to apples with the 1-minute symbols. It is a direction check.
TL;DR
- 7 years, 5 symbols, one config tuned on SPY then frozen
- 70% win rate held everywhere, Sharpe ranged from +0.23 to -0.39
- Two of four out-of-sample symbols lost money, SPXW lost money
- The win rate is meaningless without the 1:2.3 win/loss ratio next to it (it nets to zero)
- Only 3-14% of my orders even filled, and those filled at -$0.04 vs mid
- Most "profit target" exits were forced spread crosses, not clean limit fills
- The signal's high-conviction regime earned less than its middle regime
- Plan: better execution first, then sleeve-sizing math, then a simpler VIX-floor signal, then test managed strangles
The search for money God continues...