
Per-bucket Platt scaling on a 425-bet sports model: global per-sport fit was making it worse
Working on a sports prediction model in my spare time and ran into a calibration issue that surprised me. Sharing in case it's useful, or anyone has thoughts.
Setup: my journal has ~425 graded singles across NBA, MLB, and MMA. I was applying per-sport Platt scaling (one A/B sigmoid fit per sport) as a final residual correction after the Elo + form + market-anchor blend. Standard pattern from the binary-classifier literature.
Worked fine until I started bucketing the journal by verdict tier (STRONG BET vs GOOD BET) and noticed the two tiers were miscalibrated in opposite directions:
NBA STRONG (eff_n=28, 30d half-life): A = -6.83, B = +4.23
NBA GOOD (eff_n=22): A = +2.65, B = -1.77
MLB STRONG (eff_n=46): A = +3.57, B = -2.52
MLB GOOD (eff_n=127): A = +2.10, B = -1.12
The NBA STRONG and GOOD slopes disagree on sign. The single per-sport NBA fit (A=-1.36) was averaging those two errors and correcting both buckets wrong.
Fixed by switching to per-(sport, verdict_tier) Platt with a fallback chain: per-bucket when eff_n ≥ 12, else per-sport, else global, else identity. Verdict tier at inference is inferred from the probability band (p ≥ 0.68 = STRONG, 0.55-0.68 = GOOD) since the actual verdict label isn't known until edge is computed downstream.
Delta on synthetic predictions:
NBA STRONG @ p=0.78: per-sport -> 0.358, per-bucket -> 0.251 (-10.7pt)
MLB GOOD @ p=0.62: per-sport -> 0.516, per-bucket -> 0.546 (+3.0pt)
Calibration audit is live at lakeshore-edge.com/model if anyone wants the raw data. Per-bucket coefficients update on the journal reflect loop.
Caveats I'm aware of:
- 425 bets is still tiny for a 2-parameter sigmoid per bucket
- Verdict-tier inference from p-band has its own selection bias (high-p picks become STRONG more often, so the bucket fit is fitting on a non-random subset)
- Time-decay weighting (30d half-life) is plausible but not validated against a held-out window
Question for the sub: anyone done per-bucket calibration in this kind of small-sample regime? Specifically deciding between hierarchical Bayes (pool partial information across buckets) and the simpler fallback chain I'm running now.